0% found this document useful (0 votes)
35 views

Thesis

Uploaded by

Soniya Rangnani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Thesis

Uploaded by

Soniya Rangnani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Link Label Prediction in Signed Citation Network

Thesis by
Uchenna Akujuobi

In Partial Fulfillment of the Requirements

For the Degree of

Masters of Science

King Abdullah University of Science and Technology, Thuwal,

Kingdom of Saudi Arabia

(March, 2016)
2
The thesis of Uchenna Akujuobi is approved by the examination committee

Committee Chairperson: Prof. Xiangliang Zhang

Committee Member: Prof. Mikhail Moshkov

Committee Member: Prof. Xin Gao

King Abdullah University of Science and Technology

2016
3

Copyright ©2016

Uchenna Akujuobi

All Rights Reserved


4
Link Label Prediction in Signed Citation Network

Uchenna Akujuobi

Abstract

Link label prediction is the problem of predicting the missing labels or signs of

all the unlabeled edges in a network. For signed networks, these labels can either be

positive or negative. In recent years, di↵erent algorithms have been proposed such as

using regression, trust propagation and matrix factorization. These approaches have

tried to solve the problem of link label prediction by using ideas from social theories,

where most of them predict a single missing label given that labels of other edges are

known. However, in most real-world social graphs, the number of labeled edges is

usually less than that of unlabeled edges. Therefore, predicting a single edge label at

a time would require multiple runs and is more computationally demanding.

In this thesis, we look at link label prediction problem on a signed citation network

with missing edge labels. Our citation network consists of papers from three major

machine learning and data mining conferences together with their references, and

edges showing the relationship between them. An edge in our network is labeled

either positive (dataset relevant) if the reference is based on the dataset used in the

paper or negative otherwise. We present three approaches to predict the missing

labels. The first approach converts the label prediction problem into a standard

classification problem. We then, generate a set of features for each edge and then

adopt Support Vector Machines in solving the classification problem. For the second

approach, we formalize the graph such that the edges are represented as nodes with

links showing similarities between them. We then adopt a label propagation method
5
to propagate the labels on known nodes to those with unknown labels. In the third

approach, we adopt a PageRank approach where we rank the nodes according to

the number of incoming positive and negative edges, after which we set a threshold.

Based on the ranks, we can infer an edge would be positive if it goes a node above

the threshold. Experimental results on our citation network corroborate the efficacy

of these approaches.

With each edge having a label, we also performed additional network analysis

where we extracted a subnetwork of the dataset relevant edges and nodes in our

citation network, and then detected di↵erent communities from this extracted sub-

network. To understand the di↵erent detected communities, we performed a case

study on several dataset communities. The study shows a relationship between the

major topic areas in a dataset community and the data sources in the community.
6

ACKNOWLEDGEMENTS

I would like to thank God for his grace leading me thus far. I would like to acknowledge

my parents, Chief and Dr Njoku whose support and prayers have kept me going. I will

also like to acknowledge my supervisor Prof. Xiangliang Zhang, whose constant advice

and corrections has kept me on the right path. Also, I acknowledge King Abdullah

University of Science and Technology for granting me this study opportunity. I would

also like to thank the Computer Science department professors who had taught me

one course or the other, bestowing knowledge upon me, without which I wouldn’t

have been able to succeed.


7

TABLE OF CONTENTS

Examination Committee Approval 2

Copyright 3

Acknowledgements 6

List of Abbreviations 9

List of Symbols 10

List of Figures 11

List of Tables 13

List of Algorithms 14

1 Introduction 15
1.1 Signed Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Link Label Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Prior Work 22
2.1 Structural Balance Theory . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Social Status Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Page Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Label propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Link Prediction Using Propagation . . . . . . . . . . . . . . . . . . . 29

3 Link Label Prediction Methodology 31


3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Supervised Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Features and methodology . . . . . . . . . . . . . . . . . . . . 32
8
3.2.2 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . 36
3.3 Label Propagation Method . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Label propagation Theory . . . . . . . . . . . . . . . . . . . . 42
3.4 PageRank-based Method . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 PageRank Theory . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Experimental Results 52
4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Community Network Analysis . . . . . . . . . . . . . . . . . . . . . . 61

5 Conclusion and Future Work 71


5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

References 74
9

LIST OF ABBREVIATIONS

AUC Area Under Curve

ICDM International Conference on Data Mining


IEEE

KDD ACM SIGKDD Conference on Knowledge Dis-


covery & Data Mining

MF-LiSP Matrix Factorization for Link Sign Prediction

RBF Radial Basis Function


ROC Receiver operating characteristic
RPR Rooted Page Rank

SDM SIAM International Conference on Data Min-


ing
SNSs Social Network Sites
SVM Support Vector Machines

T-RPR Time-sensitive Rooted-PageRank

UCI University of California, Irvine


UCR University of California, Riverside
10

LIST OF SYMBOLS

↵ Alpha indicates a variable


A parameter for SVM
2 indicates a range
µ mu indicates a variable or parameter
mapping function
mapping function
⇡ A page rank vector
< Data points
⌃ Sigma indicates a summation
sigma indicates a variable or parameter
⇣ Zeta indicates slack variables
11

LIST OF FIGURES

1.1 Relationship between papers in a signed citation network. The local


vicinity of each edge from Paper C comprises of an edge from paper D 17

2.1 Structural balance: Each triangle must have 1 or 3 positive edges to


be balanced. Figure 2.1 (a) and (c) are balanced, while Figure 2.1 (b)
and (d) are balanced . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 16 triads determined by three nodes and a fixed unsigned edge (from
A to B). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Example of a link label prediction problem with one unknown label. . 35
3.2 Example of a linear separating hyperplane of a separable problem in a
2D space. Support vectors (circled) show the margin of the maximal
separation between the two classes. . . . . . . . . . . . . . . . . . . . 36
3.3 A non-Linearly separable hyperplane in a 2D space mapped into a new
feature space where data are linearly separable. . . . . . . . . . . . . 40
3.4 A sample graph G. (a) Original graph G given as input to the algo-
rithm, (b) Resulting graph G? produced as output of the algorithm . 42
3.5 PageRank score. (a) Resulting rank scores of Normal PageRank algo-
rithm, (b) Resulting rank scores of our method, the red line shows the
threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 A Directed graph representing a small web of six web pages. . . . . . 50

4.1 The number of collected papers for each of the three conferences from
2001 to 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Distribution of the number of citation . . . . . . . . . . . . . . . . . . 54
4.3 Full citation graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Performance Results of the evaluated approaches . . . . . . . . . . . 59
4.5 ROC curve of the di↵erent approaches. . . . . . . . . . . . . . . . . . 60
4.6 Result of the evaluation on unseen papers . . . . . . . . . . . . . . . 60
4.7 Topic Cloud of the four largest communities . . . . . . . . . . . . . . 64
12
4.8 Topic Cloud of the two selected communities . . . . . . . . . . . . . . 65
4.9 The largest community. (a) UCI dataset repository - C.L. Blake and
C.J. Merz, (b) UCI KDD archive - S.D. Bay (c) WebACE: A Web
Agent for Document Categorization and Exploration - Han et al . . . 67
4.10 The second largest community. (a) UCI dataset repository - A. Asun-
cion and D. Newman . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.11 The third largest community. (a) RCV1: A New Benchmark Collec-
tion for Text Categorization Research - Lewis et al. (b) NewsWeeder:
Learning to Filter Netnews - Lang et al. (c) Gradient-based Learning
Applied to Document Recognition - Lecun et al. (d) Molecular Clas-
sification of Cancer: Class Discovery and Class Prediction by Gene
Expression Monitoring - Golub et al . . . . . . . . . . . . . . . . . . . 68
4.12 The fourth community. (a) GroupLens: An Open Architecture for
Collaborative Filtering of Netnews - Resnick et al. (b) Cascading Be-
havior in Large Blog Graphs - Leskovec et al. (c) Grouplens: Applying
Collaborative Filtering to Usenet News - Konstan et al. . . . . . . . . 69
4.13 The fifth community. (a) The UCR time Series Data Mining Archive
- Keogh and Folias (b) Emergence of Scaling in Random Networks -
Barabasi et al. (c) GThe UCR Time Series Classification/Clustering
Homepage - keogh et al. (d) Exact Discovery of Time Series Motifs -
Abdullah et al. (e) www.cs.ucr.edu - Keogh . . . . . . . . . . . . . . 70
13

LIST OF TABLES

4.1 Driected Network Information . . . . . . . . . . . . . . . . . . . . . . 54


4.2 Dataset Network Information . . . . . . . . . . . . . . . . . . . . . . 61
14

List of Algorithms

1 Graph Convertion Algorithm . . . . . . . . . . . . . . . . . . . . . . . 41


2 Label Propagation, Zhu et al. 2002 . . . . . . . . . . . . . . . . . . . 44
3 Label Spreading, Zhou et al. 2004 . . . . . . . . . . . . . . . . . . . . 44
15

Chapter 1

Introduction

The emergence and rapid growth of Social Network Sites (SNSs) such as Twitter,

LinkedIn, eBay and Epinions have greatly increased the influence of the internet in

everyday life. People now depend more on the web for information, social inter-

actions, decision making, event planning, sales, etc. Due to this rapid increasing

amount of interaction, social network analysis has been attracting lots of attention

during the recent years. A considerable amount of these analyses has been on mining

and analyzing interesting user behaviors with the scope of enhancing user experience.

Based on similar representation, our study is on citation graphs. Citation graphs

are used to visualize and analyze papers or researchers and can be utilized in several

areas such as to analyze and visualize the performance of researchers, relationships

between di↵erent papers, rank conferences, rank papers and datasets, find and mon-

itor communities and the prevalent topics within them, etc [1] [2] [3]. Throughout

this thesis, we will use edge and link interchangeably. We will also use the terms sign

and label interchangeably.


16
1.1 Signed Network

Online social networks have traditionally been represented as graphs with positively

weighted edges representing interactions amongst entities and the nodes representing

the entities. However, this representation is inadequate since there are also negative

e↵ects at work in most network settings. For instance, in online social networks

like Slashdot and Epinions, users often tag others as friends or foes, give ratings to

other users or items [4]; and on Wikipedia, users can vote against or in favour of the

nomination of other users to be admins [5]. In a binary signed network, edges are given

positive or negative labels showing the relationship between two nodes. However, the

complexity and nature of many graph problems change once negatively labeled edges

are introduced. For example, the shortest-path problem in the presence of cycles

with negative edges is known to be NP-hard [6]. Several studies have been conducted

on binary signed networks with di↵erent algorithms proposed using methods such as

structural balance theory, social status theory, matrix factorization, trust and distrust

propagation, and regression [4] [7] [8] [5].

In our directed citation network, the direction of the edges is from a paper to its

reference (see Figure 1.1). We define its sign to be negative or positive based on its

dataset relevancy (i.e whether the citation is based on the dataset used in the paper).

The underlying question is then: Does the pattern of edge signs in the local vicinity

of a given edge a↵ect the sign of the edge? The local vicinity of an edge comprises of

other edges having the same target node as the edge. Knowing the answer to these

questions would provide an insight on the design of computing models where we try

to infer the unobserved link relationship between two papers using the negative and

positive relationship we have observed in the vicinity of this link.


17

Figure 1.1: Relationship between papers in a signed citation network. The local
vicinity of each edge from Paper C comprises of an edge from paper D

1.2 Link Label Prediction

Considering a directed graph, the label of a link can be defined to be positive (repre-

senting a positive demeanor of its originator towards the receiver) or negative (rep-

resenting a negative demeanor of its originator towards the receiver). In a citation

network, the sign can denote similarities between a pair of nodes base on properties

such as datasets, affiliation, research interest, etc.

Studies on social networks based on social psychology [9] [10] [11] [12] [4] have

shown that future unknown interactions and perceptions of an entity towards another

entity tend to be a↵ected by its current interactions or perceptions towards other

entities. For instance, if user A is known not to trust user B, and user B is known to

trust user C, it is likely that user A would not trust user C also. Considering online

shops such as eBay and Amazon, for instance, if everyone that purchased item A also

purchased item B, it can infer that if user X buys item A, then user X is also likely

to buy item B. Therefore, understanding the latent tension between the positive and
18
negative forces is very important in order to solve some networking problems like

recommending friends to users in a social network, propagation of trust and distrust,

recommending datasets/papers in a citation network, prediction, etc. The aim here

is to try to examine the relationship between negative and positive links in a network.

For this, we need to know the theories of signed networks, which will enable us to

reason and understand how di↵erent configurations of positive and negative edges

provide information for the explanation of the various interactions in the network [4].

There are two most popular theories that have been used for positive and negative

relationship prediction on social networks: Social status theory [12] [13] and structural

balance theory [10] [14].

Structural balance theory was introduced in the mid-20th-century in social psy-

chology. This theory considers the possible ways a triad (a triangle on three entities)

can be signed. The main idea of this theory is that triads with one or three positive

signs (two friends with one enemy or three mutual friends) are balanced and thus,

more plausible than triads with none or two positive signs (three mutual enemies or

two enemies with a common friend) which are unbalanced.

Social status theory introduced in [12] takes into account the direction and sign

of edges and posits that a negative directed link implies that the initiator views the

recipient as having a lower status while a positively directed link implies that the

initiator views the recipient as having a higher status. Thus, an entity will only

connect to another entity with a positive sign only if it views it as having a higher

status. Note that although these theories work well and have been proved useful,

neither of them has been able to explain the situation where the two nodes connected

by an observed edge have no mutual neighbor [15].

Generally, the edge sign prediction problem in most of these existing studies [11] [4] [12] [7] [8]

can be formalized as follows: Given a network with signs on all its edges, but the
19
sign on the edge from node u to node v, denoted as s(u, v), the aim is to predict

the missing sign on the edge s(u, v) [4]. Our goal in this thesis, however, is to solve

the link label prediction problem on our citation network. The link label prediction

problem is defined as follows: Given the information about signs of certain links in a

social network we want to predict, which reveals the nature of relationships that exist

among the nodes, we want to predict the sign, positive or negative, of the remaining

links [16].

Agrawal et al. [16] introduced the link label prediction problem and proposed a

matrix factorization based technique M F -LiSP (Matrix Factorization for Link Sign

Prediction). The matrix factorization method proposed relies on a prior knowledge

of the network (i.e. if it is a balanced network, semi balanced network, etc.). This

method uses this information to try to complete the matrix in a way so as to keep

the structural balance structure.

1.3 Problem Setting

Our problem setting is described as follows: Given a citation graph G = (V, E) where,

the set of nodes V is a set of papers and data sources, the set of edges E is the set of

relationship between pairs of nodes. Let Lu,v denote the type of relationship between

pairs of nodes. We focus on two types of relationships: dataset (positive) and non-

dataset relevant (negative) relationships. A link from node u to node v is said to

be dataset related if the paper or data source represented by node v is cited by the

paper represented by the node u based on the datasets used or available in v and

vice-versa. Given the information about the relationship nature of certain links in a

citation network, we want to learn the nature of relationships that exist among the

nodes by predicting the sign, positive or negative, of the remaining links. To solve
20
the problem of link label prediction on our signed citation network, We present three

approaches which are respectively, based on SVM, PageRank algorithm, and label

propagation algorithm.

To study the link label prediction problem, a step-by-step research approach was

followed. First, we mined the dataset used here from the DBLP computer science

bibliography website [17]. We selected three conference datasets - International Con-

ference on Data Mining IEEE (ICDM), SIAM International Conference on Data Min-

ing (SDM), and ACM SIGKDD Conference on Knowledge Discovery & Data Mining

(KDD) ranging from 2001 to 2014. From the mined dataset, we extracted the dataset

relevant references and constructed a citation graph with directed edges from papers

to papers or data sources showing that the paper is referencing the papers or data

sources. If the reference is a dataset relevant reference, we assign a positive label to

the edge else we assign a negative label. Then the following tasks were accomplished

as part of this research:

• Convert the link label prediction problem to a standard binary classification

problem by generating sets of features for each edge and adopting SVM to solve

the classification problem.

• Convert the link label prediction problem to a label propagation problem by

formalizing the graph in such a way that the edges are transformed to labeled

and unlabeled nodes in the converted graph. In the new graph, the edges show

the similarities between the nodes. This similarity is based on the nodes the

edges are linked to in the original graph. We then, adopt a label propaga-

tion approach by propagating labels from known nodes to nodes with unknown

labels.

• Adopt a PageRank approach in solving the link label prediction problem by


21
ranking the nodes by the positive and negative incoming edges such that in-

coming positive edges add to the rank of a node while negative incoming edges

reduced the rank of a node. Then we find a separating threshold that separates

dataset relevant nodes and non-dataset relevant nodes.

• Evaluate proposed algorithms for link label prediction problem on our citation

network.

• Extract a subgraph comprising of dataset relevant nodes and edges.

• Discover and study the communities in the extracted subnetwork.

This thesis is structured as follows: In chapter 2, we discuss related works based on

the algorithms and similar ideas used in this thesis. We also discuss previous works on

link label prediction that tries to predict links using ideas from label propagation. In

chapter 3, we discuss the methodologies used in this thesis and how we approached the

problem. We also discuss in detail how the algorithms used in each approach works.

In chapter 4, we discuss the result gotten from the evaluation of these algorithms

and show them side by side for comparison. We also discuss the network analysis

on the dataset relevant subgraph. Finally, we conclude and propose future research

direction.
22

Chapter 2

Prior Work

A number of papers have investigated the positive and negative relationship between

members on signed networks using di↵erent approaches. Guha et al. [12] extended the

already existing works based on trust propagation [18] [19]. Their work introduced the

distrust propagation. The trust and distrust propagations are obtained by calculating

a linear combination of powers of a combined matrix. This combined matrix is made

up of the co-citation, trust coupling, direct propagation and transpose trust matrices.

This propagation method however, cannot be explained by the structural balance

theory.

Based on the social status and structural balance theory, Leskovec et al. [7] ex-

tended the work of Guha et al. [12] by using a machine-learning framework approach.

They evaluate which of a range of structural di↵erence give more information for

prediction task. In their work, they define two classes of features for the machine

learning approach. The first class is based on the signed degree of the nodes and the

second class is based on the social structure principles using a triad. Assuming we are

trying to predict the sign of the edge from u to v. For the first class of features, they

construct seven-degree features based on the outgoing edges from u, incoming edges

to v and the embeddedness of the edge. These seven features are d+


in (v) and din (v)
23
which denotes the number of incoming positive and negative edges to v, C(u, v) which

denote the total number of common neighbours of u and v in an undirected sense

(embeddedness of the edge), d+


out (u) and dout (u) which denote the outgoing positive

and negative edges from u, the total out-degree of u and the total in-degree of v. For

the second class, they considered each triad involving the edge (u, v), consisting of

a node w (see Figure 2.2 ) and encoded the information in a 16-dimensional vector

specifying the number of triads of each type that (u, v) is involved in. Then, using

a logistic regression classifier, they combined the information from the two classes of

features into an edge sign prediction.

Chiang et al. [8] extended the work of Leskovec et al. [7] by considering not

just triads, but longer cycles. In their work, they ignored the edge directions so

as to reduce the computational complexity. They extended the supervised learning

approach of [7] by using features derived from longer circles [8]. However, Agrawal

et al. [16] is the first to tackle simultaneous label prediction of multiple links. They

formulated the link sign prediction problem as a matrix completion problem in a

setting, where the data is represented as a partially observed matrix [16]. In their

work, they proposed a new technique Matrix Factorization for Link Sign Prediction

(MF-LiSP) that is based on matrix factorizations. Note, many approaches based upon

social status theory or structural balance theory have been developed to perform edge

sign prediction in signed networks. However, they generally do not perform well in

networks with few topological features (i.e. long-range cycles and triads) [15]. The

most popular theories in the study of signed social network are that of social status

and structural balance. However, there are some methods used in link prediction

problem that can be extended to solve the link label prediction problem. Prior works

based on the approaches used in this thesis are described in sections below.
24

Figure 2.1: Structural balance: Each triangle must have 1 or 3 positive edges to
be balanced. Figure 2.1 (a) and (c) are balanced, while Figure 2.1 (b) and (d) are
balanced

2.1 Structural Balance Theory

The formulation of the structural balance theory is based on social psychology [10].

Based on this theory, Cartwright and Harary [11] formally provided and proved the

notion of structural balance using undirected triads and mathematical theories of

graphs. Their notion of structural balance has four basic intuitive explanations each

representing a particular structure. They further claim that these four structures can

be divided into two groups, namely: balance and unbalanced (see Figure 2.1). The

intuitive explanations for the balanced structures are: A, B and C are all friends

(Figure 2.1(a)); B and C are friends with A as a mutual enemy (Figure 2.1(c)). The

intuitive explanations for the unbalanced structures are: A is friends with B and C,

but B and C are enemies (Figure 2.1(b)); A, B and C are mutual enemies (Figure

2.1(d)). In other words, the balanced triads follow the rule: the friend of my friend

is my friend (Figure 2.1(a)) and the enemy of my enemy is my friend, the friend of

my enemy is my enemy and the enemy of my friend is my enemy (Figure 2.1(c)).

Structural balance theory was initially intended for undirected networks but has been
25

Figure 2.2: 16 triads determined by three nodes and a fixed unsigned edge (from A
to B).
26
applied to directed network by disregarding the edge direction [4].

2.2 Social Status Theory

Social status theory was motivated by the work of Guha et al. [12] while considering

the edge sign prediction problem. Leskovec et al. [4] developed the social status

theory using the structural balance theory and ideas from Guha et al. [12] to explain

the signed networks. In this theory, they consider the direction and sign of an edge.

Consider a link from A to B. If the link from A to B has a positive sign, it indicates

that A views B as having a higher status and otherwise, if the link has a negative

sign, it indicates that A views B as having a lower status. In this theory of status, A

will positively connect to B only if it views B as having a higher status, and negative

otherwise. These relative levels of status can be propagated across the network [12].

Assuming all nodes in the network agree to the social status ordering and the sign of

the link from A to B is unknown, we could infer the sign from the contexts provided

by the rest of the network.

2.3 Page Rank

The PageRank algorithm introduced by Brin and Page [20] shows the importance

of a node in a network based on the importance of the nodes that link to it. Some

works [21] [22] on link prediction adapted what is known as rooted PageRank approach

to their studies. In the rooted PageRank approach, the random walk assumption of

the original PageRank is altered as follows: Similarity score between two vertices u

and v can be measured as the stationary probability of v in a random walk that returns

to u with probability 1 ↵ in each step, moving to a random neighbor with probability


27
↵ [23]. However, this metric is asymmetric. To make this metric symmetric, the

counterpart where the role of u and v are reversed [23].Given a diagonal degree matrix

D with D[i, i] = j A[i, j].

1
RP R = (1 ↵)(I ↵N ) (2.1)

where, N = D 1 A is the adjacency matrix with row sums normalized to 1.

Nowell et al. [21] assigned a connection weight score(u, v) to pairs of nodes (u, v)

based on an input graph G. Then from this score, they produced a ranked list

in decreasing order of score(u, v). This can be seen as computing a measure of

similarity or proximity between pairs of nodes (u, v), based on the network topology.

The random reset procedure in web page ranking was adopted. However, this work

focused on link prediction problem which, is di↵erent from our problem setting.

Jabar et al. [22] proposed a new approach Time-sensitive Rooted-PageRank (T-

RPR) which is built upon Rooted-PageRank. This method captures the interaction

dynamics of the underlying network structure. They define the problem as: Given a

time-interaction graph Gki = (Vk , Ek , W ), where Vk are the nodes, Tk is the time slot

in which there was an interaction between the set of nodes (in their case actors), and

W is the edge weight vector. Let scorex,k (vi ) denote the RPR score of node vi when

x is used as root on Gki . The resulting list of nodes are sorted in descending order

with respect to their RPR scores at time slot k:

L(x)k = [v1 , v2 , .., vn ], (2.2)

where n =| Vk | 1, Vk \{x} = {v1 , v2 , ..., vn }, and scorex,k (vi ) scorex,k (vi+1 ) for

i = 1, ..., n 1. The rankings obtained over the n time slots are then aggregated
28
for each root node x and all remaining nodes vi 2 V , resulting in an aggregate score

aggScorex(vi ). Finally, just like in [21] the aggregate scores are sorted in descending

order. Finally, from the list of nodes L(x), the hierarchy graph GH is inferred. Their

work was focused on detecting hierarchical ties between a group of members in a

social network. Note that this is di↵erent from the objective of this thesis.

Though the methods used a similar idea of scoring and ranking nodes, there is no

prior work on using PageRank to infer the labels of links in a signed network. We

will use PageRank to rank the nodes and based on their ranks, we will infer the label

of an edge directed towards a node.

2.4 Label propagation

Label propagation introduced in [24] is a semi-supervised algorithm. Given a graph

with labeled and unlabeled nodes, label propagation di↵uses the labels recursively

across graph edges from the labeled to unlabeled nodes, until a stable result is reached.

This results in a final labeling of all the graph nodes. Because of its efficiency and

performance, it has been applied to several problems with modified algorithms be-

ing proposed [25] [24] [26] [27]. Several works [28] [29] applied label propagation on

community detection problem. Some other works [30] [31] [32] extended the works on

community detection and proposed algorithms based on label propagation in finding

overlapping communities where each vertex can belong to up to v communities, where

v is a parameter of the algorithm. Boldi et al. [33] applied label propagation in clus-

tering and compressing very large web graphs and social networks. They proposed

a scalable parameter-free method (Layered Label P ropagation) that keeps nodes

with the same labels close to each other at each iteration and yields a compression

friendly ordering. On a similar note, Ugander et al. [34] proposed an efficient algo-
29
rithm (Balanced Label P ropagation) for partitioning massive graphs while greedily

maximizing edge locality. There are also several works [35] [36] [37] [38] on classifi-

cation and predictions on social networks using label propagation. However, to the

best of my knowledge, there is no previous method proposed for solving the link label

prediction on signed graph using label propagation. We present a method formalizing

a graph, such that we propagate labels from labeled to unlabeled edges.

2.5 Link Prediction Using Propagation

A similar idea of propagating labels on links was used in the algorithm proposed

by Kashima et al. [39]. They used ideas from label propagation method to predict

links. Using ancillary information such as node similarities, they devised a node-

information based link prediction method using a conjugate gradient method and

vec-trick techniques to make the computations efficient. Given two sets of nodes

X = {x1 , x2 , ...xM } and Y = {y1 , y2 , ...yN }, note that M =| X | and N =| Y |.

The link propagation inference principle which was gotten from modifying the label

propagation principle, states that two similar node pairs are likely to have the same

link strength. The link strengths indicate how likely a link exists for a given a node

pair (xi , yj ), and are set to some positive value if a link exists in the observed graph,

some negative value if no link exists in the observed graph, or zero otherwise. Applying

the label propagation principle, they defined an objective function to minimize:

1
J(F ) = vec(F )T Lvec(F ) + k vec(F ⇤ G) vec(F ⇤ ) k22 (2.3)
2 2
8 9
>
< 1 >
if (i, j) 2 E =
Gi,j =
: p
>
µ otherwise
>
;
30
where F is a second order tensor representing link strength (vec(F ) is the vectorization

operation of the matrix F ), F ⇤ represents the observed parts of the network. The first

term of Eq.(2.3) indicates that two link strength values Fi,j and Fl,m for two pairs of

edges (xi , yj ) and (xl , ym ) should be close to each other if the similarity between them

is high. The second term is the loss function. The loss function fits the predictions

to their target values of the observed regions of the network. The second term also

acts as a regularization term so as to prevent the predictions from being too far from

zero, and for numerical stability (the ⇤ is the Hadamard product). The > 0 and

µ> 0 are regularization parameters which balance the two terms of Eq. (2.3). L

is a M N ⇥ M N Laplacian matrix. To obtain F that minimizes Eq.(2.3), Eq.(2.3)

is di↵erentiated with respect to vec(F ) [39]. This method can fill in missing parts

of tensors, and thus, is applicable to multi-relational domains, making it possible to

handle multiple types of links simultaneously [40].

Although the qualities of this method were better compared to those of other

state-of-the-art methods, its efficiency and e↵ectiveness were overshadowed by its

computational time and space constraints thereby, limiting its application. To ad-

dress this issue, Raymond et al. [41] extended [39], proposing a fast and scalable

algorithm for link propagation. This proposed algorithm utilizes matrix factoriza-

tion and approximation techniques (such as the Cholesky and eigen-decomposition)

to reduce the computational time and space required in solving linear equations in

the Link Propagation algorithm. Note, however, this is solving the link prediction

problem and is di↵erent from our problem.


31

Chapter 3

Link Label Prediction

Methodology

In this section, we discuss the approaches used in solving our link label prediction

problem.

3.1 Problem Definition

Link label prediction problem is defined as: Given a citation graph G = (V, E)

where, the set of nodes V is the set of papers or data sources, the set of directed

edges E is the set of relationship between pairs of nodes. An edge pointing from u

to v indicating v is cited in u. We let Lu,v denote the label of the edge (u, v) from

node u to node v, representing the relationship between u and v. We focus on two

types of relationship: dataset relevant (positive) and non-dataset relevant (negative)

relationship. A link from node u to node v is said to be dataset relevant if node v is

cited by node u based on the dataset it uses and vice-versa. In this thesis, we will

address the node relationship using their graphical terms (labels or signs). Given the

information about the relationship nature of certain links in a citation network, we

want to learn the nature of relationships that exist among the nodes by predicting
32
the sign, positive or negative, of the remaining links. We assume that: A link (u, v)

from node(u) to a node(v) is most likely to be positive if node(v) is known to have a

higher volume of its incoming links to be positive and on the other hand, if node(v)

is known to have a higher volume of its incoming links to be negative, then link (u, v)

is most likely to be a negative link. This problem can thus be formulated as follows:

Given a citation graph G = (V, E), and a set of label Lu,v = { 1, 1} for some edges,

we predict the missing labels for other edges in the graph. We next introduce three

di↵erent approaches for solving this problem.

3.2 Supervised Method

Here, we consider the link label prediction problem as a classification problem. For

each edge, we generate a set of features based on the network topology, structural

balance and status theory. Having obtained these features, we convert the link label

prediction problem to a binary classification problem: labeled edges are used for

training a classifier that will predict positive or negative relationship for unlabeled

edges. We adopt Support Vector Machines (SVM) in solving this binary classification

problem due to its well-known efficacy in classification. A broad description of how

SVM works can be found in section 3.2.2.

3.2.1 Features and methodology

A critical part of any machine-learning algorithm is choosing an appropriate feature

set. Making a poor feature set selection might have negative e↵ects on the result

learned by many machine-learning algorithms. We divided the features into two

categories. The first category is based on the signed degree of the nodes and the

second category is based on the signed degree of common neighbors.


33
For an edge (u, v), pointing from node u to node v, we define the following features

in the first category.

1. d+
in (v) and din (v) which denote the number of incoming positive and negative

edges to v

2. d+
out (u) and dout (u) which denote the outgoing positive and negative edges from

3. The total degree of u and the total degree of v which are d(u) and d(v) respec-

tively

4. C(u, v) which denote the embededness of the edge. We describe the embed-

edness as the total number of mutual neighbors of u and v in a semi-directed

sense that is, the number of nodes w such that wi is linked by an edge in either

direction with u and linked to node v in a directed sense (i.e. from w to v)

5. H(u, v) which infers the sign of the edge based on social status theory. Node

status is calculated as follows: (x) = d+ +


in (x) + dout (x) din (x)

6. T (v) which is the number of dataset related words contained in the paper title

or data source name represented by the target node v

The status heuristic (x) gives node x status benefits for each positive link it

receives and each positive link it generates, and status reduction for each negative

link it receives. We then predict a positive sign +1 for (u, v) if (u) < (v), a negative

sign -1 if (u) > (v), and a 0 otherwise. The list of dataset related word was gotten

from observing the publications and data sources.

For the second class, we consider each node w such that w has an edge to or from

u and also an edge to v. Here, we used the idea that if nodes similar to node u link
34
to node v with a positive sign, then u is most likely to link to v with a positive sign

or with a negative sign if nodes similar to node u link to v with a negative sign”.

The features for an edge(u,v) in the second category are

1. N + - The number of mutual neighbors with a positive link to v

2. N - The number of mutual neighbors with a negative link to v

3. P (e(u,v) = 1 | N ± ) which denotes the probability of edge (u, v) to be positive

based on their mutual neighbors.

To further explain the feature setup, a small network with one unknown label is

shown in Figure 3.1. In this small network, the features for edge(u,v) in the first

category are

1. d+
in (v) = 1 , din (v) = 4

2. d+
out (u) = 0 , dout (u) = 1

3. d(u) = 3 and d(v) = 6

4. C(u, v) = 2

5. H(u, v) = 1 since (v) = 3 < (u) = 1

6. T (v) is gotten from the paper title or name of the data source represented by

v.

The features for edge(u,v) in the second category are

1. N + = 0

2. N = 2

3. P (e(u,v) = 1 | N ± ) = 0
35

Figure 3.1: Example of a link label prediction problem with one unknown label.

Getting the features, we pass them to SVM classifier to classify the edges into

positive and negative edges using the RBF kernel (See section. 3.2.2). We used the

LibSVM implementation of SVM for the classification. Since the network is made

up many lowly cited nodes (See Figure 4.2), we sample the test dataset (edges)

from nodes with high in-degree to avoid a situation where we get dangling nodes by

randomly sampling all the edges linking (in an undirected sense) a node to any other

node in the network. Another reason for this is to keep a considerable amount of

links to each node for better predictions. For instance, in Figure 3.1, using only edge

(d, v) in the training dataset might lead to the prediction of positive labels on other

incoming edges to v in the test dataset. This is because a positively signed edge (d, v)

would be the only incoming edge to node v in the training dataset. Thus, for each

node with a high in-degree, we randomly sample 10% of the incoming links. The

parameter was selected by running a grid search cross-validation on the training data

and then selecting the parameters that achieved the best result. The evaluations and

results will be shown and discussed in Chapter 4.


36
3.2.2 Support Vector Machines (SVM)

Support Vector Machines (SVM) was introduced in 1979 (Vapnik, 1982). This al-

gorithm has been shown to support both classification and regression tasks and has

thus grown in popularity over the years. The basic concept of this algorithm is as

follows: Given an input vector, find a hyperplane (boundary) that separates a set

of objects having di↵erent class memberships. This raises the question of how do

we find a good hyperplane that not only can separate the training data, but, also

will generalize well - as not only all hyperplanes that separate the training data will

automatically separate the testing data [42]. The optimal hyperplane for this work

can be defined as a decision function with maximal margin between positively signed

vectors and negatively signed vectors.

Figure 3.2: Example of a linear separating hyperplane of a separable problem in a 2D


space. Support vectors (circled) show the margin of the maximal separation between
the two classes.
37
Optimal Hyperplane

Suppose we have a set of labeled training examples:

(x1 , y1 ), (x2 , y2 ), ..., (xn , yn ), yi 2 {1, 1} (3.1)

In our case x can be seen as link features and y as the link label. This training

examples are linearly separable if there exists a vector w and a scalar b such that the

inequalities:

w · xi + b 1 if yi = 1, (3.2)

w · xi + b  1 if yi = 1, (3.3)

are valid for all examples of the training set [42]. These inequalities can be combined

into one set of inequality:

yi (xi · w + b) 1 0 8i (3.4)

Lying on the hyperplane H1 : w · xi + b = 1, are the points for which the equality
w·x+b 1
(3.3) holds. These points have a perpendicular distance kwk
= kwk
from the optimal

hyperplane with normal w. Similarly, lying on the hyperplane H2 : w · xi + b = 1,

are the points for which the equality (3.2) holds. These points have a perpendicular
w·x+b 1
distance kwk
= kwk
from the optimal hyperplane with normal w. Therefore, d =
|1| 2
d+ = kwk
and the margin is kwk
. Therefore, the optimal hyperplane w? · x + b? = 0

can be obtained by minimizing k w k2 subject to the constraints (3.4).

Examples xi for which yi (w · xi + b) = 1 are called the support vectors. If the

optimal hyperplane can be built from few of its support vectors relative to the size

of the training dataset, the generalization ability will be high even in an infinite
38
dimensional space [42]. However, [14] shows that w can be expressed as a linear

combination of a subset of the training dataset: The points that lie exactly on the

margin (i.e. minimum distance to the hyperplane).

w? = ⌃li=1 yi ↵i? xi , (3.5)

where ↵i? 0. Since ↵ > 0 only for support vectors, the equation (3.5) represents a

concise form of writing w? . To handle non-separable noisy examples between H1 and

H2, the constrains (3.2) and (3.3) can to be relaxed when necessary (i.e. introduce

further cost for doing this) by introducing positive slack variables ⇣ i , i = 1, 2, ..., n

[42]. The constraints will now be:

w · xi + b 1 ⇣i if yi = 1, ⇣i 0 8i. (3.6)

w · xi + b  1 + ⇣i if yi = 1, ⇣i 0 8i. (3.7)

One way to assign an extra penalty for errors is to change the objective function to
kwk2 kwk2
be minimized from 2
to 2
+ C(⌃i ⇣i )k where a larger C correlate to assigning a

higher penalty to errors.

Non-Linearly separable training

A question that may arise is “what if they are not linearly separable, how can the

above methods be generalized?” Boser et al. [43] showed that this problem can be

solved by mapping the data into a new dimension feature space where it can be

separated (see Figure 3.3). For this, we need to find a function that will perform

such mapping:

: <N ! F
39
However, this feature space might be of a much higher dimension and mapping the

data into a feature space with such dimension might a↵ect the performance of the

resulting machine. One can show [44] that during training, the optimization problem

only uses the training examples to compute pair-wise dot products, xi · xj , where

xi , xj 2 <N . This is notable because it turns out that there exist functions that,

given two vectors x and y in <N , implicitly computes the dot product between x and

y in a higher-dimensional space <M without explicitly transforming x and y to <M .

This process uses no extra memory and has a minimal e↵ect on computation time.

These functions are called kernel functions. A kernel function can thus be formulated

as:

K(xi , xj ) = ( (xi ) · (xj ))M , xi , xj 2 <N (3.8)

where, (·, ·)M is an inner product of <M , (x) transforms x to <M ( :<N ! <M ) and

M > N . The most popular kernel functions K(xi , xj ) = (xi ). (xj ) avalible in most

o↵-the-shelf classifiers are


8 9
>
> >
>
>
> xi · xj Linear >
>
>
> >
>
>
> >
>
< ( · xi · xj + C)d Polynomial =
K(xi , xj ) = (xi ) · (xj ) =
>
> >
>
>
> exp( | xi xj |2 ) RBF >
>
>
> >
>
>
> >
>
: tanh( · xi · xj + C) Sigmoid ;

The most popular choice of kernel types used in Support Vector Machines is the

RBF kernel. This is primarily because of their finite and localized responses across

the entire range of the real x-axis [45].


40

Figure 3.3: A non-Linearly separable hyperplane in a 2D space mapped into a new


feature space where data are linearly separable.

3.3 Label Propagation Method

Based on an assumption that “If a publication is citing a popular database, the

link relationship is most likely to be dataset relevant (positive)”, a semi-supervised

approach to our link label prediction problem is to think of it as a label propagation

problem. Note, “popular” in our case means having most of its incoming links to

be positive. Given a graph G = (V, E) where nodes V = 1, ..., n represent papers

and dataset sources, edges E are partially labeled (i.e only some edges are labeled),

we want to infer the labels of the unlabeled edges using the labeled edges and the

topological proximity relationship of the graph. A broad description of how label

propagation works can be found in section 3.3.1.

In our case, in order to use a label propagation approach, we need a way to

convert our directed graph G = (V, E) to a directed graph G? = (V ? , E ? ) such that

the edges are represented as labeled and unlabeled nodes V ? and edges E ? represents

similarities between them. These similarities are given by a weight matrix W : Wi,j is

non-zero i↵ ei and ej have similar target nodes or the target node of one is the source

node of another, i.e., the edges have the following configurations ({ei,j and ek,j } or

{ei,j and ej,k }). We can then propagate the labels along the directed edges according
41
to the weight of the links in a random walk approach.

To convert our graph G = (V, E) with N number of edges to G? = (V ? , E ? )

with N number of nodes, we define a conversion algorithm (See Algorithm 1). This

algorithm creates a new graph G? with the edges of G as its nodes and generates

edges in such a way that two nodes in G? are linked only if their edges in G have the

same target node Vj or the source node of one is the target node of the other. Two

nodes in G? are linked together with either a similarity score of two if the edges in G

represented by these nodes have the same target node, or a similarity score of one if

the source node of one is the target node of the other. Figure 3.1 shows an example

of a given graph G converted to graph G? .

Algorithm 1 Graph Convertion Algorithm


Input: Graph G(V, E)
Output: Graph G? = (V ? , E ? ) and weights W ?
Vi? = ei for i = 1, ..., N
foreach Vi? do
Create and add to E ? edges connecting Vi? (ei ) to all the incoming and outgoing
edges Vj? (ej ) of the target node of edge ei
if Connecting to an incoming edge then
?
Assign a weight Wi,j =2
else
?
Assign a weight Wi,j =1
end
end

If our graph was a connected graph, using only the graph produced as output

by Algorithm 1 would have been enough. However, our graph has more than one

connected component. For this, we introduced an adjustment: Whenever there is

a dead end, we make a random jump to any other connected component. This is

implemented by introducing a random edge to a node in each connected component

CCi in G from a node in another connected component CCj . These edges are given

no label and are generated based on a probability distribution. The probability


42

(a) Original Graph G (b) Converted Graph G?

Figure 3.4: A sample graph G. (a) Original graph G given as input to the algorithm,
(b) Resulting graph G? produced as output of the algorithm
with weights assigned to its edges.

distribution is made such that the random edge is more likely to be generated from

a node with high in-degree. This preference is made so that the result of the label

propagation algorithm is not a↵ected by the generated edges. Based on the label

propagation algorithm (See Algorithm 2), starting with labeled nodes 1, 2, ..., l and

unlabeled nodes l + 1, ..., n, each node starts to propagate its label to its neighbors

in the manner of a Markov chain on the graph, and the process is repeated until

convergence [46].

3.3.1 Label propagation Theory

Label propagation is a popular graph-based semi-supervised learning framework. The

problem setup is defined as follows: denoting a set of items and their corresponding

labels (x1 , y1 )...(xn , yn ),let X = {x1 ...xn } be the set of items and Y = {y1 ...yn } the

item labels. Let (x1 , y1 )...(xl , yl ) be the items with known labels YL = y1 ...yl , and

(xl+1 , yl+1 )...(xn , yn ) be the items with unknown labels YU L = yl+1 ...yn . The aim is to

predict the labels of unlabeled items from X and YL . To achieve this goal, a graph

G = (V, E) is constructed where; the set of nodes V = x1 ...xn is the set of items X and
43
the weights of the set of edges E shows the similarities among the items. Intuitively,

similar nodes should have similar labels. Label from one node is propagated to other

nodes based on the weight of the edges (i.e. labels are more likely to be propagated

through edges with larger weights). Di↵erent algorithms have been proposed for label

propagation. These include iterative algorithms [25][24] and using Markov random

walks [26].

Iterative Algorithms

Given the graph G, the idea is to propagate labels from labeled nodes to unlabeled

nodes in the graph. Starting with labeled nodes (x1 , y1 )...(xl , yl ) labeled with (1or 1)

to unlabeled nodes (xl+1 , yl+1 )...(xn , yn ) labeled with 0, each node starts to propagate

its label to its neighbors, and the process is repeated until convergence.

Algorithms of this kind have been proposed by Zhu et al. and Zhou et al. [25][24]

(See Algorithm: (2) and (3)). Labels on labeled and unlabeled data are denoted by

Ŷ = (Ŷl , Yˆu ). Zhu et al. Proposed and proved the convergence of the propagation

algorithm (See Algorithm: (2)), the initial labels for the data points with known

labels (x1 ...xl ) are forced on their estimated labels i.e. Ŷl is constrained to be equal

to Yl . However, another similar label propagation algorithm (See Algorithm: (3))

known as label spreading was proposed and proved to converge by Zhou et al. [25]

which at each step, a node i gets a contribution from its neighbors j (weighted by the

normalized weight of the edge (i, j)) and an additional small contribution given by its

initial value [46]. In general, we can expect convergence rate of these two algorithms

to be at worst on the order of O(kn2 ), where k is the number of neighbors of a point

in the graph. In the case of a dense weight matrix, the computational time is thus

O(n3 ) [46].
44

Algorithm 2 Label Propagation, Zhu et al. 2002


Input: Graph G(V, E), labels Yl
Output: Labels Ŷ
Compute Dii = ⌃j Aij
Compute P = D 1 A
Initialize Y 0 = (Yl , 0), t = 0
while Y t hasn’t converged do
Y t+1 PY t
t+1
Yl Yl t
end
Ŷ Yt

Algorithm 3 Label Spreading, Zhou et al. 2004


Input: Graph G(V, E), labels Yl
Output: Labels Ŷ
Compute Dii =⌃j Aij
1 1
Compute S = D 2 AD 2
Initialize Y 0 = (Yl , 0), t = 0
Choose a parameter 2 [0, 1)
while Y t hasn’t converged do
Y t+1 SY t + (1 )Y 0
t t+1
end
45
Markov Random Walks

Szummer et al. [26] took a di↵erent approach for label propagation on a similarity

graph by implementing a Markov random walks approach. To estimate probabilities

of class labels, they defined the transition probabilities of Markov random walks on

the graph from i to k as:


Wi,k
pi,k = , (3.9)
⌃j Wi,j

Where, the weight Wi,j is given by a Gaussian kernel for neighbors and 0 for non-

neighbors, and Wi,i = 1 (but one could also use Wi,i = 0) [46]. They assume that the

starting point of the Markov random walk is chosen uniformly at random, i.e.,P (i) =
1
N
. A probability P (y = 1|i) of being of class 1 is associated with each data point xi .

The probability P t (ystart = 1 | j) of a given a point xj (i.e. we started from a point

of class ystart = 1), given that we arrived to xj after t steps of random walk is given

by:

P t (ystart = 1 | j) = ⌃ni=1 P (y = 1 | i)P0|t (i | j), (3.10)

where P0|t (i | j) is the probability that we started from xi given that we arrived at

j after t steps of random walk (this probability can be computed from the pi,k ) [26].

If P t (ystart = 1 | j) > 0.5, xj is then classified to 1, otherwise it is classified to

1. They proposed two techniques for estimating the unknown parameters P (y | i):

maximum likelihood with expectation-maximization (EM), and maximum margin

subject to constraints. However, this algorithm’s performance is largely dependent

on the length of the random walk t. This parameter can be chosen heuristically (i.e.,

to the scale of the clusters we are interested in) or by cross-validation (if enough data

are available) [46].


46
3.4 PageRank-based Method

Here we consider another semi-supervised approach for predicting the missing link

labels in our dataset. In this method, we approach our link label prediction problem

as a task to rank the nodes in the network. The idea is to design an algorithm that

will assign higher scores to nodes with more positive in-degree than those with more

negative in-degree. Here, we rank the nodes using the PageRank algorithm, then we

find a threshold separating the nodes such that an edge directed towards any of the

nodes with a rank score above the threshold would be labeled as a positive edge and

an edge directed towards a node below the threshold would be labeled as a negative

edge. A broad description of how PageRank works can be found in section 3.4.1

below.

We define an approach: Given a graph G, we construct a Markov model that

represents the graph as a sparse square matrix M whose element Mu,v is the prob-

ability of moving from node u to node v in one-time step. For instance, using the

graph in Figure (3.6), then our M would be the matrix P (See Matrix 3.4.1). We

then compute the adjustments to make our matrix irreducible and stochastic (See

Section 3.4.1). The PageRank scores are then calculated using a modified version of

the quadratic extrapolation algorithm. The quadratic extrapolation algorithm accel-

erates the convergence of the power method [47]. We modified the algorithm such

that: 8
>
< rj
Nj
if edge (Pj , Pi ) is positive,
ri = ⌃j2Li (3.11)
>
: rj
Nj
if edge (Pj , Pi ) is negative,

Equation 3.11 is set such that incoming negative links will decrease the rank score of

a node while incoming positive links will increase the rank score of a node. An initial
1
rank score of N
was assigned to each node (N is the total number of nodes). Figure
47
(3.5) shows the PageRank score distribution of both the original PageRank algorithm

and the modified PageRank algorithm. Another important thing in this approach is

the threshold. The performance of this approach relies on a good choice of threshold.

For the our network, we chose the 85th percentile to be the threshold.

3.4.1 PageRank Theory

The PageRank algorithm which was first introduced by Brin and page [20] is one of the

algorithms used by Google to rank their search engine results. The main idea behind

the PageRank algorithm is that the importance of any web page can be determined

based on the pages that link to it. It was intuitively explained using the concept

of an infinitely dedicated web surfer randomly going from page to page by choosing

a random link from one page to get to another. The PageRank of page(i) is the

probability that a random web surfer visits page(i). However, this random surfer can

end up in a loop (cycles around cliques of interconnected pages) or hit a dead end (a

web page with no outgoing links). Therefore, in order to address these problems, an

adjustment was made such that with a certain probability, the random surfer jumps

to a random web page. This random walk is known as a Markov process. Based on the

principle of PageRank, if we include a hyperlink to the web page(i) in our Data mining

lab site, this means that we consider page(i) important and relevant to the topic being

discussed on our site. If lots of other web pages link also to page(i), the logical belief is

that page(i) is important (contains some important news or information). However,

if page(i) has only one backlink, which comes from a very popular site page(j),

(like www.huffingtonpost.com, www.kaust.edu.sa, www.tmz.com) we say that page(j)

asserts that page(i) is important. With this understanding, we can say that PageRank

algorithm is like a counter of an online ballot, where pages vote for the importance
48

×10 -5
2.134

2.132

2.13
Scores

2.128

2.126

2.124

2.122
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Nodes ×10 4

(a) Original PageRank

×10 -5
2.128

2.126

2.124

2.122
Scores

2.12

2.118

2.116

2.114

2.112
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Nodes ×10 4

(b) Modified PageRank

Figure 3.5: PageRank score. (a) Resulting rank scores of Normal PageRank algo-
rithm, (b) Resulting rank scores of our method, the red line shows the threshold
49
of other pages, and this result is then gathered by PageRank and is reflected in the

search results. For any given graph, PageRank is well-defined and can be used for

capturing quantitative correlations between pairs of vertices as well as pairs of subsets

of vertices.

The Algorithm

The original PageRank algorithm described by Brin et al. [20] is given by:

✓ ◆
P R(T1 ) P R(Tn )
P R(A) = (1 d) + d + ... + (3.12)
C(T1 ) C(Tn )

where, P R(A) is the PageRank of page A, P R(Ti ) is the PageRank of pages Ti which

link to page A, C(Ti ) - number of outbound links on page Ti , and d - damping factor

which can be set between 0 and 1. The damping factor d is usually set to be 0.85. This

means the amount of PageRank that a page has to vote will be its own value ⇥ 0.85

and this vote is shared equally among all the pages it links to. To get accurate results,

this process performed iteratively till it converges. The PageRank ri , i 2 1, 2, ..., n for

page Pi , i 2 1, 2, ..., n can be recursively defined as:

rj
ri = ⌃j2Li , i = 1, 2, ..., n. (3.13)
Nj

where Nj - the number of outlinks from page Pj and Li - the pages that link to page

Pj .

Operation On Matrix

Let us consider Figure (3.6) which is a graphical representation of a small web con-

sisting of six web pages. The graph contains six nodes, each of which represents a web
50
page with the link showing the relationship between the web pages. Here, a hyperlink

Figure 3.6: A Directed graph representing a small web of six web pages.

1
matrix P is introduced, which is a row normalized hyperlink matrix with Pi,j = |Pi |
if,

node(i) has a link to node(j) and 0 otherwise. At each iteration, a PageRank single

vector ⇡ T containing all the page rank values is calculated:

⇡ (k+1)T = ⇡ (k)T P (3.14)

1
where ⇡ is the page rank vector and is usually initialized with ⇡i = N
for a web of N

pages.The matrix P of the small web in Figure (3.6) is given by:

0 1
P1 P2 P3 P4 P5 P6
B C
B C
BP 1 0 0 0 0 0 0 C
B C
B C
B 1 1 C
BP 2 0 0 0 0 C
B 2 2 C
B C
P =B
BP 3
1 1
0 1
0 0 C
C
3 3 3
B C
B C
BP 4 0 0 0 0 1 0 C
B C
B C
BP 5 0 0 0 1
0 1 C
B 2 2 C
@ A
1 1
P6 0 0 0 2 2
0

However, using just the hyperlink structure in building the Markov matrix is not

enough because of the presence of dangling node. Dangling nodes are nodes with no
51
outgoing edges like P1 in our example thus, P is not stochastic. These dangling nodes

appear very often on the web. There are two proposed adjustments to deal with this

problem. The first adjustment is to make the matrix stochastic. This is formulated

mathematically as:
1 T
S = P + a( e ) (3.15)
N

where 8
>
<1 if Pi is a dangling node,
ai =
>
:0 otherwise

With this adjustment, our matrix will look like this:

0 1
P1 P2 P3 P4 P5 P6
B C
B C
BP 1 1 1 1 1 1 1 C
B 6 6 6 6 6 6 C
B C
B 1 1 C
BP 2 0 0 0 0C
B 2 2 C
B C
S=B
BP 3
1 1
0 1
0 0CC
3 3 3
B C
B C
BP 4 0 0 0 0 1 0C
B C
B C
BP 5 0 0 0 1
0 1 C
B 2 2 C
@ A
1 1
P6 0 0 0 2 2
0

However, we need to make P irreducible so as to insure the existence of the stationary

vector of the chain (the PageRank vector) [48]. Therefore, to make P irreducible,

another adjustment is made:

G = ↵S + (1 ↵)E (3.16)

1
Where ↵ is a scalar between 0 and 1 and E = N
eeT is the teleportation matrix.

↵ = 0.85 can be explained as the random surfer follows the hyperlink structure of the

web 85% of the time and jumps to a random new page 15% of the time.
52

Chapter 4

Experimental Results

In this chapter, we present and discuss our dataset, the results of experiments, com-

pare the proposed approaches, evaluate their predictive performance on our dataset,

and also analyze the community structure of dataset relevant citations.

4.1 Data Collection

The dataset used in this thesis comprises papers presented in three major confer-

ences: International Conference on Data Mining IEEE (ICDM), SIAM International

Conference on Data Mining (SDM), and ACM SIGKDD Conference on Knowledge

Discovery & Data Mining (KDD). The dataset spans from 2001 until 2014 and

was crawled from the DBLP computer science repository website http://dblp.

uni-trier.de/db/conf/icdm/index.html, http://dblp.uni-trier.de/db/conf/

sdm/index.html, and http://dblp.uni-trier.de/db/conf/kdd/index.html. The

main information in the collected dataset includes paper title, authors, abstract and

reference. Figure 4.1 shows the number of papers collected for each conference from

2001 to 2014. However, to use this dataset efficiently in our work, we split the papers

and their references, and present them as separate entries thus, building an index

over the whole dataset. For each paper, we create a node pair (u, vi ) i = 1, 2, ..., n.
53
Where u is the paper, vi is a reference in the paper, and n is the total number of

references in the paper.

(a) International Conference on Data (b) SIAM International Conference on


Mining IEEE (ICDM) Data Mining (SDM)

(c) ACM SIGKDD Conference on


Knowledge Discovery & Data Mining
(KDD)

Figure 4.1: The number of collected papers for each of the three conferences from
2001 to 2014

4.2 Dataset description

The obtained dataset is presented as a citation network with nodes representing

papers and their references and directed links representing the citation relationship.

In total, there are 4830 papers gotten from the three conferences and on average,

each paper references a dataset source, although there are papers with more than one

reference to dataset sources and also papers with no reference to any dataset source.

Each link is labeled as positive or negative based on the nature of the reference. If the
54
Table 4.1: Driected Network Information

Number of Nodes 51680


Number of Edges 101503
Connected components 63
Number of Nodes with out-links 4830

reference is based on the dataset used in the paper it is labeled as positive, otherwise,

it is labeled as negative. The labels were obtained by a manual process of browsing

through the papers to check for which of their references are dataset related. The

network contains 51,680 nodes and 101,503 edges, of which ⇠ 90% are negative. The

full citation network is shown in Figure 4.3. We can see that the network is made up

of one large connected component and a number of smaller connected components.

In total, there are 63 connected components. Information about the network can be

seen in Table 4.1.


300

250
Number of citations

200

150

100

50

0
0 1 2 3 4 5 6
Papers and data sources (index) ×10 4

Figure 4.2: Distribution of the number of citation

The number of citations of each node in our network (Figure 4.3) can be seen in
55

Figure 4.3: Full citation graph

Figure 4.2 with the most cited paper in our dataset being Latent Dirichlet Allocation

- Blei et al.. The nodes in Figure 4.2 represented by their indices are sorted in

descending order according to their citation count in our dataset. We can see in

4.2 that the dataset is mainly made of lowly-cited papers and dataset sources. The

highly-cited dataset sources (with positive in-degree greater than 20 and based on

di↵erent citations) in our dataset listed according to their citation count are:

1. UCI Repository of Machine Learning Databases - Blake and Merz

2. UCI Repository of Machine Learning Databases - Asuncion and Newman

3. UCI Repository of Machine Learning Databases - frank and Asuncion

4. The UCI KDD Archive - S.D Bay

5. Grouplens: An Open Architecture For Collaborative Filtering of Netnews -

Resnick et al.
56
6. Gradient-based Learning Applied to Document Recognition - Lecun et al.

The UCI Repository of Machine learning Databases [49] is a machine learning

dataset repository maintained by the University of California, Irvine. This repository

which was created in 1987 currently comprises 348 datasets that encompass a wide

variety of application areas and data types from a broad range of problem areas

including medicine, engineering, politics, finance, molecular biology, etc.

The UCI KDD Archive [50] published in 1999 is similar to the UCI Repository

of Machine Learning Databases. However, its aim is to store large datasets that

can challenge existing algorithms in terms of scaling as a function of the number of

individual samples (or examples, or objects). This dataset also spans a wide variety

of problem tasks and data types.

GroupLens [51] is a research paper published in 1994. It introduced a system

integrated into the Usernet news architecture. It is a distributed system for gathering,

distributing, and using ratings from some users to predict other users’ interest in

articles. The news clients can be freely substituted, providing an environment for

experimentation in predicting ratings and in user interfaces for collecting ratings and

presenting predictions [51].

Gradient-based Learning Applied to Document Recognition is a paper on neural

networks and handwriting recognition. In this paper, they created a new database

which was called the Modified NIST dataset or MNIST. This database composes

handwritten digit images with over 120000 images. They also collected datasets of

over 500,000 hand printed characters (comprising of upper cases, lower cases, digits,

and punctuations) spanning the entire printable ASCII set, and over 3000 words.
57
4.3 Results

The network that we study is overwhelmingly unbalanced with most of the nodes

having very low in-degree. Thus, we consider and evaluate the three di↵erent al-

gorithms by randomly sampling 10% of edges to each node with high in-degree as

test datasets. The evaluation results are obtained from running the evaluation five

times with random samples and then taking the average. However, since our aim is

to make predictions for new papers, we set aside 300 papers as unseen data to be

used to verify the performance of the algorithms. We measure the performance of

our methods based on the precision, recall, and AUC (area under the ROC curve) of

the algorithms and also show their ROC curve. For the SVM method, we evaluate

di↵erent 5 di↵erent approaches with di↵erent feature sets

1. SVM method with all 12 features, see Section 3.2.2 (noted as “Degree & Em-

bedded” in Figure 4.4 )

2. SVM method with 9 features from node degree information (noted as “Degree”

in Figure 4.4 )

3. SVM method with 3 features from mutual neighbors (noted as “Embedded” in

Figure 4.4 )

4. SVM method with 28 features with our 9 degree features and 16 structural

balance features as presented in [7] (noted as “Degree & 16 Triads” in Figure

4.4 )

5. SVM method with only 16 structural balance features as presented in [7] (noted

as “16 Triads” in Figure 4.4 )

In addition, we have the


58
1. Label Propagation method (noted as “Propagation” in Figure 4.4 )

2. PageRank method (noted as “PageRank” in Figure 4.4 )

The results shown in Figure 4.4 compare the AUC, precision, and recall of these

di↵erent approaches. We can see that using just triad information or information

obtained from mutual neighbors produce worse results since as noted by [7], the

triad features are only relevant when two nodes have neighbors in common. It is

to be expected that they will be most e↵ective for edges of greater embeddedness.

However, the network we study has very few of such edges with high embeddedness.

The ROC curve is shown in Figure 4.5. It compares the 7 di↵erent approaches. We

can see that the methods relying on the embeddedness of the network performed worse

than the rest. The propagation method in general performed well while the PageRank

method performed worse than the SVM methods. However, in this thesis work, our

aim is to obtain a high number of true-positives with low false-positives; hence, the

region of the curves corresponding to low false-positive rates is of primary interest.

In Figure 4.5, the SVM method based on the degree and structural balance features

performs better than the other methods if a false-positive threshold of 0.1 is selected.

However, the PageRank method performs less than both the label propagation and

SVM methods in the region of the curves corresponding to low false-positive rates as

well as the full curves.

To evaluate the performance of our algorithms on the unseen papers, we use

all the papers in the training and testing set for training. We then try to predict

the dataset-relevant links in the unseen dataset. Figure 4.6 shows the result of the

prediction using the di↵erent algorithms. For SVM, we report the best result from

using the di↵erent feature sets. This is interesting as it shows how the di↵erent

methods respond to unseen datasets. We can see that the label propagation method
59

(a) Average AUC

(b) Average Precision

(c) Average Recall

Figure 4.4: Performance Results of the evaluated approaches


60
1

0.9

0.8

0.7

True positive rate


0.6

0.5

0.4

0.3
Embedded
16 Triads
0.2 Degree & 16 Triads
Degree & Embedded
Degree
0.1
Propagation
PageRank
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate

Figure 4.5: ROC curve of the di↵erent approaches.

Figure 4.6: Result of the evaluation on unseen papers

out-performs the other methods, even with the unseen datasets as it was able to

predict more dataset-relevant links with lesser error.


61
4.4 Community Network Analysis

In many networks, communities can be found as groups of vertices. The connections

among vertices within a community are denser than that from one community to

another. Various algorithms have in recent years been developed for detecting this

structure. In our work, we are interested in analyzing the citation structure of data-

reference papers. Analyzing the dataset communities can enable us to gain further

information on the datasets such as how the datasets or dataset sources a↵ect the

structure of the network, which datasets are more useful for a particular topic, what

are the prevalent topics within the communities, which people are more likely to use

a dataset etc. We used the GLay community detection algorithm by Su et al. [52]

implemented in a tool for community detection in Cytoscape clusterM aker [53].

Table 4.2: Dataset Network Information

Number of Nodes 5701


Number of Edges 4627
Connected components 1265

Taking out all the non-dataset related links from our network, we got a new

network with only dataset relevant links. Information about the new network can

be seen in Table 4.2. Running the community detection algorithm and selecting

communities with more than five nodes in order to focus on fairly large communities,

we got a total of 128 Communities.

The three largest communities are shown in Figure 4.9, 4.10, and 4.11. We can

see that the first and second largest communities Figure 4.9 and 4.10, are largely

made up of papers citing the UCI dataset repository. Although these papers might

be using similar datasets, the dataset repository is cited di↵erently according to the

UCI repository librarians in charge of the repository at the time of use. The UCI

dataset repository is one of the most popular repositories used by the machine learning
62
community and comprises datasets of di↵erent types and varieties. The third largest

community Figure 4.11 consists of two main dataset sources which were respectively

introduced in two research papers:

• RCV1: A New Benchmark Collection for Text Categorization Research - Lewis

et al. which introduced a dataset repository Reuters Corpus Volume I (RCV1),

an archive of over 800,000 manually categorized Newswire stories made available

by Reuters, Ltd. for research purposes.

• NewsWeeder: Learning to Filter Netnews - Lang et al. introduced NewsWeeder

a netnews-filtering system which collects user ratings on online news articles.

The collected rating information is then used to learn a new model of the users

interests.

In addition to the three largest communities, two other communities shown in Figure

4.12 and 4.13 are analyzed. The dataset sources in the fourth community are mainly

based on the use and behavior of online articles. These datasets were introduced and

discussed in the following research papers

• GroupLens: An Open Architecture for Collaborative Filtering of Netnews -

Resnick et al. introduced a system GroupLens. This system makes article

predictions based on ratings obtained from Usernet news users.

• GroupLens: Applying Collaborative Filtering to Usenet News - Konstan et al.

is a further discussion on the GroupLens system.

• Cascading Behavior in Large Blog Graphs - Leskovec et al. extracted 2,422,704

posts from 44,362 blogs from a larger dataset [54] which contains 21.3 million

posts from 2.5 million blogs from August and September 2005.
63
The fifth community mainly comprises citations on time-series dataset sources. Al-

though the UCR time series repository is cited in di↵erent formats, some of the papers

having di↵erent citation formats might be using similar datasets. This is due to the

citation policy of the repository (similar to UCI dataset repository), which requires

the citation to be on the repository’s librarians at the time of use. The main dataset

sources (according to the di↵erent citation formats) in this community are

• The UCR Time Series Data Mining Archive - Keogh and Folias

• The UCR Time Series Classification/Clustering Homepage - Keogh, Xi, Wei,

and Ratanamahatana

• www.cs.ucr.edu - Keogh

• Emergence of Scaling in Random Networks - Barabasi et al. evaluated their

work on the electrical power grid data of western US with 4,941 vertices, col-

laboration graph of movie actors with 212,250 vertices and the world wide web

network with 325,729 vertices.

• Exact Discovery of Time Series Motifs - Abdullah et al. built and made publicly

available a web page containing all the time series dataset and code used in their

work.

To further analyze these communities, we first look into the research topics of the

papers in each community. A total of 20,542 topics are crawled from the Microsoft

academic website [55] in the field of machine learning and data mining, because papers

studied in this thesis are from conferences in these two areas. We then calculate the

frequency of each topic in the titles of the papers represented by the nodes in a

given community. The popular topics in each community are the frequent ones (with
64
frequency varying from community to community ). Figure 4.7 and 4.8 shows the

topic clouds of the three largest communities and the two selected communities.

(a) Community 1

(b) Community 2

(c) Community 3

Figure 4.7: Topic Cloud of the four largest communities

The largest community (Community 1) has three main sources of datasets: UCI

dataset repository, UCI KDD archive, and WebACE an agent for exploring and cat-

egorizing documents of the world wide web. WebAce was introduced in a research

paper - WebACE: A Web Agent for Document Categorization and Exploration - Han

et al. The largest dataset source in this community is the UCI dataset repository.

This is expected due to the vast amount of dataset in the repository. However, due to

the repository citation policy of both UCI dataset repository and UCI KDD archive,

the exact type of dataset used in the papers cannot be easily determined by looking

at the repositories (or the citations). The topic cloud confirms that the repositories
65

(a) Community 4

(b) Community 5

Figure 4.8: Topic Cloud of the two selected communities

span a wide variety of data types and application areas.

The second largest dataset community (Community 2) has one main dataset

source which is the UCI dataset repository - A. Asuncion and D. Newman. This

community’s topic cloud also confirms the wide range of dataset and application area

of the UCI dataset repository.

The third largest dataset community (Community 3) has four main dataset

sources introduced in RCV1: A New Benchmark Collection for Text Categoriza-

tion Research - Lewis et al., NewsWeeder: Learning to Filter Netnews - Lang et

al., Gradient-based Learning Applied to Document Recognition - Lecun et al., and

Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene

Expression Monitoring - Golub et al. respectively. We can see (based on the descrip-

tion of the dataset sources above) that three of the largely cited dataset sources in this

community are relevant to text (or character) recognition and classification. Thus,

the larger amount of papers on text related topics. However, due to the fourth dataset

source, the community comprises of a fair amount of papers on gene expression.

The fourth community (Community 4) has two main dataset sources introduced
66
and discussed in three research papers: GroupLens: An Open Architecture for Collab-

orative Filtering of Netnews - Resnick et al., Cascading Behavior in Large Blog Graphs

- Leskovec et al., and Grouplens: Applying Collaborative Filtering to Usenet News -

Konstan et al. We can see in Figure 4.8 that this community mainly has papers in

collaboratory filtering. This is because the main dataset source in this community is

the GroupLens system which uses collaborative filtering to predict (and recommend)

Usernet news articles based on obtained user ratings. The second dataset source,

however, is a collection of posts and articles from blogs.

The fifth community (Community 5) has three main dataset sources. The first

data source is the UCR time series data mining archive which is a repository of time

series datasets. The second dataset source introduced in Exact Discovery of Time

Series Motifs - Abdullah et al. is also made up of time series datasets while the third

dataset source contains datasets from di↵erent networks. Two of the three main

dataset sources in this community consists of time series datasets. Thus, the high

number of time series papers in this community.

To further analyze the predicted dataset relevant links in the unseen data, we

ran the community-detection process with the correctly predicted links (using the

label propagation method) included, and analyzed the distribution of the links in the

communities. 68 of the 109 predicted links were distributed in the 15 largest (with

more than 53 papers) communities with 19 of the 68 links in the largest community.
67

Figure 4.9: The largest community. (a) UCI dataset repository - C.L. Blake and C.J.
Merz, (b) UCI KDD archive - S.D. Bay (c) WebACE: A Web Agent for Document
Categorization and Exploration - Han et al
68

Figure 4.10: The second largest community. (a) UCI dataset repository - A. Asuncion
and D. Newman

Figure 4.11: The third largest community. (a) RCV1: A New Benchmark Collection
for Text Categorization Research - Lewis et al. (b) NewsWeeder: Learning to Filter
Netnews - Lang et al. (c) Gradient-based Learning Applied to Document Recognition
- Lecun et al. (d) Molecular Classification of Cancer: Class Discovery and Class
Prediction by Gene Expression Monitoring - Golub et al
69

Figure 4.12: The fourth community. (a) GroupLens: An Open Architecture for
Collaborative Filtering of Netnews - Resnick et al. (b) Cascading Behavior in Large
Blog Graphs - Leskovec et al. (c) Grouplens: Applying Collaborative Filtering to
Usenet News - Konstan et al.
70

Figure 4.13: The fifth community. (a) The UCR time Series Data Mining Archive -
Keogh and Folias (b) Emergence of Scaling in Random Networks - Barabasi et al. (c)
GThe UCR Time Series Classification/Clustering Homepage - keogh et al. (d) Exact
Discovery of Time Series Motifs - Abdullah et al. (e) www.cs.ucr.edu - Keogh
71

Chapter 5

Conclusion and Future Work

5.1 Conclusion

We studied the link label prediction problem and proposed three di↵erent approaches

for our citation graph. The first approach which is a supervised method uses Support

Vector Machines (SVM) to predict link labels based on generated features. The

second approach applies a label propagation method on a transformed graph where

nodes are citation relations and edges indicate similarities between citations. This

similarity is based on their mutual nodes in the original graph. The known node labels

are then propagated to the nodes with unknown labels. The third approach applies

a PageRank method to assign scores to each node. Nodes with a high PageRank

value will be classified as dataset relevant nodes and nodes with lower PageRank

value will be classified as non-dataset relevant nodes. Results from the experiments

show that the label propagation method performs the best. Although an approach

using the supervised method performed slightly better than the PageRank method,

the supervised method using SVM is computationally demanding and need di↵erent

preprocessing steps such as generating and choosing the best features to use, the best

parameters to use, and which SVM kernel to use. Cross-validation can be employed
72
to tune the parameter and choose kernel functions. However, the computation cost

is an issue to be resolved later. Therefore, the label propagation approach is the best

choice, due to its less demand on pre-processing work and less computation cost.

We also analyzed our citation network to get the communities that are dataset

relevant. On getting these dataset communities, we try to understand them better

by studying their data sources. For this study, we examined five communities. We

found out that by looking at the data sources available in a dataset communities, one

can understand to a certain degree the communities and their main topic areas.

5.2 Future Work

There are several future research directions based on this work. First, the current

network is composed of papers from only three conferences. The results presented in

this thesis can be made better by including more conferences and papers. Adding

more data would improve the prediction accuracy of the algorithms. However, as

real-world data contain some noise, another possible approach would be to use the

label spreading algorithm (Algorithm 3) [25] as it allows changes to the initial la-

beling to keep it the labels consistent across the network. A disadvantage of using

this method on our citation network is that although keeping the consistency in our

citation network might remove some noise, it would also result in the change of some

true labels. For instance, if node u has a total of five in-degree, four of which have a

true label of negative, and one positive (i.e one paper cites u because of the dataset

used in it) then, in making the labels consistent among the edges, it would modify the

label on the positive edge to be negative thereby, producing a false label. A future

research is to find a way to solve this problem in order to produce a cleaner dataset.

Second, we would like to improve the predictive accuracy of our methods on


73
unseen papers. This will include using some additional information. The analyses

in this thesis are based on using mainly the topological information of the graph

and some minor additional heuristic and title information. An extension would be

to use more additional side information such as the sections where the references are

mentioned in the papers, the sentences in which the references were mentioned in the

paper, the abstracts, etc.

Finally, another interesting study would be to analyze the affiliations of authors in

these three conferences and how they change over time. This might include examining

the number of authors with new affiliations taking part in the conference, or the

growth in the number of authors with affiliations already taking part in the conference.

Also, it would be interesting to study the increase (or decrease) in the number of

authors within each affiliation over time, starting from its first involvement in any of

the conferences.
74

REFERENCES

[1] M. Jacovi, V. Soroka, G. Gilboa-Freedman, S. Ur, E. Shahar, and N. Marmasse,


“The chasms of cscw: A citation graph analysis of the cscw conference,” in
Proceedings of the 2006 20th Anniversary Conference on Computer Supported
Cooperative Work, ser. CSCW ’06. New York, NY, USA: ACM, 2006, pp.
289–298. [Online]. Available: http://doi.acm.org/10.1145/1180875.1180920

[2] A. Yuan, J. Jeannette, and M. E. Evangelos, “Characterizing and mining


the citation graph of the computer science literature,” Knowledge and
Information Systems, vol. 6, no. 6, pp. 664–678, 2004. [Online]. Available:
http://dx.doi.org/10.1007/s10115-003-0128-3

[3] L. Wangzhong, J. Janssen, E. Milios, N. Japkowicz, and Y. Zhang,


“Node similarity in the citation graph,” Knowledge and Information
Systems, vol. 11, no. 1, pp. 105–129, 2006. [Online]. Available: http:
//dx.doi.org/10.1007/s10115-006-0023-9

[4] J. Leskovec, D. P. Huttenlocher, and J. M. Kleinberg, “Signed networks in social


media,” in Proceedings of the ACM Special Interest Group on Computer-Human
Interaction. ACM, 2010, pp. 1,361–1,370.

[5] M. Burke and R. Kraut, “Mopping up: Modeling wikipedia promotion decisions,”
in Proceedings of Conference on Computer-Supported Cooperative Work, 2008,
pp. 27–36.

[6] B. Korte and J. Vygen, Combinatorial Optimization: Theory and Algorithms,


4th ed. Springer Publishing Company, Incorporated, 2007.

[7] J. Leskovec, D. P. Huttenlocher, and J. M. Kleinberg, “Predicting positive and


negative links in online social networks,” in , Proceedings of the 20th International
Conference on World Wide Web, 2010, pp. 641–650.
75
[8] K.Y.Chiang, N.Natarajan, A.Tewari, and I.S.Dhillon, “Exploiting longer cycles
for link prediction in signed networks,” in Proceedings of the 20th ACM Confer-
ence on Information and Knowledge Management, Glasgow, Scotland, 2011, pp.
1,157–1,162.

[9] C. Kadushin, Understanding social networks: Theories, concepts, and findings.


Oxford University Press, 2012.

[10] F. Heider, “Attitudes and cognitive organization,” Journal of Psychology, vol. 21,
pp. 107–112, 1946.

[11] D. Cartwright and F. Harary, “Structural balance: A generalization of heider’s


theory,” Psych. Rev., vol. 63, pp. 277–293, 1956.

[12] R. V. Guha, R. Kumar, P. Raghavan, and A. Tomkins, “Propagation of trust


and distrust,” in Proceedings of the 13th International Conference on World
Wide Web, New York, 2004, pp. 403–412.

[13] J. A. Davis and S. Leinhardt, “The structure of positive interpersonal relations


in small groups,” Sociological Theories in Progress 2, pp. 218–251, 1972.

[14] J. A. Davis, “Structural balance mechanical solidarity and interpersonal rela-


tions,” American Journal of Sociology, vol. 68, no. 4, pp. 444–462, 163.

[15] D. Song and D. A. Meyer, “Link sign prediction and ranking in signed directed
social networks,” Social Network Analysis and Mining, 2015.

[16] V. G. P. Agrawal and R. Narayanam, “Link label prediction in signed social


networks,” in Proceedings of the Twenty-Third International Joint Conference
on Artificial Intelligence, 2013, pp. 2591–2597.

[17] “Dblp bibliography.” [Online]. Available: http://dblp.uni-trier.de

[18] S. D. Kamvar, M. T. Schlosser, and H. Garcia-Molina, “The eigentrust algo-


rithm for reputation management in p2p networks,” in Proceedings of the 12th
International World Wide Web Conference, 2003, pp. 640–651.
76
[19] M. Richardson, R. Agrawal, and P. Domingos, “Trust management for the se-
mantic web,” in Proceedings of the Second International Semantic Web Confer-
ence, 2003, pp. 351–368.

[20] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search
engine* 1,” Computer networks and ISDN systems, vol. 30, no. 1-7, pp. 107–117,
1998.

[21] D. Liben-Nowell and J. Kleinberg, “The link prediction problem for social net-
works,” in CIKM ’03 Proceedings of the twelfth international conference on In-
formation and knowledge management, 2003, pp. 556 – 559.

[22] M. Jaber, P. Papapetrou, S. Helmer, and P. T. Wood, “Using time-sensitive


rooted pagerank to detect hierarchical social relationships,” Advances in Intelli-
gent Data Analysis XIII, vol. 8819, pp. 143–154, 2014.

[23] M. A. Hasan and M. J. Zaki, A survey of link prediction in social networks.


Springer, 2011.

[24] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data with
label propagation,” CMU CALD, Tech. Rep., 2002.

[25] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with


local and global consistency,” in Advances in Neural Information Processing Sys-
tems 16, . 328, Ed. MIT Press, 2004.

[26] M. Szummer and T. Jaakkola, “Partially labeled classification with markov ran-
dom walks,” in Advances in Neural Information Processing Systems, T. G. Di-
etterich, S. Becker, and Z. Ghahramani, Eds., vol. 14. MIT Press, 2002.

[27] W. Zhang, N. Johnson, B. Wu, and R. Kuang, “Signed network propagation


for detecting di↵erential gene expressions and dna copy number variations,” in
Proceedings of the ACM Conference on Bioinformatics, Computational Biology
and Biomedicine, ser. BCB ’12. New York, NY, USA: ACM, 2012, pp. 337–344.
[Online]. Available: http://doi.acm.org/10.1145/2382936.2382979
77
[28] U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to
detect community structures in large-scale networks,” Phys. Rev. E, vol. 76,
p. 036106, Sep 2007. [Online]. Available: http://link.aps.org/doi/10.1103/
PhysRevE.76.036106

[29] X. Liu and T. Murata, “Advanced modularity-specialized label propagation


algorithm for detecting communities in networks,” Physica A: Statistical
Mechanics and its Applications, vol. 389, no. 7, pp. 1493 – 1500,
2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/
S0378437109010152

[30] S. Gregory, “Finding overlapping communities in networks by label propagation,”


New Journal of Physics, vol. 12, no. 10, p. 103018, 2010. [Online]. Available:
http://stacks.iop.org/1367-2630/12/i=10/a=103018

[31] J. Xie, B. K. Szymanski, and X. Liu, “Slpa: Uncovering overlapping communities


in social networks via a speaker-listener interaction dynamic process,” in Data
Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on,
Dec 2011, pp. 344–349.

[32] Z.-H. Wu, Y.-F. Lin, S. Gregory, H.-Y. Wan, and S.-F. Tian, “Balanced multi-
label propagation for overlapping community detection in social networks,”
Journal of Computer Science and Technology, vol. 27, no. 3, pp. 468–479, 2012.
[Online]. Available: http://dx.doi.org/10.1007/s11390-012-1236-x

[33] P. Boldi, M. Rosa, M. Santini, and S. Vigna, “Layered label propagation: A


multiresolution coordinate-free ordering for compressing social networks,” in
Proceedings of the 20th International Conference on World Wide Web, ser.
WWW ’11. New York, NY, USA: ACM, 2011, pp. 587–596. [Online]. Available:
http://doi.acm.org/10.1145/1963405.1963488

[34] J. Ugander and L. Backstrom, “Balanced label propagation for partitioning


massive graphs,” in Proceedings of the Sixth ACM International Conference on
Web Search and Data Mining, ser. WSDM ’13. New York, NY, USA: ACM, 2013,
pp. 507–516. [Online]. Available: http://doi.acm.org/10.1145/2433396.2433461
78
[35] M. Speriosu, N. Sudan, S. Upadhyay, and J. Baldridge, “Twitter polarity
classification with label propagation over lexical links and the follower
graph,” in Proceedings of the First Workshop on Unsupervised Learning
in NLP, ser. EMNLP ’11. Stroudsburg, PA, USA: Association for
Computational Linguistics, 2011, pp. 53–63. [Online]. Available: http:
//dl.acm.org/citation.cfm?id=2140458.2140465

[36] M. J. Rattigan, M. Maier, and D. Jensen, “Exploiting network structure for


active inference in collective classification,” in Data Mining Workshops, 2007.
ICDM Workshops 2007. Seventh IEEE International Conference on, Oct 2007,
pp. 429–434.

[37] S. Bhagat, G. Cormode, and S. Muthukrishnan, Social Network Data Analytics.


Boston, MA: Springer US, 2011, ch. Node Classification in Social Networks, pp.
115–148. [Online]. Available: http://dx.doi.org/10.1007/978-1-4419-8462-3 5

[38] M. D. Conover, B. Goncalves, J. Ratkiewicz, A. Flammini, and F. Menczer,


“Predicting the political alignment of twitter users,” in Privacy, Security, Risk
and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social
Computing (SocialCom), 2011 IEEE Third International Conference on, Oct
2011, pp. 192–199.

[39] H. Kashima, T. Kato, Y. Yamanishi, M. Sugiyama, and K. Tsuda, “Link prop-


agation: a fast semi-supervised learning algorithm for link prediction,” in Pro-
ceedings of the 9th SIAM International Conference on Data Mining (SDM’09),
2009, pp. 1099–1110.

[40] P. Wang, B. Xu, Y. Wu, and X. Zhou, “Link prediction in social networks: the
state-of-the-art,” Science China Information Sciences, vol. 58, pp. 1–38, 2014.

[41] R. Raymond and H. Kashima, “Fast and scalable algorithms for semi-supervised
link prediction on static and dynamic graphs,” in Proceedings of ECML/P-
KDD’10, 2010, pp. 131 – 147.

[42] C. Cortes and V. Vapnik, “Support-vector networks,” in Machine Learning,


L. Saitta, Ed., 1995, pp. 273–297.
79
[43] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal
margin classifiers,” in Proceedings of the fifth annual workshop on Computational
learning theory (COLT ’92). New York: ACM, 1992, pp. 144–152.

[44] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,”


Data Min. Knowl. Discov., vol. 2, no. 2, pp. 121–167, Jun. 1998. [Online].
Available: http://dx.doi.org/10.1023/A:1009715923555

[45] I. StatSoft. (2013) Electronic statistics textbook. [Online]. Available: http:


//www.statsoft.com/textbook/

[46] O. Chapelle, B. Scholkopf, and A. Zien, Label Propagation and Quadratic


Criterion. MIT Press, 2006, pp. 193–216. [Online]. Available: http:
//ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6280898

[47] K. Sepandar, H. Taher, M. Chris, and G. Gene, “Extrapolation methods for


accelerating pagerank computations,” in Twelfth International World Wide Web
Conference (WWW 2003), 2003.

[48] A. N. Langville and C. D. Meyer, “Deeper inside pagerank,” Internet Mathemat-


ics, vol. 1, no. 3, pp. 335–380, 2004.

[49] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available:


http://archive.ics.uci.edu/ml

[50] S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth, “The uci kdd archive
of large data sets for data mining research and experimentation,” SIGKDD
Explor. Newsl., vol. 2, no. 2, pp. 81–85, Dec. 2000. [Online]. Available:
http://doi.acm.org/10.1145/380995.381030

[51] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “Grouplens: an


open architecture for collaborative filtering of netnews,” in Proceedings of the
1994 ACM Conference on Computer supported cooperative work. New York,
NY, USA: ACM, 1994, pp. 175–186.
80
[52] G. Su, A. Kuchinsky, J. H. Morris, D. J. States, and F. Meng, “Glay: community
structure analysis of biological networks,” Bioinformatics, vol. 26, no. 24, pp.
3135–3137, 2010.

[53] M. J. H., A. Leonard, N. A. M., B. Jan, W. Tobias, S. Gang, B. G. D., and


F. T. E., “clustermaker: a multi-algorithm clustering plugin for cytoscape,”
BMC Bioinformatics, vol. 12, no. 1, pp. 1–14, 2011. [Online]. Available:
http://dx.doi.org/10.1186/1471-2105-12-436

[54] N. Glance, M. Hurst, K. Nigam, M. Siegler, R. Stockton, and T. Tomokiyo,


“Deriving marketing intelligence from online discussion,” in Proceedings of the
Eleventh ACM SIGKDD International Conference on Knowledge Discovery in
Data Mining, ser. KDD ’05. New York, NY, USA: ACM, 2005, pp. 419–428.
[Online]. Available: http://doi.acm.org/10.1145/1081870.1081919

[55] “Microsoft academic.” [Online]. Available: http://academic.research.microsoft.


com/RankList?entitytype=8&topDomainID=2&subDomainID=6

You might also like