0% found this document useful (0 votes)
37 views

Community Detection Using Statistically Significant Subgraph Mining

This document discusses using statistically significant subgraph mining to detect communities in graphs. It begins by introducing the topic and motivation. It then reviews previous work on community detection algorithms. The main body of the document describes the authors' work. It argues that statistically significant subgraphs are likely to be similar to dense subgraphs and could be used to find communities. It evaluates different labelling methods and compares algorithms using metrics like Jaccard index and F-score on real-world datasets. The authors ran 2-approximate and statistically significant subgraph algorithms to find communities and compare results.

Uploaded by

Anonymous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Community Detection Using Statistically Significant Subgraph Mining

This document discusses using statistically significant subgraph mining to detect communities in graphs. It begins by introducing the topic and motivation. It then reviews previous work on community detection algorithms. The main body of the document describes the authors' work. It argues that statistically significant subgraphs are likely to be similar to dense subgraphs and could be used to find communities. It evaluates different labelling methods and compares algorithms using metrics like Jaccard index and F-score on real-world datasets. The authors ran 2-approximate and statistically significant subgraph algorithms to find communities and compare results.

Uploaded by

Anonymous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Community Detection using Statistically

Significant Subgraph Mining


Mohit Kumar Garg, Prithvi Sharma, and Rohit Kumar Jha
Indian Institute of Technology Kanpur, India

Abstract. In our project, we are trying to find the relation, if any,


among densest subgraphs, statistically significant subgraphs and communities. We also try to find the reasons for the existence or non-existence
of the relation among them. We have used synthetic as well as real world
datasets to analyze the algorithms for the same.
Keywords: statistically significant subgraphs, community detection, densest subgraphs

Introduction

Structured patterns in graphs are studied for the purpose of understanding the
intrinsic characteristics of the scientific data represented by the graphs. Graph
data has been growing significantly in various commercial and scientific applications. This has led to increasing research in studying various patterns within
graphs. For instance, frequent pattern mining in drug development data reveals
substructures in chemical compounds that are medically effective and thus gives
a better understanding about the data[13]. As we see, most of these applications depend on finding significant substructures in the forms of subgraphs. So
statistically significant sub-graphs is precisely the problem of extracting these
significant substructures from the graph data.
Considering all the existing work done on community detection and significant sub-graph mining, we explore the possibility of using statistically significant
sub-graph mining[14] to find communities in this graph. We have analysed results
of a few existing approaches and of our approach on several real world data-sets.
We also explored the potential labelling methods that can be used while finding
significant sub-graphs so as to use them to find potential communities.

Motivation

Like there is a wide variety of group organizations in the society, there has
been formation of online communities or virtual groups on the internet. For
a long time, studies have been done on social communities and these studies
have given enormous amount of insights into the behaviour of organizations or
structures in general. In several networked systems ranging from biology, politics,
economics, to, computer science and engineering etc there is existence of similar

Fig. 1. Interrelation Triangle

communities. For example, in protein-protein interaction networks, proteins that


do carry out the same function within the cell lie within a community. They mean
functional structures such as pathways and cycles in metabolic networks, they
correspond to groups of pages dealing with related topics in the graph of the
internet, in food webs they may represent compartments, and so on.
Finding communities or structured patterns in networks can have manifold
applications. In protein structure analysis, evolutionarily essential patterns of
interactions can be revealed by the existence of subgraphs that are conserved
in contact maps developed in proteins[11]. Clustering clients who have similar
interests and are near to each other may improve the performance of applications provided on the internet as each cluster of clients can then be served by
a separate server. In the purchase relationships networks between products and
consumers, finding groups of customers who have interests in similar domains
or product categories(like in the case of www.amazon.com) can enable setting up
of recommendation systems that are efficient and enhance business opportunities. In order to efficiently handle navigational queries and store the graph data,
clusters of large graphs can be used. In the ad hoc networks, devices move and
because of that the network rapidly changes. There is no central routing table
that specify how one node need to communicate to another node.

Previous Work

The problem of detecting communities in various type of graphs has been studied
in the field of computer science for quite a long time. Various algorithms have
been proposed using different techniques. Initially this problem was tried to solve
using graph partitioning. Most of the variants of these algorithms dont have
polynomial running time. However several algorithms give a better complexity
but their solutions are far from optimal (Pothen, 1997)[1]. One of the earliest
but still frequently used method in the field is the spectral bisection method
(Barness, 1982)[17]. There are several algorithms which use maximum flow in
graphs, like algorithm proposed by Goldberg and Tarjan (Goldbert and Tarjan
1988)[2], which takes O(n3 ). Another variant was proposed by Flake et al (Flake

et al, 2002)[3]. There are some hierarchical clustering algorithms. But those
algorithms are not scalable.
Some authors have proposed the extension of kmeans clustering (Schenker
et al.,2003[5]; Hlaoui and Wang,2004[4]). In partition clustering, the number
of clusters must be specified in the beginning, which is generally not known.
Shi and Malik proposed a method based upon unnormalized spectral clustering
(CVPR 97)[6] and Ng et al proposed normalized spectral clustering technique
to solve the problem of community detection[7].
Girvan and Newman proposed a divisive algorithm (Girvan and Newman,
2002; Newman and Girvan,2004)[8]. This method has historical importance because it opened the field of community detection to physicists. The complete
calculation requires a time O(n3 ) on a sparse graph. Pinney and Westhead extended the algorithm by Girvan and Newman to find overlapping communities
in graph[9].
Some algorithms were proposed to detect overlapping communities, where a
node can be a part of between more than one communities. The most popular
work in this field is the Clique Percolation Method (CPM) by Palla et al (Palla
et al., 2005)[10].

4
4.1

Our Work
Statistically Significant Subgraphs and Communities

If a node lies at an abnormal distance from other nodes in a random sample taken
from a population, it is called an outlier. If we choose the labelling method as the
degree of each node, then the outliers are the vertices with least and the largest
degrees. The node labels follow a normal distribution after mean adjustment and
normalization by standard deviation.
In general, in a graph, we have more concentration of vertices with small
degrees than vertices with large degrees. So, with respect to standard normal
distribution, the vertex with the largest degree is an outlier. Contiguous region
of the outliers with respect to the null hypothesis is defined as statistically significant[14].
There is no hard and fast rule or lemma establishing a direct relation between
Statistically significant subgraphs and Dense Subgraphs. But intuitively, it can
be seen that the statistically significant subgraphs obtained can be close to densest subgraphs. As we have described, the vertices with extreme degrees (outliers)
would be there in the statistically significant subgraphs. The same vertices, the
ones with the largest degrees are expected to be in the densest subgraph. We
apply the same unfolding logic to see that the statistically significant and the
densest subgraphs we obtain are highly likely to be close.
Kumar et al [15] defined the web communities as dense bipartite subgraphs.
Their hypothesis suggested that any topically focused community on the web will
most likely contain a dense bipartite subgraph and that almost every occurrence
of the same corresponds to a web community. This gives enough reasons to try
to use statistically significant subgraphs to find communities.

To compare the different algorithms, we have used Jaccards Index and F


Score. Jaccard Index for two sets A and B is given by
|A B|/|A B|
Similarly, F Score for ground-truth G and result R is given by harmonic mean
of precision and recall. Precision is given by
|G R|/|R|
Recall is given by
|G R|/|G|

4.2

Importance of Labelling

In most applications, the nature of the graph is such that every node of the
graph has to be assigned a label[14]. For example, in a communication network,
labelling schemes are used to represent the amount of traffic at a particular node.
Similarly in biological networks, several biochemical entities like genes, molecules
etc. are represented by labels on the nodes. Like these, there are many other
applications where labelling is important including coding theory, circuit design,
x-ray crystallography, database management, radar, astronomy and many more.

4.3

Labelling Methods

The algorithm we used[14], uses discrete and continuous labelling of node to find
statistically significant subgraphs. We employed different labelling methods as
mentioned below:
Degree of the node as discrete label - Degree of each node in the graph is
taken and then quantized to a maximum of threshold number of labelling
values.
Page Rank of the node as discrete label - Page rank of each node
is calculated and then quantized to a maximum of threshold number of
labelling values.
Degree of the node as continuous label - Degree of each node in the
graph is taken and then standard deviation normalization is applied after
mean adjustment to give distribution with mean of zero and a standard
deviation of one.
Page Rank of the node as continuous label - Page rank of each node is
calculated and then standard deviation normalization is applied after mean
adjustment to give distribution with mean of zero and a standard deviation
of one.

4.4

Description of Work

There already exist an O(n3 log n) time algorithm(Goldbergs Algorithm) to find


the densest subgraph in a graph. So, using our hypothesis, we cant expect to get
better results that the same. There are, however, approximate algorithms to find
densest subgraphs with better time complexity but they dont yield such good
results. The code for statistically significant subgraph gives results in linear time
and can be run for larger graphs in practical scenarios. So, we ran 2-approximate
algorithm[16] to find dense subgraphs and compared it with the results of the
other algorithm. We ran all the algorithms on Facebook, DBLP and Amazon
datasets.
Null hypothesis plays a very important role in finding statistically significant
subgraphs when the labels of the vertices are discrete. We experimented with
various null hypotheses on Facebook and DBLP datasets, and observed how
results varied. So coming with a hypothesis which will work for all kind of graphs
is not possible, as null hypothesis essentially depend on what type of graph you
are working and how you define your community. So we moved on to testing
with continuous labels of graph.

Datasets

DBLP Collaboration Network:


(http://snap.stanford.edu/data/com-DBLP.html):
The DBLP computer science bibliography provides a comprehensive list of
research papers in computer science. The dataset is a co-authorship network
in which if two authors have published at least one paper together, they
are connected. Individual ground truth communities are formed on the basis
publication journal or conference etc. The dataset regards each connected
component in a group as a community. Overall, the dataset has 317080 nodes
and 1049866 edges and the ground truth set has top 5000 communities.
Facebook user interaction data:
(https://sites.google.com/site/himeldevportfolio/codes)
The dataset contains 2421 interactions from 221 Facebook users. In this
dataset, a unique id is given to each user ranging from 1 to 221. Number
of common page likes, number of wall posts etc. are the interaction types
considered. Ground truth communities are given on the basis of these interactions as occurring in the real world.
Web data - Amazon product co-purchasing network:
(http://snap.stanford.edu/data/com-Amazon.html)
This dataset uses Customers Who Bought This Item Also Bought feature
of the Amazon website, as in, if a product i is frequently purchased along
with product j, then the graph contains an undirected edge from i to j.
Ground truth communities are provided as each of the product categories.
A connected component in a product category is given as a ground truth
community. Overall, the dataset has 334863 nodes and 925872 edges.

Results

As mentioned in the last section, we have primarily used three datasets for
testing and experimentation. On these three datasets, we have tried to compare
results of the following approaches in various scenarios:

Statistically significant subgraph using discrete labeling


Statistically significant subgraph using continuous labeling
Densest Subgraph using Goldbergs Algorithm
2Approximation Algorithm for Densest Subgraph

We have tried to compare the algorithm we have worked on with the densest
subgraph algorithm. For this we check how close the results of these algorithms
are as compared to the ground truth communities available to us. For measuring this closeness, we use the F-measure using precision recall and the Jaccard
similarity coefficient.
6.1

Closeness of results to Groundtruth Communities

Jaccard coef- Discrete


ficient
Labeling
Facebook
0.1290
Amazon
0.896
DBLP
0.275

Continuous Label- Continuous Label- 2Approx. Dens- Densest


ing - Degree
ing - Pagerank
est subgraph
graph
0.4059
0.3535
0.762
0.78125
0.1774
0.1531
NA
NA
0.536
0.868
NA
NA

sub-

Table 1. Closeness of results to Groundtruth Communities: Jaccard Index (NA: Not


Applicable as Communities defined as the Densest Subgraph)

F-score
Facebook
Amazon
DBLP

Discrete
Labeling
0.2285
0.1645
0.536

Continuous Label- Continuous Label- 2Approx. Dens- Densest


ing via Degree
ing via Pagerank est subgraph
graph
0.3684
0.3889
0.865
0.8772
0.3173
0.3215
NA
NA
0.1734
0.2252
NA
NA

Table 2. Closeness of results to Groundtruth Communities: F-Score (NA: Not Applicable as Communities defined as the Densest Subgraph)

As is visible from the adjoining graph and the results table, the densest
subgraph is most similar to the ground truth communities in the case of Facebook
dataset. In the other two datasets, on experimentation, we found that their
communities are actually same as densest subgraphs.

sub-

Jaccard coefficient

Discrete Labeling

Facebook
Amazon
DBLP

0.409
0.896
0.275

Continuous Labeling Continuous Labeling


- Degree
- Pagerank
0.1596
0.5420
0.2994
0.6427
0.1148
0.2352

Table 3. Closeness of results to Densest Communities: Jaccard Index

F-score

Discrete Labeling

Facebook
Amazon
DBLP

0.786
0.1645
0.536

Continuous Labeling Continuous Labeling


- Degree
- Pagerank
0.2727
0.7030
0.4832
0.7916
0.1764
0.2741

Table 4. Closeness of results to Densest Communities: F-Score

6.2

Closeness of Results to Densest Communities

As is visible from the adjoining graph and the results table, the results of our
algorithm with pagerank as continuous labels is the closest to densest subgraph
as compared to degree as continuous labels or with discrete labelling. In all
the cases, the recall was very high but F score was still low because of large
communities.

One of the primary tasks of our project was to find out upto what extent
statistically significant subgraphs and the ground truth communities are related

with each other. When we started experimenting with different datasets and
started analyzing the results, one of our observations was that the algorithm
for finding statistically significant subgraphs had a high recall but low precision.
On further analysis, we found out that the reason for this was the large size of
communities given by the algorithm. Due to the large size, we were including all
relevant nodes of the communities(high recall) but we were also including many
irrelevant nodes along with the relevant ones(low precision).
The algorithm for finding statistically significant subgraph [14] first converts
the given graph into a much smaller graph and then find the most statistically
significant subgraph on this smaller graph. This process of reducing the size of
the graph takes place in two steps. In first step it creates a supergraph from the
given graph and then shrink this super graph till the number of nodes are less
than a threshold. This threshold will determine how much time the algorithm
will take to find the most significant subgraph. Higher the threshold, higher will
be the size of reduced graph, more will be the time taken to find most statistically
significant subgraph. In order to report the community we had to convert these
results (of reduced graph) back to the original graph. In case of facebook dataset
the original graph has 221 nodes, the super graph has 70 nodes and the reduced
graph has 20 nodes. On this 20 nodes graph we found the subgraph of 4 nodes,
which was most significant among all the subgraphs. When we map these 4 nodes
to original graph we get a community of 68 nodes. When the same algorithm is
applied on very large dataset (DBLP : 317080 nodes, Amazon : 334863), number
of nodes in the resulting communities were very high (around 30, 000), which is
very high compare to the size of ground truth communities. So one thing we
conclude here is that this algorithm, in its current form can not give useful
results for very large datasets.

To see the performance of the part of the algorithm that involves creation of
the super-graph, we created a synthetic graph dataset: we created ten dense
graphs and connected them with small number of edges. After running the algorithm for creating super-graph, we found that about 17 out of the 19 supernodes
that comprised of more than one node had the property of having all constituent
nodes from the same community. But, the number of constituent nodes in the
super nodes, as found on further observation was very small. This suggested
that more work can be done on improving the super graph creation algorithm
too. Ideally, the super graph would have been expected to have ten super-nodes,
each containing the vertices of one of the dense sub-graphs which could then
correspond to the ten communities.

Conclusions

We tried to eliminate the reduction part of the algorithm from the process by
cross-checking the contents of the supernodes against communities. In this, if
all the content nodes of a supernode were a part of the same community, that
would imply that there is no essential information loss from the original graph
to the super graph. But apparently as we found out, the number of supernodes with this property of having all content nodes as members of the same
community was also not very high. In the facebook dataset for instance, approximately 9 out of 19 supernodes satisfied this property. This now led us to
conclude two things:- either the labeling methods we used need to be improved
or the algorithm for super graph creation needs some changes.

Future Works

In statistical significant subgraph mining, labelling techniques play a very important role. We tried degree and pagerank as labels. After observing the possible
issues with these labels, more work can be done for devising new methods for
labelling that can capture the essence of community.
The algorithm for finding significant subgraph requires the graph to be
shrinked to a small number of nodes, as the time taken for large graphs is very
high. If we can do some improvement in this algorithm then we can eliminate the
shrinking segment which led to creation of large-size communities thus making
the precision low and recall very high.
The supergraph creation algorithm can be further improved so that it captures the essence of communities in a better way, thus leading to formation of
a smaller number of supernodes as well as ensure that constituent nodes of a
supernode all belong to the same ground truth community.

References
1. Pothen, A., 1997, Graph Partitioning Algorithms with Applications to Scientific
Computing, Technical Report, Norfolk, VA, USA
2. Goldberg, A. V., and R. E. Tarjan, 1988, Journal of the ACM 35, 921.
3. Flake, G. W., S. Lawrence, C. Lee Giles, and F. M. Coetzee, 2002, IEEE Computer
35, 66.
4. Hlaoui, A., and S. Wang, 2004, in Neural Networks and Computational Intelligence,
pp. 158-163
5. Schenker, A., M. Last, H. Bunke, and A. Kandel, 2003, in IbPRIA03:
6. Shi, J., and J. Malik, 1997, in CVPR 97: Proceedings of the 1997 Conference on
Computer Vision and Pattern Recognition
7. Ng, A. Y., M. I. Jordan, and Y. Weiss, 2001, in Advances in Neural Information
Processing Systems
8. Girvan, M., and M. E. J. Newman, 2002, Proc. Natl. Acad. Sci. USA 99 (12), 7821
9. Pinney, J. W., and D. R. Westhead, 2006, in Interdisciplinary Statistics and Bioinformatics (Leeds University Press, Leeds, UK), pp. 87-90
10. Palla, G., I. Der enyi, I. Farkas, and T. Vicsek, 2005, Nature 435, 814.
11. J. Hu, X. Shen, Y. Shao, C. Bystroff, and M. J. Zaki. Mining protein contact
maps. In BIOKDD, 2002.
12. R. Sharan, S. Suthram, R. M. Kelley, T. Kuhn, S. McCuine, P. Uetz, T. Sittler,
R. M. Karp, and T. Ideker. Conserved patterns of protein interaction in multiple
species. In Proc Natl Acad Sci, 2005.
13. S. Kramer, L. D. Raedt, and C. Helma. Molecular feature mining in HIV data.
In KDD, 2001
14. A. Arora, M. Sachan, A. Bhattacharya, Mining Statistically Signicant Connected
Subgraphs in Vertex Labeled Graphs
15. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, Trawling the web
for emerging cyber-communities http://www8.org/w8-papers/4a-search-mining/
trawling/trawling.html
16. M. Charikar, Greedy Approximation Algorithms for Finding Dense Components
in a Graph http://link.springer.com/chapter/10.1007/3-540-44436-X_10
17. Barnes, E. R., 1982, SIAM J. Alg. Discr. Meth.3,541

You might also like