Community Detection Using Statistically Significant Subgraph Mining
Community Detection Using Statistically Significant Subgraph Mining
Introduction
Structured patterns in graphs are studied for the purpose of understanding the
intrinsic characteristics of the scientific data represented by the graphs. Graph
data has been growing significantly in various commercial and scientific applications. This has led to increasing research in studying various patterns within
graphs. For instance, frequent pattern mining in drug development data reveals
substructures in chemical compounds that are medically effective and thus gives
a better understanding about the data[13]. As we see, most of these applications depend on finding significant substructures in the forms of subgraphs. So
statistically significant sub-graphs is precisely the problem of extracting these
significant substructures from the graph data.
Considering all the existing work done on community detection and significant sub-graph mining, we explore the possibility of using statistically significant
sub-graph mining[14] to find communities in this graph. We have analysed results
of a few existing approaches and of our approach on several real world data-sets.
We also explored the potential labelling methods that can be used while finding
significant sub-graphs so as to use them to find potential communities.
Motivation
Like there is a wide variety of group organizations in the society, there has
been formation of online communities or virtual groups on the internet. For
a long time, studies have been done on social communities and these studies
have given enormous amount of insights into the behaviour of organizations or
structures in general. In several networked systems ranging from biology, politics,
economics, to, computer science and engineering etc there is existence of similar
Previous Work
The problem of detecting communities in various type of graphs has been studied
in the field of computer science for quite a long time. Various algorithms have
been proposed using different techniques. Initially this problem was tried to solve
using graph partitioning. Most of the variants of these algorithms dont have
polynomial running time. However several algorithms give a better complexity
but their solutions are far from optimal (Pothen, 1997)[1]. One of the earliest
but still frequently used method in the field is the spectral bisection method
(Barness, 1982)[17]. There are several algorithms which use maximum flow in
graphs, like algorithm proposed by Goldberg and Tarjan (Goldbert and Tarjan
1988)[2], which takes O(n3 ). Another variant was proposed by Flake et al (Flake
et al, 2002)[3]. There are some hierarchical clustering algorithms. But those
algorithms are not scalable.
Some authors have proposed the extension of kmeans clustering (Schenker
et al.,2003[5]; Hlaoui and Wang,2004[4]). In partition clustering, the number
of clusters must be specified in the beginning, which is generally not known.
Shi and Malik proposed a method based upon unnormalized spectral clustering
(CVPR 97)[6] and Ng et al proposed normalized spectral clustering technique
to solve the problem of community detection[7].
Girvan and Newman proposed a divisive algorithm (Girvan and Newman,
2002; Newman and Girvan,2004)[8]. This method has historical importance because it opened the field of community detection to physicists. The complete
calculation requires a time O(n3 ) on a sparse graph. Pinney and Westhead extended the algorithm by Girvan and Newman to find overlapping communities
in graph[9].
Some algorithms were proposed to detect overlapping communities, where a
node can be a part of between more than one communities. The most popular
work in this field is the Clique Percolation Method (CPM) by Palla et al (Palla
et al., 2005)[10].
4
4.1
Our Work
Statistically Significant Subgraphs and Communities
If a node lies at an abnormal distance from other nodes in a random sample taken
from a population, it is called an outlier. If we choose the labelling method as the
degree of each node, then the outliers are the vertices with least and the largest
degrees. The node labels follow a normal distribution after mean adjustment and
normalization by standard deviation.
In general, in a graph, we have more concentration of vertices with small
degrees than vertices with large degrees. So, with respect to standard normal
distribution, the vertex with the largest degree is an outlier. Contiguous region
of the outliers with respect to the null hypothesis is defined as statistically significant[14].
There is no hard and fast rule or lemma establishing a direct relation between
Statistically significant subgraphs and Dense Subgraphs. But intuitively, it can
be seen that the statistically significant subgraphs obtained can be close to densest subgraphs. As we have described, the vertices with extreme degrees (outliers)
would be there in the statistically significant subgraphs. The same vertices, the
ones with the largest degrees are expected to be in the densest subgraph. We
apply the same unfolding logic to see that the statistically significant and the
densest subgraphs we obtain are highly likely to be close.
Kumar et al [15] defined the web communities as dense bipartite subgraphs.
Their hypothesis suggested that any topically focused community on the web will
most likely contain a dense bipartite subgraph and that almost every occurrence
of the same corresponds to a web community. This gives enough reasons to try
to use statistically significant subgraphs to find communities.
4.2
Importance of Labelling
In most applications, the nature of the graph is such that every node of the
graph has to be assigned a label[14]. For example, in a communication network,
labelling schemes are used to represent the amount of traffic at a particular node.
Similarly in biological networks, several biochemical entities like genes, molecules
etc. are represented by labels on the nodes. Like these, there are many other
applications where labelling is important including coding theory, circuit design,
x-ray crystallography, database management, radar, astronomy and many more.
4.3
Labelling Methods
The algorithm we used[14], uses discrete and continuous labelling of node to find
statistically significant subgraphs. We employed different labelling methods as
mentioned below:
Degree of the node as discrete label - Degree of each node in the graph is
taken and then quantized to a maximum of threshold number of labelling
values.
Page Rank of the node as discrete label - Page rank of each node
is calculated and then quantized to a maximum of threshold number of
labelling values.
Degree of the node as continuous label - Degree of each node in the
graph is taken and then standard deviation normalization is applied after
mean adjustment to give distribution with mean of zero and a standard
deviation of one.
Page Rank of the node as continuous label - Page rank of each node is
calculated and then standard deviation normalization is applied after mean
adjustment to give distribution with mean of zero and a standard deviation
of one.
4.4
Description of Work
Datasets
Results
As mentioned in the last section, we have primarily used three datasets for
testing and experimentation. On these three datasets, we have tried to compare
results of the following approaches in various scenarios:
We have tried to compare the algorithm we have worked on with the densest
subgraph algorithm. For this we check how close the results of these algorithms
are as compared to the ground truth communities available to us. For measuring this closeness, we use the F-measure using precision recall and the Jaccard
similarity coefficient.
6.1
sub-
F-score
Facebook
Amazon
DBLP
Discrete
Labeling
0.2285
0.1645
0.536
Table 2. Closeness of results to Groundtruth Communities: F-Score (NA: Not Applicable as Communities defined as the Densest Subgraph)
As is visible from the adjoining graph and the results table, the densest
subgraph is most similar to the ground truth communities in the case of Facebook
dataset. In the other two datasets, on experimentation, we found that their
communities are actually same as densest subgraphs.
sub-
Jaccard coefficient
Discrete Labeling
Facebook
Amazon
DBLP
0.409
0.896
0.275
F-score
Discrete Labeling
Facebook
Amazon
DBLP
0.786
0.1645
0.536
6.2
As is visible from the adjoining graph and the results table, the results of our
algorithm with pagerank as continuous labels is the closest to densest subgraph
as compared to degree as continuous labels or with discrete labelling. In all
the cases, the recall was very high but F score was still low because of large
communities.
One of the primary tasks of our project was to find out upto what extent
statistically significant subgraphs and the ground truth communities are related
with each other. When we started experimenting with different datasets and
started analyzing the results, one of our observations was that the algorithm
for finding statistically significant subgraphs had a high recall but low precision.
On further analysis, we found out that the reason for this was the large size of
communities given by the algorithm. Due to the large size, we were including all
relevant nodes of the communities(high recall) but we were also including many
irrelevant nodes along with the relevant ones(low precision).
The algorithm for finding statistically significant subgraph [14] first converts
the given graph into a much smaller graph and then find the most statistically
significant subgraph on this smaller graph. This process of reducing the size of
the graph takes place in two steps. In first step it creates a supergraph from the
given graph and then shrink this super graph till the number of nodes are less
than a threshold. This threshold will determine how much time the algorithm
will take to find the most significant subgraph. Higher the threshold, higher will
be the size of reduced graph, more will be the time taken to find most statistically
significant subgraph. In order to report the community we had to convert these
results (of reduced graph) back to the original graph. In case of facebook dataset
the original graph has 221 nodes, the super graph has 70 nodes and the reduced
graph has 20 nodes. On this 20 nodes graph we found the subgraph of 4 nodes,
which was most significant among all the subgraphs. When we map these 4 nodes
to original graph we get a community of 68 nodes. When the same algorithm is
applied on very large dataset (DBLP : 317080 nodes, Amazon : 334863), number
of nodes in the resulting communities were very high (around 30, 000), which is
very high compare to the size of ground truth communities. So one thing we
conclude here is that this algorithm, in its current form can not give useful
results for very large datasets.
To see the performance of the part of the algorithm that involves creation of
the super-graph, we created a synthetic graph dataset: we created ten dense
graphs and connected them with small number of edges. After running the algorithm for creating super-graph, we found that about 17 out of the 19 supernodes
that comprised of more than one node had the property of having all constituent
nodes from the same community. But, the number of constituent nodes in the
super nodes, as found on further observation was very small. This suggested
that more work can be done on improving the super graph creation algorithm
too. Ideally, the super graph would have been expected to have ten super-nodes,
each containing the vertices of one of the dense sub-graphs which could then
correspond to the ten communities.
Conclusions
We tried to eliminate the reduction part of the algorithm from the process by
cross-checking the contents of the supernodes against communities. In this, if
all the content nodes of a supernode were a part of the same community, that
would imply that there is no essential information loss from the original graph
to the super graph. But apparently as we found out, the number of supernodes with this property of having all content nodes as members of the same
community was also not very high. In the facebook dataset for instance, approximately 9 out of 19 supernodes satisfied this property. This now led us to
conclude two things:- either the labeling methods we used need to be improved
or the algorithm for super graph creation needs some changes.
Future Works
In statistical significant subgraph mining, labelling techniques play a very important role. We tried degree and pagerank as labels. After observing the possible
issues with these labels, more work can be done for devising new methods for
labelling that can capture the essence of community.
The algorithm for finding significant subgraph requires the graph to be
shrinked to a small number of nodes, as the time taken for large graphs is very
high. If we can do some improvement in this algorithm then we can eliminate the
shrinking segment which led to creation of large-size communities thus making
the precision low and recall very high.
The supergraph creation algorithm can be further improved so that it captures the essence of communities in a better way, thus leading to formation of
a smaller number of supernodes as well as ensure that constituent nodes of a
supernode all belong to the same ground truth community.
References
1. Pothen, A., 1997, Graph Partitioning Algorithms with Applications to Scientific
Computing, Technical Report, Norfolk, VA, USA
2. Goldberg, A. V., and R. E. Tarjan, 1988, Journal of the ACM 35, 921.
3. Flake, G. W., S. Lawrence, C. Lee Giles, and F. M. Coetzee, 2002, IEEE Computer
35, 66.
4. Hlaoui, A., and S. Wang, 2004, in Neural Networks and Computational Intelligence,
pp. 158-163
5. Schenker, A., M. Last, H. Bunke, and A. Kandel, 2003, in IbPRIA03:
6. Shi, J., and J. Malik, 1997, in CVPR 97: Proceedings of the 1997 Conference on
Computer Vision and Pattern Recognition
7. Ng, A. Y., M. I. Jordan, and Y. Weiss, 2001, in Advances in Neural Information
Processing Systems
8. Girvan, M., and M. E. J. Newman, 2002, Proc. Natl. Acad. Sci. USA 99 (12), 7821
9. Pinney, J. W., and D. R. Westhead, 2006, in Interdisciplinary Statistics and Bioinformatics (Leeds University Press, Leeds, UK), pp. 87-90
10. Palla, G., I. Der enyi, I. Farkas, and T. Vicsek, 2005, Nature 435, 814.
11. J. Hu, X. Shen, Y. Shao, C. Bystroff, and M. J. Zaki. Mining protein contact
maps. In BIOKDD, 2002.
12. R. Sharan, S. Suthram, R. M. Kelley, T. Kuhn, S. McCuine, P. Uetz, T. Sittler,
R. M. Karp, and T. Ideker. Conserved patterns of protein interaction in multiple
species. In Proc Natl Acad Sci, 2005.
13. S. Kramer, L. D. Raedt, and C. Helma. Molecular feature mining in HIV data.
In KDD, 2001
14. A. Arora, M. Sachan, A. Bhattacharya, Mining Statistically Signicant Connected
Subgraphs in Vertex Labeled Graphs
15. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, Trawling the web
for emerging cyber-communities http://www8.org/w8-papers/4a-search-mining/
trawling/trawling.html
16. M. Charikar, Greedy Approximation Algorithms for Finding Dense Components
in a Graph http://link.springer.com/chapter/10.1007/3-540-44436-X_10
17. Barnes, E. R., 1982, SIAM J. Alg. Discr. Meth.3,541