Efficient_Densest_Subgraphs_Discovery_in_Large_Dynamic_Graphs_by_Greedy_Approximation
Efficient_Densest_Subgraphs_Discovery_in_Large_Dynamic_Graphs_by_Greedy_Approximation
ABSTRACT Densest subgraph detection has become an important primitive in graph mining tasks when
analyzing communities and detecting events in a wide range of application domains. Currently, it is a
challenging and practically crucial research issue to develop efficient densest subgraphs mining approaches
that can handle both very large and continuously evolving graphs. Although large-scale or dynamic methods
have been proposed to find the densest subgraphs, there is still a lack of a promising method to deal with large-
scale and dynamically evolving graphs. In this paper, the problem is formulated and proved to be NP-Hard,
an incremental greedy approximation approach is proposed, and its running time is O(m+n). In order to find
the densest subgraph effectively by heuristically merging the local densest subgraphs, firstly, the edge flow
of a dynamic graph is divided into several subgraphs in a given period of T . Secondly, a local candidate set is
generated by local denser subgraph discovery. Third, the global densest subgraph candidates are collected by
heuristically merging. Last, the densest subgraphs are induced from the global densest subgraph candidates
with constraints by static densest subgraph discovery algorithm. This incremental approach enables us to
scale up the existing densest subgraph discovery algorithms, which focus mainly on small and static graphs
and thus can handle very large dynamic graphs. Experiments on real-world networks with billions of nodes
for comprehensive evaluations present excellent improvement in efficiency and accuracy: it reduces about
25% running time on average and presents a more accurate estimation of the structure of a graph with
more compact subgraphs than the static method.It also performs well when dealing with graphs of varying
densities.
INDEX TERMS Densest subgraphs, incremental greedy approximation, heuristically merging, local candi-
date, global densest subgraph candidates.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 11, 2023 49367
T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation
many challenges hinder the discovery of the densest sub- densest subgraph problem is NP-Hard is given. It is fol-
graphs in massive graphs, especially when the graphs are lowed by Section IV that a greedy approximation approach
dynamic and evolve over time. First, the scale of graphs can for DSP and theoretical analysis on running time is pre-
be very large, making it hard to be fitted into the memory sented. The proposed approach is demonstrated in Section V
in their entirety for processing. Although it costs polynomial through experiments over six large network datasets. Finally,
time in theory to process massive graphs, the sheer sizes make Section VI concludes the paper.
it computationally infeasible in practice [12], [13], [14], [15].
II. RELATED WORK AND MOTIVATIONS
Second, the graphs are usually developing or evolving over
DSP has been widely researched in data mining and the-
time with links creations and deletions. It is very challenging
oretical computer science. There have been comprehensive
for traditional densest subgraph mining algorithms to obtain
surveys [16], [17], [18], [19] and tutorials [20] about the dis-
the whole graph and, therefore, can not process the graph
covery of the densest subgraph. Investigations in data mining
completely.
include those reported by Tsourakakis [1], [10], [12], [21],
To address the above-mentioned problems, an incremen-
[22], [23], [24]. Studies of theoretical computer science can
tal approximation approach to more efficiently discover the
densest subgraphs is presented in this paper. It first uses a be found in researches [25], [26], [27], [28], [29], [30].
sliding time window segmentation strategy to divide the vast- A. MAIN STRUCTURES
volume streaming graph distributed in the time windows into Average density is one of the main structures for graph
small graphs within a given period of T . Second, the local mining. Balalau [12] considered limited overlapping as the
densest subgraphs are detected by density function after a pre- concern of finding a set of dense subgraphs and devised an
processing step of filtering the least connected edges. Third, approximation and heuristic algorithm to solve the NP-Hard
when the small graphs are processed, the approach incre- problem. Xie [8] gave comparative research in overlapping
mentally merges the subgraphs to approximate the densest communities detection. Macgregor [7] proposed a one-pass
subgraphs by a greedy algorithm. Finally, the global densest approximation algorithm to find the densest subgraph where
subgraphs are detected by density functions. the graph was an unordered stream of edges deletions and
The main contributions to this work are summarized as insertions. Leskovec [31] analyzed algorithms for finding
follows: communities in networks empirically. Chen [32] put for-
1) We give proof that the problem is NP-Hard. ward an algorithm to find multiple dense subgraphs in a
2) We propose a novel incremental framework to efficiently certain sparse graph. K -core is usually used to find core
discover the densest subgraphs by heuristically merg- decomposition. Bonchi [9] designed an algorithm to pro-
ing the local densest subgraphs. The merging algorithm cess uncertain graphs through core decomposition efficiently.
greedily constructs the densest subgraphs candidate set It aimed to reduce the number of exact density compu-
to preserve the main dense parts of the graph by pruning tations by parameter-free approximation and exact algo-
the least connected parts and adding the most connected rithm as well [10]. K -clique contains k(k − 1)/2 edges
parts. The running time performance is O(m + n). in the clique, and its average density is the largest. The
3) We have conducted in-depth evaluations of the proposed researchers [33], [34] identified top-k subgraphs with trian-
method on six large-scale networks. The experimental gles that can represent the dense region of the large-scale
results prove the effectiveness and efficiency of the graph in a parameter-free fashion in polynomial computing
greedy approximation method: When processing more time. Ghasabeh [35] proposed a unified framework to com-
edges and nodes, our approach presents a more powerful pare and evaluate different algorithms for finding subgraphs
performance: it reduces about 25% running time on in social networks. Valari [36] investigated the discovery of
average in four density definitions on the six diverse top-k dense subgraphs in both static and dynamic graphs.
datasets; the execution time declines drastically with Other structures have also been detected in many applica-
all subgraph detection algorithms when the size of data tions. A great many research works on reducing the size of
reaches 0.6 billion, decreasing ranging from 300 s to huge-volume graphs with desirable structures, such as con-
800 s. we compare the quality of subgraphs detected nectivity [37], [38], [39], [40], backbone [41], distance [42],
by our greedy approximation method and the static bi-simulation structure [43] and vocabulary subgraphs [44].
method through three measures. : Cross Common Frac- Although Koutra’s paper managed to summarize six pre-
tion(CCF), Jaccard Index(JI) and Normalized Mutual defined structures(vocabulary subgraphs) in a million-node
Information(NMI). Our approach outperforms 6%, 7%, graph, it didn’t dedicate itself to finding dense subgraphs in
and 1% on average by the static method, respectively. large graphs. Moreover, these researchers preserved all the
The influence of sparsity on synthetic graphs with dif- nodes and removed parts of the edges, in which way the dense
ferent densities is also discussed. Experiments show that parts of the original graph deteriorated.
this method is sparsity-friendly.
This paper is organized as follows. Section II dis- B. DYNAMIC SETTINGS
cusses related work and motivations. Section III the densest The articles [45], [46] proposed a congest model-based
subgraph problem(DSP) is defined, and the proof that the dynamic approach for densest subgraph discovery in a
method is discussed. Then, the architecture and procedures densest subgraph G∗ is detected from the global candidate
of this method and the corresponding algorithms are given. set under the size constraint based on the density function.
C. THEORETICAL ANALYSIS
In this section, the running time is discussed. Let T(v,e) be
FIGURE 2. Four steps of Incremental Graph Approximation. the time required for a minimum capacity cut in a graph
with v nodes and e edges. [48] For a minimum capacity
cut in a graph, there is only one loop which is executed
in G′i exceeds the limitations defined by parameter γ ; thus ⌈log(m + 1)n(n + 1)⌉ = O(logn) times. Inside the loop,
we accept all the nodes in G′i because it is an optimal local finding the min-cut is the main step. the graph has n +
solution and the candidate set will be further processed. (iii) 2 = O(n) nodes and 2m + 2n = O(m + n) edges. So the
When G′i < γ |Gi |, more nodes and edges, represented by running time is O(T (n, n + m)logn). Previous algorithm [48]
G′′i , are required to be added to the local candidate set to reach discovers a minimum capacity cut in a graph with k nodes
the size limits. inO(k 3 ) step. If this method is applied, a maximum density
subgraph can be found in time O(n3 logn). If a faster min-
if G′i ≥ γ |Gi |
( ′
Gi cut approach is used, it can improve the running time of our
Si = [ [ (3)
G′i G′′i if G′i G′′i = γ |Gi | maximum density approach, e.g., Sleateor’s algorithm [48]
has T (v, e) = O(velogv). This min-cut algorithm generates
3) INCREMENTAL APPROXIMATION an O(n(n + m)lognlog(n + m)), which has a better bound for
After above-mentioned two steps, Local Candidate Set of the sparse graph. A faster result [49] by greedy approximation
S1 , . . . , Si , . . . , Sj , . . . , Ssn are established. algorithm runs in O(m + n). Recent research [50] proposes
The graph approximation algorithm assembles the local an iterative peeling algorithm that can output near-optimal
candidate set of subgraphs into Global Candidate Set (GCS) and optimal solutions fast by adding a few more passes to
by approximation algorithm. The detailed descriptions are in Charikar’s greedy algorithm [49]. For Algorithm 1, there is
Algorithm 2 and the flowchart is present in figure3. only one loop which is executed sn times. Inside the loop,
den(Gi ) is the main step, which is O(ni + mi ). For graph
4) DENSEST SUBGRAPHS DETECTION G = G1 , . . . , Gi , . . . , Gsn , the running time is O(m + n).
This step aims at detecting the densest subgraphs from sim-
plified graph S by density(S) listed in Table 2, including edge V. EXPERIMENTS
density, k-core and k-clique. They illustrate ‘‘dense’’ from In this section, in-depth evaluations are conducted to demon-
three aspects: The edge density defines the maximum average strate the efficiency, accuracy, and effectiveness of our
density of the graph and emphasizes the average density in incremental greedy approximation approach. They are imple-
subgraphs. The k-core finds subgraphs with nodes having at mented on six large dynamic graphs with disordered edges
least k edges and focuses on the connection of subgraphs. The and ground-truth communities. Data sets and experimental
k-clique discovers a group of k nodes that directly connect configurations are introduced in the first place. Then, com-
each other with (k − 1)(k − 2)/2 edges and highlight the prehensive evaluation and results are elaborated. After that,
strongly connected components. the findings from our experiments are summarized.
TABLE 3. Various real-world networks with ground-truth. CC stands for Clustering Coefficient.
TABLE 4. Comparisons of execution time performance. In average density, the threshold density is defined as the maximum edge density of a graph.
In k-core, (k) is the maximum core number of a graph. In k-clique, (k) is set to 5 and 3. When k=3,3-clique means triangles counting in the graph.
[24] C. Ma, Y. Fang, R. Cheng, L. V. S. Lakshmanan, W. Zhang, and X. Lin, [44] D. Koutra, U. Kang, J. Vreeken, and C. Faloutsos, ‘‘VOG: Summarizing
‘‘On directed densest subgraph discovery,’’ ACM Trans. Database Syst., and understanding large graphs,’’ in Proc. SIAM Int. Conf. Data Mining,
vol. 46, no. 4, pp. 1–45, Dec. 2021. Philadelphia, PA, USA, Apr. 2014, pp. 91–99.
[25] R. Andersen and K. Chellapilla, ‘‘Finding dense subgraphs with [45] X. Liu, T. Ge, and Y. Wu, ‘‘A stochastic approach to finding densest
size bounds,’’ in Proc. Int. Workshop Algorithms Models Web-Graph, temporal subgraphs in dynamic graphs,’’ IEEE Trans. Knowl. Data Eng.,
Barcelona, Spain, Feb. 2009, pp. 25–37. vol. 34, no. 7, pp. 3082–3094, Jul. 2022.
[26] S. Bhattacharya, M. Henzinger, D. Nanongkai, and E. C. Tsourakakis, [46] A. D. Sarma, A. Lall, D. Nanongkai, and A. Trehan, ‘‘Dense subgraphs
‘‘Space- and time-efficient algorithm for maintaining dense subgraphs on on dynamic networks,’’ in Proc. 26th Int. Conf. Distrib. Comput., Berlin,
one-pass dynamic streams,’’ 2015, arXiv:1504.02268. Germany, 2012, pp. 151–165.
[27] S. Khuller and B. Saha, ‘‘On finding dense subgraphs,’’ in Proc. Int. Colloq. [47] A. Angel, N. Sarkas, N. Koudas, and D. Srivastava, ‘‘Dense subgraph
Automata, Lang., Program., Rhodes, Greece, Jul. 2009, pp. 597–608. maintenance under streaming edge weight updates for real-time story iden-
[28] P. Bombina and B. Ames, ‘‘Convex optimization for the densest subgraph tification,’’ Proc. VLDB Endowment, vol. 5, no. 6, pp. 574–585, Feb. 2012.
and densest submatrix problems,’’ Social Netw. Oper. Res. Forum, vol. 1, [48] A. V. Goldberg, ‘‘Finding a maximum density subgraph,’’ EECS Dept.,
no. 3, p. 23, Sep. 2020. Univ. California, Berkeley, CA, USA, Tech. Rep., UCB/CSD-84-171,
[29] C. Ma, Y. Fang, R. Cheng, L. V. S. Lakshmanan, and X. Han, ‘‘A convex- 1984.
programming approach for efficient directed densest subgraph discovery,’’ [49] M. Charikar, ‘‘Greedy approximation algorithms for finding dense com-
in Proc. Int. Conf. Manage. Data, Philadelphia, PA, USA, Jun. 2022, ponents in a graph,’’ in Proc. Int. Workshop Approximation Algorithms
pp. 845–859. Combinat. Optim. Berlin, Germany: Springer, 2000, pp. 84–95.
[30] R. Dondi, M. M. Hosseinzadeh, G. Mauri, and I. Zoppis, ‘‘Top-k over- [50] D. Boob, Y. Gao, R. Peng, S. Sawlani, C. Tsourakakis, D. Wang, and
lapping densest subgraphs: Approximation algorithms and computational J. Wang, ‘‘Flowless: Extracting densest subgraphs without flow compu-
complexity,’’ J. Combinat. Optim., vol. 41, no. 1, pp. 80–104, Jan. 2021. tations,’’ in Proc. Web Conf., Apr. 2020, pp. 573–583.
[31] J. Leskovec, K. J. Lang, and M. Mahoney, ‘‘Empirical comparison of [51] J. Leskovec and A. Krevl. (Jun. 2014). SNAP Datasets: Stanford Large Net-
algorithms for network community detection,’’ in Proc. 19th Int. Conf. work Dataset Collection. [Online]. Available: http://snap.stanford.edu/data
World Wide Web, Raleigh, NC, USA, Apr. 2010, pp. 631–640. [52] M. Wang, C. Wang, J. X. Yu, and J. Zhang, ‘‘Community detection in social
[32] J. Chen and Y. Saad, ‘‘Dense subgraph extraction with application to networks: An in-depth benchmarking study with a procedure-oriented
community detection,’’ IEEE Trans. Knowl. Data Eng., vol. 24, no. 7, framework,’’ Proc. VLDB Endowment, vol. 8, no. 10, pp. 998–1009,
pp. 1216–1230, Jul. 2012. Jun. 2015.
[33] L. Qin, R.-H. Li, L. Chang, and C. Zhang, ‘‘Locally densest subgraph [53] L. Hamers, Y. Hemeryck, G. Herweyers, M. Janssen, H. Keters,
discovery,’’ in Proc. 21st ACM SIGKDD Int. Conf. Knowl. Discovery Data R. Rousseau, and A. Vanhoutte, ‘‘Similarity measures in scientometric
Mining, Sydney, NSW, Australia, Aug. 2015, pp. 965–974. research: The Jaccard index versus Salton’s cosine formula,’’ Inf. Process.
[34] A. Konar and D. N. Sidiropoulos, ‘‘The triangle-densest-K-subgraph prob- Manage., vol. 25, no. 3, pp. 315–318, Jan. 1989.
lem: Hardness, Lovász extension, and application to document summariza- [54] F. A. McDaid, D. Greene, and J. N. Hurley, ‘‘Normalized mutual infor-
tion,’’ in Proc. AAAI Conf. Artif. Intell., Feb. 2022, pp. 4075–4082. mation to evaluate overlapping community finding algorithms,’’ 2011,
[35] A. Ghasabeh and M. S. Abadeh, ‘‘Community detection in social networks arXiv:1110.2515.
using a hybrid swarm intelligence approach,’’ Int. J. Knowl.-Based Intell. [55] L. Danon, A. Díaz-Guilera, J. Duch, and A. Arenas, ‘‘Comparing commu-
Eng. Syst., vol. 19, no. 4, pp. 255–267, Jan. 2016. nity structure identification,’’ J. Stat. Mech., Theory Exp., vol. 2005, no. 9,
[36] E. Valari, M. Kontaki, and N. A. Papadopoulos, ‘‘Discovery of top-k dense Sep. 2005, Art. no. P09008.
subgraphs in dynamic graph collections,’’ in Proc. 24th Int. Conf. Sci. Stat. [56] S. Goswami, C. A. Murthy, and A. K. Das, ‘‘Sparsity measure of a network
Database Manag., 2012, pp. 213–230. graph: Gini index,’’ Inf. Sci., vol. 462, pp. 16–39, Sep. 2018.
[37] F. Zhou, S. Mahler, and H. Toivonen, Simplification of Networks by Edge
Pruning. Berlin, Germany: Springer, 2012, pp. 179–198.
[38] R. Dondi, M. M. Hosseinzadeh, and P. H. Guzzi, ‘‘A novel algorithm for
finding top-k weighted overlapping densest connected subgraphs in dual
networks,’’ Appl. Netw. Sci., vol. 6, no. 1, p. 40, Dec. 2021.
[39] F. Bonchi, D. García-Soriano, A. Miyauchi, and C. E. Tsourakakis,
‘‘Finding densest k-connected subgraphs,’’ Discrete Appl. Math., vol. 305,
pp. 34–47, Dec. 2021.
[40] A. Miyauchi and A. Takeda, ‘‘Robust densest subgraph discovery,’’ in
Proc. IEEE Int. Conf. Data Mining (ICDM), Nov. 2018, pp. 1188–1193. TAO HAN was born in Beijing, China. She
[41] S. Chawla, K. Garimella, A. Gionis, and D. Tsang, Discovering the Net- received the M.S. degree in computer science and
work Backbone from Traffic Activity Data. Cham, Switzerland: Springer, the Ph.D. degree in computer architecture from
2016, pp. 409–422. Beihang University, China, in 2011 and 2018,
[42] N. Ruan, R. Jin, and Y. Huang, ‘‘Distance preserving graph simplification,’’ respectively. She is currently with the National
in Proc. IEEE 11th Int. Conf. Data Mining, Vancouver, BC, Canada, Center for Public Credit Information, China. Her
Dec. 2011, pp. 1200–1205. research interests include credit data management,
[43] G. H. L. Fletcher and M. Pechenizkiy, ‘‘On structure preserving sampling big data, and data mining.
and approximate partitioning of graphs,’’ in Proc. 31st Annu. ACM Symp.
Appl. Comput., New York, NY, USA, 2016, pp. 875–882.