0% found this document useful (0 votes)
8 views

Efficient_Densest_Subgraphs_Discovery_in_Large_Dynamic_Graphs_by_Greedy_Approximation

Uploaded by

nishantmr03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Efficient_Densest_Subgraphs_Discovery_in_Large_Dynamic_Graphs_by_Greedy_Approximation

Uploaded by

nishantmr03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received 22 April 2023, accepted 15 May 2023, date of publication 17 May 2023, date of current version 24 May 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3277197

Efficient Densest Subgraphs Discovery in Large


Dynamic Graphs by Greedy Approximation
TAO HAN
National Center for Public Credit Information, Beijing 100045, China
e-mail: [email protected]

ABSTRACT Densest subgraph detection has become an important primitive in graph mining tasks when
analyzing communities and detecting events in a wide range of application domains. Currently, it is a
challenging and practically crucial research issue to develop efficient densest subgraphs mining approaches
that can handle both very large and continuously evolving graphs. Although large-scale or dynamic methods
have been proposed to find the densest subgraphs, there is still a lack of a promising method to deal with large-
scale and dynamically evolving graphs. In this paper, the problem is formulated and proved to be NP-Hard,
an incremental greedy approximation approach is proposed, and its running time is O(m+n). In order to find
the densest subgraph effectively by heuristically merging the local densest subgraphs, firstly, the edge flow
of a dynamic graph is divided into several subgraphs in a given period of T . Secondly, a local candidate set is
generated by local denser subgraph discovery. Third, the global densest subgraph candidates are collected by
heuristically merging. Last, the densest subgraphs are induced from the global densest subgraph candidates
with constraints by static densest subgraph discovery algorithm. This incremental approach enables us to
scale up the existing densest subgraph discovery algorithms, which focus mainly on small and static graphs
and thus can handle very large dynamic graphs. Experiments on real-world networks with billions of nodes
for comprehensive evaluations present excellent improvement in efficiency and accuracy: it reduces about
25% running time on average and presents a more accurate estimation of the structure of a graph with
more compact subgraphs than the static method.It also performs well when dealing with graphs of varying
densities.

INDEX TERMS Densest subgraphs, incremental greedy approximation, heuristically merging, local candi-
date, global densest subgraph candidates.

I. INTRODUCTION mining [1] and is the basis for various applications,including


Massive graphs are commonly used to represent data with community mining [2], picture feature selection [3], decision
related information in a wide variety of domains. Examples making [4], graph computing systems [5], and action recog-
include social networks (e.g., Twitter, Face-books), electronic nition [6] and so on.
commerce websites(e.g., Amazon), and bioinformatics(e.g., Researchers have investigated various features of the dens-
nucleus hierarchy). Nodes are usually used to describe enti- est subgraphs, and there are several definitions of ‘‘dense
ties, such as Twitter accounts, objects, videos, or pictures, and parts’’. These definitions mainly concern the features of
edges commonly represent relationships or behaviours, such nodes and their degrees in subgraphs, such as average
as ‘‘follow’’, ‘‘purchase’’, ‘‘share’’ or ‘‘like’’. The Densest density, [7], [8], k-core [9], [10], and k-clique [11]. Corre-
Subgraphs Problem (DSP for short) aims to discover a subset sponding to the definitions, algorithms for finding the densest
S of nodes that has the highest ratio of the number of edges to subgraph are proposed. These algorithms can be roughly clas-
the number of nodes in S. It is a fundamental problem in graph sified into the following three categories [11]: 1) Heuristics
without theoretical guarantees; 2) Enumeration techniques
The associate editor coordinating the review of this manuscript and that are usually applicable to small graphs; And 3) Poly-time
approving it for publication was Hocine Cherifi . solvable algorithms with theoretical guarantees. However,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 11, 2023 49367
T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation

many challenges hinder the discovery of the densest sub- densest subgraph problem is NP-Hard is given. It is fol-
graphs in massive graphs, especially when the graphs are lowed by Section IV that a greedy approximation approach
dynamic and evolve over time. First, the scale of graphs can for DSP and theoretical analysis on running time is pre-
be very large, making it hard to be fitted into the memory sented. The proposed approach is demonstrated in Section V
in their entirety for processing. Although it costs polynomial through experiments over six large network datasets. Finally,
time in theory to process massive graphs, the sheer sizes make Section VI concludes the paper.
it computationally infeasible in practice [12], [13], [14], [15].
II. RELATED WORK AND MOTIVATIONS
Second, the graphs are usually developing or evolving over
DSP has been widely researched in data mining and the-
time with links creations and deletions. It is very challenging
oretical computer science. There have been comprehensive
for traditional densest subgraph mining algorithms to obtain
surveys [16], [17], [18], [19] and tutorials [20] about the dis-
the whole graph and, therefore, can not process the graph
covery of the densest subgraph. Investigations in data mining
completely.
include those reported by Tsourakakis [1], [10], [12], [21],
To address the above-mentioned problems, an incremen-
[22], [23], [24]. Studies of theoretical computer science can
tal approximation approach to more efficiently discover the
densest subgraphs is presented in this paper. It first uses a be found in researches [25], [26], [27], [28], [29], [30].
sliding time window segmentation strategy to divide the vast- A. MAIN STRUCTURES
volume streaming graph distributed in the time windows into Average density is one of the main structures for graph
small graphs within a given period of T . Second, the local mining. Balalau [12] considered limited overlapping as the
densest subgraphs are detected by density function after a pre- concern of finding a set of dense subgraphs and devised an
processing step of filtering the least connected edges. Third, approximation and heuristic algorithm to solve the NP-Hard
when the small graphs are processed, the approach incre- problem. Xie [8] gave comparative research in overlapping
mentally merges the subgraphs to approximate the densest communities detection. Macgregor [7] proposed a one-pass
subgraphs by a greedy algorithm. Finally, the global densest approximation algorithm to find the densest subgraph where
subgraphs are detected by density functions. the graph was an unordered stream of edges deletions and
The main contributions to this work are summarized as insertions. Leskovec [31] analyzed algorithms for finding
follows: communities in networks empirically. Chen [32] put for-
1) We give proof that the problem is NP-Hard. ward an algorithm to find multiple dense subgraphs in a
2) We propose a novel incremental framework to efficiently certain sparse graph. K -core is usually used to find core
discover the densest subgraphs by heuristically merg- decomposition. Bonchi [9] designed an algorithm to pro-
ing the local densest subgraphs. The merging algorithm cess uncertain graphs through core decomposition efficiently.
greedily constructs the densest subgraphs candidate set It aimed to reduce the number of exact density compu-
to preserve the main dense parts of the graph by pruning tations by parameter-free approximation and exact algo-
the least connected parts and adding the most connected rithm as well [10]. K -clique contains k(k − 1)/2 edges
parts. The running time performance is O(m + n). in the clique, and its average density is the largest. The
3) We have conducted in-depth evaluations of the proposed researchers [33], [34] identified top-k subgraphs with trian-
method on six large-scale networks. The experimental gles that can represent the dense region of the large-scale
results prove the effectiveness and efficiency of the graph in a parameter-free fashion in polynomial computing
greedy approximation method: When processing more time. Ghasabeh [35] proposed a unified framework to com-
edges and nodes, our approach presents a more powerful pare and evaluate different algorithms for finding subgraphs
performance: it reduces about 25% running time on in social networks. Valari [36] investigated the discovery of
average in four density definitions on the six diverse top-k dense subgraphs in both static and dynamic graphs.
datasets; the execution time declines drastically with Other structures have also been detected in many applica-
all subgraph detection algorithms when the size of data tions. A great many research works on reducing the size of
reaches 0.6 billion, decreasing ranging from 300 s to huge-volume graphs with desirable structures, such as con-
800 s. we compare the quality of subgraphs detected nectivity [37], [38], [39], [40], backbone [41], distance [42],
by our greedy approximation method and the static bi-simulation structure [43] and vocabulary subgraphs [44].
method through three measures. : Cross Common Frac- Although Koutra’s paper managed to summarize six pre-
tion(CCF), Jaccard Index(JI) and Normalized Mutual defined structures(vocabulary subgraphs) in a million-node
Information(NMI). Our approach outperforms 6%, 7%, graph, it didn’t dedicate itself to finding dense subgraphs in
and 1% on average by the static method, respectively. large graphs. Moreover, these researchers preserved all the
The influence of sparsity on synthetic graphs with dif- nodes and removed parts of the edges, in which way the dense
ferent densities is also discussed. Experiments show that parts of the original graph deteriorated.
this method is sparsity-friendly.
This paper is organized as follows. Section II dis- B. DYNAMIC SETTINGS
cusses related work and motivations. Section III the densest The articles [45], [46] proposed a congest model-based
subgraph problem(DSP) is defined, and the proof that the dynamic approach for densest subgraph discovery in a

49368 VOLUME 11, 2023


T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation

with at least k links. γ is a parameter to specify the percentage


of nodes left. The Densest subgraph discovery problem in
sliding time windows is formulated as
sn
[
G∗ = max den(G) ≈ max den( den(Gi ))
1
subject to: |G∗ | ≥ γ |G|. (1)
The streaming graph G is segmented into several subgraphs
Gi according to a period time of T . den(Gi ) is applied to find
local densest subgraphsSin Gi . Then a heuristic merge algo-
rithm (represented by sn 1 )combines local densest subsets,
and a global densest subgraph candidate set is constructed.
Finally, the maximum density subgraphs G∗ induced from
FIGURE 1. Graph in time windows are segmented into subgraphs. the global candidate set by density function are the densest
subgraph approximations of the original graph G. Note that
distributed fashion. Angel et al. [47] studied the densest sub- G∗ needs to satisfy an edge size constraint |G∗ | ≥ γ |G|.
graph maintaining on weighted graphs and to find interesting We reduce the densest k subgraph problem (this problem
events posted on Twitter. Top-k densest subgraph problem is NP-hard [27]) to the densest at least k subgraph problem.
in dynamic collections is studied [36], where the densest Suppose given a graph G and a parameter λ, and we try
subgraph is removed recursively or stays in the collections. to know whether a subgraph with size λ and density ≥ d
This paper presents an incremental approximation exists. We construct a clique G′ of size n2 , where V (G) = n.
approach to more efficiently discover the densest subgraphs Consider that the graph is the union of G and G′ , G ∪ G′ , and
in large dynamic graphs, which greedily discovers local requires a subgraph with size at least n2 + λ of maximum
optimal and gradually approximates the global densest sub- density. The solution S for G ∪ G′ satisfies two properties: 1)
graphs in the giant graph. The highlights of this method G′ ⊂ S and 2) E(S ∩ G) = λ.
include a time window strategy for processing dynamic data, From the above two properties, S ∩ G is the maximum
a greedy strategy for detecting the local densest subgraph, density subgraph with size λ in G, since otherwise, we can
an incremental approximation strategy for constructing can- obtain a better result by replacing the size λ subgraph of S ∩G
didate sets, and a selection strategy for revealing the densest with the densest λ subgraph. Thus, the result S∩G returned by
2
subgraph. (n )+λ∗d
the algorithm has density ≥ 2n2 +λ , then we get the densest
subgraph. Therefore densest at least k subgraph problem is
III. PROBLEM STATEMENT also NP-hard.
In this section, the fundamental concepts of finding the dens- Proof of two properties.
est subgraph problem (DSP) in sliding windows are defined 1) G′ ⊂ S: Suppose not satisfied. Let G′ − S = T ̸ = φ,
first. Then, the challenges in solving this problem and the V (S) = r and the density of S be δ(S). Then adding
densest subgraphs merging problem in graph approximation T to S, additional edges (edges incident on nodes in
are defined. T) are denoted by T ′ . So the new density of S ∪ T is
Let G(V, E) represent a graph, where vertex v ∈ V δ(S)r+E(T ′ )
r+V (T ) . We can get δ(S)V (T ) < E(T ) from the

denotes users, and edge e ∈ E denotes friendships in a social
earlier setting on δ(S). So, we can obtain δ(S)r+E(T
′ )
network. The number of nodes and edges in the graph are r+V (T ) >
δ(S)r+δ(S)V (T )
|V | = n and |E| = m, respectively. Clusters, partitions, r+V (T ) = δ(S). The density of S ∪ T is large
subgraphs, or communities all describe the intrinsic structures than the density of S, which contradicts the assumption
of the graph, reflecting the densest parts of the graph. They that δ(S) is the maximum density. Therefore, G′ must
are built from a set of non-empty node subsets and corre- be contained in S.
sponding edges. Subgraphs are represented by Subgraphs = 2) E(S ∩ G) = λ: Let SG represents S ∩ G. Since E(S) ≥
{G1 , · · · , Gsn }, where sn is the number of subgraphs and n2 + λ, E(SG) ≥ λ. If possible E(SG) = λ′ > λ.
S sn Let δ(SG) denotes the density of S ∩ G. Therefore,
1 Gi ⊆ G. Fig. 1 illustrates the segmentation of the graph
the density of S is δ(SG)∗ λ′ +E(G′ ) ′ ′
flow. Each Gi in Figure 1 can be viewed as the subgraphs λ′ +V (G′ ) ≤ λ (λλ−1)/2+E(G)
′ +V (G′ ) ≤
1+E(G)
induced from G by sliding time window. The notations used λ′ −1+V (G′ ) . Hence the density of S is a decreasing
throughout this paper are listed in Table 1. function of E(SG). Thus we have E(SG) = λ.
In this paper, we use three density functions listed in
Table 2. average density focuses on the average density of the IV. INCREMENTAL GRAPH APPROXIMATION FOR
subgraph, while k-clique selects subgraphs with the highest DENSEST SUBGRAPHS DETECTION IN DYNAMIC GRAPHS
average density of graph with k nodes. k-core is used to con- This section will introduce our incremental graph approxi-
duct core composition and find subgraphs, including nodes mation method in detail. First, the fundamental idea of this

VOLUME 11, 2023 49369


T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation

TABLE 1. Notations used in this paper.

TABLE 2. Three popular metrics used to define graph density.

method is discussed. Then, the architecture and procedures densest subgraph G∗ is detected from the global candidate
of this method and the corresponding algorithms are given. set under the size constraint based on the density function.

B. ARCHITECTURE AND COMPONENTS


Algorithm 1 High-Level Algorithm of Incremental As illustrated in Fig. 2, the proposed approach includes four
Densest Subgraph Detection steps for solving DSP in continuous sliding windows: graph
Input: G, γ partition, local densest subgraphs generation, incremental
Output: G∗ approximation, and densest subgraph detection.
LCS ← φ;
GCS ← φ; 1) GRAPH PARTITION
removed_edges ← φ; In this step, we deal with a streaming graph with continuous
for i in sn do edges. We first equally divided the period T into sn sliding
Gi ← (vi1 , vi2 ), . . . , (vim , vin ) ; time windows and then partition the graph into sn streaming
G′i ← den(Gi ); subgraphs G1 , . . . , Gi , . . . , Gsn associated to a sliding time
if |G′i | ≥ γ |den(Gi )| then window. We do not focus much on the partition algorithm,
Si ← G′i ; which is also an exciting and challenging problem in graph
else mining. We only apply the existing well-developed method
Si ← G′i G′′i ;
S
S to capture subgraphs individually.
LCS ← LCS Si ;
GCS ← Approximation(LCS, removed_edges, γ ) 2) LOCAL DENSEST SUBGRAPHS GENERATION
; Within subgraph Gi , local candidate set(LCS) is generated
G∗ ← den(GCS); by discovering the local densest subgraphs Si with constraint
|Si | ≥ γ |Gi | through function den(Gi ). γ is a parameter to
control the percentage of left nodes.

LCS = Si = max den(Gi )


A. PRINCIPLES s.t. |Si | ≥ γ |Gi | (2)
In graph streams, a small graph Gi is first generated in
a sliding time window T . Then, Gi is processed to con- The constraint |Si | = γ |Gi | makes LCS different from
struct the local densest subgraph candidate set Si by density classical DSP problem in Equation 3: (i)When the densest
function. Third, a heuristic merge algorithm combines Si , subgraphs G′i in Gi satisfies G′i = γ |Gi |, it definitely makes
i = 1, 2, . . . , sn to form a global candidate set. Finally, the sure that Si = G′i . (ii) While G′i > γ |Gi |, the node size

49370 VOLUME 11, 2023


T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation

Algorithm 2 Incremental Approximation Algorithm


Input: LCS, γ
Output: GCS
GCS← φ;
for Si in LCS do
if i == 1 then
S0 ← S1 ;
else
if Si S0 ! = φ then
T
T
if density (S0T Si ) ≥ density (S0 ) then
Si ← Si S0 ;
else T
Si ← Si − (Si S0 );
S
S0 ← Si S0 ;
regulate |S0 | according to γ ;
GCS← S0 ;

C. THEORETICAL ANALYSIS
In this section, the running time is discussed. Let T(v,e) be
FIGURE 2. Four steps of Incremental Graph Approximation. the time required for a minimum capacity cut in a graph
with v nodes and e edges. [48] For a minimum capacity
cut in a graph, there is only one loop which is executed
in G′i exceeds the limitations defined by parameter γ ; thus ⌈log(m + 1)n(n + 1)⌉ = O(logn) times. Inside the loop,
we accept all the nodes in G′i because it is an optimal local finding the min-cut is the main step. the graph has n +
solution and the candidate set will be further processed. (iii) 2 = O(n) nodes and 2m + 2n = O(m + n) edges. So the
When G′i < γ |Gi |, more nodes and edges, represented by running time is O(T (n, n + m)logn). Previous algorithm [48]
G′′i , are required to be added to the local candidate set to reach discovers a minimum capacity cut in a graph with k nodes
the size limits. inO(k 3 ) step. If this method is applied, a maximum density
subgraph can be found in time O(n3 logn). If a faster min-
if G′i ≥ γ |Gi |
( ′
Gi cut approach is used, it can improve the running time of our
Si = [ [ (3)
G′i G′′i if G′i G′′i = γ |Gi | maximum density approach, e.g., Sleateor’s algorithm [48]
has T (v, e) = O(velogv). This min-cut algorithm generates
3) INCREMENTAL APPROXIMATION an O(n(n + m)lognlog(n + m)), which has a better bound for
After above-mentioned two steps, Local Candidate Set of the sparse graph. A faster result [49] by greedy approximation
S1 , . . . , Si , . . . , Sj , . . . , Ssn are established. algorithm runs in O(m + n). Recent research [50] proposes
The graph approximation algorithm assembles the local an iterative peeling algorithm that can output near-optimal
candidate set of subgraphs into Global Candidate Set (GCS) and optimal solutions fast by adding a few more passes to
by approximation algorithm. The detailed descriptions are in Charikar’s greedy algorithm [49]. For Algorithm 1, there is
Algorithm 2 and the flowchart is present in figure3. only one loop which is executed sn times. Inside the loop,
den(Gi ) is the main step, which is O(ni + mi ). For graph
4) DENSEST SUBGRAPHS DETECTION G = G1 , . . . , Gi , . . . , Gsn , the running time is O(m + n).
This step aims at detecting the densest subgraphs from sim-
plified graph S by density(S) listed in Table 2, including edge V. EXPERIMENTS
density, k-core and k-clique. They illustrate ‘‘dense’’ from In this section, in-depth evaluations are conducted to demon-
three aspects: The edge density defines the maximum average strate the efficiency, accuracy, and effectiveness of our
density of the graph and emphasizes the average density in incremental greedy approximation approach. They are imple-
subgraphs. The k-core finds subgraphs with nodes having at mented on six large dynamic graphs with disordered edges
least k edges and focuses on the connection of subgraphs. The and ground-truth communities. Data sets and experimental
k-clique discovers a group of k nodes that directly connect configurations are introduced in the first place. Then, com-
each other with (k − 1)(k − 2)/2 edges and highlight the prehensive evaluation and results are elaborated. After that,
strongly connected components. the findings from our experiments are summarized.

VOLUME 11, 2023 49371


T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation

FIGURE 4. The processing time performance of four approaches with Our


approach and the static method on LiveJournal.

FIGURE 5. The processing time performance of four approaches with Our


approach and the static method on Friendsters.

B. IMPROVEMENT IN THE COMPUTATIONAL EFFICIENCY


FIGURE 3. Flowchart of Incremental Approximation Algorithm. The execution time performance of discovering the densest
subgraphs with four algorithms is improved dramatically for
the six datasets from our greedy approximation approach over
A. EXPERIMENTAL SETTINGS the static algorithm. Table 4 presents part of our simulation
To evaluate the performance of our approach in a consistent results. It also verifies the scalability of our approach and
way, we carry on experiments on six data sets from SNAP the static method with size n increasing from 0.3 million
[51]: LiveJournal, Friendster, Youtube, Orkut, DBLP, and to 0.6 billion. When processing more edges and nodes, our
Amazon. These data sets(in Table 3) are different in density, approach presents a more powerful performance: it reduces
number, and scale of communities, clustering coefficient(CC about 25% running time on average in four density definitions
for short), and the number of triangles. CC is the average on the six diverse datasets; the execution time declines dras-
clustering coefficient, a widely-accepted metric to measure tically with all subgraph detection algorithms when the size
the quality of subgraphs. of data reaches 0.6 billion, decreasing ranging from 300 s to
We measure the running time performance of both our 800 s. For different density algorithms, the processing time
approach and the three state-of-the-art density function algo- required by k-core is much longer than other algorithms.
rithms, including average density, k-core, and k-clique, cor- The k-clique algorithm shows a relatively stable result when
responding to the three density definitions. For subgraphs parameters range from 3 to 5. When k is set to 3, it requires
discovered by average density, the maximum average density much more calculations in theory than a larger parameter 5.
is set to density. For communities induced by k-core and Table 4 also reveals that our approach performs well when
k-clique, we set 3 and 5 to k, respectively. discovering subgraphs with diverse parameter-sensitive defi-
Three metrics are used for performance evaluation: Cross nitions. It reduces at least 23% execution times on average.
Common Function(CCF) [52], Jaccard Index(JI ) [53] and Our greedy approximation approach has largely reduced
Normalized Mutual Information(NMI ) [54]. the execution times. This is graphically demonstrated in

49372 VOLUME 11, 2023


T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation

TABLE 3. Various real-world networks with ground-truth. CC stands for Clustering Coefficient.

TABLE 4. Comparisons of execution time performance. In average density, the threshold density is defined as the maximum edge density of a graph.
In k-core, (k) is the maximum core number of a graph. In k-clique, (k) is set to 5 and 3. When k=3,3-clique means triangles counting in the graph.

FIGURE 6. The processing time performance of four approaches with Our


approach and the static method on Orkut. FIGURE 8. The processing time performance of four approaches with Our
approach and the static method on DBLP.

FIGURE 7. The processing time performance of four approaches with Our


approach and the static method on Youtube. FIGURE 9. The processing time performance of four approaches with Our
approach and the static method on Amazon.

figure 4-9, which shows vivid comparisons of running times


over six data sets between our approach (marked in blue) Overall, our greedy approximation approach reduces about
and the static algorithm(marked in red): the execution time 25% execution time on average and presents better perfor-
marked in blue is always less than the red one. mance when the size of the graph rockets. Our approach

VOLUME 11, 2023 49373


T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation

FIGURE 10. CCF. FIGURE 11. JI.

is also applicable to state-of-the-art algorithms to improve


efficiency in diverse real-world applications.

C. IMPROVEMENT IN THE ACCURACY


In this section, we compare the quality of subgraphs
detected by our greedy approximation method and the
static method through three measures: Cross Common
Fraction(CCF), Jaccard Index(JI ) and Normalized Mutual
Information(NMI ). CCF [52] finds the maximal shared parts
of the subgraphs between the induced one and the real one.
Formally, it is defined as
cn cn′ FIGURE 12. NMI.
1X \ 1X \
CCF = max |Ci Cj′ | + max |Ci Cj′ | (4)
2 j 2 i
i=1 j=1 TABLE 5. The accuracy performance of k-core with our approach and the
static method on six datasets.
where cn and cn′
are the numbers of subgraphs from discov-
ery and in the original graph, respectively, and Ci and Cj′ are
the subgraphs induced by algorithms and in real community
respectively.
Jaccard Index(JI ) [53] is frequently applied to measure
similarity by calculating classification results of node pairs
which could be clustered into the same subgraph or different
subgraph. It can be defined as Ns +NNdss+Nsd where Ns is the
number of node pairs which are classified into the same
subgraph by algorithms and original existence, Nsd stands for
node pairs who are in the same subgraph but are divided into 66% to 85% similarity with the original graph, whereas sub-
different subgraphs, and Nds vice versa. graphs detected from the static method have 59% to 80 %
similarity with the original graph. Our approach outperforms
Normalized Mutual Information (NMI ) [54], [55] is based
6% on average by the static method. For Jaccard Index(JI ),
on information theory for evaluating accuracy. The score of
considering node pairs who are in the same subgraph are
NMI is defined as
N N
classified into the same subgraph, subgraphs discovered from
−2 i,j Ni,j log Ni,ji. N.jt
P
our approach have 67 % to 84 % accuracy to categorize the
NMI = P N.j
(5) node pairs into the same group, whereas the accuracy in the
Ni. P
i Ni. log Nt + j N.j log Nt static method is ranging from 54 % to 77%, respectively.
where N is the confusion matrix, and Ni,j is the same parts Our approach outperforms the static method, close to 7%
between a detected subgraph Si and a real one Sj′ . Ni. and N.j on average. For Normalized Mutual Information, the scores
stand
P Pfor the sum by row and column, respectively, and Nt = show that our approach gets higher scores than the static
i j Nij . method does. Subgraphs detected from our approach can get
It indicates in Figs. 10- 12 and Tab. 5 that subgraphs scores ranging from 0.68 to 0.85 on six different datasets,
derived from our approach incorporated k-core defini- while Subgraphs detected from the static method gets scores
tion are more compact and denser. For Cross Common ranging from 0.55 to 0.77. Our approach performs better than
Fraction(CCF), subgraphs induced from our approach have the static method by 0.1 on average.

49374 VOLUME 11, 2023


T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation

FIGURE 15. CCF of varying degree.

FIGURE 13. Overall time costs of varying degree.


TABLE 6. Comparisons between our greedy approximation approach and
the static method.

We present CC (Fig.14) and CCF (Fig.15)as representa-


tives of accuracy, respectively, and observe similar trends on
other measures. Our approach produces the expected com-
munity with higher CC values as the density increases. From
the perspective of CCF evaluation accuracy, our approach
FIGURE 14. CC of varying degree. not only significantly outperforms the static algorithm but
also shows more stable performance and is less sensitive
to density. In addition, we find that the number of detected
Overall, the experimental results have shown that our communities decreased as density increased.
greedy approximation approach with different density defini- An increase in network density generally results in a higher
tions presents a more accurate estimation of the structure of a clustering coefficient. For most algorithms, it results in a
graph with more compact subgraphs than the static method. smaller detection community, and in general, our approach
When the graph size grows dramatically, our approach helps outperforms static algorithms in overall performance and
speed up the subgraph discovery processes and guarantees the stability.
quality of the derived subgraphs. The output subgraphs are E. SUMMARY OF THE COMPARISONS WITH THE STATIC
also more compact when the graph size varies drastically. METHOD
A summary of comparisons between our greedy approxima-
D. INFLUENCE ON SPARSITY tion approach presented in this paper and the static method
The sensitivity of the proposed approach, when applied to for densest subgraph problem computation are tabulated in
datasets with different network sparsity [56] is also a con- Table 6. It is seen from Table 6 that the static method is
cern in practice. In this experiment, synthetic networks with relatively easy to use than our approach. The settings of our
various average degrees (degree increasing from 10 to 70 at approach are sliding windows and the static method is no need
intervals 10, note that |E| = η|V | and degree ≈ 2η) to set parameters. Other comparison items are all in favour
are generated by gradually increasing its density based on of our approach. Particularly, our approach is suitable for
LiveJournal. DSP computation with steaming data, but the static method
The variations of the total time cost are shown in Fig.13. is not directly applicable in this case. As the static method
The growth trend of the proposed approach is consistent with will miss useful information in the data sets, our approach
the original algorithm shown in Fig.4. We also study the gives further improvements in accuracy. Also, compared to
accuracy of the relevant algorithms on datasets with various the static method, our approach is more computationally
densities and present results in Fig.14. efficient, leading to much-improved scalability. Our approach

VOLUME 11, 2023 49375


T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation

uses existing DSP extraction techniques as underlying tools REFERENCES


for DSP computation. Our approach incorporates DSP com- [1] C. Tsourakakis, ‘‘The K-clique densest subgraph problem,’’ in Proc. 24th
putation techniques, thus requiring dynamic updates and Int. Conf. World Wide Web, Florence, Italy, May 2015, pp. 1122–1132.
[2] H. Hafidi, P. Ciblat, M. Ghogho, and A. Swami, ‘‘Graph-assisted Bayesian
interactions with the DSP computation during the evolution node classifiers,’’ IEEE Access, vol. 11, pp. 23989–24002, 2023.
of the sliding window for fixed or steaming data sets. [3] M. Chen, Y. Wang, Y. Chen, H. Zhu, Y. Xie, and P. Guo, ‘‘Adaptive graph
Our main novelty and significance are: 1)Our greedy representation for clustering,’’ IEEE Access, vol. 10, pp. 122981–122994,
approximation approach for the densest subgraphs discov- 2022.
[4] E. M. A. Butt, W. Mahmood, F. M. O. Tawfiq, Q. Xin, and M. Shoaib,
ery in both very large scale and highly evolving graphs in ‘‘A study of complex dombi fuzzy graph with application in decision
sliding windows has been presented in this paper. 2)It helps making problems,’’ IEEE Access, vol. 10, pp. 102064–102075, 2022.
understand and reveal the core structures of networks with [5] T. A. Ayall, H. Liu, C. Zhou, A. M. Seid, F. B. Gereme, H. N. Abishu,
and Y. H. Yacob, ‘‘Graph computing systems and partitioning techniques:
fewer computations in a continuous fashion and thus reduces A survey,’’ IEEE Access, vol. 10, pp. 118523–118550, 2022.
the total execution time. The running time performance is [6] M. Rahevar, A. Ganatra, T. Saba, A. Rehman, and S. A. Bahaj,
O(m + n). 3)The approach has been designed for handling ‘‘Spatial–temporal dynamic graph attention network for skeleton-based
action recognition,’’ IEEE Access, vol. 11, pp. 21546–21553, 2023.
large-scale updating streaming data by dealing with contin-
[7] A. McGregor, D. Tench, S. Vorotnikova, and T. H. Vu, ‘‘Densest subgraph
uous sliding windows among edge streams. Particularly, the in dynamic graph streams,’’ in Proc. 40th Int. Symp. Math. Found. Comput.
approach grasps the densest subgraphs locally and constructs Sci., Italy, Aug. 2015, pp. 472–482.
the global candidate set to reduce computation complexity [8] J. Xie, S. Kelley, and B. K. Szymanski, ‘‘Overlapping community detection
in networks: The state-of-the-art and comparative study,’’ ACM Comput.
by greedy approximation. 4)Besides efficiency and accuracy Surv., vol. 45, no. 4, p. 43, 2013.
performance, the sparsity influence of our approach is studied [9] F. Bonchi, F. Gullo, A. Kaltenbrunner, and Y. Volkovich, ‘‘Core decompo-
to solve real-world problems. The results show a more stable sition of uncertain graphs,’’ in Proc. 20th ACM SIGKDD Int. Conf. Knowl.
Discovery Data Mining, New York, NY, USA, Aug. 2014, pp. 1316–1325.
performance. [10] N. Tatti and A. Gionis, ‘‘Density-friendly graph decomposition,’’ in Proc.
24th Int. Conf. World Wide Web, New York, NY, USA, May 2015,
pp. 1089–1099.
VI. CONCLUSION [11] M. Mitzenmacher, J. Pachocki, R. Peng, C. Tsourakakis, and S. C. Xu,
‘‘Scalable large near-clique detection in large-scale networks via sam-
An incremental graph greedy approximation approach in pling,’’ in Proc. 21st ACM SIGKDD Int. Conf. Knowl. Discovery Data
sliding windows has been presented in this paper for find- Mining, Sydney, NSW, Australia, Aug. 2015, pp. 815–824.
ing the densest subgraphs from both large-scale and highly [12] O. D. Balalau, F. Bonchi, T.-H.-H. Chan, F. Gullo, and M. Sozio, ‘‘Finding
subgraphs with maximum total density and limited overlap,’’ in Proc. 8th
dynamic graphs. This helps reveal the densest communities ACM Int. Conf. Web Search Data Mining, Shanghai, China, Feb. 2015,
of networks with less complicated computations in a contin- pp. 379–388.
uous model and thus reduces the total processing time. The [13] C. Ma, Y. Fang, R. Cheng, L. V. S. Lakshmanan, W. Zhang, and X. Lin,
‘‘Efficient algorithms for densest subgraph discovery on large directed
approach has been designed for handling large-scale dynamic graphs,’’ in Proc. ACM SIGMOD Int. Conf. Manage. Data, Portland, OR,
streaming data by processing persistent sliding windows USA, Jun. 2020, pp. 1051–1066.
among edge streams. Particularly, the approach preserves the [14] Y. Fang, K. Yu, R. Cheng, L. V. S. Lakshmanan, and X. Lin, ‘‘Efficient
algorithms for densest subgraph discovery,’’ Proc. VLDB Endowment,
densest subgraphs locally and constructs the global candidate vol. 12, no. 11, pp. 1719–1732, Jul. 2019.
set to reduce computation complexity. [15] X. Liu, T. Ge, and Y. Wu, ‘‘Finding densest lasting subgraphs in dynamic
Experiments have been implemented incorporating this graphs: A stochastic approach,’’ in Proc. IEEE 35th Int. Conf. Data Eng.
greedy approximation approach with four density definitions (ICDE), Macao, China, Apr. 2019, pp. 782–793.
[16] E. V. Lee, N. Ruan, R. Jin, and C. C. Aggarwal, ‘‘A survey of algorithms for
on six real-world graphs with ground truth. The results have dense subgraph discovery,’’ in Managing and Mining Graph Data. Cham,
demonstrated the efficiency and accuracy of the presented Switzerland: Springer, 2010, pp. 303–336.
greedy approximation approach. It is also proved that the [17] A. McGregor, ‘‘Graph stream algorithms: A survey,’’ ACM SIGMOD Rec.,
vol. 43, no. 1, pp. 9–20, May 2014.
approach is not influenced by density. Therefore, the greedy
[18] S. Harenberg, G. Bello, L. Gjeltema, S. Ranshous, J. Harlalka, R. Seay,
approximation approach performs well in investigating large- K. Padmanabhan, and N. Samatova, ‘‘Community detection in large-scale
scale networks. networks: A survey and empirical evaluation,’’ WIREs Comput. Statist.,
vol. 6, no. 6, pp. 426–439, Nov. 2014.
The work in this paper opens the door to future research
[19] C. Chekuri, K. Quanrud, and R. M. Torres, ‘‘Densest subgraph: Super-
in DSP computing in the following three ways. From the modularity, iterative peeling, and flow,’’ in Proc. Annu. ACM-SIAM
perspective of intelligent systems, graph summarization and Symp. Discrete Algorithms (SODA), Alexandria, VA, USA, Jan. 2022,
simplification based on new knowledge or features will pp. 1531–1555.
[20] A. Gionis and C. E. Tsourakakis, ‘‘Dense subgraph discovery: KDD
bring a powerful evolution process for big-stream data. 2015 tutorial,’’ in Proc. 21st ACM SIGKDD Int. Conf. Knowl. Discovery
From the perspective of graph visualization, the graphical Data Mining, Sydney, NSW, Australia, Aug. 2015, pp. 2313–2314.
representations of graph summarization and simplification, [21] M. Sozio and A. Gionis, ‘‘The community-search problem and how to plan
a successful cocktail party,’’ in Proc. 16th ACM SIGKDD Int. Conf. Knowl.
as well as its evolution, are significant for better under- Discovery Data Mining, Washington, DC, USA, Jul. 2010, pp. 939–948.
standing the dynamic features of application networks. From [22] C. Tsourakakis, F. Bonchi, A. Gionis, F. Gullo, and M. Tsiarli, ‘‘Denser
the perspective of processing time performance, distributed than the densest subgraph: Extracting optimal quasi-cliques with quality
fashion and parallel computing in a cloud cluster will use guarantees,’’ in Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discovery
Data Mining, Chicago, IL, USA, Aug. 2013, pp. 104–112.
more CPU and storage resources to achieve faster DSP [23] Y. Tao, ‘‘Technical perspective of efficient directed densest subgraph dis-
computing. covery,’’ ACM SIGMOD Rec., vol. 50, no. 1, p. 32, Jun. 2021.

49376 VOLUME 11, 2023


T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation

[24] C. Ma, Y. Fang, R. Cheng, L. V. S. Lakshmanan, W. Zhang, and X. Lin, [44] D. Koutra, U. Kang, J. Vreeken, and C. Faloutsos, ‘‘VOG: Summarizing
‘‘On directed densest subgraph discovery,’’ ACM Trans. Database Syst., and understanding large graphs,’’ in Proc. SIAM Int. Conf. Data Mining,
vol. 46, no. 4, pp. 1–45, Dec. 2021. Philadelphia, PA, USA, Apr. 2014, pp. 91–99.
[25] R. Andersen and K. Chellapilla, ‘‘Finding dense subgraphs with [45] X. Liu, T. Ge, and Y. Wu, ‘‘A stochastic approach to finding densest
size bounds,’’ in Proc. Int. Workshop Algorithms Models Web-Graph, temporal subgraphs in dynamic graphs,’’ IEEE Trans. Knowl. Data Eng.,
Barcelona, Spain, Feb. 2009, pp. 25–37. vol. 34, no. 7, pp. 3082–3094, Jul. 2022.
[26] S. Bhattacharya, M. Henzinger, D. Nanongkai, and E. C. Tsourakakis, [46] A. D. Sarma, A. Lall, D. Nanongkai, and A. Trehan, ‘‘Dense subgraphs
‘‘Space- and time-efficient algorithm for maintaining dense subgraphs on on dynamic networks,’’ in Proc. 26th Int. Conf. Distrib. Comput., Berlin,
one-pass dynamic streams,’’ 2015, arXiv:1504.02268. Germany, 2012, pp. 151–165.
[27] S. Khuller and B. Saha, ‘‘On finding dense subgraphs,’’ in Proc. Int. Colloq. [47] A. Angel, N. Sarkas, N. Koudas, and D. Srivastava, ‘‘Dense subgraph
Automata, Lang., Program., Rhodes, Greece, Jul. 2009, pp. 597–608. maintenance under streaming edge weight updates for real-time story iden-
[28] P. Bombina and B. Ames, ‘‘Convex optimization for the densest subgraph tification,’’ Proc. VLDB Endowment, vol. 5, no. 6, pp. 574–585, Feb. 2012.
and densest submatrix problems,’’ Social Netw. Oper. Res. Forum, vol. 1, [48] A. V. Goldberg, ‘‘Finding a maximum density subgraph,’’ EECS Dept.,
no. 3, p. 23, Sep. 2020. Univ. California, Berkeley, CA, USA, Tech. Rep., UCB/CSD-84-171,
[29] C. Ma, Y. Fang, R. Cheng, L. V. S. Lakshmanan, and X. Han, ‘‘A convex- 1984.
programming approach for efficient directed densest subgraph discovery,’’ [49] M. Charikar, ‘‘Greedy approximation algorithms for finding dense com-
in Proc. Int. Conf. Manage. Data, Philadelphia, PA, USA, Jun. 2022, ponents in a graph,’’ in Proc. Int. Workshop Approximation Algorithms
pp. 845–859. Combinat. Optim. Berlin, Germany: Springer, 2000, pp. 84–95.
[30] R. Dondi, M. M. Hosseinzadeh, G. Mauri, and I. Zoppis, ‘‘Top-k over- [50] D. Boob, Y. Gao, R. Peng, S. Sawlani, C. Tsourakakis, D. Wang, and
lapping densest subgraphs: Approximation algorithms and computational J. Wang, ‘‘Flowless: Extracting densest subgraphs without flow compu-
complexity,’’ J. Combinat. Optim., vol. 41, no. 1, pp. 80–104, Jan. 2021. tations,’’ in Proc. Web Conf., Apr. 2020, pp. 573–583.
[31] J. Leskovec, K. J. Lang, and M. Mahoney, ‘‘Empirical comparison of [51] J. Leskovec and A. Krevl. (Jun. 2014). SNAP Datasets: Stanford Large Net-
algorithms for network community detection,’’ in Proc. 19th Int. Conf. work Dataset Collection. [Online]. Available: http://snap.stanford.edu/data
World Wide Web, Raleigh, NC, USA, Apr. 2010, pp. 631–640. [52] M. Wang, C. Wang, J. X. Yu, and J. Zhang, ‘‘Community detection in social
[32] J. Chen and Y. Saad, ‘‘Dense subgraph extraction with application to networks: An in-depth benchmarking study with a procedure-oriented
community detection,’’ IEEE Trans. Knowl. Data Eng., vol. 24, no. 7, framework,’’ Proc. VLDB Endowment, vol. 8, no. 10, pp. 998–1009,
pp. 1216–1230, Jul. 2012. Jun. 2015.
[33] L. Qin, R.-H. Li, L. Chang, and C. Zhang, ‘‘Locally densest subgraph [53] L. Hamers, Y. Hemeryck, G. Herweyers, M. Janssen, H. Keters,
discovery,’’ in Proc. 21st ACM SIGKDD Int. Conf. Knowl. Discovery Data R. Rousseau, and A. Vanhoutte, ‘‘Similarity measures in scientometric
Mining, Sydney, NSW, Australia, Aug. 2015, pp. 965–974. research: The Jaccard index versus Salton’s cosine formula,’’ Inf. Process.
[34] A. Konar and D. N. Sidiropoulos, ‘‘The triangle-densest-K-subgraph prob- Manage., vol. 25, no. 3, pp. 315–318, Jan. 1989.
lem: Hardness, Lovász extension, and application to document summariza- [54] F. A. McDaid, D. Greene, and J. N. Hurley, ‘‘Normalized mutual infor-
tion,’’ in Proc. AAAI Conf. Artif. Intell., Feb. 2022, pp. 4075–4082. mation to evaluate overlapping community finding algorithms,’’ 2011,
[35] A. Ghasabeh and M. S. Abadeh, ‘‘Community detection in social networks arXiv:1110.2515.
using a hybrid swarm intelligence approach,’’ Int. J. Knowl.-Based Intell. [55] L. Danon, A. Díaz-Guilera, J. Duch, and A. Arenas, ‘‘Comparing commu-
Eng. Syst., vol. 19, no. 4, pp. 255–267, Jan. 2016. nity structure identification,’’ J. Stat. Mech., Theory Exp., vol. 2005, no. 9,
[36] E. Valari, M. Kontaki, and N. A. Papadopoulos, ‘‘Discovery of top-k dense Sep. 2005, Art. no. P09008.
subgraphs in dynamic graph collections,’’ in Proc. 24th Int. Conf. Sci. Stat. [56] S. Goswami, C. A. Murthy, and A. K. Das, ‘‘Sparsity measure of a network
Database Manag., 2012, pp. 213–230. graph: Gini index,’’ Inf. Sci., vol. 462, pp. 16–39, Sep. 2018.
[37] F. Zhou, S. Mahler, and H. Toivonen, Simplification of Networks by Edge
Pruning. Berlin, Germany: Springer, 2012, pp. 179–198.
[38] R. Dondi, M. M. Hosseinzadeh, and P. H. Guzzi, ‘‘A novel algorithm for
finding top-k weighted overlapping densest connected subgraphs in dual
networks,’’ Appl. Netw. Sci., vol. 6, no. 1, p. 40, Dec. 2021.
[39] F. Bonchi, D. García-Soriano, A. Miyauchi, and C. E. Tsourakakis,
‘‘Finding densest k-connected subgraphs,’’ Discrete Appl. Math., vol. 305,
pp. 34–47, Dec. 2021.
[40] A. Miyauchi and A. Takeda, ‘‘Robust densest subgraph discovery,’’ in
Proc. IEEE Int. Conf. Data Mining (ICDM), Nov. 2018, pp. 1188–1193. TAO HAN was born in Beijing, China. She
[41] S. Chawla, K. Garimella, A. Gionis, and D. Tsang, Discovering the Net- received the M.S. degree in computer science and
work Backbone from Traffic Activity Data. Cham, Switzerland: Springer, the Ph.D. degree in computer architecture from
2016, pp. 409–422. Beihang University, China, in 2011 and 2018,
[42] N. Ruan, R. Jin, and Y. Huang, ‘‘Distance preserving graph simplification,’’ respectively. She is currently with the National
in Proc. IEEE 11th Int. Conf. Data Mining, Vancouver, BC, Canada, Center for Public Credit Information, China. Her
Dec. 2011, pp. 1200–1205. research interests include credit data management,
[43] G. H. L. Fletcher and M. Pechenizkiy, ‘‘On structure preserving sampling big data, and data mining.
and approximate partitioning of graphs,’’ in Proc. 31st Annu. ACM Symp.
Appl. Comput., New York, NY, USA, 2016, pp. 875–882.

VOLUME 11, 2023 49377

You might also like