Evolutionary Clustering With DBSCAN
Evolutionary Clustering With DBSCAN
†
Beijing Institute of System Engineering, Beijing, China
*
School of Economics and Management, Beihang University, Beijing, China
†
{dragonzyc, bodeng}@gmail.com, * [email protected]
Abstract—Clustering algorithms have been used in the field of with data density, they can easily character groups of arbitrary
data mining for such a long time. With the accumulation of the shapes and find the noise in the dataset.
online data sets, studies on cluster evolution were carried out so as
to decrease noise and maintain continuity of clustering results. A In terms of the density-based clustering algorithm,
number of evolutionary clustering algorithms have been proposed, DBSCAN (density-based spatial clustering of applications with
such as the evolutionary K-means and Spectral clustering, but noise) [14], CLIQUE (clustering in quest) [15] and DENCLUE
none of them were engaged to solving the density-based clustering (density clustering) [16] are the three representatives, among
problem. which DBSCAN is regard as a classical method used most
In this paper, we initially present an evolutionary clustering frequently. Here, our research would mainly focus on the
algorithm with DBSCAN (density-based spatial clustering of evolution of DBSCAN to demonstrate the evolutionary
applications with noise), which is on the basis of temporal clustering for the density-based clustering algorithms.
smoothness penalty framework. We conduct the evaluations of
our framework both on the random Gaussian dataset and the
In this paper, we initially propose an evolutionary
classical DBSCAN dataset. Compared with the other similar clustering algorithm with DBSCAN to solve the density-based
evolutionary clustering algorithms, such as the evolutionary K- clustering problem under the framework of temporal
means clustering, our method can not only resist to the noise, but smoothness penalty [2]. Our proposed framework is not
also distinguish the clusters with arbitrary shapes during the sensitive to the noise and can adapt to a diversity of the
evolution process. density-based algorithms.
In short, compared with the traditional clustering
Index Terms—Evolutionary Clustering, Density-Based
Clustering, DBSCAN.
algorithms, our main contributions are three-fold:
1. Our evolutionary approach can easily be robust to the
I. INTRODUCTION
noise merged in the datasets and would help to resist
In recent years, the evolutionary clustering, which is able to the variation in the short term.
cluster the data changing over the time, has attracted a lot of
attention. With the data evolving, the clustering results can be 2. Our evolutionary approach can automatically recognize
obtained at each time step through such methods. Compared the clusters with arbitrary shapes, exceeding that of the
with the traditional clustering algorithms, it can not only fit the traditional ones.
current data well, but also prohibit deviation from the history
3. Our evolutionary approach is not sensitive to the
data. As of the characteristic on tracing the temporal drifting in
variation in cluster number and can prohibit the
the long term while being robust to the noise in the short term
[1], more and more applications have put the evolutionary derivation from the history data.
clustering as an important place in their work. The rest of our paper is organized as follows. Section II
Many researchers have conducted clustering studies on discusses some related works on evolutionary clustering. The
evolutionary topics, such as K-means [1][5], spectral clustering framework and the detailed algorithm of our evolutionary
[2]-[4] and hierarchical clustering [1][6], which would be DBSCAN are both described in Section III. Section IV
described in the next section. Similar to evolutionary clustering, examines the results of our proposed evolutionary framework
incremental clustering [7]-[9] and streaming clustering [10][11] through experiments. Section V presents the conclusion and
take the history data into consideration as well. However, few introduces our future works.
of them set foot in the density-based clustering algorithms. II. RELATED WORKS
Since data and noise mingled with it are both evolving over the
time, we should develop an approach to distinguish them apart In this section, we review some related work on
as well as trace their variation. In such occasion, the density- evolutionary clustering. Recent researches are mainly
based evolutionary clustering is put forward to solve this conducted on a basis of the temporal smoothness penalty [1],
problem. As the density-based algorithms cluster in accordance which would also be employed in our framework.
924
(1) (2) (3) (4)
925
beginning, two U style clusters are separated along with much
noise in the dataset. As the time stepping, one of them is
moving toward to another until they are twined with each other.
The evolution process is described as Figure 2. For evaluation,
we run our experiments with EPS=1 and MINPTS=55 in the
evolutionary DBSCAN.
For the G dataset, it would help us to justify the basic
function for our proposed evolutionary approach. For the U
dataset, it would verify our evolutionary approach capable of
distinguishing the data with different shapes. Provided the
above two datasets, we hope to overall evaluate our framework.
To emphasize it, both of our datasets are generated for average
Figure 3. The EDBSCAN on Gaussian Datasets (G) 5 times and the measures are evaluated statistically on all the
synthetic datasets.
A. Measures
Here, we use the evaluation index that was mentioned
by Junjie W. et al. [12] It evaluates the cluster quality
comparing the agreement between the estimated labels and
their true labels in the contingency matrix. The definition of
can be found in the following equation,
∑ ∑ ·
∑ ·
. 4
∑ · ⁄2 ∑ ·
/2 ∑ ·
∑ ·
926
(a) . (b) . (c) .
Figure 5. A Comparison of EDBSCAN and EK-means on Gaussian Datasets (G)
Secondly, we analyze our approach accompanied with the From the evolutionary clustering results with a diverse alpha
EK-means proposed by Chakrabarti [1]. Since the parameter values in Figure 6, the EDBSCAN is turned out to be far more
alpha determines the weight between snapshot and temporal effective than the EK-means. It is mainly because that the EK-
cost, we rang the alpha value to test the robustness of our means is sensitive to the noise and fails to distinguish the
evolutionary framework. cluster with different shapes as well. Moreover, as proved in
our research in Figure 4, the evolutionary methods also perform
Figure 5 shows the outcomes when alpha is 0.25, 0.5 and much better than its static and incremental counterparts in U
0.75 respectively. Since the evolutionary results are not dataset.
sensitive to the alpha values, we only select three representative
alpha values to demonstrate our experiments. On the basis of Based on the simulations above in U dataset, we know that
the clustering results in the figures above, it is easy to observe our evolutionary approach can easily distinguish the data with
that the EDBSCAN in a more favorable accuracy than the EK- arbitrary shapes and resist to the noise in the dataset.
means before time step 4, when the clusters are beginning to
amalgamate. As the cluster number is set manually in EK- V. CONCLUSION
means, it naturally outperforms after time step 4 when the In this paper, we initially propose an evolutionary
clusters have been merged together. clustering algorithm with DBSCAN to solve the density-based
From the experiments on G datasets, we can easily find that evolutionary clustering problem under the framework of
our evolutionary approach shows an advantage on fitting the temporal smoothness penalty. Through the experiments, we
current data as well as prohibiting the derivation from the have learned that the proposed evolutionary approach show an
history data over its counterparts, especially for the times when advantage over the synthetic datasets. Compared with the other
the clusters start to amalgamate. similar evolutionary clustering algorithms, such as the
evolutionary K-means clustering, our method can not only
When it comes to the U dataset, we compare our resist to the noise, but also distinguish the clusters with
EDBSCAN with the EK-means, as we do on the G dataset. arbitrary shapes during the evolution.
927
However, it is still necessary for us to take the parameters [6] Kevin S. Xu, Mark Kliger, Alfred O. Hero III. Evolutionary spectral
of the density-based algorithms into consideration, such as the clustering with adaptive forgetting factor [J]. ICASSP 2010:2174-2177.
[7] Charikar M, Chekuri C, Feder T, et al. Incremental clustering and
EPS and MINPTS in DBSCAN, which are usually set prior to dynamic information retrieval [C]. Proceedings of the Twenty-ninth
the clustering by users. Parameters selection is an important Annual ACM Symposium on Theory of Computing, 1997:626-635.
topic and sensitive problems for density-based clustering, [8] Gupta C., Grossman R., et al. A single pass generalized incremental
especially for those contexts where the nodes number and algorithm for clustering [C]. SIAM Conference on Data Minning,
density are both evolving with the time. For the former one, we Florida, 2004:147-153.
[9] Gomes R, Welling M, Perona P. Incremental learning of nonparametric
have listed out some solutions before, while for the latter one, Bayesian mixture model [C]. IEEE Conference on Computer Vision and
we need to reconstruct the parameters to adapt the variation. Pattern Recognition, 2008:1-8.
[10] Barbará D. Requirements for clustering data streams [C]. ACM
Meanwhile, for the time consumed of density-based SIGKDD Explorations Newsletter, 2002, 3(2):23-27.
evolutionary algorithms, we could theoretically improve it by [11] Gaber M., Zaslavsky A., Krishnaswamy S. Mining data streams: a
incorporating the techniques from the cloud computing, such as review [C]. ACM SIGMOD Record, 2005, 34(2):18-26.
the Hadoop framework [13]. All these problems are planned to [12] Junjie Wu, Hui Xiong, Jian Chen. Adapting the right measures for K-
be worked out in our future researches. means clustering [C]. ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (KDD), 2009:877-886.
REFERENCES [13] Bi-Ru Dai, I.-Chang Lin. Efficient Map/Reduce-Based DBSCAN
Algorithm with Optimized Data Partition [C]. IEEE CLOUD, 2012:59-
[1] Deepayan Chakrabarti, Ravi Kumar, Andrew Tomkins. Evolutionary 66.
clustering [C]. ACM SIGKDD International Conference on Knowledge [14] ESTER M, KRIEGEL H P, SANDER J. A density-based algorithm for
Discovery & Data Mining (KDD), 2006:554-560. discovering clusters in large spatial databases with noise [C]. Proc of the
[2] Yun Chi, Xiaodan Song, Dengyong Zhou, Koji Hino, Belle L. Tseng. 2nd International Conference on Knowledge Discovering in Databases
On evolutionary spectral clustering [J]. ACM Transactions on and Data Mining, 1996: 122 -128.
Knowledge Discovery from Data, Vol. 3, No. 4, 2009. [15] Rakesh A, Johanners G, Dimitrios G, Prabhakar R. Automatic subspace
[3] Yun Chi, Xiaodan Song, Dengyong Zhou, et al. Evolutionary spectral clustering of high dimensional data for data mining applications [C].
clustering by incorporating temporal smoothness [C]. ACM SIGKDD Proc. of ACM SIGMOD Conference on Management of Data, 1994:
International Conference on Knowledge Discovery & Data Mining 94-105.
(KDD), 2007:153-162. [16] A. Hinneburg and D. A. Keim. An efficient approach to clustering in
[4] Ulrike Luxburg. A tutorial on spectral clustering [J]. Statistics and large multimedia databases with noise [C]. ACM SIGKDD International
Computing (SAC) 17(4), 2007. Conference on Knowledge Discovery & Data Mining (KDD), 1998.
[5] Kevin S. Xu, Mark Kliger, Alfred O. Hero III: Adaptive Evolutionary
Clustering [J]. CoRR abs/1104.1990, 2011.
928