0% found this document useful (0 votes)
52 views

Evolutionary Clustering With DBSCAN

Uploaded by

kripsi ilyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Evolutionary Clustering With DBSCAN

Uploaded by

kripsi ilyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2013 Ninth International Conference on Natural Computation (ICNC)

Evolutionary Clustering with DBSCAN


Yuchao Zhang , Hongfu Liu*, Bo Deng
† †


Beijing Institute of System Engineering, Beijing, China
*
School of Economics and Management, Beihang University, Beijing, China

{dragonzyc, bodeng}@gmail.com, * [email protected]

Abstract—Clustering algorithms have been used in the field of with data density, they can easily character groups of arbitrary
data mining for such a long time. With the accumulation of the shapes and find the noise in the dataset.
online data sets, studies on cluster evolution were carried out so as
to decrease noise and maintain continuity of clustering results. A In terms of the density-based clustering algorithm,
number of evolutionary clustering algorithms have been proposed, DBSCAN (density-based spatial clustering of applications with
such as the evolutionary K-means and Spectral clustering, but noise) [14], CLIQUE (clustering in quest) [15] and DENCLUE
none of them were engaged to solving the density-based clustering (density clustering) [16] are the three representatives, among
problem. which DBSCAN is regard as a classical method used most
In this paper, we initially present an evolutionary clustering frequently. Here, our research would mainly focus on the
algorithm with DBSCAN (density-based spatial clustering of evolution of DBSCAN to demonstrate the evolutionary
applications with noise), which is on the basis of temporal clustering for the density-based clustering algorithms.
smoothness penalty framework. We conduct the evaluations of
our framework both on the random Gaussian dataset and the
In this paper, we initially propose an evolutionary
classical DBSCAN dataset. Compared with the other similar clustering algorithm with DBSCAN to solve the density-based
evolutionary clustering algorithms, such as the evolutionary K- clustering problem under the framework of temporal
means clustering, our method can not only resist to the noise, but smoothness penalty [2]. Our proposed framework is not
also distinguish the clusters with arbitrary shapes during the sensitive to the noise and can adapt to a diversity of the
evolution process. density-based algorithms.
In short, compared with the traditional clustering
Index Terms—Evolutionary Clustering, Density-Based
Clustering, DBSCAN.
algorithms, our main contributions are three-fold:
1. Our evolutionary approach can easily be robust to the
I. INTRODUCTION
noise merged in the datasets and would help to resist
In recent years, the evolutionary clustering, which is able to the variation in the short term.
cluster the data changing over the time, has attracted a lot of
attention. With the data evolving, the clustering results can be 2. Our evolutionary approach can automatically recognize
obtained at each time step through such methods. Compared the clusters with arbitrary shapes, exceeding that of the
with the traditional clustering algorithms, it can not only fit the traditional ones.
current data well, but also prohibit deviation from the history
3. Our evolutionary approach is not sensitive to the
data. As of the characteristic on tracing the temporal drifting in
variation in cluster number and can prohibit the
the long term while being robust to the noise in the short term
[1], more and more applications have put the evolutionary derivation from the history data.
clustering as an important place in their work. The rest of our paper is organized as follows. Section II
Many researchers have conducted clustering studies on discusses some related works on evolutionary clustering. The
evolutionary topics, such as K-means [1][5], spectral clustering framework and the detailed algorithm of our evolutionary
[2]-[4] and hierarchical clustering [1][6], which would be DBSCAN are both described in Section III. Section IV
described in the next section. Similar to evolutionary clustering, examines the results of our proposed evolutionary framework
incremental clustering [7]-[9] and streaming clustering [10][11] through experiments. Section V presents the conclusion and
take the history data into consideration as well. However, few introduces our future works.
of them set foot in the density-based clustering algorithms. II. RELATED WORKS
Since data and noise mingled with it are both evolving over the
time, we should develop an approach to distinguish them apart In this section, we review some related work on
as well as trace their variation. In such occasion, the density- evolutionary clustering. Recent researches are mainly
based evolutionary clustering is put forward to solve this conducted on a basis of the temporal smoothness penalty [1],
problem. As the density-based algorithms cluster in accordance which would also be employed in our framework.

978-1-4673-4714-3/13/$31.00 ©2013 IEEE 923


In the year of 2006, Chakrabarti et al. [1] firstly presented Here, we assume that and are respectively the
an evolutionary version of the K-means and hierarchical evolutionary and static clustering results at time t. Given a time
clustering to initiate the researches on evolutionary clustering. step t, the object of evolutionary clustering is to find the
Both of algorithms above are deduced by incorporating a optimal that minimizes the cost function as follows.
temporal smoothness penalty that would be discussed in the
following sections. The evolutionary K-means algorithm · , 1 · , , 1
mapped the variation of centroids in different time to minimize
the overall temporal penalty cost. Thus, the evolutionary where the snapshot function snap() denotes the current cost
hierarchical algorithm tries to search the optimal cost using the between evolutionary and statistic clustering results. Thus, the
heuristic algorithm in binary tree. temporal function temp() denotes the historical cost between
time step t and t-1. Theoretically, we could use any continuous
After that, Yun C. et al. [2]-[4] put forward an evolutionary k steps before time t to depict the temporal cost in temp(),
spectral clustering, which is a sort of algorithm based on the K- instead of only two steps. Moreover, as mentioned before, we
means. Their algorithms were derived based on the two kinds could explain the above cost function either into the PQM
of temporal smoothness penalty, namely PCQ (preserve framework or into the PCM framework [2]-[4].
clustering quality) and PCM (preserve clustering membership)
framework. In PCQ, the temporal penalty is measured on As a weight between the snapshot and temporal cost, the
different data in the adjacent intervals using the same clustering, parameter in Equation 1 can be set according to the user
while in PCM, it is penalized on the distinction of the preference. In particular, the evolutionary algorithms are
clustering results at different time. As for our research, we regarded as an incremental clustering, when α equals to 0.
would adopt the latter one as our basic framework. Meanwhile, it turns out to be a static clustering, when α equals
to 1. In addition to the linear combination above, we might also
Different from the above temporal smoothness framework, use another non-linear weight to demonstrate the balance of
Kevin S. et al. [5][6] proposed a new approach to trace the cost between snap() and temp().
evolutionary clusters using a combination of proximity matrix
and noisy matrix. It outperformed other evolutionary In our context under the density-based evolutionary
algorithms in K-means, hierarchical clustering and spectral clustering with DBSCAN, we should redefine the snapshot and
clustering, through the adaptive forgetting factor controlling the temporal function to illustrate the variation of the clusters.
weight of history data recursively at each time step. However,
no experiments show that it has similarly good effect on the B. Evolutionary DBSCAN
density-based clustering. DBSCAN, which is short for the density-based spatial
clustering of applications with noise, has become one of the
In addition, the incremental clustering [7]-[9] and streaming classical algorithms in clustering. It can not only resist the
clustering [10][11] are two concepts that are often mixed up noise in the dataset, but also automatically distinguish the
with evolutionary clustering. Since the capacity of memory number of clusters.
tends to be far from sufficient for a large scale application, we
thus advanced the incremental and streaming clustering Core nodes, boundary nodes and noisy nodes are three
algorithms trying to cluster the online accumulated data by a basic components in the DBSCAN. For a given node, it would
single pass. These algorithms do not yet emphasize the become a core node, only if the number of its neighbors
variation of data over the time, which is regarded as the main exceeds an upper threshold. On the contrary, those nodes
difference from our evolutionary algorithms. within a limited radius are defined as the boundary nodes,
which are mainly determined by the core nodes. Besides, the
From the previous work on evolutionary clustering, we can remaining nodes would become a noisy node, which usually
easily note that none of researches have previously discussed takes a minority of the dataset. Core nodes and their boundary
the evolutionary problems in density-based clustering nodes that could be directed within a limited distance are
algorithms, which would be demonstrated in details through the eventually divided into the same class. Noisy nodes would not
following sections. belong to any class and thus should be neglected.
III. THE PROPOSED ALGORITHMS WITH DBSCAN From the analysis above, we could infer that the cores
In this section, we would firstly overview the framework of nodes are determinant factors in the DBSCAN clustering.
our density-based evolutionary clustering. After that, we would According to it, we should trace the variation of the core nodes
discuss the evolutionary approach with DBSCAN for the to precisely character the evolution for different clusters.
density-based clustering algorithms. Our method are Nevertheless, we all know that the core nodes are
considered to have a strong resistance to the noise and capable determined by two main parameters, EPS and MINPTS, which
of finding clusters with arbitrary shapes. are regularly settled on the basis of data density in DBSCAN
A. Overview clustering. To fully trace the variation, we could use the
neighbor vector to represents the core nodes, which is defined
The evolutionary framework of our research is primarily as follows.
based on the following temporal smoothness theory, which was
put forward by Chakrabarti [1]. Its main idea aims to add a (Definition): The element of the neighbor vector represents
constraint cost function of temporal smoothness penalty to the the number of neighbors within the EPS for each node, that
static clustering algorithms. would be a core node only if its value is large than MINPTS.

924
(1) (2) (3) (4)

Figure 1. One Sample of the Evolutionary Gaussian Datasets (G).


The above two clusters are: (1) separated; (2) beginning to merge; (3) merged; (4) totally merged.

(1) (2) (3)

Figure 2. One Sample of the Evolutionary Classical DBSCAN Datasets (U).


The above two clusters are: (1) separated; (2) beginning to twine; (3) twined together;

We formally define the vector as the number of C. Further Discussion


neighbors for each node at time t. Meanwhile, the vector The evolutionary clustering with DBSCAN above is
should be the number of neighbors for evolutionary clustering. presented under an ideal assumption that the number of nodes
It can be described as follows. would not vary with the time. Since the nodes are not always
, ,…, , stable in the real world, the movement of cluster nodes should
be taken into consideration. Moreover, with the variation of
, ,…, , cluster nodes, the configuration of parameters, such as the EPS
where and represent the statistic number and and MINPTS, is also needed to be tuned in density-based
evolutionary number of neighbors for node i at time t. Here, we algorithms.
assume that the node number would not vary with time. In terms of the removal and entrance of the cluster nodes,
Given a neighbor vector, we can quantify the temporal our evolutionary framework can easily deal with these
smoothness function into the following one: conditions. To begin with, we could neglect the neighbor
vectors of those removed nodes in the subsequent intervals,
·∑ 1 ·∑ . 2 which can be done by deleting a row in the vector . At the
In such occasions, our goal is to find the optimal solution of same time, we could initial the neighbor vectors of these new
above function for the evolutionary DBSCAN. Meanwhile, all comer nodes as zeros, which can be done by adding a row to
the evolutionary approaches discussed in our context are the vector . Since the derived neighbor vectors for the
derived under the PCM framework as mentioned before. evolutionary clustering is a linear combination of the snapshot
cost and temporal cost, the removal and entrance of the cluster
Based on the above observation, we could easily know that nodes at any time step would naturally not affect the evolution
the value of is optimal only when the first-order derivative process.
of the object function equals to zero. It can be seen as follows
in the Equation 3: When it comes to the configuration, our object is to work
out an available solution that can adaptively select the
| · 1 · . 3 parameters as the clusters are merging or even separating. Note
that the heuristic algorithms are capable of searching the
Apart from the derivative, we could also construct other optimal solutions recursively. It could be realized with the help
functions, such as the exponential smooth function, to work out of those methods in the future, which is beyond the scope of
the optimal solutions. our paper.

925
beginning, two U style clusters are separated along with much
noise in the dataset. As the time stepping, one of them is
moving toward to another until they are twined with each other.
The evolution process is described as Figure 2. For evaluation,
we run our experiments with EPS=1 and MINPTS=55 in the
evolutionary DBSCAN.
For the G dataset, it would help us to justify the basic
function for our proposed evolutionary approach. For the U
dataset, it would verify our evolutionary approach capable of
distinguishing the data with different shapes. Provided the
above two datasets, we hope to overall evaluate our framework.
To emphasize it, both of our datasets are generated for average
Figure 3. The EDBSCAN on Gaussian Datasets (G) 5 times and the measures are evaluated statistically on all the
synthetic datasets.
A. Measures
Here, we use the evaluation index that was mentioned
by Junjie W. et al. [12] It evaluates the cluster quality
comparing the agreement between the estimated labels and
their true labels in the contingency matrix. The definition of
can be found in the following equation,
∑ ∑ ·
∑ ·

. 4
∑ · ⁄2 ∑ ·
/2 ∑ ·
∑ ·

In the Equation 4, , and denote the row, column and


total number of the above contingency matrix respectively.
Figure 4. The EDBSCAN on Classical DBSCAN With the time stepping, its value will be ranging between 0 and
Datasets (U) 1. Besides, a higher value of indicates a much better
clustering result for all the competitors.
B. Results
IV. EXPERIMENTS The clustering results are to be analyzed in this part. For
In this section, we would first introduce the synthetic short, we would use the EDBSCAN and EK-means as the
datasets and measures respectively used in our experiments. evolutionary version of traditional DBSCAN and K-means
The performances of our evolutionary clustering approach with clustering algorithms in the following sections. Our measures
DBSCAN are then evaluated on the above synthetic datasets. are statistically observed on average five generated synthetic
datasets.
We conduct the verification of our framework on the
random Gaussian dataset and classical DBSCAN dataset, To justify our evolutionary framework, we firstly compare
which are short for ‘G’ and ‘U’ respectively in our context. our method with the static and incremental ones. As seen from
These two datasets are both synthesized automatically so as to the previous sections, our evolutionary approach would be a
evaluate the performance of our evolutionary approaches. static clustering when the parameter alpha is equal to 1, while it
Moreover, we randomly generate our dataset for average 5 would become an incremental algorithm when the alpha is 0
times to statistically observe the variation of our approaches. The clustering results are demonstrated in the Figure 3, where
different alpha indicates a different approach.
The Gaussian dataset (G) simulates the conditions, where a
cluster is moving toward the other with one step for each time. Figure 3 depicts the evolutionary clustering results of
The initial centroid of two clusters are (3, 3) and (12, 12). After EDBSCAN on G datasets, which shows a better performance
6 intervals, the two clusters would be merged together. One than its counterparts. At time step 3 and 4 when those two
representative sample of the above process could be seen in the clusters are beginning to merge with each other, the static
Figure 1. The G dataset consists of 100 ordinary nodes and 30 clustering (alpha=1) falls down dramatically, while the
noisy nodes. Different from the synthetic data in [5, 6], we add evolutionary clustering algorithms keep a relatively stable
the noisy nodes to test the robustness of our algorithms. In variation. The similar condition also can be seen for the
experiments, we run evolutionary DBSCAN clustering 10 incremental clustering (alpha=0). As the time stepping, all of
times with EPS=3 and MINPTS=25. the clustering algorithms tends to be the same for the totally
merged clusters. Besides, from the evolutionary clustering
The second synthetic dataset (U) is derived from the results of EDBSCAN on U datasets in Figure 4, it is obviously
classical DBSCAN dataset, which has been widely used for observed that EDBSCAN exceed the other two algorithms,
testing the density-based algorithms. The U dataset consists of especially for the static clustering algorithms
1200 ordinary nodes and 200 noisy nodes. At the very

926
(a) . (b) . (c) .
Figure 5. A Comparison of EDBSCAN and EK-means on Gaussian Datasets (G)

Secondly, we analyze our approach accompanied with the From the evolutionary clustering results with a diverse alpha
EK-means proposed by Chakrabarti [1]. Since the parameter values in Figure 6, the EDBSCAN is turned out to be far more
alpha determines the weight between snapshot and temporal effective than the EK-means. It is mainly because that the EK-
cost, we rang the alpha value to test the robustness of our means is sensitive to the noise and fails to distinguish the
evolutionary framework. cluster with different shapes as well. Moreover, as proved in
our research in Figure 4, the evolutionary methods also perform
Figure 5 shows the outcomes when alpha is 0.25, 0.5 and much better than its static and incremental counterparts in U
0.75 respectively. Since the evolutionary results are not dataset.
sensitive to the alpha values, we only select three representative
alpha values to demonstrate our experiments. On the basis of Based on the simulations above in U dataset, we know that
the clustering results in the figures above, it is easy to observe our evolutionary approach can easily distinguish the data with
that the EDBSCAN in a more favorable accuracy than the EK- arbitrary shapes and resist to the noise in the dataset.
means before time step 4, when the clusters are beginning to
amalgamate. As the cluster number is set manually in EK- V. CONCLUSION
means, it naturally outperforms after time step 4 when the In this paper, we initially propose an evolutionary
clusters have been merged together. clustering algorithm with DBSCAN to solve the density-based
From the experiments on G datasets, we can easily find that evolutionary clustering problem under the framework of
our evolutionary approach shows an advantage on fitting the temporal smoothness penalty. Through the experiments, we
current data as well as prohibiting the derivation from the have learned that the proposed evolutionary approach show an
history data over its counterparts, especially for the times when advantage over the synthetic datasets. Compared with the other
the clusters start to amalgamate. similar evolutionary clustering algorithms, such as the
evolutionary K-means clustering, our method can not only
When it comes to the U dataset, we compare our resist to the noise, but also distinguish the clusters with
EDBSCAN with the EK-means, as we do on the G dataset. arbitrary shapes during the evolution.

(a) . (b) . (c) .


Figure 6. A Comparison of EDBSCAN and EK-means on Classical DBSCAN Datasets (U)

927
However, it is still necessary for us to take the parameters [6] Kevin S. Xu, Mark Kliger, Alfred O. Hero III. Evolutionary spectral
of the density-based algorithms into consideration, such as the clustering with adaptive forgetting factor [J]. ICASSP 2010:2174-2177.
[7] Charikar M, Chekuri C, Feder T, et al. Incremental clustering and
EPS and MINPTS in DBSCAN, which are usually set prior to dynamic information retrieval [C]. Proceedings of the Twenty-ninth
the clustering by users. Parameters selection is an important Annual ACM Symposium on Theory of Computing, 1997:626-635.
topic and sensitive problems for density-based clustering, [8] Gupta C., Grossman R., et al. A single pass generalized incremental
especially for those contexts where the nodes number and algorithm for clustering [C]. SIAM Conference on Data Minning,
density are both evolving with the time. For the former one, we Florida, 2004:147-153.
[9] Gomes R, Welling M, Perona P. Incremental learning of nonparametric
have listed out some solutions before, while for the latter one, Bayesian mixture model [C]. IEEE Conference on Computer Vision and
we need to reconstruct the parameters to adapt the variation. Pattern Recognition, 2008:1-8.
[10] Barbará D. Requirements for clustering data streams [C]. ACM
Meanwhile, for the time consumed of density-based SIGKDD Explorations Newsletter, 2002, 3(2):23-27.
evolutionary algorithms, we could theoretically improve it by [11] Gaber M., Zaslavsky A., Krishnaswamy S. Mining data streams: a
incorporating the techniques from the cloud computing, such as review [C]. ACM SIGMOD Record, 2005, 34(2):18-26.
the Hadoop framework [13]. All these problems are planned to [12] Junjie Wu, Hui Xiong, Jian Chen. Adapting the right measures for K-
be worked out in our future researches. means clustering [C]. ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (KDD), 2009:877-886.
REFERENCES [13] Bi-Ru Dai, I.-Chang Lin. Efficient Map/Reduce-Based DBSCAN
Algorithm with Optimized Data Partition [C]. IEEE CLOUD, 2012:59-
[1] Deepayan Chakrabarti, Ravi Kumar, Andrew Tomkins. Evolutionary 66.
clustering [C]. ACM SIGKDD International Conference on Knowledge [14] ESTER M, KRIEGEL H P, SANDER J. A density-based algorithm for
Discovery & Data Mining (KDD), 2006:554-560. discovering clusters in large spatial databases with noise [C]. Proc of the
[2] Yun Chi, Xiaodan Song, Dengyong Zhou, Koji Hino, Belle L. Tseng. 2nd International Conference on Knowledge Discovering in Databases
On evolutionary spectral clustering [J]. ACM Transactions on and Data Mining, 1996: 122 -128.
Knowledge Discovery from Data, Vol. 3, No. 4, 2009. [15] Rakesh A, Johanners G, Dimitrios G, Prabhakar R. Automatic subspace
[3] Yun Chi, Xiaodan Song, Dengyong Zhou, et al. Evolutionary spectral clustering of high dimensional data for data mining applications [C].
clustering by incorporating temporal smoothness [C]. ACM SIGKDD Proc. of ACM SIGMOD Conference on Management of Data, 1994:
International Conference on Knowledge Discovery & Data Mining 94-105.
(KDD), 2007:153-162. [16] A. Hinneburg and D. A. Keim. An efficient approach to clustering in
[4] Ulrike Luxburg. A tutorial on spectral clustering [J]. Statistics and large multimedia databases with noise [C]. ACM SIGKDD International
Computing (SAC) 17(4), 2007. Conference on Knowledge Discovery & Data Mining (KDD), 1998.
[5] Kevin S. Xu, Mark Kliger, Alfred O. Hero III: Adaptive Evolutionary
Clustering [J]. CoRR abs/1104.1990, 2011.

928

You might also like