Winkler 2010 Problems of Fuzzy C-Means Clustering and Similar Algorithms With High Dimensional Data Sets
Winkler 2010 Problems of Fuzzy C-Means Clustering and Similar Algorithms With High Dimensional Data Sets
Abstract Fuzzy c-means clustering and its derivatives are very successful on many
clustering problems. However, fuzzy c-means clustering and similar algorithms have
problems with high dimensional data sets and a large number of prototypes. In particular, we discuss hard c-means, noise clustering, fuzzy c-means with polynomial
fuzzifier function and its noise variant. A special test data set that is optimal for clustering is used to show weaknesses of said clustering algorithms in high dimensions.
We also show that a high number of prototypes influences the clustering procedure
in a similar way as a high number of dimensions. Finally, we show that the negative
effects of high dimensional data sets can be reduced by adjusting the parameter of
the algorithms, i.e. the fuzzifier, depending on the number of dimensions.
1 Introduction
Clustering high dimensional data has many interesting applications. For example
clustering similar music files, semantic web applications, image recognition or biochemical problems. Many tools today are not designed to handle hundreds of dimensions, or in this case, it might be better to call it degrees of freedom. Many
clustering approaches work quite well in low dimensions, but especially the fuzzy
c-means algorithm (FCM), [4, 2, 8, 10] seems to fail in high dimensions. This paper
is dedicated to give some insight into this problem and the behaviour of FCM as
well as its derivatives in high dimensions.
Roland Winkler
German Aerospace Center Braunschweig e-mail: [email protected]
Frank Klawonn
Ostfalia, University of Applied Sciences e-mail: [email protected]
Rudolf Kruse
Otto-von-Guericke University Magdeburg e-mail: [email protected]
The algorithms that are analysed and compared in this paper are hard c-means
(HCM), fuzzy c-means (FCM), noise FCM (NFCM), FCM with polynomial fuzzifier function (PFCM) and PFCM with a noise cluster (PNFCM) that is an extension
of PFCM in the same way like NFCM is an extension of FCM. All these algorithms are prototype based and gradient descent algorithms. Previous to this paper,
an analysis of FCM in high dimensions is presented in [12] which provides a more
extensive view on the high dimension problematic but solely analysis the behaviour
of FCM. Not included in this paper is the extension by Gustafson and Kessel [7]
because this algorithm is already unstable in low dimensions. Also not included is
the competitive agglomeration FCM (CAFCM) [6] the algorithm is not a gradient
descent algorithm in the strict sense.
A very good analysis of the influence of high dimensions is to the nearest neighbour search is done in [1]. The nearest neighbour approach can not be applied directly on clustering problems. But the basic problem is similar and thus can be used
as a starting point for the analysis of the effects of high dimensional data on FCM
as it is presented in this paper.
We approach the curse of dimensionality for the above mentioned clustering algorithms because they seem very similar but perform very differently. The main
motivation lies more in observing the effects of high dimensionality rather than producing a solution to the problem. First, we give a short introduction to the algorithms
and present a way how to test the algorithms in a high dimensional environment in
the next section. In Section 3, the effects of a high dimensional data set are presented. A way to use the parameters of the algorithms to work on high dimensions
is discussed in section 4. We close this paper with some last remarks in section 5,
followed by a list of references.
(1)
The fuzzifier function for FCM [4, 2] is an exponential function with fFCM (u) =
u and 1 < R. In figure 1, the prototypes are represented as filled circles, their
tails represent the way the prototypes took from their initial- to their final location.
The devastating effect of a high dimensional data set to FCM is obvious: the prototypes run straight into the centre of gravity of the data set, independently of their
initial location and therefore, finding no clusters at all. NFCM [3] is one of the two
algorithms considered in this paper that is able to detect noise. The fuzzifier function for NFCM is identical to FCM: fNFCM = fFCM . Apart from the fact that all data
objects have the highest membership value for the noise cluster, the behaviour of the
algorithm does not change compared to FCM. PFCM [9] is a mixture of HCM and
2
2
FCM, as the definition of the fuzzifier function shows: fPFCM (u) = 1
1+ u + 1+ u.
This fuzzifier function creates an area of crisp membership values around a prototype while outside of these areas of crisp membership values, fuzzy values are
assigned. The parameter controls the size of the crisp areas: the low value of
means a small crisp area of membership values.
other data sets because there is no easier data set than D. Especially if more, highdimensional problems occur like overlapping clusters or very unbalanced cluster
sizes.
As the example in Figure 1-right has shown, the prototypes end up in the centre of gravity for FCM and NFCM. To gain knowledge why this behaviour occurs
(and why not in the case of PFCM), the clustering algorithms are tested in a rather
artificial way. The prototypes are all initialised in the centre of gravity (COG) and
then moved iteratively towards the data objects by ignoring the update procedure
indicated by the clustering algorithms. Let control the location of the prototypes:
[0, 1], xi Rn the ith data object with cog(D) Rn the centre of gravity of
data set D implies yi : [0, 1] Rn with yi () = xi + (1 ) cog(D) and finally:
di j () = d(yi (), x j ). Since the membership values are functions of the distance
values and the objective function is a function of membership values and distance
values, it can be plotted as a function of .
Fig. 3 Objective function plots for FCM (left) and NFCM (right)
However, for FCM and NFCM, these factors have a strong influence on the objective function. In Figure 3, the objective functions for these two algorithms are
plotted for a variety of dimensions, depending on . For convenience, the objective
function values are normalized to 1 at = 0. The plots show a strong local maximum between = 0.5 and = 0.9. Winkler et.al showed in [12] that the number of
dimensions effects the objective function by the height of this local maximum. The
number of prototypes however influences the location of the maximum: the higher
the number of prototypes, the further right the local maximum can be observed.
Since these are gradient descent algorithms, the prototypes will run into the centre
of gravity if they are initialized left of the local maximum which is exactly what
is presented in figure 1-right. Since the volume of an n-dimensional hypersphere
increases exponentially with its radius, it is almost hopeless to initialize a prototype
near enough to a cluster so that the prototype converges to that cluster. For this example, in 50 dimensions and with 100 prototypes, the converging hypershere radius
is 0.3 times the feature space radius which means, the hypervolume is 7.2 1027
times the volume of the feature space.
Fig. 4 Objective function plots for PFCM (left) and PNFCM (right)
As presented in Figure 4-left, PFCM does not create such a strong local maximum as FCM, also the local maximum that can be observed is very far left. That is
the reason why PFCM can be successfully applied on a high dimensional data set.
The situation is quite different for PNFCM, see 4-right. The fixed noise distance is
chosen appropriate for the size of the clusters but the distance of the prototypes to
the clusters is much larger. Therefore, all data objects have membership value 0 for
the prototypes which explains the constant objection function value.
Fig. 5 Objective function plots for FCM (left), NFCM (middle) and PNFCM (right) with dimension dependent parameters
To test that, we apply each algorithm 100 times on T 50 , the results are presented
in Table 1 as mean and sample standard deviation in braces. The found clusters column is the most important, the other two are just for measuring the performance of
recognizing noise data objects. The test clearly shows the improvement by adjusting
the parameters according to the number of dimensions.
Algorithm Found clusters
HCM
42.35 (4.65)
FCM
0
(0)
NFCM
0
(0)
PFCM
90.38 (2.38)
PNFCM
0
(0)
Adjusted Parameter
FCM AP
88.09 (3.58)
NFCM AP 88.5 (3.37)
PNFCM AP 92.7 (2.67)
(0)
(0.58)
(3.12)
0
1136.0
96.0
(0)
(344.82)
(115.69)
Table 1 Performance overview of T50 with 100 data objects for each cluster, 1000 noise data
objects and each algorithm is applied performed 100 times. The mean value and (sample standard
deviation) are displayed.
5 Conclusions
The two algorithms HCM and FCM do not work on high dimensions properly. It
is very odd therefore that a combination of them in form of PFCM works quite
well. We have shown that the reason for this effect is a very small local minimum of
PFCM compared to FCM in the COG. We presented, that FCM, NFCM and PNFCM
can be tuned in such a way that their objective function shows a similar behaviour
in our test as PFCM, in which case the clustering result is similar on the test data set
T 50 . The question remains why this local minimum occurs. A possible explanation
is presented in [1, 5] as they identify the effect of distance concentration as being the
most problematic in having a meaningful nearest neighbour searches. Further work
References
1. Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is nearest neighbor meaningful? In Database Theory - ICDT99, volume 1540 of Lecture Notes in Computer
Science, pages 217235. Springer Berlin / Heidelberg, 1999.
2. James C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum
Press, New York, 1981.
3. Rajesh N. Dave. Characterization and detection of noise in clustering. Pattern Recogn. Lett.,
12(11):657664, 1991.
4. J.C. Dunn. A fuzzy relative of the isodata process and its use in detecting compact wellseparated clusters. Cybernetics and Systems: An International Journal, 3(3):3257, 1973.
5. Robert J. Durrant and Ata Kaban. When is nearest neighbour meaningful: A converse theorem and implications. Journal of Complexity, 25(4):385 397, 2008.
6. Hichem Frigui and Raghu Krishnapuram. A robust clustering algorithm based on competitive
agglomeration and soft rejection of outliers. Computer Vision and Pattern Recognition, IEEE
Computer Society Conference on, 0:550, 1996.
7. Donald E. Gustafson and William C. Kessel. Fuzzy clustering with a fuzzy covariance matrix.
In IEEE, volume 17, pages 761766, Jan. 1978.
8. F. Hoppner, F. Klawonn, R. Kruse, and T. Runkler. Fuzzy Cluster Analysis. John Wiley &
Sons, Chichester, England, 1999.
9. Frank Klawonn and Frank Hoppner. What is fuzzy about fuzzy clustering? understanding and
improving the concept of the fuzzifier. In Cryptographic Hardware and Embedded Systems
- CHES 2003, volume 2779 of Lecture Notes in Computer Science, pages 254264. Springer
Berlin / Heidelberg, 2003.
10. Rudolf Kruse, Christian Dring, and Marie-Jeanne Lesot. Advances in Fuzzy Clustering and
its Applications, chapter Fundamentals of Fuzzy Clustering, pages 330. John Wiley & Sons,
2007. ISBN: 978-0-470-02760-8.
11. Hugo Steinhaus. Sur la division des corps materiels en parties. Bull. Acad. Pol. Sci., Cl. III,
4:801804, 1957.
12. Roland Winkler, Frank Klawonn, and Rudolf Kruse. Fuzzy c-means in high dimensional
spaces. International Journal of Fuzzy System Applications (to appear), 2011.