First Paper Before
First Paper Before
Geodesic Distance
Abstract
Clustering is the process of looking at data sets to determine If they can be usefully described
regarding a comparatively limited quantity of clusters or groupings within items that are in some
ways similar to one another but distinct from individuals in other clusters. In many different
domains, clustering is a widely utilized method for statistical data analysis whether the data is
regular or circular. In this paper, we assess the effectiveness of multiple clustering techniques
(average linkage, mcquitty linkage, median linkage, centroid linkage) with several internal
indicators of validity (KL, CH, Hartigan, SD) by use of several models applied to simulated
circular data considering various sample sizes with the program R. Additionally, external indices
(Rand, FM) are utilized to determine the amount of agreement among the partitions that we
calculated with cluster algorithms and the true clusters of observations.
Keywords: Cluster analysis, Linkage clustering techniques, Internal measurements, External
assessments
1. Introduction
Clustering is the method of using a similarity measure to find naturally occurring groups or
clusters in multidimensional data. [8, 9], and this technique has become very common in statistics
and in many fields such as bioinformatics [1], data mining [6], text, multimedia and social
networks [5]. There are numerous algorithms available that try selecting the clustering optimally
according to their own rules, but no one manages true clusters of various data structures to be
captured with excellent performance [14]. In other words, clustering is the process of grouping a
collection of data items into distinct groups or clusters, enabling the items inside every cluster to
differ greatly from one another while still having a high degree of similarity [11]. There are many
methods that can be used for performing a clustering, one of the most important of these methods
in statistics is Hierarchical methods this method includes two different types of methods
agglomerative method and divisive method [15]. The agglomerative hierarchical method with
bivariate circular data is the primary focus of this paper. The idea of this method begins first by
making each object represent a separate cluster, and then, by continuously merging clusters, the
proper cluster structure is generated [18]. We will generate a set of circular data using its
distributions to accurately record the outstanding performance groups by choosing the clustering
algorithm that can do so and using this technique we can evaluate the proposed new methods. The
paper studies the effectiveness of various clustering techniques (average linkage, mcquitty
linkage, median linkage, and centroid linkage) by considering distance measurements using
many models based on several validity indicators (KL, CH, Hartigan, SD), using the R
programming language, we have dealt with multiple simulated data sets and different sizes of
samples, after that, in order to obtain the compatibility rate between the real clusters and the
partitions that we obtained from using the clusters methods, we use a set of external indices to
achieve this goal.
In hierarchical clustering, there are four primary linkage techniques that can be used average
linkage, mcquitty linkage, median linkage and centroid linkage. The basis for the work of this
linkage is the distance function in order to obtain the distance matrix that differs from one linkage
to another, and since the data used is circular data, the distance function is special for it.
The agglomerative hierarchical method commences by utilizing a matrix that represents the
distances between points. This matrix is subject to vary based on the chosen connection method .
Circular data differs from linear data, so linear distance functions such as the Euclidean distance
cannot be applied to that data. In this paper, the distance function that we use with clustering
methods is geodesic distance. Let 𝜑1 , 𝜑2 , … , 𝜑𝑛 be univariant circular data such that 𝜑𝑖 ∈ [0,2𝜋)
on the unit circle [11, 2], then geodesic distance is calculated as
where 𝑖, 𝑗 = 1, 2, … , 𝑛.
There are primarily three potential linkages in hierarchical clustering procedures which are:
nearest neighbor (single linkage), farthest neighbor (complete linkage), and average linkage. We
have to update the matrix of distance when we combine two clusters based on the selected linking
strategy. For example, let's consider a dataset, {𝑥1 , 𝑥2, 𝑥3 , 𝑥4 , 𝑥5 }, to begin with, let's say that the
distance between 𝑥2 and 𝑥4 is the smallest, 𝑑(𝑥2 , 𝑥4 ). Thus, 𝑥2 and 𝑥4 at the first stage of
aggregation are merged together, and 𝑥6 is used to indicate this newly formed cluster. From 𝑥6 to
𝑥2 the distance is
𝑑61 = min {𝑑21 , 𝑑41 }
Similarly, the distances among other data points and the new cluster has been updated. If we use
the average linkage and the complete linkage, Following is the calculation of the distance among
𝑥6 and 𝑥1 [10]:
𝑑61 = (𝑑21 + 𝑑41 )/2 or 𝑑61 = max {𝑑21 , 𝑑41 }
Within the context of centroid linkage clustering, A cluster's distance from another is equal to the
distance between their cluster means. As for Median Linkage clustering, the median distance
among data points in two clusters is used to calculate the distance among clusters
Simulated Data: To explain the working characteristics of the approach that we have proposed
a simulation study is conducted. Let (𝜑1 , 𝜓1 ), (𝜑2 , 𝜓2 ), … , (𝜑𝑛 , 𝜓𝑛 ) are bivariate circular data
obtained from BvM (𝜇, Σ) and present in Table 1. Based on the simulated data in Table 1, Table
2 presents the geodesic distance matrix created with hierarchical clustering to group objects.
Figure 1 displays the dendrogram that corresponds to the data that we have created.
Table 1: Displays data produced by BvM(μ,Σ).
Index 𝜑 𝜓 Index 𝜑 𝜓
1 0.459 4.618 6 0.229 3.861
2 0.200 4.610 7 0.447 4.587
3 0.249 3.924 8 5.774 4.828
4 6.154 4.669 9 6.078 4.228
5 0.033 4.161 10 0.059 3.703
Table 2: The distance matrix displays the specified data in Table 1.
1 2 3 4 5 6 7 8 9 10
1 0
2 0.258 0
3 0.726 0.688 0
4 0.590 0.335 0.836 0
5 0.625 0.480 0.321 0.534 0
6 0.792 0.750 0.066 0.884 0.358 0
7 0.034 0.248 0.692 0.582 0.594 0.758 0
8 0.990 0.742 1.180 0.412 0.860 1.217 0.986 0
9 0.770 0.557 0.546 0.448 0.247 0.568 0.744 0.673 0
10 0.998 0.918 0.291 0.984 0.458 0.231 0.965 1.260 0.587 0
q which maximizes the CH index, determines the optimal number of clusters [17].
• Hartigan index
The Hartigan index [7] is computed in the following way:
𝑡𝑟(𝑊𝑞 )
Har= ( − 1)(𝑛 − 𝑞 − 1) (4)
𝑡𝑟(𝑊𝑞 +1)
where 𝑞 ∈ {1, … , 𝑛 − 2}.When the index value is at its highest it is regarded as correctly
representing the number of clusters present in the dataset.
SD index
Total cluster separation and average dispersion form the basis for the SD index notion [19] and
decreasing the value of SD will yield the ideal number of clusters. Equation (5) is used to
calculate it [4].
SD index(𝑞) = 𝛼𝑆𝑐𝑎𝑡(𝑞) + 𝐷𝑖𝑠(𝑞), (5)
where
1 𝑞
∑ ‖𝜎 (𝑘) ‖
𝑞 𝑘=1
𝑆𝑐𝑎𝑡(𝑞) = ‖𝜎‖
(6)
where
* 𝐷𝑚𝑎𝑥 = max ( 𝑑(𝑐𝑘 , 𝑐𝑧 ))∀𝑘, 𝑧 ∈ {1,2,3, … , 𝑞} represents the maximum distance between
cluster centers.
* 𝐷𝑚𝑖𝑛 = min( 𝑑(𝑐𝑘 , 𝑐𝑧 ))∀𝑘, 𝑧 ∈ {1,2,3, … , 𝑞} } represent the minimum distance between cluster
centers.
2.2 External Indices
the external indices are utilized to compare the cluster results with a previously established
partitioning [20]. In order to make this comparison, we took two external indices: the Rand Index
and the Fowlkes-Mallows (FM) Index.
For two partitions 𝐶 (1) and 𝐶 (2) define the following [21]:
𝑎: number of pairs of elements that are located in the same cluster in 𝐶 (1) and 𝐶 (2) .
𝑏: number of pairs of elements that are located in separate clusters in 𝐶 (1) and 𝐶 (2) .
𝑐: number of elements which are pairs that are located in separate clusters in 𝐶 (1) and
similar clusters in 𝐶 (2) .
𝑑: number of pairs of elements that belong to the same cluster 𝐶 (1) but are in distinct
clusters in 𝐶 (2) .
Now we will discuss the two external indices which are called by Rand and FM.
• Rand Index
Rand Index is an idea put forth by [16]. It determines the degree of similarities between the two
partitions, 𝑃 and 𝐶. The Rand index is defined as follows:
𝑎+𝑏
𝑅𝐼 = (7)
𝑎+𝑏+𝑐+𝑑
The value of this index is in the interval (0, 1], where 0 denotes extremely dissimilar partitions
𝐶 (1) and 𝐶 (2) , while the value 1 denotes extremely similar partitions.
• FM Index
Fowlkes and Mallows Index (FM) [6] is the technique that is used to find out the similarity
between two partitions. This index is given by
(𝑇𝑘 )2
𝐹𝑀 = √ , 𝐹𝑀 ∈ [0,1] (8)
𝑃 𝑘 𝑄𝑘
where
𝑢 𝑣 𝑢 𝑣 𝑣 𝑢
2
𝑇𝑘 = ∑ ∑ 𝑛𝑖𝑗 − 𝑚, 𝑃𝑘 ∑(∑ 𝑛𝑖𝑗 ) − 𝑚, 𝑄𝑘 = ∑(∑ 𝑛𝑖𝑗 )2 − 𝑚
2
Model 2
Average, KL 0.590 0.463 Median, KL 0.748 0.329
Average, CH 0.590 0.463 Median, CH 0.792 0.329
Average, Hartigan 0.596 0.462 Median, Hartigan 0.748 0.328
Average, SD 0.596 0.462 Median, SD 0.345 0.383
Mcquitty, KL 0.749 0.527 Centroid, KL 0.589 0.484
Mcquitty, CH 0.749 0.527 Centroid, CH 0.589 0.484
Mcquitty, Hartigan 0.749 0.527 Centroid, Hartigan 0.609 0.463
Mcquitty, SD 0.749 0.527 Centroid, SD 0.609 0.464
Model 3
Average, KL 0.651 0.277 Median, KL 0.758 0.187
Average, CH 0.783 0.188 Median, CH 0.767 0.181
Average, Hartigan 0.651 0.278 Median, Hartigan 0.689 0.210
Average, SD 0.791 0.183 Median, SD 0.758 0.187
Mcquitty, KL 0.812 0.186 Centroid, KL 0.761 0.204
Mcquitty, CH 0.812 0.186 Centroid, CH 0.676 0.269
Mcquitty, Hartigan 0.682 0.240 Centroid, Hartigan 0.676 0.269
Mcquitty, SD 0.812 0.186 Centroid, SD 0.696 0.254
Model 4
Average, KL 0.728 0.515 Median, KL 0.823 0.331
Average, CH 0.728 0.515 Median, CH 0.846 0.355
Average, Hartigan 0.728 0.515 Median, Hartigan 0.846 0.355
Average, SD 0.728 0.515 Median, SD 0.112 0.330
Mcquitty, KL 0.852 0.415 Centroid, KL 0.731 0.511
Mcquitty, CH 0.852 0.415 Centroid, CH 0.731 0.511
Mcquitty, Hartigan 0.775 0.404 Centroid, Hartigan 0.731 0.511
Mcquitty, SD 0.677 0.364 Centroid, SD 0.731 0.511
Figure6: The box plot displays the number of clusters that result from clustering algorithms together with
internal indices for Model 1.
Figure 7: The box plot displays the number of clusters that result from clustering algorithms together with
internal indices for Model 2.
Figure 8: The box plot displays the number of clusters that result from clustering algorithms together with
internal indices for Model 3.
Figure 9: The box plot displays the number of clusters that result from clustering algorithms together with
internal indices for Model 4.
5. Conclusion
The paper used several models with the recommended torus distance to assess the effectiveness
of clustering algorithms (average, mcquitty, median, centroid) with various validity indicators
(KL, CH, Hartigan, SD) on diverse simulated datasets with varying sample sizes using the R
software. External indices Rand and FM are utilized to measure the degree of agreement between
the true clusters of data elements and the partition produced by clustering techniques. From the
results, we observed the following:
1. Model 1's results indicate that, when tested with two external validity indices, the average
linkage with CH index and the true clusters accord quite well (see Figure 2 and Table 3).
2. Models 2 and Model 3 indicate that the mcquitty clustering method's results under all
internal indices accord with the true clusters at a high rate (see Figure 3 and Table 3).
3. Model 4's results indicate a high degree of agreement with the true clusters as determined
by the mcquitty clustering technique and the two indices KL and CH (see Figure 4 and
Table 3).
4. The Box plots in Figures 6-9 also show the same results as mentioned above in Steps from
1 to 3.
Reference
[1] Abushilah, S. F. H., 2019. Clustering methodology for bivariate circular data with
application to protein dihedral angles. Ph.D. thesis, University of Leeds.
[2] Ali, A.J and Abushilah, S. F., Distribution-free two-sample homogeneity test for circular
data based on geodesic distance. Journal . Nonlinear Anal. Appl. 13 (2022) 1, 2703-2711
http://dx.doi.org/10.22075/ijnaa.2022.5992
[3] Calinski T, Harabasz J (1974). “A Dendrite Method for Cluster Analysis.” Communications
in Statistics – Theory and Methods, 3(1), 1–27.
[4] Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A., 2014. Nbclust: an r package for
determining the relevant number of clusters in a data set. Journal of statistical software 61, 1-36.
[5] Charu C. Aggarwal and Chandan K. Reddy. DATA CLUSTERING Algorithms and
Applications.
[6] Fowlkes, E. B., Mallows, C. L., 1983. A method for comparing two hierarchical clusterings.
Journal of the American statistical association 78 (383), 553- 569.
[6] Gupta, G. K., 2014. Introduction to data mining with case studies. PHI Learning Pvt. Ltd.
[7] Hartigan JA (1975). Clustering Algorithms. John Wiley & Sons, New York
[8] Jain, R. Duin and J. Mao. Statistical Pattern Recognition: A Review. IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 22, no.1, pp. 4-37, 2000.
[9] Jain, M. Murty and P. Flynn. Data Clustering: A Review. ACM Computing Surveys, vol.
31, no. 3, pp. 264-323,1999.
[10] J. H. Ward, “Hierarchical grouping to optimize an objective function,” Journal of the
American Statistical Association, vol. 58, no. 301, pp. 236-244, 1963.
[11] Judd, P. Mckinley and A. Jain. Large-scale Parallel Data Clustering. IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 871-876, 1998.
[12] K.V. Mardia and P.E. Jupp, Directional Statistics, volume 494, John Wiley & Sons, 2009.
[13] Krzanowski WJ, Lai YT (1988). “A Criterion for Determining the Number of Groups in a
Data Set Using Sum-of-Squares Clustering.” Biometrics, 44(1), 23–34
[14] Landau, S., Leese, M., Stahl, D., Everitt, B. S., 2011. Cluster analysis. John Wiley & Sons.
[15] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. USA: Addison-
Wesley Longman, Inc., 2005.
[16] [Rand, W. M., 1971. Objective criteria for the evaluation of clustering methods. Journal of
the American Statistical association 66 (336), 846{850.
[16] Rosie Cornish, Statistics Cluster Analysis, Mathematic Learning Support Center, 2007.
[17] Saitta, S., Raphael, B. and Smith, I.F.C. "A comprehensive validity index for clustering",
Intelligent Data Analysis, vol. 12, no 6, 2008, pp. 529-548
http://content.iospress.com/articles/intelligent-data-analysis/ida00346 Copyright IOS Press
[18] Maimon, O., Rokach, L., 2005. Data mining and knowledge discovery handbook
[19] M. Halkidi, M. Vazirgiannis, and Y. Batistakis, “Quality scheme assessment in the
clustering process,” in PKDD, London, UK, 2000, pp. 265–276.
[20] Xu, R. and Wunsch, D. (2009). Clustering. IEEE Press, Piscataway, NJ, USA.
[21] Y. Zhao and G. Karypis, “Evaluation of hierarchical clustering algorithms for document
datasets,” in Procedings of CIKM, 2002, pp. 515–524
within
w. IEEE