0% found this document useful (0 votes)

45 views

First Paper Before

Uploaded by

زيد الدين

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

First Paper Before

Uploaded by

زيد الدين

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Studying Clustering Techniques under Circular Data and

Geodesic Distance

Huda Karaim & Samira Faisal Abushilah

University of Kufa, College of Education for Girls, Department of Mathematics, Najaf, Iraq

Abstract
Clustering is the process of looking at data sets to determine If they can be usefully described
regarding a comparatively limited quantity of clusters or groupings within items that are in some
ways similar to one another but distinct from individuals in other clusters. In many different
domains, clustering is a widely utilized method for statistical data analysis whether the data is
regular or circular. In this paper, we assess the effectiveness of multiple clustering techniques
(average linkage, mcquitty linkage, median linkage, centroid linkage) with several internal
indicators of validity (KL, CH, Hartigan, SD) by use of several models applied to simulated
circular data considering various sample sizes with the program R. Additionally, external indices
(Rand, FM) are utilized to determine the amount of agreement among the partitions that we
calculated with cluster algorithms and the true clusters of observations.
Keywords: Cluster analysis, Linkage clustering techniques, Internal measurements, External
assessments
1. Introduction
Clustering is the method of using a similarity measure to find naturally occurring groups or
clusters in multidimensional data. [8, 9], and this technique has become very common in statistics
and in many fields such as bioinformatics [1], data mining [6], text, multimedia and social
networks [5]. There are numerous algorithms available that try selecting the clustering optimally
according to their own rules, but no one manages true clusters of various data structures to be
captured with excellent performance [14]. In other words, clustering is the process of grouping a
collection of data items into distinct groups or clusters, enabling the items inside every cluster to
differ greatly from one another while still having a high degree of similarity [11]. There are many
methods that can be used for performing a clustering, one of the most important of these methods
in statistics is Hierarchical methods this method includes two different types of methods
agglomerative method and divisive method [15]. The agglomerative hierarchical method with
bivariate circular data is the primary focus of this paper. The idea of this method begins first by
making each object represent a separate cluster, and then, by continuously merging clusters, the
proper cluster structure is generated [18]. We will generate a set of circular data using its
distributions to accurately record the outstanding performance groups by choosing the clustering
algorithm that can do so and using this technique we can evaluate the proposed new methods. The
paper studies the effectiveness of various clustering techniques (average linkage, mcquitty
linkage, median linkage, and centroid linkage) by considering distance measurements using
many models based on several validity indicators (KL, CH, Hartigan, SD), using the R
programming language, we have dealt with multiple simulated data sets and different sizes of
samples, after that, in order to obtain the compatibility rate between the real clusters and the
partitions that we obtained from using the clusters methods, we use a set of external indices to
achieve this goal.
In hierarchical clustering, there are four primary linkage techniques that can be used average
linkage, mcquitty linkage, median linkage and centroid linkage. The basis for the work of this
linkage is the distance function in order to obtain the distance matrix that differs from one linkage
to another, and since the data used is circular data, the distance function is special for it.
The agglomerative hierarchical method commences by utilizing a matrix that represents the
distances between points. This matrix is subject to vary based on the chosen connection method .
Circular data differs from linear data, so linear distance functions such as the Euclidean distance
cannot be applied to that data. In this paper, the distance function that we use with clustering
methods is geodesic distance. Let 𝜑1 , 𝜑2 , … , 𝜑𝑛 be univariant circular data such that 𝜑𝑖 ∈ [0,2𝜋)
on the unit circle [11, 2], then geodesic distance is calculated as

𝑑(𝜑𝑖 , 𝜑𝑗 ) = 𝜋 − |𝜋 − |𝜑𝑖 − 𝜑𝑗 ||, (1)

where 𝑖, 𝑗 = 1, 2, … , 𝑛.
There are primarily three potential linkages in hierarchical clustering procedures which are:
nearest neighbor (single linkage), farthest neighbor (complete linkage), and average linkage. We
have to update the matrix of distance when we combine two clusters based on the selected linking
strategy. For example, let's consider a dataset, {𝑥1 , 𝑥2, 𝑥3 , 𝑥4 , 𝑥5 }, to begin with, let's say that the
distance between 𝑥2 and 𝑥4 is the smallest, 𝑑(𝑥2 , 𝑥4 ). Thus, 𝑥2 and 𝑥4 at the first stage of
aggregation are merged together, and 𝑥6 is used to indicate this newly formed cluster. From 𝑥6 to
𝑥2 the distance is
𝑑61 = min {𝑑21 , 𝑑41 }
Similarly, the distances among other data points and the new cluster has been updated. If we use
the average linkage and the complete linkage, Following is the calculation of the distance among
𝑥6 and 𝑥1 [10]:
𝑑61 = (𝑑21 + 𝑑41 )/2 or 𝑑61 = max {𝑑21 , 𝑑41 }
Within the context of centroid linkage clustering, A cluster's distance from another is equal to the
distance between their cluster means. As for Median Linkage clustering, the median distance
among data points in two clusters is used to calculate the distance among clusters
Simulated Data: To explain the working characteristics of the approach that we have proposed
a simulation study is conducted. Let (𝜑1 , 𝜓1 ), (𝜑2 , 𝜓2 ), … , (𝜑𝑛 , 𝜓𝑛 ) are bivariate circular data
obtained from BvM (𝜇, Σ) and present in Table 1. Based on the simulated data in Table 1, Table
2 presents the geodesic distance matrix created with hierarchical clustering to group objects.
Figure 1 displays the dendrogram that corresponds to the data that we have created.
Table 1: Displays data produced by BvM(μ,Σ).
Index 𝜑 𝜓 Index 𝜑 𝜓
1 0.459 4.618 6 0.229 3.861
2 0.200 4.610 7 0.447 4.587
3 0.249 3.924 8 5.774 4.828
4 6.154 4.669 9 6.078 4.228
5 0.033 4.161 10 0.059 3.703
Table 2: The distance matrix displays the specified data in Table 1.

1 2 3 4 5 6 7 8 9 10
1 0
2 0.258 0
3 0.726 0.688 0
4 0.590 0.335 0.836 0
5 0.625 0.480 0.321 0.534 0
6 0.792 0.750 0.066 0.884 0.358 0
7 0.034 0.248 0.692 0.582 0.594 0.758 0
8 0.990 0.742 1.180 0.412 0.860 1.217 0.986 0
9 0.770 0.557 0.546 0.448 0.247 0.568 0.744 0.673 0
10 0.998 0.918 0.291 0.984 0.458 0.231 0.965 1.260 0.587 0

Figure 1: Displays the dendrogram representing the data found in Table 1.

2. Validity Indices
Cluster validation indicators are employed to determine the optimal number of clusters in a set of
data and to assess the effectiveness of the clustering method [16]. One of the most critical
difficulties in cluster analysis is evaluating clustering findings to discover the most suitable
partitioning for the underlying data. There are two fundamental types of cluster validation:
external clustered validation and internal clustered validation. After the clustering technique has
been applied, Internal indices are employed to determine the optimal partitioning, while when the
true data partition is known, the external indices are used to determine the agreement between the
real component and the clustering result [6]. The next section will discuss recent research on
cluster validation indices.
2.1 Indices Internal
Indices Internal are frequently rely on two criteria [14, 21]: Compactness within clusters and
separation between clusters. Compactness calculates the degree to which the objects in a cluster
are connected. A set of measurements based on variance evaluate cluster compactness. Lower
variance suggests more compactness. While separation determines how different or well-
separated one cluster is from another. We choose four distinct internal indices among the
frequently employed indices: Krzanowski and Lai index (KL index) [13], Calinski and Harabasz
indicator, often known as CH index [3], Hartigan indicator 7], and SD indicator [4]. All of these
internal indices are available in R [4]. We define:
• 𝑛 = number of objects
• 𝑝 = number of variables
• 𝑞 = number of clusters
• 𝑥̅ = mean of all objects
• 𝑛𝑘 =number of objects in cluster 𝐶𝑘
• 𝑐̅𝑘 =centroid of cluster 𝐶𝑘
• 𝑥𝑖 =observations of the ith object in cluster 𝐶𝑘
𝑞
• 𝑊𝑞 = ∑𝑘=1 ∑𝑖∈𝐶𝑘 𝑑(𝑥𝑖 , 𝑐̅𝑘 )2 is the within-group dispersal matrix for data that has been
sorted into q clusters.
𝑞
• 𝐵𝑞 = ∑𝑘=1 𝑛𝑘 . 𝑑(𝑐𝑘 , 𝑥̅ )2 is The matrix represents the dispersion between groups for data
that has been divided into q clusters.
• KL index
Krzanowski and Lai's[13] indicated KL index can be described as follows:
𝐷𝐼𝐹𝐹𝑞
KL(q) = | |, (2)
𝐷𝐼𝐹𝐹𝑞+1

where 𝐷𝐼𝐹𝐹𝑞 = (𝑞 − 1)2/𝑝 𝑡𝑟(𝑊𝑞−1 ) − 𝑞 2/𝑝 𝑡𝑟(𝑊𝑞 ).

The optimal number of clusters could be determined when the KL index has a high value.
• CH index
The Calinski-Harabaz index (CH) [3] is a technique for determining exactly how many clusters
that can be applied in association with a dendrogram. The CH index is defined as follows:
(𝑛−𝑞)𝑡𝑟(𝐵𝑞 )
𝐶𝐻 = (𝑞−1)𝑡𝑟(𝑊 , 𝑞 > 1 (3)
𝑞)

q which maximizes the CH index, determines the optimal number of clusters [17].
• Hartigan index
The Hartigan index [7] is computed in the following way:
𝑡𝑟(𝑊𝑞 )
Har= ( − 1)(𝑛 − 𝑞 − 1) (4)
𝑡𝑟(𝑊𝑞 +1)

where 𝑞 ∈ {1, … , 𝑛 − 2}.When the index value is at its highest it is regarded as correctly
representing the number of clusters present in the dataset.
SD index
Total cluster separation and average dispersion form the basis for the SD index notion [19] and
decreasing the value of SD will yield the ideal number of clusters. Equation (5) is used to
calculate it [4].
SD index(𝑞) = 𝛼𝑆𝑐𝑎𝑡(𝑞) + 𝐷𝑖𝑠(𝑞), (5)
where
1 𝑞
∑ ‖𝜎 (𝑘) ‖
𝑞 𝑘=1
𝑆𝑐𝑎𝑡(𝑞) = ‖𝜎‖
(6)

where σ represents vector of variances of all variables in the data.

* 𝜎 = (𝑣𝑎𝑟(𝑉1 ), 𝑣𝑎𝑟(𝑉2 ), … , 𝑣𝑎𝑟(𝑉𝑝 ))

* 𝜎 (𝑘) represents the variance vector in every cluster𝐶𝑘 ,

* 𝜎 (𝑘) = (𝑣𝑎𝑟(𝑉1 (𝑘) ), 𝑣𝑎𝑟(𝑉2 (𝑘) ), … , 𝑣𝑎𝑟(𝑉𝑝 (𝑘) ))

𝐷𝑚𝑎𝑥
and 𝐷𝑖𝑠(𝑞) = ∑𝑞𝑘=1(∑𝑞𝑧=1 𝑑(𝑐𝑘 , 𝑐𝑧 ))−1
𝐷𝑚𝑖𝑛

where
* 𝐷𝑚𝑎𝑥 = max ( 𝑑(𝑐𝑘 , 𝑐𝑧 ))∀𝑘, 𝑧 ∈ {1,2,3, … , 𝑞} represents the maximum distance between
cluster centers.
* 𝐷𝑚𝑖𝑛 = min( 𝑑(𝑐𝑘 , 𝑐𝑧 ))∀𝑘, 𝑧 ∈ {1,2,3, … , 𝑞} } represent the minimum distance between cluster
centers.
2.2 External Indices
the external indices are utilized to compare the cluster results with a previously established
partitioning [20]. In order to make this comparison, we took two external indices: the Rand Index
and the Fowlkes-Mallows (FM) Index.

For two partitions 𝐶 (1) and 𝐶 (2) define the following [21]:
𝑎: number of pairs of elements that are located in the same cluster in 𝐶 (1) and 𝐶 (2) .
𝑏: number of pairs of elements that are located in separate clusters in 𝐶 (1) and 𝐶 (2) .
𝑐: number of elements which are pairs that are located in separate clusters in 𝐶 (1) and
similar clusters in 𝐶 (2) .
𝑑: number of pairs of elements that belong to the same cluster 𝐶 (1) but are in distinct
clusters in 𝐶 (2) .
Now we will discuss the two external indices which are called by Rand and FM.
• Rand Index
Rand Index is an idea put forth by [16]. It determines the degree of similarities between the two
partitions, 𝑃 and 𝐶. The Rand index is defined as follows:
𝑎+𝑏
𝑅𝐼 = (7)
𝑎+𝑏+𝑐+𝑑

The value of this index is in the interval (0, 1], where 0 denotes extremely dissimilar partitions
𝐶 (1) and 𝐶 (2) , while the value 1 denotes extremely similar partitions.
• FM Index
Fowlkes and Mallows Index (FM) [6] is the technique that is used to find out the similarity
between two partitions. This index is given by
(𝑇𝑘 )2
𝐹𝑀 = √ , 𝐹𝑀 ∈ [0,1] (8)
𝑃 𝑘 𝑄𝑘

where
𝑢 𝑣 𝑢 𝑣 𝑣 𝑢
2
𝑇𝑘 = ∑ ∑ 𝑛𝑖𝑗 − 𝑚, 𝑃𝑘 ∑(∑ 𝑛𝑖𝑗 ) − 𝑚, 𝑄𝑘 = ∑(∑ 𝑛𝑖𝑗 )2 − 𝑚
2

𝑖=1 𝑗=1 𝑖=1 𝑗=1 𝑗=1 𝑖=1

3. Assessment of Effectiveness
Within this part, We use circular data simulations with several models and sample sizes under
geodesic measurement to assess the effectiveness of various linkage clustering algorithms
(average, centroid, median, mcquitty) with various internal validating indices (KL, CH, Hartigan,
SD) using R software version 3.1. The proposed simulation serves three distinct purposes. First,
which clustering method can identify the true clusters. Secondly, the index of internal validity
can accurately partition the tree to determine the correct number of clusters. Thirdly, which
index of external validity can achieve a high degree to which clustering results match up with
true clusters. In order to achieve this, we generate groupings of circular data using different
models, and we use this data to assess how well the clustering algorithm functions. We take into
account several models with variations on each of the aforementioned:
1. Model 1: 4 groups of observations in different size 𝑛𝜖{50, 60, 70, 80} are produced by
2 3.25 3 0
𝐵𝑣𝑚(𝜇𝑖 , Σ), 𝑖 = 1, … , 4 such that 𝜇𝑖 ∈ {[ ] , [ ]} , and Σ = [ ]
4 2.5 0 5
2. Model 2: 6 groups of 100 observations in size are produced by 𝐵𝑣𝑚(𝜇𝑖 , Σ), 𝑖 = 1, 2, 3, … ,6
1 3.25 4 2 0
where 𝜇𝑖 𝜖 {[ ] , [ ] , [ ]} , and Σ = [ ]
4 3.25 1 0 2
3. Model 3: 8 groups of observations in different size are produced by 𝐵𝑣𝑚(𝜇𝑖 , Σ), 𝑖 = 1, … ,8
0.25 4.25 4.25 0.25
where 𝑛𝜖{12, 16, 20, 24, 28, 32, 36, 40}, 𝜇𝑖 𝜖 {[ ],[ ],[ ][ ]} , and Σ =
4.25 4.25 0.25 0.25
0.8 0
[ ].
0 0.8
4. Model 4: 10 groups of observations are produced by 𝐵𝑣𝑚(𝜇𝑖 , Σ𝑖 ), 𝑖 = 1, 2, … ,10 such that
1 3 4.25 1.25 3.25
𝑛𝜖{20, 25, 30, 35, 40, 45, 50, 55, 60, 65}, 𝜇𝑖 𝜖 {[ ] , [ ] , [ ],[ ],[ ]}, and
3 5 0.25 1.25 4.25
5 0 5 0 4 0 3 0 4 0
Σ = {[ ],[ ],[ ],[ ],[ ]}.
0 3 0 5 0 5 0 4 0 4.25
The simulation study is summaries in the following algorithm (Algorithm 1).
Algorithm 1
1. Generate bivariate circular data 𝐵𝐶𝐷 = {(𝜑1 , 𝜓1 ), (𝜑2 , 𝜓2 ), … , (𝜑𝑛 , 𝜓𝑛 )} from 𝐵𝑣𝑚(𝜇𝑖 , Σ𝑖 )
as stated previously where there are k true clusters in these data points.
2. Find the distance matrix D, where each element in the matrix is calculated using torus distance
which is given by

𝑑𝑖𝑗 = 2(−1) (2𝜋 − |𝜋 − |𝜑𝑖 − 𝜑𝑗 || − |𝜋 − |𝜓𝑖 − 𝜓𝑗 ||)

3. Using the distance matrix D and various distance-based clustering techniques (average linkage,
centroid linkage, median linkage, mcquitty linkage) to create a hierarchical structure for the
data.
4. Find the optimal number of clusters by utilizing internal validity indices (KL, CH, Hartigan,
and SD).
5. Determine the degree of convergence between the true number of clusters and the generated
by a computer partition from Step (4) by calculating the external validity indices (Rand and
FM).
4. The Results
The simulation's outcomes are displayed in Figures 2 to 9 and Table 3. The models' true clusters
are displayed in Figure 2-5 for the following models: Model 1with four groupings, Model 2 with
six groupings, Model 3 with eight grouping, and Model 4 with ten groupings as well as the results
of clustering techniques. Figures 6-9 provide box plots showing the number of clusters in
simulated models generated using various clustering algorithms and indices. Table 3 displays the
agreement rate between the true clusters and the clustering results obtained from internal
similarity indicators, together with external indicators Rand and FM. The following is seen in
these results:
1. Model 1 results indicate that the clustering technique’s average linkage with the CH index has
a high degree of agreement with the true clusters under two external validity indices (RI and
FM) (see Figure 2 and Table 3).
2. Model 2 and Model 3 show that the clustering technique (mcquitty) results have a high degree
of agreement with the true clusters under all internal indices (see Figure 3 and Table 3).
3. Model 4 indicates a high degree of agreement with the true clusters using the clustering
method (mcquitty) using two indices, KL and CH(see Figure 4 and Table 3).
4. The Box plots in Figures 6-9 also show the same results as mentioned above in Steps from1to
Figure 2: Displays the true clusters of Model 1, consisting of 4 groups as well as the results of cluster
Figure 3: Displays the true clusters of Model 2, consisting of 6 groups as well as the results of cluster
Figure 4: Displays the true clusters of Model 3, consisting of 8 groups as well as the results of cluster
Figure 5: Displays the true clusters of Model 4, consisting of 4 groups as well as the results of clustering
Table 3: The degree of agreement for each of the four models we created using the external
similarity indices (Rand, FM) between the true clusters and the partition.
Model 1
Clustering methods Rand FM Clustering methods Rand FM
Average, KL 0.706 0.650 Median, KL 0.364 0.483
Average, CH 0.736 0.650 Median, CH 0.696 0.343
Average, Hartigan 0.707 0.440 Median, Hartigan 0.377 0.470
Average, SD 0.662 0.647 Median, SD 0.377 0.470
Mcquitty, KL 0.660 0.610 Centroid, KL 0.710 0.647
Mcquitty, CH 0.660 0.610 Centroid, CH 0.710 0.647
Mcquitty, Hartigan 0.711 0.362 Centroid, Hartigan 0.261 0.500
Mcquitty, SD 0.661 0.607 Centroid, SD 0.711 0.645

Model 2
Average, KL 0.590 0.463 Median, KL 0.748 0.329
Average, CH 0.590 0.463 Median, CH 0.792 0.329
Average, Hartigan 0.596 0.462 Median, Hartigan 0.748 0.328
Average, SD 0.596 0.462 Median, SD 0.345 0.383
Mcquitty, KL 0.749 0.527 Centroid, KL 0.589 0.484
Mcquitty, CH 0.749 0.527 Centroid, CH 0.589 0.484
Mcquitty, Hartigan 0.749 0.527 Centroid, Hartigan 0.609 0.463
Mcquitty, SD 0.749 0.527 Centroid, SD 0.609 0.464

Model 3
Average, KL 0.651 0.277 Median, KL 0.758 0.187
Average, CH 0.783 0.188 Median, CH 0.767 0.181
Average, Hartigan 0.651 0.278 Median, Hartigan 0.689 0.210
Average, SD 0.791 0.183 Median, SD 0.758 0.187
Mcquitty, KL 0.812 0.186 Centroid, KL 0.761 0.204
Mcquitty, CH 0.812 0.186 Centroid, CH 0.676 0.269
Mcquitty, Hartigan 0.682 0.240 Centroid, Hartigan 0.676 0.269
Mcquitty, SD 0.812 0.186 Centroid, SD 0.696 0.254

Model 4
Average, KL 0.728 0.515 Median, KL 0.823 0.331
Average, CH 0.728 0.515 Median, CH 0.846 0.355
Average, Hartigan 0.728 0.515 Median, Hartigan 0.846 0.355
Average, SD 0.728 0.515 Median, SD 0.112 0.330
Mcquitty, KL 0.852 0.415 Centroid, KL 0.731 0.511
Mcquitty, CH 0.852 0.415 Centroid, CH 0.731 0.511
Mcquitty, Hartigan 0.775 0.404 Centroid, Hartigan 0.731 0.511
Mcquitty, SD 0.677 0.364 Centroid, SD 0.731 0.511
Figure6: The box plot displays the number of clusters that result from clustering algorithms together with
internal indices for Model 1.
Figure 7: The box plot displays the number of clusters that result from clustering algorithms together with
internal indices for Model 2.
Figure 8: The box plot displays the number of clusters that result from clustering algorithms together with
internal indices for Model 3.
Figure 9: The box plot displays the number of clusters that result from clustering algorithms together with
internal indices for Model 4.

5. Conclusion
The paper used several models with the recommended torus distance to assess the effectiveness
of clustering algorithms (average, mcquitty, median, centroid) with various validity indicators
(KL, CH, Hartigan, SD) on diverse simulated datasets with varying sample sizes using the R
software. External indices Rand and FM are utilized to measure the degree of agreement between
the true clusters of data elements and the partition produced by clustering techniques. From the
results, we observed the following:
1. Model 1's results indicate that, when tested with two external validity indices, the average
linkage with CH index and the true clusters accord quite well (see Figure 2 and Table 3).
2. Models 2 and Model 3 indicate that the mcquitty clustering method's results under all
internal indices accord with the true clusters at a high rate (see Figure 3 and Table 3).
3. Model 4's results indicate a high degree of agreement with the true clusters as determined
by the mcquitty clustering technique and the two indices KL and CH (see Figure 4 and
Table 3).
4. The Box plots in Figures 6-9 also show the same results as mentioned above in Steps from
1 to 3.
Reference
[1] Abushilah, S. F. H., 2019. Clustering methodology for bivariate circular data with
application to protein dihedral angles. Ph.D. thesis, University of Leeds.
[2] Ali, A.J and Abushilah, S. F., Distribution-free two-sample homogeneity test for circular
data based on geodesic distance. Journal . Nonlinear Anal. Appl. 13 (2022) 1, 2703-2711
http://dx.doi.org/10.22075/ijnaa.2022.5992
[3] Calinski T, Harabasz J (1974). “A Dendrite Method for Cluster Analysis.” Communications
in Statistics – Theory and Methods, 3(1), 1–27.
[4] Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A., 2014. Nbclust: an r package for
determining the relevant number of clusters in a data set. Journal of statistical software 61, 1-36.
[5] Charu C. Aggarwal and Chandan K. Reddy. DATA CLUSTERING Algorithms and
Applications.
[6] Fowlkes, E. B., Mallows, C. L., 1983. A method for comparing two hierarchical clusterings.
Journal of the American statistical association 78 (383), 553- 569.
[6] Gupta, G. K., 2014. Introduction to data mining with case studies. PHI Learning Pvt. Ltd.
[7] Hartigan JA (1975). Clustering Algorithms. John Wiley & Sons, New York
[8] Jain, R. Duin and J. Mao. Statistical Pattern Recognition: A Review. IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 22, no.1, pp. 4-37, 2000.
[9] Jain, M. Murty and P. Flynn. Data Clustering: A Review. ACM Computing Surveys, vol.
31, no. 3, pp. 264-323,1999.
[10] J. H. Ward, “Hierarchical grouping to optimize an objective function,” Journal of the
American Statistical Association, vol. 58, no. 301, pp. 236-244, 1963.
[11] Judd, P. Mckinley and A. Jain. Large-scale Parallel Data Clustering. IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 871-876, 1998.
[12] K.V. Mardia and P.E. Jupp, Directional Statistics, volume 494, John Wiley & Sons, 2009.
[13] Krzanowski WJ, Lai YT (1988). “A Criterion for Determining the Number of Groups in a
Data Set Using Sum-of-Squares Clustering.” Biometrics, 44(1), 23–34
[14] Landau, S., Leese, M., Stahl, D., Everitt, B. S., 2011. Cluster analysis. John Wiley & Sons.
[15] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. USA: Addison-
Wesley Longman, Inc., 2005.
[16] [Rand, W. M., 1971. Objective criteria for the evaluation of clustering methods. Journal of
the American Statistical association 66 (336), 846{850.
[16] Rosie Cornish, Statistics Cluster Analysis, Mathematic Learning Support Center, 2007.
[17] Saitta, S., Raphael, B. and Smith, I.F.C. "A comprehensive validity index for clustering",
Intelligent Data Analysis, vol. 12, no 6, 2008, pp. 529-548
http://content.iospress.com/articles/intelligent-data-analysis/ida00346 Copyright IOS Press
[18] Maimon, O., Rokach, L., 2005. Data mining and knowledge discovery handbook
[19] M. Halkidi, M. Vazirgiannis, and Y. Batistakis, “Quality scheme assessment in the
clustering process,” in PKDD, London, UK, 2000, pp. 265–276.
[20] Xu, R. and Wunsch, D. (2009). Clustering. IEEE Press, Piscataway, NJ, USA.
[21] Y. Zhao and G. Karypis, “Evaluation of hierarchical clustering algorithms for document
datasets,” in Procedings of CIKM, 2002, pp. 515–524

within
w. IEEE

Isabela State University: Republic of The Philippines
No ratings yet
Isabela State University: Republic of The Philippines
5 pages
Rizal's Higher Education and Life Abroad
100% (2)
Rizal's Higher Education and Life Abroad
18 pages
Dll-Sample On English 8
No ratings yet
Dll-Sample On English 8
6 pages
2002 Hakidi Cluster Validity Methods Part II
No ratings yet
2002 Hakidi Cluster Validity Methods Part II
9 pages
Lecture-9 Cluster Analysis_LAK
No ratings yet
Lecture-9 Cluster Analysis_LAK
4 pages
Cluster Analysis BRM Session 14
No ratings yet
Cluster Analysis BRM Session 14
25 pages
Cluster Analysis
No ratings yet
Cluster Analysis
20 pages
Chapter-5-Cluster Analysis PDF
No ratings yet
Chapter-5-Cluster Analysis PDF
5 pages
Recent Advances in Clustering A Brief Survey
No ratings yet
Recent Advances in Clustering A Brief Survey
9 pages
Knowledge Acquisition and Sharing - Data Mining: INF 791 Lecture 4: Cluster Analysis
No ratings yet
Knowledge Acquisition and Sharing - Data Mining: INF 791 Lecture 4: Cluster Analysis
43 pages
Unit IV and V
No ratings yet
Unit IV and V
19 pages
Recursive Hierarchical Clustering Algorithm
No ratings yet
Recursive Hierarchical Clustering Algorithm
7 pages
In Marketing, Cluster Analysis Is Used For: Statistical
No ratings yet
In Marketing, Cluster Analysis Is Used For: Statistical
3 pages
Efficient Clustering Algorithm For Large Database
No ratings yet
Efficient Clustering Algorithm For Large Database
25 pages
Clustering Hierarchical Algorithms
100% (1)
Clustering Hierarchical Algorithms
21 pages
Journal of Statistical Software: Nbclust: An R Package For Determining The Relevant Number of Clusters in A Data Set
No ratings yet
Journal of Statistical Software: Nbclust: An R Package For Determining The Relevant Number of Clusters in A Data Set
36 pages
V61i06 PDF
No ratings yet
V61i06 PDF
36 pages
MA Unit 5
No ratings yet
MA Unit 5
7 pages
An Approach of Hybrid Clustering Technique For Maximizing Similarity of Gene Expression
No ratings yet
An Approach of Hybrid Clustering Technique For Maximizing Similarity of Gene Expression
14 pages
Algorithms 10 00105
No ratings yet
Algorithms 10 00105
14 pages
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
No ratings yet
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
12 pages
10.cluster Analysis
No ratings yet
10.cluster Analysis
68 pages
MCS: A Method For Finding The Number of Clusters
No ratings yet
MCS: A Method For Finding The Number of Clusters
26 pages
Cluster Analysis
No ratings yet
Cluster Analysis
45 pages
Chapter 20: Cluster Analysis: Advance Marketing Research
No ratings yet
Chapter 20: Cluster Analysis: Advance Marketing Research
40 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
Lecture 02 - Cluster Analysis 1
No ratings yet
Lecture 02 - Cluster Analysis 1
59 pages
Comparing Clustering
No ratings yet
Comparing Clustering
42 pages
The Others in The Cluster But With Differences Between Clusters
No ratings yet
The Others in The Cluster But With Differences Between Clusters
5 pages
TwoStep Cluster Analysis
No ratings yet
TwoStep Cluster Analysis
35 pages
Entropy: A Clustering Method Based On The Maximum Entropy Principle
No ratings yet
Entropy: A Clustering Method Based On The Maximum Entropy Principle
30 pages
Performance Evaluation of Distance Metrics in The Clustering Algorithms
No ratings yet
Performance Evaluation of Distance Metrics in The Clustering Algorithms
14 pages
DataMining_Unit4_notes
No ratings yet
DataMining_Unit4_notes
27 pages
Cluster Analysis
No ratings yet
Cluster Analysis
5 pages
Unit-6 Clustering Techniques
No ratings yet
Unit-6 Clustering Techniques
110 pages
20-463 Internal and External Validity PDF
No ratings yet
20-463 Internal and External Validity PDF
8 pages
Applied Soft Computing: Boseop Kim, Hakyeon Lee, Pilsung Kang
No ratings yet
Applied Soft Computing: Boseop Kim, Hakyeon Lee, Pilsung Kang
15 pages
By Lior Rokach and Oded Maimon: Clustering Methods
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
Predicting Students' Performance Using K-Median Clustering
No ratings yet
Predicting Students' Performance Using K-Median Clustering
4 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Paper 1 73
No ratings yet
Paper 1 73
6 pages
K-Means Clustering
No ratings yet
K-Means Clustering
8 pages
B43 Exp5 ML
No ratings yet
B43 Exp5 ML
6 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Chapter 9-Analysis of Ecological Distance by Clustering
No ratings yet
Chapter 9-Analysis of Ecological Distance by Clustering
14 pages
Unit- 4 DMA
No ratings yet
Unit- 4 DMA
145 pages
Chapter 2 (19-06-2019 v2)
No ratings yet
Chapter 2 (19-06-2019 v2)
10 pages
BA2 7 Cluster
No ratings yet
BA2 7 Cluster
33 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
5 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Importance of Clustering in Data Mining
No ratings yet
Importance of Clustering in Data Mining
5 pages
Intermediate R - Cluster Analysis
33% (3)
Intermediate R - Cluster Analysis
27 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
L18_19_Clustering
No ratings yet
L18_19_Clustering
48 pages
Chapter 4 - Cluster Analysis
No ratings yet
Chapter 4 - Cluster Analysis
55 pages
The General Considerations and Implementation In: K-Means Clustering Technique: Mathematica
No ratings yet
The General Considerations and Implementation In: K-Means Clustering Technique: Mathematica
10 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages
SD, TGHFDHSGD
No ratings yet
SD, TGHFDHSGD
1 page
HierarchicalClusterAnalysis1
No ratings yet
HierarchicalClusterAnalysis1
13 pages
Automatic Clustering With Single Optimal Solution
No ratings yet
Automatic Clustering With Single Optimal Solution
13 pages
Cluster Analysis
No ratings yet
Cluster Analysis
25 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
List of Books - 2024-25 (AU) - Except 3 and 6
No ratings yet
List of Books - 2024-25 (AU) - Except 3 and 6
13 pages
Shahin 2014
No ratings yet
Shahin 2014
12 pages
Product, Process, and Problem Analysis Quality System
No ratings yet
Product, Process, and Problem Analysis Quality System
25 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
Mathematics in The Modern World: Module 5
100% (1)
Mathematics in The Modern World: Module 5
5 pages
College of Engineering: Appendix 1. Student Internship Information Sheet
No ratings yet
College of Engineering: Appendix 1. Student Internship Information Sheet
16 pages
Software Quality Assurance Testing
No ratings yet
Software Quality Assurance Testing
11 pages
Active and Passive Voice Rule and Questions Attached
No ratings yet
Active and Passive Voice Rule and Questions Attached
8 pages
Postgraduate Mba Master Programs at Ifa Paris 2020
No ratings yet
Postgraduate Mba Master Programs at Ifa Paris 2020
33 pages
NSF - References Cited Instructions and Samples
No ratings yet
NSF - References Cited Instructions and Samples
8 pages
Karakter Wisatawan Milenial
No ratings yet
Karakter Wisatawan Milenial
5 pages
Course Outline TTQC
No ratings yet
Course Outline TTQC
4 pages
Soal Ulangan Harian
No ratings yet
Soal Ulangan Harian
3 pages
Error Analysis Practice
No ratings yet
Error Analysis Practice
2 pages
Creating Tropical Yankees Social Science Textbooks and U S Ideological Control in Puerto Rico 1898 1908 1st Edition Jose-Manuel Navarro 2024 Scribd Download
100% (5)
Creating Tropical Yankees Social Science Textbooks and U S Ideological Control in Puerto Rico 1898 1908 1st Edition Jose-Manuel Navarro 2024 Scribd Download
71 pages
Las Ls5 Work Sample Ppa
No ratings yet
Las Ls5 Work Sample Ppa
10 pages
V/DM 3 P (Exp) /atm P (TH) /atm Z
No ratings yet
V/DM 3 P (Exp) /atm P (TH) /atm Z
31 pages
The Importance of Interpersonal Skills
60% (5)
The Importance of Interpersonal Skills
35 pages
Adv MX Paper 2 Exam 2014
No ratings yet
Adv MX Paper 2 Exam 2014
3 pages
The Effect of Training On Employee Performance
No ratings yet
The Effect of Training On Employee Performance
4 pages
Ella Bell - Alabama Superintendent Evaluation - May 9, 2019
No ratings yet
Ella Bell - Alabama Superintendent Evaluation - May 9, 2019
9 pages
Innovation Management: Dr. Babasaheb Ambedkar Technological University, Lonere
No ratings yet
Innovation Management: Dr. Babasaheb Ambedkar Technological University, Lonere
3 pages
Covening Order NCC A Cert Exam 2024-25
No ratings yet
Covening Order NCC A Cert Exam 2024-25
4 pages
Vincent Lancrin
No ratings yet
Vincent Lancrin
15 pages
Performance Task On Preparing Pasta
No ratings yet
Performance Task On Preparing Pasta
4 pages
2021 Y11 Ext 1 Task 1
No ratings yet
2021 Y11 Ext 1 Task 1
28 pages
Ccare Pro Coby Harrington Romaniuk Feature
No ratings yet
Ccare Pro Coby Harrington Romaniuk Feature
9 pages

First Paper Before

Uploaded by

First Paper Before

Uploaded by

Studying Clustering Techniques under Circular Data and

Huda Karaim & Samira Faisal Abushilah

𝑑(𝜑𝑖 , 𝜑𝑗 ) = 𝜋 − |𝜋 − |𝜑𝑖 − 𝜑𝑗 ||, (1)

Figure 1: Displays the dendrogram representing the data found in Table 1.

where 𝐷𝐼𝐹𝐹𝑞 = (𝑞 − 1)2/𝑝 𝑡𝑟(𝑊𝑞−1 ) − 𝑞 2/𝑝 𝑡𝑟(𝑊𝑞 ).

where σ represents vector of variances of all variables in the data.

* 𝜎 = (𝑣𝑎𝑟(𝑉1 ), 𝑣𝑎𝑟(𝑉2 ), … , 𝑣𝑎𝑟(𝑉𝑝 ))

* 𝜎 (𝑘) represents the variance vector in every cluster𝐶𝑘 ,

* 𝜎 (𝑘) = (𝑣𝑎𝑟(𝑉1 (𝑘) ), 𝑣𝑎𝑟(𝑉2 (𝑘) ), … , 𝑣𝑎𝑟(𝑉𝑝 (𝑘) ))

𝑖=1 𝑗=1 𝑖=1 𝑗=1 𝑗=1 𝑖=1

𝑑𝑖𝑗 = 2(−1) (2𝜋 − |𝜋 − |𝜑𝑖 − 𝜑𝑗 || − |𝜋 − |𝜓𝑖 − 𝜓𝑗 ||)

You might also like