1 s2.0 S1319157821001701 Main
1 s2.0 S1319157821001701 Main
a r t i c l e i n f o a b s t r a c t
Article history: K-means is one of ten popular clustering algorithms. However, k-means performs poorly due to the pres-
Received 1 April 2021 ence of outliers in real datasets. Besides, a different distance metric makes a variation in data clustering
Revised 9 June 2021 accuracy. Improve the clustering accuracy of k-means is still an active topic among researchers of the
Accepted 2 July 2021
data clustering community from outliers removal and distance metrics perspectives. Herein, a novel
Available online 13 July 2021
modification of the k-means algorithm is proposed based on Tukey’s rule in conjunction with a new dis-
tance metric. The standard Tukey rule is modified to remove the outliers adaptively by considering
Keywords:
whether the data is distributed to the left, right or even to the input data’s mean value. The elimination
Data clustering
Modified k-means
of outliers is applied in the proposed modification of the k-means before calculating the centroids to min-
Outliers removal imize the outliers’ influences. Meanwhile, a new distance metric is proposed to assign each data point to
Adapted Tukey’s rule the nearest cluster. In this research, the modified k-means significantly improves the clustering accuracy
New distance metric and centroids convergence. Moreover, the proposed distance metric’s overall performance outperforms
most of the literature distance metrics. This manuscript’s presented work demonstrates the significance
of the proposed technique to improve the overall clustering accuracy up to 80.57% on nine standard mul-
tivariate datasets.
Ó 2021 The Authors. Published by Elsevier B.V. on behalf of King Saud University. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
https://doi.org/10.1016/j.jksuci.2021.07.003
1319-1578/Ó 2021 The Authors. Published by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
N.H.M.M. Shrifan, M.F. Akbar and N.A.M. Isa Journal of King Saud University – Computer and Information Sciences 34 (2022) 6365–6376
optimization techniques in term of clustering accuracy positively means, various distance metrics and Tukey’s rule for outliers
overcomes numerous conventional clustering techniques. How- removal is presented in Section 3. The proposed method including
ever, these algorithms still suffer from being trapped into local the new distance metric, the adapted Tukey’s rule and the modified
extrema (Zhang et al., 2011) and generating redundant solutions k-means are introduced in Section 4. Sections 5 explains the results
(Benmessahel and Touahria, 2010). Moreover, many unsupervised and discussion while the challenges and limitations are addressed
clustering techniques are proposed based on blind source separa- in Section 6. Finally, the conclusion and future work is summarized
tion techniques which a set of observations are modelled as linear in Section 7.
combinations to be statistically separated into independent sources
or components such as independent component analysis (ICA)
(Safont et al., 2017), principal component analysis (PCA) (Kaya 2. Related work
et al., 2017)and non-negative matrix factorization (NNFM) (Laxmi
Lydia et al., 2020). Although the encouraging performance of data Many studies dealt with improving the clustering accuracy of
clustering-based source separation techniques on linear data, these the k-means algorithm based on various techniques of outliers
algorithms perform poorly due to the outlier presence degrades the removal. In term of outliers detection-based distance metric, sev-
data linearity and clustering accuracy (Lever et al., 2017). eral studies have recognized the outliers based on the distance
Among the clustering algorithms, k-means (MacQueen, 1967) is between the data point and its closest centroid (Sarvani et al.,
one of the most popular partitioning algorithms and it used inten- 2019; Barai (Deb) and Dey, 2017). In these techniques, the data
sively due to its efficiency and simplicity of implementation (Xie point with a larger distance to the nearest centroid is recognized
et al., 2019). However, it performs poorly due to the presence of as an outlier. Additionally, the data points with both low density
outliers which influence the clustering quality (Mousavi et al., and large distance to their centroids are considered as outliers as
2020). According to Hawkins (Hawkins, 1980); ‘‘ An outlier is an presented in He et al. (2020). In a different approach, local search
observation which deviates so much from other observations as to techniques (Gupta et al., 2017; Friggstad et al., 2019) are used to
arouse suspicions that a different mechanism generated it”. Naturally, assist the k-means for outliers detection. The local search aims to
the real-world data is not always ideal and may include noisy data remove a few data points from the data within the cluster for min-
due to abnormal phenomena or a different measurement mecha- imizing the objective function. If the removed data points have
nism that produces outliers. Recently, outliers processing became minimized the objective function then those data points are con-
a trending topic among the developers of data mining techniques sidered as outliers and grouped in a separate cluster. In term of
(Gupta and Chandra, 2020). Therefore, it is worthy of considering preprocessing techniques, k-means++ is utilized as an additional
the impacts of outliers once modifying the k-means algorithm. filtering step in Im et al. (2020) to remove out z of data points as
Besides the influence of outliers, a different similarity metric outliers before applying the conventional k-means. Although, the
leads to different clustering form that may increase or decrease encouraging clustering results of these techniques, the clustering
the k-means clustering accuracy (Gupta and Chandra, 2020). Among process was only performed on the remaining data which is
the distance metrics, Euclidean distance is commonly used with k- outlier-free. The outliers data are completely removed and not
means for data clustering. Also, cosine and correlation are the most classified to any known cluster as collected initially.
well-known metrics for clusters differentiation. Clustering the same In other studies, outlier detection is used as an advantage to
dataset based on these metrics may produce various manners of separate an object from its background such as in image processing
clustering which highly depends on the distance model that fits (Tu et al., 2020, 2019). However, few studies dealt with mitigating
the data domain (Bekhet and Ahmed, 2020). Evaluation of the outliers’ effects from the mean measurement and classifying all
impact of various similarity metrics can be found in Gupta and data points into known clusters as collected initially. In
Chandra (2020), Gupta and Chandra (2020) and Singh et al. (2013) Olukanmi et al. (2017), a k-means# is proposed to eliminate the
while selecting or develop suitable distance measure is still an active outliers’ influences from the clusters’ centroid. The detected out-
field among the researchers in the data clustering community. liers are completely excluded from the mean measurement only
In this paper, a modified k-means algorithm is proposed for but they are involved later in the clustering process. Thus, the
improving data clustering while both online clustering and big effect of the outliers is mitigated from the centroid measurement
data clustering are out of this work’s scope. The modified algorithm and enhanced the clustering accuracy. Although the proposed
aims to mitigate the outliers’ influences during the centroid mea- technique outperformed the conventional k-means, the data point
surement for better clustering accuracy. In the proposed algorithm, with N attributes was eliminated completely from centroid mea-
a well known Tukey’s rule is adapted and applied instead of dis- surement. In this case, the algorithm cannot recognize an outlier’s
tance metric to identify the boundaries of the outliers-free data. presence in every attribute independently. This is because the sin-
After that, each cluster’s centroid is measured based on the average gle value of the distance metric represents the entire vector instead
of upper and lower boundaries of the outliers-free data. The aver- the single attribute be removed. Therefore, an empty cluster may
age measurement is applied to each attribute of the data points occur in case of the presence of at least one outlier in each data
independently. This eliminates the influences of the outliers with- point.
out removing the entire data point. On the other hand, a new dis- Improvement of the clustering accuracy from the perspective of
tance metric is proposed in this research. Initially, the ratio is distance metric is demonstrated in various studies. Probabilistic
computed between every two corresponding values in the com- distance for ICA mixture models (PDI) is proposed in Safont et al.
pared attribute. Then, the angle of each ratio is calculated using (2018). The distance measures the discordance between the prob-
arctangent. The difference between the produced angles is mea- ability density of the data, especially to the parameters of each
sured for each attribute in the compared vectors. Finally, root mean ICAMM model. The source separation of ICAMM is improved based
square error (RMSE) of the angles’ differences is computed as a on PDI distance especially after adjusting a threshold value.
similarity metric. The proposed method has improved the overall Although, the good performance for detecting the flaws and the
clustering accuracy up to 80.57% on nine standard multivariate variation in electroencephalographic (EEG). The authors have sug-
datasets compared to the literature algorithms. In this paper, the gested parallel computing for reducing the processing time.
related work for improving the clustering accuracy of k-means In Meng et al. (2018), several orders of derivative information is
from the perspective of outliers removal and distance metrics is measured between the compared vectors and added to the dis-
presented in Section 2. The discussion of the conventional k- tance metric. The added information of the derivatives is useful
6366
N.H.M.M. Shrifan, M.F. Akbar and N.A.M. Isa Journal of King Saud University – Computer and Information Sciences 34 (2022) 6365–6376
for capturing the differences between the compared functional data points X ¼ fx1 ; x2 ; x3 ; ; xN g includes some outliers where
data. However, this technique is computational complex due to X 2 R and x1 x2 x3 xn . In the beginning, the first quar-
the calculation of several derivative orders of the functional data. tile Q 1 and third quartile Q 3are computed as stated in (3) and
In term of hybrid distance metrics which is commonly used for (4) respectively. Thereafter, the inter-quartile range IQR is mea-
improving clustering accuracy, a new distance metric named sured using (5). Finally, the mutation of lower bound lb and upper
‘‘direction-aware” is developed in Gu et al. (2017) to improve the bound ub are given in (6) and (7) respectively. Therefore, any data
clustering accuracy of k-means. The proposed distance combines point xi is lower than lb or greater than ub is considered as an
the conventional Euclidean distance to handle the spatial similarity outlier.
while the cosine metric calculates the shape similarity. Compared Q1 ¼ xi ji ¼ roundððN þ 1Þ 0:25Þ ð3Þ
to the original metrics, the hybridization of both metrics in a single
one has improved the clustering purity. Moreover, a weighted sum Q3 ¼ xi ji ¼ roundððN þ 1Þ 0:75Þ ð4Þ
of the Euclidean and Pearson distance is introduced in Immink and
Weber (2015) using a weight for summation both of the Euclidean IQR ¼ Q 3 Q1 ð5Þ
and Pearson coefficients. The hybrid distance has improved the
similarity between the compared signals once the noise is added lb ¼ Q 1 r IQR ð6Þ
significantly.
The clustering accuracy can be improved as long as the outlier is ub ¼ Q1 þ r IQR ð7Þ
removed before measuring the cluster’s centroid as discussed in
this section. In this study, a well-known approach such as Tukey’s where i is the data point index while Ndenotes the total number of
rule for outlier removal is adapted to be used instead of the dis- data points. r is a predefined constant to adjust the inliers’ bound-
tance metric. Moreover, the hybridization approach of several dis- aries. Tukey proposed a predefined constant r ¼ 1:5. However, there
tance metrics is followed due to its efficiency for improving the is no statistical basis for the reasoning of the proposed value (Seo
clustering accuracy as discussed earlier. Taking the advantages of et al., 2006). Therefore, this value can be adapted based on its
commonly used distance metrics such as Euclidean, cosine and application.
correlation into a new single similarity metric can process the data
from different perspectives. The capability of hybrid distance 3.3. Similarity measurements
approaches for improving the clustering accuracy is high especially
after eliminating the influence of the outliers from the centroid of Euclidean distance is the most common similarity metric
data clusters. The theoretical background of the used techniques among various machine learning algorithms (Mesquita et al.,
such as k-means, Tukey’s rule and similarity metrics are discussed 2017). For measuring the distance between two vectors, let
in the next section. A ¼ fa1 ; a2 ; a3 ; ; aN g and B ¼ fb1 ; b2 ; b3 ; ; bN g are two data
points with N numeric attributes. The Euclidean distance dE
between A and B can be measured as given in (8).
3. Theoretical background
X
N
dE ðA; BÞ ¼ jai bi j ð8Þ
3.1. K-means clustering ALGORITHM i¼1
The approach of k-means is based on spherical clusters in which Several properties are satisfied by Euclidean distance such as
the data points converge surrounding the cluster’s centroid. The k- non-negative where dE 0, once both A and B are completely sim-
means splits a set of data points X ¼ fx1 ; x2 ; x3 ; ; xN g into k ilars the distance dE ¼ 0 and the distance is a symmetric function
known number of clusters. Randomly, the k-means selects k set where dE ðA; BÞ ¼ dE ðB; AÞ (Han et al., 2012). Despite the encourag-
of centroids C ¼ fc1 ; c2 ; c3 ; ; ck g wherek N. Thereafter, each ing performance in the categorical and numerical datasets, the per-
data point xi is assigned to the nearest cluster C j based on the small- formance of the Euclidean distance is relatively poor for mixed
est Euclidean distance. The mean of data points lj within each datasets (Hu et al., 2016).
On the other hand, cosine metric computes the similarity
cluster is computed to update the centroid in each iteration. The
between two vectors based on the inner product as defined in (9)
procedure is repeated until there are no changes in the centroid
(Lo et al., 2018). Cosine distance dC ðA; BÞ ¼ 1 indicates that both
values or the number of maximum iterations is reached. The sim-
vectors have no similarity. The higher similarity between the input
ilarity of the data points within each cluster is high through mini-
vectors is once the cosine distance tends to 0.
mizing the distance among them as given in (1). PN
8 9 ai bi
k X 2 =
arg min < X dC ðA; BÞ ¼ 1 PN i¼1PN 2 ð9Þ
X lj ð1Þ 2
i¼1 ai i¼1 bi
C : j¼1 X2C ;
j
Typically, the cosine distance is utilized to measure the text-
similarity in the document classification (Manning et al., 2008).
1 X
lj ¼ X ð2Þ Cosine distance performs well on the categorical and numerical
N X2C
j datasets independently and worst in the mixed datasets (Hu
et al., 2016).
where argmin is the objective function of the k-means. The symbol
In a different approach, Pearson coefficient is utilized to com-
jj denotes to the Euclidean distance while N denotes the number of
th
pute the correlation degree between A and B data points on the
the data points within j cluster. straight regression line (Yin and Wang, 2020). The similarity metric
is defined by subtracting the Pearson correlation coefficient from 1
3.2. Tukey’s rule for outliers removal as stated in (10).
PN
Tukey’s rule is one of the most robust used techniques for out- i¼1 ðai AÞðbi BÞ
dR ðA; BÞ ¼ 1 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffisffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð10Þ
liers’ detection in multivariate data (Huyghues-Beaufond et al., PN 2 PN 2
2020). Tukey defines outliers as the data located outside the ðai AÞ ðbi B
i¼1 i¼1
boundaries of the inner fence. To detect the outliers, let a set of
6367
N.H.M.M. Shrifan, M.F. Akbar and N.A.M. Isa Journal of King Saud University – Computer and Information Sciences 34 (2022) 6365–6376
where A and B are the means of A and B vectors, respectively. Both and vice versa. As a result, N of angles’ differences are generated
data scaling and localization are invariant to the Pearson correlation for the compared vectors A and Busingjai bi j. Finally, root mean
coefficient. Thus it is not a distance metric since the approach aims square error (RMSE) of the angles’ differences is computed as sta-
to define the variations in the data’s shape instead of calculating the ted in (17). RMSE tends to 0 in the case of similar vectors and a
magnitude differences. large value for the dissimilar vectors. In this technique, RMSE is
the last outcome of the distance measurement. RMSE is not mea-
4. Proposed method sured from the raw data directly, but it is computed from the
angles’ differences of the ratio coefficients.
ai
4.1. Distance measurement ai ¼ tan1 ; where bi –0 ð15Þ
bi
In this study, a new similarity metric is proposed to replace the
Euclidean distance in the conventional k-means. The proposed bi
bi ¼ tan1 ; where ai –0 ð16Þ
similarity metric aims to take the advantages of various measure- ai
ments such as Euclidean distance, cosine metric and correlation vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u
coefficient due to a different distance metric leading to a different u1 X N
manner of the clustering accuracy (Aggarwal et al., 2019). dP ðA; BÞ ¼ t jai bi j2 ð17Þ
N i¼1
Let A ¼ fa1 ; a2 ; a3 ; ; aN g and B ¼ fb1 ; b2 ; b3 ; ; bN g are two
vectors with N numeric attributes. To measure the similarity The proposed distance dP satisfies several mathematical proper-
between A andB, a normalization process is firstly applied to con- ties such as non-negative where dP 0. In the case of both A and B
vert them into positive values between 1 and 90 as follow: are completely similar, the distance dP ¼ 0. Moreover, the pro-
Y ¼ maxðA; BÞ ð11Þ posed distance is a symmetric function if dP ðA; BÞ ¼ dP ðB; AÞ.
Besides, several advantages are considered while computing the
Z ¼ minðA; BÞ ð12Þ proposed distance. The advantages of the correlation distance are
used at the first stage by computing the proportionality between
AZ the compared vectors. Secondly, the compared vectors’ angle met-
A¼ ðr max r min Þ þ r min ð13Þ
Y Z ric is utilized using the arctangent function, which is similar to the
cosine metric. Thirdly, the angles difference jai bi j is achieved
BZ
B¼ ðr max r min Þ þ r min ð14Þ similarly to the Euclidean distance. Finally, RMSE is used to pro-
Y Z duce the last outcome of the proposed distance dP The value of
where YandZdenote the maximum and minimum values among A dP tends to 0 if there is a strong correlation between the compared
and B respectively.rmax and r min denote the maximum and minimum vectors due to the entire points lie on the regression line. Thus, the
of the target scaling which are 90 and 1 respectively. proposed distance metric is computed to be used with the modi-
The normalization process aims to guarantee both values of ai fied k-means as discussed in Section 4.3.
and bi is not equal to zero for avoiding the division by zero in
(15) and (16). After the normalization, the proportionality between 4.2. Adapted Tukey’s rule
the normalized vectors A andB is computed firstly to indicate the
number of times for one value is contained in the other one. The The standard Tukey’s rule as discussed in Section 3.2 aims to
proportionality can be computed by measuring the ratios between remove the outliers using Tukey’s constant r ¼ 1:5in both lower
every two corresponding values such as ai =bi andbi =ai in the com- and upper boundaries equally. Meanwhile, the distribution of the
pared vectors A andB. In the case of fully similar vectors, both ratio remaining data concerning their mean is not considered, including
coefficients equal 1 while a different value is produced in skewness. The skewed data means that the data is not equally dis-
dissimilarity. tributed along the left and right sides of the mean value. Therefore,
After that, the angles ai and bi of both ratios are computed as computing the mean from the remaining data after applying
given in (15) and (16). For similar vectors, both angles tend to Tukey’s rule does not make sense while the hard skewness has
45° while the angles of the dissimilar values tend between 0° existed. The skewed data influences the centroid measurement of
and 90° as shown in Fig. 1. Therefore, the distance between ai the k-means. Therefore, an adaption to the standard Tukey’s rule
and bi is decreased while the difference between them is small is proposed in this paper to provide a robust measurement of the
Fig. 1. The relationship between the angles differences ja bj and the proposed distance metricdP .
6368
N.H.M.M. Shrifan, M.F. Akbar and N.A.M. Isa Journal of King Saud University – Computer and Information Sciences 34 (2022) 6365–6376
k-means’ centroids by mitigating the data skewness through out- 4. Centroid updating: the new centroid cj of each attribute m
liers removal. This provides outliers-free data which does not influ- within-cluster j is computed as the following:
ence the centroid measurement.
Typically, the skewness of the distributed data can be detected
by counting the number of data points along the mean value’s right lbm þ ubm
cj;m ¼ ð20Þ
and left sides. Therefore, skewness exists if the number of the data 2
points is not balanced equally along the left and the right sides of
where lb and ub are the lower and upper boundaries obtained from
the mean value. Otherwise, the data points are normally distributed.
(18) and (19).
The Tukey’s constant r is divided into r1 and r2 to manage the
amount of the removed outliers at each side of the distributed data
5. Termination criteria: stop if all data centroids in C are not chan-
where r1 > r2 (e.g., r1 ¼ 0:85 and r 2 ¼ 0:65). In the case of skewed
ged or maximum iteration is reached. Otherwise, repeat the
data, r1 allows removing fewer data points due to its large value
steps from 2 to 5.
which increases the boundary of the data to be involved at the skew-
ness position. Conversely, r 2 allows removing a large number of the
Applying the adapted Tukey’s rule recursively has two objec-
data points at the boundary away from the skewness position. The
tives. The first one is to obtain the outliers-free data’s boundaries
small value of r2 decreases the data boundary to be involved far
by mining the underlying patterns where the data within each
from the skewness position. On the other hand, the amount of the
cluster is focused. Therefore, the recursive process helps remove
removed data is set into the average of r 1 and r2 in case of the data
the hidden outliers that other outliers may cover. The second one
is equally distributed along both sides of the mean value. The for-
is to remove the outliers from each attribute independently instead
mulation of the adaptive process is given in (18) and (19).
8 of removing the entire observation as commonly used by the clus-
< Q 1 r 1 IQR j GM < LM
> tering algorithms based on outliers’ removal techniques. Process-
lb ¼ Q 1 r 2 IQR j GM > LM ð18Þ ing each attribute independently is useful to avoid the
>
:
Q1 r1 þr
2
2
IQR j GM ¼ LM occurrence of the empty cluster once at least one outlier is
detected in each observation.
8
< Q3 þ r 2 IQR j GM < LM
>
Q3 þ r 1 IQR j GM > LM Algorithm 1: The modified k-means clustering algorithm.
ub ¼ ð19Þ
>
:
Q 3 þ r1 þr
2
2
IQR j GM ¼ LM Input: X is an input dataset withN data points and M
attributes. k is a predefined clusters’ number. MaxIter is a
where GM and LM are the numbers of the data points that greater maximum number of iterations.
and less than the mean value respectively. Output: Sis a set of k clusters belong to X dataset.
The modified Tukey’s rule aims to adaptively eliminate the out- 1. Begin
liers concerning three data distribution cases, whether the data is 2. // Initialization
distributed at the left, right or equally to the mean value. This is 3. Select random k of initial centroids C ¼ fc1 ; c2 ; c3 ; ; ck g
useful to decide where the hard and less removal should be applied where C 2 X.
instead of eliminating the outliers from both sides equally as pro- 4. Repeat until MaxIter is met or C is not changed // Termina-
posed in the standard Tukey’s rule. The purpose of adapting tion criteria
Tukey’s rule is to provide robust outliers-free boundaries. These 5. S ¼ Null
boundaries such as lb and ub can be used to compute the clusters’ 6. // Assignment
centroids of the modified k-means only. Outliers removal is only 7. For i ¼ 1 : N
applied during the centroids measurement. Therefore, none of 8. For j ¼ 1 : k
the input data will be removed and all of them will be clustered 9. dðjÞ ¼ dP xi ; cj , compute the distance between xi and
to the known group as initially collected. A detailed discussion cj as given in (17).
about the hybridization between the adapted Tukey’s rule and k-
10. Sj ¼ xi , assign xi to its cluster setSj if xi has a minimum
mean is introduced in the next section.
dðjÞ withcj .
4.3. Modified k-means 11. // Outliers’ removal
12. For j ¼ 1 : k
In this paper, the modification of the k-means aims to mitigate 13. For m ¼ 1 : M
influences of the outliers while computing the clusters’ centroids. 14. Repeat until no outlier is detected in Sj ð:; mÞ // ‘‘:”
Let X be a numerical N M dataset where N denotes the number denotes to all values
of data points and M is the number of attributes. Algorithm 1
15. Compute lband ub of Sj ð:; mÞas given in (18) and
shows the pseudocode of the modified k-means and the entire
(19) respectively.
steps are discussed as the following:
16. For i ¼ 1 : lengthðSj ð:; mÞÞ
1. Initialization: The modified k-means selects k set of the data
points randomly as data centroids C ¼ fc1 ; c2 ; c3 ; ; ck g where 17. If Sj ði; mÞ < lb or Sj ði; mÞ > ub // Check if an
k N and C 2 X. outlier
2. Assignment: each data point xi is assigned to the nearest cluster 18. Remove the outlier data pointSj ði; mÞ.
cj based on the proposed distance metric as given in Eq. (17). At 19. // Centroid updating
this step, k sets of the data clusters are constructed initially. 20. cj ðmÞ ¼ ðlb þ ubÞ=2, update the centroid of each attri-
3. Outliers’ removal: the adapted Tukey’s rule is applied to each bute as given in (20).
cluster for removing the outliers in each m attribute indepen- 21. End
dently. The adapted Tukey’s rule is applied recursively to each
attribute’s remaining data until no more outliers are detected.
6369
N.H.M.M. Shrifan, M.F. Akbar and N.A.M. Isa Journal of King Saud University – Computer and Information Sciences 34 (2022) 6365–6376
In the modified k-means, the centroid is measured from the average of the number of data points that are correctly clustered as given
of lb and ub of the outliers-free data at the last iteration of the in (21).
adapted Tukey’s rule. This step replaces the traditional centroid
measurement lj in the conventional k-means. This step’s replace- 1 Xk
Purity ¼ jL i \ C i j ð21Þ
ment allows the modified k-means to quickly calculate the centroid N i¼1
from the outlier-free data’s boundaries instead of computing the
mean of entire input data. On the other hand, the modified k- where Ndenotes the total number of data points. Li denotes the
means allows the entire data points to be clustered based on the number of true data points within each cluster i. C i denotes the
proposed distance metric. None of the data points will be removed number of data points that are correctly clustered using a clustering
from the input data like usually used in the conventional clustering algorithm.
algorithms based on outliers removal methods. The proposed The proposed method overcomes most of the literature algo-
method is validated on various commonly used datasets as dis- rithms as shown in Table 2. The modified k-means significantly
cussed in the next section. outperformed the conventional k-means, k-means# and the evolu-
tionary algorithms such as CPSO, CGA and SCMVO in all datasets
especially Glass, Balance, Vertebral, Ecoli and Blood. However,
5. Results and discussion HCA overcomes the proposed algorithm in the Balance dataset
with 2.55% of accuracy difference. Meanwhile, both HCA and FFA
5.1. Datasets outperform the modified k-means in the Blood dataset with a
0.47% accuracy difference, which is a competitive performance.
The proposed method is validated on nine multivariate datasets Moreover, DBSCAN achieves better performance in the Blood data-
obtained from the machine learning repository of the University of set compared to the proposed method with 2% of accuracy
California, Irvine (UCI) (Dua et al., 2020). The used datasets have a difference.
diverse number of attributes ranged from 4 to 13. Moreover, the The modified k-means achieves a significant performance for
number of clusters of the given datasets is varied from 2 to 6. the most challenging databases with a large number of classes such
The content of the used datasets is described and summarized in as Glass and Ecoli with 6 and 5 classes, respectively. Although the
Table 1. initial centroids randomly selected, the low rate of the standard
deviation over 100 runs reflects the stability as well as the robust-
ness of the proposed technique. Generally, the overall performance
5.2. Evaluation of the proposed method
of the modified k-means performed the best accuracy results in a
total of 7 datasets and competitive accuracy results in the other
5.2.1. Evaluation of the clustering accuracy of the modified k-means
2 datasets. As a result, the modified k-means completely over-
The modified k-means is evaluated on the UCI as mentioned
comes 5 well-known algorithms, and with competitive results
earlier datasets. The given datasets are normalized by ranging their
compared to the other three in the two datasets.
values from 1 to 90 to avoid dividing by zero of the given equations
in (15) and (16). The obtained results are compared to the common
literature algorithms such as k-means (KMN), hierarchal clustering 5.2.2. Impact of outliers removal on the clustering accuracy
algorithm (HCA) and farthest first algorithm (FFA). Moreover, a The impacts of both standard and adapted Tukey’s rules on the
comparison has been introduced between the proposed algorithm modified k-means are discussed in Section 5.2.2.1. Moreover, the
and well-known evolutionary algorithms such as clustering-based mechanism of outliers removal through the clustering process of
PSO (CPSO), clustering-based GA (CGA) and static clustering-based both outlier techniques is addressed in Section 5.2.2.2. In the
MVO (SCMVO). Also, the results of modified k-means are compared experimental setup, the adapted Tukey’s rule parameters r1 and
to k-means# (KMN#) and DBSCAN which are robust algorithms to r2 are set up into 0.85 and 0.65 respectively. In the standard
outliers’ influences. In the experimental setup, the constant values Tukey’s rule, Tukey’s constant r ¼ 1:5.
of the adapted Tukey’s rule r 1 and r 2 are set up into 0.85 and 0.65
respectively as given in Eqs. (18) and (19) where r1 > r2 . These con- 5.2.2.1. Clustering accuracy through the modified k-means’ itera-
stants are empirically selected based on the best clustering accu- tions. Fig. 2 shows the accuracy variations during the iteration
racy and globally used for all datasets during this experiment. incremental of the modified k-means based on the adapted Tukey’s
The utilized distance measurement of the proposed method is rule. Each iteration has a valuable contribution in increasing the
given in (17). The maximum number of iterations of the modified clustering accuracy on most datasets as shown in Fig. 2. However,
k-means is 50 while the accuracy results are obtained from the the loss in the accuracy during the iteration increment can be seen
mean value over 100 runs. The clustering accuracy in each run is on the Blood dataset, which inferences the modified algorithm’s
measured based on the clustering purity which is the percentage performance compared to the HCA and FFA. This behaviour of
Table 1
The description of the used datasets obtained from UCI repository.
Dataset No. of attributes No. of classes No. of data points No. of data points in each class Data type
Iris 4 3 150 50, 50, 50 Real
Glass 9 6 214 70, 17, 9, 76, 29, 13 Real
Balance 4 3 625 49, 288, 288 Real
Cancer 8 2 699 458, 241 Real
Wine 13 3 178 59, 71, 48 Real
Vertebral 6 2 310 207, 100 Real
Ecoli 7 5 327 143, 77, 35, 20, 52 Real
Blood 4 2 748 570, 178 Real
Seed 7 3 210 70, 70, 70 Real
6370
N.H.M.M. Shrifan, M.F. Akbar and N.A.M. Isa Journal of King Saud University – Computer and Information Sciences 34 (2022) 6365–6376
Table 2
Accuracy results of the modified k-means compared to other literature clustering algorithms. The bold values in all tables refer to the best performance.
Dataset KMN HCA (Hochbaum FFA CPSO CGA (Maulik and CSMVO DBSACN KMN# PROPOSED
(MacQueen, and Shmoys, (Sharmila, (Jarboui Bandyopadhyay, (Shukri (Ester et al., (Olukanmi
1967) 1985) 2013) et al., 2007) 2000) et al., 2018) 1996) et al., 2017)
Mean ± Std Mean ± Std Mean ± Std Mean ± Std Mean ± Std Mean ± Std Mean ± Std Mean ± Std Mean ± Std
Iris 0.57 ± 0.24 0.89 ± 0.00 0.86 ± 0.00 0.96 ± 0.00 0.96 ± 0.00 0.96 ± 0.00 0.68 ± 0.00 0.85 ± 0.10 0.96 ± 0.00
Glass 0.36 ± 0.08 0.46 ± 0.00 0.48 ± 0.00 0.45 ± 0.00 0.36 ± 0.00 0.52 ± 0.08 0.46 ± 0.00 0.49 ± 0.04 0.56 ± 0.03
Balance 0.47 ± 0.17 0.63 ± 0.00 0.65 ± 0.00 0.37 ± 0.18 0.41 ± 0.18 0.49 ± 0.17 0.47 ± 0.00 0.55 ± 0.07 0.61 ± 0.07
Cancer 0.95 ± 0.00 0.66 ± 0.00 0.84 ± 0.00 0.96 ± 0.00 0.96 ± 0.00 0.96 ± 0.00 0.93 ± 0.00 0.96 ± 0.03 0.97 ± 0.00
Wine 0.90 ± 0.14 0.40 ± 0.00 0.70 ± 0.00 0.91 ± 0.12 0.95 ± 0.01 0.96 ± 0.00 0.61 ± 0.00 0.71 ± 0.04 0.98 ± 0.00
Vertebral 0.68 ± 0.00 0.67 ± 0.00 0.68 ± 0.00 0.68 ± 0.02 0.70 ± 0.00 0.71 ± 0.00 0.67 ± 0.00 0.73 ± 0.04 0.78 ± 0.01
Ecoli 0.66 ± 0.11 0.65 ± 0.00 0.60 ± 0.00 0.60 ± 0.11 0.62 ± 0.13 0.57 ± 0.13 0.43 ± 0.00 0.66 ± 0.09 0.71 ± 0.07
Blood 0.53 ± 0.06 0.76 ± 0.00 0.76 ± 0.00 0.48 ± 0.00 0.48 ± 0.00 0.48 ± 0.01 0.78 ± 0.00 0.68 ± 0.03 0.76 ± 0.02
Seed 0.79 ± 0.19 0.90 ± 0.00 0.67 ± 0.00 0.89 ± 0.05 0.90 ± 0.00 0.90 ± 0.00 0.68 ± 0.00 0.90 ± 0.04 0.94 ± 0.00
Overall accuracy 0.66 ± 0.11 0.67 ± 0.00 0.69 ± 0.00 0.70 ± 0.05 0.70 ± 0.04 0.73 ± 0.04 0.63 ± 0.00 0.73 ± 0.05 0.81 ± 0.02
the proposed algorithm due to the nature of the dataset itself while Fig. 4(a) illustrates a description of the boxplot parameters of the
the variety of the included data could be clustered up to 18 differ- adapted Tukey’s rule. In this example, the data of the Sepal length
ent classes (Lord et al., 2017). Therefore, the high variety of the attribute of the Iris dataset is monitored in every single cluster dur-
included data points impacts the accuracy of the modified k- ing the outliers removal process. In all three clusters, the remaining
means. data at the final iteration is completely outliers-free as shown in
The standard Tukey’s rule influences the convergence process Fig. 4(b), (c) and (d). It can be noted that at the first iteration there
due to the number of iterations is increased significantly with less are some of the outliers were detected and removed from the Sepal
contribution to the accuracy compared to the adapted Tukey’s rule length attribute in every cluster. Moreover, there are some of the
as shown in Fig. 3. The adapted Tukey’s rule achieves the best per- covered outliers were detected in both Iris Setosa and Iris Virginica
formance in term of both iterations’ number and accuracy as clusters after applying the adapted Tukey’s rule recursively to the
shown in Table 3. The adapted Tukey’s rule positively increases remaining data. Therefore the proposed recursive process of the
the clustering accuracy while reducing the iterations’ number removal of the outliers is useful to eliminate this kind of outliers.
needed for centroids convergence. This improvement is acquired Furthermore, the proposed technique for centroid measurement
due to the adapted Tukey’s rule adaptively eliminates the outliers is different from the centroid of the conventional k-means at the
based on the added rules of removing the outliers as given in (18) final iteration of removing the outliers. The proposed centroid
and (19). This makes the adapted Tukey’s rule quickly reach the tends to be close to the mean of the outliers-free data. The pro-
boundaries of the outlier-free data where the data centroid should posed centroid is aligned at the middle of the blue box where
be measured. On the other hand, the modified k-means based on the data is focused. In term of skewed data, it can be noted that
the standard Tukey’s rule achieves the best accuracy performance a hard skewness is shown in the Iris Virginica cluster. In this clus-
compared to the literature clustering algorithm in four datasets ter, the data median is shifted at the lower boundary as shown in
which are Cancer, Wine, Vertebral and Seed as underlined in Fig. 4(d). Therefore, the adapted Tukey’s rule mitigates the skew-
Table 3. ness by eliminated the covered outliers at each iteration. As a
result, the centroid is measured after the hard skewness is miti-
5.2.2.2. Mechanism of outliers removal. In terms of the outliers gated at the final iteration. Thus, the adapted Tukey’s rule works
removal mechanism, Fig. 4 shows an example of the centroid mea- to provide a robust measurement of the k-means’ centroids by mit-
surement at each iteration during the outliers removal process. igating the data skewness through outliers removal. This provides
Fig. 2. Evolution of the clustering accuracy through iterations of the modified k-means.
6371
N.H.M.M. Shrifan, M.F. Akbar and N.A.M. Isa Journal of King Saud University – Computer and Information Sciences 34 (2022) 6365–6376
Fig. 3. Evolution of the clustering accuracy through iterations of the modified k-means-based standard Tukey’s rule.
Table 3
A comparison between the performance of the standard Tukey’s rule and the adapted one in term of iterations’ number and clustering accuracy. The bold values in all tables refer
to the best performance.
outliers-free data which does not influence the centroid cluster with few data points are brought from other clusters. The
measurement. performance of the modified k-means is encouraged regarding
Compared to the adapted Tukey’s rule, the standard Tukey’s the scarce data clustering.
rule is not sensitive enough to detect the outliers within each clus-
ter of the Iris dataset as shown in Fig. 5. The large value of Tukey’s
constant r ¼ 1:5 maximizes both of lower and upper boundaries.
As a result, the covered outliers are not detected because they 5.2.4. Evaluation of clustering run-time
are quietly placed lower than uband greater thanlb. Therefore, In Table 4, the run-time comparison is addressed among con-
the computed centroid in this case is similar to that one calculated ventional k-means (KMN), k-means-based standard Tukey’s rule
by the conventional k-means, as illustrated in Fig. 5. (KMN-TTR) and k-means-based adapted Tukey’s rule (KMN-ATR).
The conventional k-means has significantly outperformed both k-
means based on standard and adapted Tukey’s rules in term of pro-
5.2.3. Evaluation of scarce data clustering cessing time. The additional steps of outlier removal in both algo-
The Glass dataset provides a good example of scarce data clus- rithms have added extra processing time to the procedure of the
tering. In this dataset, six classes are given which the number of conventional algorithm. Compared to the standard Tukey’s rule,
data points within each class is 70, 17, 9, 76, 29 and 13 respec- k-means based on the adapted Tukey’s rule is the worst in process-
tively. It is clear that the second, third and fourth classes fall under ing time. In the adapted Tukey’s rule, the small values of r 1 ¼ 0:85
scarce data clustering due to the small number of data points and r 2 ¼ 0:65 have risen the iterations’ number of outliers removal.
within these classes compared to the other ones. The proposed Meanwhile, the large value of Tukey constantr ¼ 1:5 has mini-
method correctly assigns 35.3%, 88.9% and 61.5% of data points in mized the processing time due to no more outliers are detected
the second, third and fourth clusters respectively. Focusing on and quickly met stop criteria. The given examples in Figs. 4 and
the third cluster which only includes nine data points, eight of 5 clearly illustrate the difference in iterations’ number in both rules
them are correctly assigned to this cluster. In this cluster, only of outliers removal. Although the added steps by the adapted
two data points are wrongly included from other clusters. Consid- Tukey’s rule have risen the processing time but they significantly
ering the clustering complexity of the Glass dataset as shown in improve the clustering accuracy. In some critical applications such
Table 2, the proposed method successfully constructs an indepen- as medical requirements, clustering accuracy is essential regard-
dent centroid that attracts most of the data points in the second less of the computational time (Walker et al., 2020). Giving priority
6372
N.H.M.M. Shrifan, M.F. Akbar and N.A.M. Isa Journal of King Saud University – Computer and Information Sciences 34 (2022) 6365–6376
Fig. 4. The centroid measurement of the sepal length attribute within each cluster of the Iris dataset during the outliers removal process using the adapted Tukey’s rule.
to the accuracy or time is highly dependent on the used algorithm 6. Challenges and limitations
for such application.
In this manuscript, several modifications are proposed for the k-
means algorithm based on the adapted Tukey’s rule for outliers
5.2.5. Evaluation of the proposed distance removal and the new distance metric. Although the significant
The clustering accuracy of the proposed distance has outper- improvement in clustering accuracy, several challenges face the
formed most of the literature distance metrics in various datasets modified k-means such as real-time clustering due to the outliers
as shown in Table 5. The significant performance of the proposed removal step that added extra processing time to the procedure
distance can be observed clearly on Iris, Glass, Wine, Vertebral, of the conventional k-means. Also, the proposed distance metric
Blood and Seed. However, the cosine and correlation metrics only is limited to real values greater than zero. This needs additional
overcome the proposed distance in Balance dataset with 2.95% processing time for data normalization. However, the biding
and 0.85% of accuracy difference, respectively. Moreover, the Eucli- between the clustering accuracy and run-time highly depends on
dean distance overcomes the proposed distance in Ecoli dataset the application filed. On the other hand, data clustering with high
with 3.52% of accuracy difference. The proposed distance achieves variety number of possible clusters such as in Blood dataset is
very competitive performance with Euclidean distance in the Can- another challenge to the proposed algorithm. This variety in num-
cer dataset by 0.09% of the clustering accuracy difference. ber of clusters degrades the clustering accuracy using the modified
On the other hand, the underlined accuracy results mean that k-means.
the used distance metrics outperformed the entire clustering algo-
rithms stated in Table 2. Therefore, the modified k-means can work
adequately to achieve an acceptable accuracy rate once other dis- 7. Conclusion and future work
tance metrics are used such as Euclidean, cosine and correlation.
The low STD coefficient illustrates the stability of the modified k- This work has demonstrated the significance of the proposed
means once working with the utilized distance metrics. Compared technique to improve the overall clustering accuracy up to
to the overall accuracy, the modified k-means achieves the best 80.57% on nine UCI multivariate datasets. The proposed technique
accuracy performance once various distance metrics are used. for centroid measurement based on the adapted Tukey’s rule sig-
6373
N.H.M.M. Shrifan, M.F. Akbar and N.A.M. Isa Journal of King Saud University – Computer and Information Sciences 34 (2022) 6365–6376
Fig. 5. The centroid measurement of the Sepal length within each cluster of the Iris dataset during the outliers removal process using the standard Tukey’s rule.
Table 4
Run-time comparison of KMN, KMN-TTR and KMN-ATR. The bold values in all tables refer to the best performance.
Dataset Iris Glass Balance Cancer Wine Vertebral Ecoli Blood Seed
KMN 0.0779 0.1707 0.2998 0.2617 0.0901 0.0668 0.2240 0.2775 0.1165
KMN-TTR 0.1027 0.3014 0.2714 0.3036 0.1718 0.1618 0.3143 0.3228 0.1606
KMN-ATR 0.1250 0.3448 0.3383 0.2977 0.2321 0.2043 0.3578 0.3342 0.1899
Table 5
A comparison between the accuracy of the proposed distance metric and various literature distance metrics using the modified k-means clustering algorithm. The bold values in
all tables refer to the best performance.
nificantly improved the overall clustering accuracy by 5.72% com- accuracy, respectively. In future work, the adapted Tukey’s rule
pared to the standard rule. Moreover, the overall iteration number and the new distance metric will be applied to various clustering
for centroid convergence is highly reduced by 50% once the algorithms such as fuzzy c-means and k-medoids. Moreover,
adapted Tukey’s rule is used. Although the reduction in iteration improving the run-time of modified k-means for big data cluster-
number, the adapted Tukey’s rule has added extra processing time ing will be investigated further in future work by reducing the time
to the procedure of conventional k-means algorithm compared to complexity of Tukey’s rule for outlier removal.
the standard Tukey’s rule.
On the other hand, the developed distance metric properly Declaration of Competing Interest
aided the modified k-means in term of the overall accuracy perfor-
mance compared to the other distance metrics. Generally, the pro- The authors declare that they have no known competing finan-
posed distance metric outperformed the Euclidean, cosine and cial interests or personal relationships that could have appeared
correlation metrics by 2.40%, 5.79% and 6.15% of the clustering to influence the work reported in this paper.
6374
N.H.M.M. Shrifan, M.F. Akbar and N.A.M. Isa Journal of King Saud University – Computer and Information Sciences 34 (2022) 6365–6376
Acknowledgement Kaya, I.E., Pehlivanlı, A.Ç., Sekizkardesß, E.G., Ibrikci, T., 2017. PCA based clustering for
brain tumor segmentation of T1w MRI images. Comput. Methods Programs
Biomed. 140, 19–28. https://doi.org/10.1016/j.cmpb.2016.11.011.
This research is partially supported by Fundamental Research Laxmi Lydia, E., Krishna Kumar, P., Shankar, K., Lakshmanaprabu, S.K., Vidhyavathi,
Grant Scheme (FRGS), Ministry of Higher Education, Malaysia enti- R.M., Maseleno, A., 2020. Charismatic document clustering through novel k-
means non-negative matrix factorization (KNMF) algorithm using key phrase
tled ‘Enhanced Differential Evolution Algorithm by Balancing the
extraction. Int. J. Parallel Prog. 48 (3), 496–514. https://doi.org/10.1007/s10766-
Exploitation and Exploration Search Behavior for Data Clustering’ 018-0591-9.
account number 203/PELECT/6071398. Lever, J., Krzywinski, M., Altman, N., 2017. Points of Significance: Principal
component analysis. Nat. Methods 14 (7), 641–642. https://doi.org/10.1038/
nmeth.4346.
Lo, O., Buchanan, W.J., Griffiths, P., Macfarlane, R., 2018. Distance measurement
References methods for improved insider threat detection. Secur. Commun. Networks.
2018, 1–18. https://doi.org/10.1155/2018/5906368.
Lord, E., Willems, M., Lapointe, F.-J., Makarenkov, V., 2017. Using the stability of
S. Aggarwal N. Agarwal M. Jain Performance analysis of uncertain k-means
objects to determine the number of clusters in datasets. Inf. Sci. (Ny) 393, 29–
clustering algorithm using different distance metrics, Adv. Intell. Syst.
46. https://doi.org/10.1016/j.ins.2017.02.010.
Comput., 2019:. 237–245. 10.1007/978-981-13-1132-1_19.
MacQueen, J., 1967. Some methods for classification and analysis of multivariate
Aggarwal, C.C., Reddy, C.K., 2013. Data Custering: Algorithms and Applications,
observations. Proc. Fifth Berkeley Symp. Math. Stat. Probab.
Chapman & Hall/CRC Data Mining and Knowledge Discovery Series.
Manning, C.D., Raghavan, P., Schutze, H., 2008. In: Introduction to Information
Barai (Deb), A., Dey, L., 2017. Outlier detection and removal algorithm in k-means
Retrieval. Cambridge University Press, Cambridge. https://doi.org/10.1017/
and hierarchical clustering. World J. Comput. Appl. Technol. 5 (2), 24–29.
CBO9780511809071.
Bekhet, S., Ahmed, A., 2020. Evaluation of similarity measures for video retrieval.
Maulik, U., Bandyopadhyay, S., 2000. Genetic algorithm-based clustering technique.
Multimed. Tools Appl. 79 (9-10), 6265–6278. https://doi.org/10.1007/s11042-
Pattern Recogn. 33 (9), 1455–1465. https://doi.org/10.1016/S0031-3203(99)
019-08539-4.
00137-5.
Benmessahel, B., Touahria, M., 2010. An improved combinatorial particle swarm
Meng, Y., Liang, J., Cao, F., He, Y., 2018. A new distance with derivative information
optimization algorithm to database vertical partition. J. Emerg. Trends Comput.
for functional k-means clustering algorithm. Inf. Sci. (Ny). 463-464, 166–185.
Inf. Sci. 2, 130–135 (accessed 21 October 2020) http://www.cisjournal.org.
https://doi.org/10.1016/j.ins.2018.06.035.
Bezdek, J.C., Coray, C., Gunderson, R., Watson, J., 1981. Detection and
Mesquita, D.P.P., Gomes, J.P.P., Souza Junior, A.H., Nobre, J.S., 2017. Euclidean
characterization of cluster substructure I. Linear structure: fuzzy c -lines.
distance estimation in incomplete datasets. Neurocomputing. 248, 11–18.
SIAM J. Appl. Math. 40 (2), 339–357. https://doi.org/10.1137/0140029.
https://doi.org/10.1016/j.neucom.2016.12.081.
Doroshenko, A., 2020. Analysis of the distribution of COVID-19 in Italy using
Mousavi, S., Boroujeni, F.Z., Aryanmehr, S., 2020. Improving customer clustering by
clustering algorithms. In: 2020 IEEE Third Int. Conf. Data Stream Min. Process.
optimal selection of cluster centroids in K-means and K-medoids algorithms. J.
IEEE, Lviv, Ukraine, pp. 325–328. https://doi.org/10.1109/
Theor. Appl. Inf. Technol. 98, 3807–3814 http://www.jatit.org/volumes/
DSMP47368.2020.9204202.
Vol98No18/8Vol98No18.pdf.
Dua, D., Graff, C., 2020. UCI Machine Learning Repository. http://archive.ics.uci.edu/
Olukanmi, P.O., Twala, B., 2017. K-means-sharp: Modified centroid update for
ml.
outlier-robust k-means clustering. In: 2017 Pattern Recognit. Assoc. South
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., 1996. A density-based algorithm for
Africa Robot Mechatronics. IEEE, pp. 14–19. https://doi.org/10.1109/
discovering clusters in large spatial databases with noise. Proc. 2nd Int. Conf.
RoboMech.2017.8261116.
Knowl. Discov. Data Min.
Safont, G., Salazar, A., Vergara, L., 2017. Unsupervised learning of non-Gaussian
Friggstad, Z., Khodamoradi, K., Rezapour, M., Salavatipour, M.R., 2019.
mixtures with temporal dependencies. In: 2017 40th Int Conf. Telecommun.
Approximation schemes for clustering with outliers. ACM Trans. Algorithms
Signal Process. TSP, p. 2017. https://doi.org/10.1109/TSP.2017.8076014.
15 (2), 1–26. https://doi.org/10.1145/3301446.
Safont, G., Salazar, A., Vergara, L., Gomez, E., Villanueva, V., 2018. Probabilistic
Govindaraju, P., Achter, S., Ponsignon, T., Ehm, H., Meyer, M., 2018. Comparison of
distance for mixtures of independent component analyzers. IEEE Trans. Neural
two clustering approaches to find demand patterns in semiconductor supply
Networks Learn. Syst. 29 (4), 1161–1173. https://doi.org/10.1109/
chain planning. IEEE Int. Conf. Autom. Sci. Eng. https://doi.org/10.1109/
TNNLS.2017.2663843.
COASE.2018.8560535.
Sarvani, A., Venugopal, B., Devarakonda, N., 2019. Anomaly detection using K-
Gu, X., Angelov, P.P., Kangin, D., Principe, J.C., 2017. A new type of distance metric
means approach and outliers detection technique. In: Adv. Intell. Syst. Comput.,
and its use for clustering. Evol. Syst. 8 (3), 167–177. https://doi.org/10.1007/
pp. 375–385. https://doi.org/10.1007/978-981-13-0589-4_35.
s12530-017-9195-7.
Seo, S., Gary, P.D., Marsh, M., 2006. A Review and Comparison of Methods for
Guha, S., Rastogi, R., Shim, K., 1998. CURE: an efficient clustering algorithm for large
Detecting Outliers in Univariate Data Sets. University of Pittsburgh. http://d-
databases. ACM SIGMOD Rec. 27 (2), 73–84. https://doi.org/10.1145/
scholarship.pitt.edu/7948/.
276305.276312.
Sharmila, M.K., 2013. An optimized farthest first clustering algorithm. In: 2013
Gupta, M.K., Chandra, P., 2020. A comprehensive survey of data mining. Int. J. Inf.
Nirma Univ. Int. Conf. Eng. IEEE, pp. 1–5. https://doi.org/10.1109/
Technol. 12 (4), 1243–1257. https://doi.org/10.1007/s41870-020-00427-7.
NUiCONE.2013.6780070.
Gupta, M.K., Chandra, P., 2020. An empirical evaluation of K-means clustering
Shrifan, N.H.M.M., Jawad, G.N., Isa, N.A.M., Akbar, M.F., 2021. Microwave
algorithm using different distance/similarity metrics. Lect. Notes Electr. Eng.
nondestructive testing for defect detection in composites based on k-means
https://doi.org/10.1007/978-3-030-30577-2_79.
clustering algorithm. IEEE Access 9, 4820–4828. https://doi.org/10.1109/
Gupta, S., Kumar, R., Lu, K., Moseley, B., Vassilvitskii, S., 2017. Local search methods
ACCESS.2020.3048147.
for k-means with outliers. Proc. VLDB Endow. 10, 757–768. https://doi.org/
Shukri, S., Faris, H., Aljarah, I., Mirjalili, S., Abraham, A., 2018. Evolutionary static and
10.14778/3067421.3067425.
dynamic clustering algorithms based on multi-verse optimizer. Eng. Appl. Artif.
Han, J., Kamber, M., Pei, J., 2012. Getting to know your data. In: Data Min. Elsevier,
Intell. 72, 54–66. https://doi.org/10.1016/j.engappai.2018.03.013.
pp. 39–82. https://doi.org/10.1016/B978-0-12-381479-1.00002-2.
Siddiqui, F.U., Isa, N.A.M., Yahya, A., 2013. Outlier rejection fuzzy c-means (ORFCM)
Hawkins, D.M., 1980. In: Identification of Outliers. Springer, Netherlands,
algorithm for image segmentation. Turk. J. Electr. Eng. Comput. Sci. 21, 1801–
Dordrecht. https://doi.org/10.1007/978-94-015-3994-4.
1819. https://doi.org/10.3906/elk-1111-29.
He, Q., Chen, Z., Ji, K., Wang, L., Ma, K., Zhao, C., Shi, Y., 2020. Cluster center
Singh, A., Yadav, A., Rana, A., 2013. K-means with three different distance metrics.
initialization and outlier detection based on distance and density for the K-
Int. J. Comput. Appl. 67 (10), 13–17. https://doi.org/10.5120/11430-6785.
means algorithm. In: Adv. Intell. Syst. Comput., pp. 530–539. https://doi.org/
Tu, B., Li, N., Liao, Z., Ou, X., Zhang, G., 2019. Hyperspectral anomaly detection via
10.1007/978-3-030-16657-1_49.
spatial density background purification. Remote Sens. 11 (22), 2618. https://doi.
Hochbaum, D.S., Shmoys, D.B., 1985. A best possible heuristic for the k -center
org/10.3390/rs11222618.
problem. Math. Oper. Res. 10 (2), 180–184. https://doi.org/
Tu, B., Yang, X., Li, N., Zhou, C., He, D., 2020. Hyperspectral anomaly detection via
10.1287/moor.10.2.180.
density peak clustering. Pattern Recogn. Lett. 129, 144–149. https://doi.org/
Hu, L.-Y., Huang, M.-W., Ke, S.-W., Tsai, C.-F., 2016. The distance function effect on k-
10.1016/j.patrec.2019.11.022.
nearest neighbor classification for medical datasets. Springerplus. 5, 1304.
Uma Maheswari, R., 2020. An efficient cancer classification using mid value k-
https://doi.org/10.1186/s40064-016-2941-7.
means and naïve bayes. J. Sci. Comput. Eng. Res., 1–6. https://doi.org/10.46379/
Huyghues-Beaufond, N., Tindemans, S., Falugi, P., Sun, M., Strbac, G., 2020. Robust
jscer.2020.010101.
and automatic data cleansing method for short-term load forecasting of
Walker, S., Khan, W., Katic, K., Maassen, W., Zeiler, W., 2020. Accuracy of different
distribution feeders. Appl. Energy 261, 114405. https://doi.org/10.1016/j.
machine learning algorithms and added-value of predicting aggregated-level
apenergy.2019.114405.
energy performance of commercial buildings. Energy Build. 209, 109705.
Im S., Qaem, M.M., Moseley, B., Sun, X., Zhou, R., 2020. Fast noise removal for k-
https://doi.org/10.1016/j.enbuild.2019.109705.
means clustering, ArXiv.
Xiaowei, Xu, Ester, M., Kriegel, H.-P., Sander, J., 1998. A distribution-based clustering
Immink, K.A.S., Weber, J.H., 2015. Hybrid minimum pearson and euclidean distance
algorithm for mining in large spatial databases. In: Proc. 14th Int. Conf. Data Eng.,
detection. IEEE Trans. Commun. 63 (9), 3290–3298. https://doi.org/10.1109/
IEEE Comput. Soc, pp. 324–331. https://doi.org/10.1109/ICDE.1998.655795.
TCOMM.2015.2458319.
Xie, H., Zhang, L., Lim, C.P., Yu, Y., Liu, C., Liu, H., Walters, J., 2019. Improving K-
Jarboui, B., Cheikh, M., Siarry, P., Rebai, A., 2007. Combinatorial particle swarm
means clustering with enhanced Firefly Algorithms. Appl. Soft Comput. 84,
optimization (CPSO) for partitional clustering problem. Appl. Math. Comput.
105763. https://doi.org/10.1016/j.asoc.2019.105763.
192 (2), 337–345. https://doi.org/10.1016/j.amc.2007.03.010.
6375
N.H.M.M. Shrifan, M.F. Akbar and N.A.M. Isa Journal of King Saud University – Computer and Information Sciences 34 (2022) 6365–6376
Yin, S., Wang, T., 2020. An unknown Protocol improved k-means clustering Zhang, Y., Wu, L., Wang, S., Huo, Y., 2011. Chaotic artificial bee colony used for
algorithm based on Pearson distance. J. Intell. Fuzzy Syst. 38 (4), 4901–4913. cluster analysis. In: Commun. Comput. Inf. Sci., pp. 205–211. https://doi.org/
https://doi.org/10.3233/JIFS-191561. 10.1007/978-3-642-18129-0_33.
6376