Clustering_Algorithms_in_an_Educational_Context_An_Automatic_Comparative_Approach
Clustering_Algorithms_in_an_Educational_Context_An_Automatic_Comparative_Approach
2020.
Digital Object Identifier 10.1109/ACCESS.2020.3014948
ABSTRACT Despite an increasing consensus regarding the significance of properly identifying the most
suitable clustering method for a given problem, a surprising amount of educational research, including both
educational data mining (EDM) and learning analytics (LA), neglects this critical task. This shortcoming
could in many cases have a negative impact on the prediction power of both the EDM and LA based
approaches. To address such issues, this work proposes an evaluation approach that automatically compares
several clustering methods using multiple internal and external performance measures on 9 real-world
educational datasets of different sizes, created from the University of Tartu’s Moodle system, to produce
two-way clustering. Moreover, to investigate the possible effect of normalization on the performance of the
clustering algorithms, this work performs the same experiment on a normalized version of the datasets. Since
such an exhaustive evaluation includes multiple criteria, the proposed approach employs a multiple criteria
decision-making method (i.e., TOPSIS) to rank the most suitable methods for each dataset. Our results
reveal that the proposed approach can automatically compare the performance of the clustering methods
and accordingly recommend the most suitable method for each dataset. Furthermore, our results show that
in both normalized and nonnormalized datasets of different sizes with 10 features, DBSCAN and k-medoids
are the best clustering methods, whereas agglomerative and spectral methods appear to be among the most
stable and highly performing clustering methods for such datasets with 15 features. Regarding datasets with
more than 15 features, OPTICS is among the top-ranked algorithms among the nonnormalized datasets, and
k-medoids is the best among the normalized datasets. Interestingly, our findings reveal that normalization
may have a negative effect on the performance of certain methods, e.g., spectral clustering and OPTICS;
however, it appears to mostly have a positive impact on all of the other clustering methods.
INDEX TERMS Educational context, clustering methods, multiple criteria decision-making, educational
data mining, learning analytics.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
146994 VOLUME 8, 2020
D. Hooshyar et al.: Clustering Algorithms in an Educational Context: An Automatic Comparative Approach
learning analytics (LA), and their applications—which usu- revolve around developing performance measures for cluster-
ally aim to provide helpful understanding of the learning ing methods with the aim of determining the appropriateness
process to both students and instructors by converting raw of the produced clusters [24], [25]. However, surprisingly,
educational data into useful information—are some possi- although there is an increasing consensus concerning the
ble responses to this challenge [3]–[5]. To uncover hidden importance of properly identifying the best clustering method
patterns and knowledge from educational data, researchers and subsequently interpreting the produced result for a given
usually employ various EDM or LA approaches, e.g., super- problem, a limited number of research studies [25]–[27],
vised and unsupervised learning techniques (for a systematic if any, have comprehensively considered both internal and
review of EDM research, see [3]). Even though high predic- external measurements for the evaluation of clustering meth-
tion accuracy could be achieved using supervised learning, ods in an educational context. In addition to the tedious pro-
they are often inapplicable in regard to educational data with cess behind the experimentation and the data preprocessing,
no predefined class labels [6]. Unsupervised learning meth- one main reason is that cluster evaluation normally involves
ods, henceforth referred to as clustering, not only are capable multiple conflicting criteria (due to a large number of external
of discovering hidden, underlying patterns and structures in and internal metrics). In this study, to address this issue—
data without labels but also could be used to label unlabeled comprehensively evaluating the performance of different
data for the possible use of supervised methods. clustering methods in educational datasets (e.g., Moodle)—
The main aim of clustering methods is to find and form this study models the cluster evaluation task as a multiple
objects’ groups, or clusters, that are more similar to one criteria decision-making (MCDM) problem and accordingly
another. The practical value of this capability is to enable proposes a novel approach for the automatic evaluation of
personalized differentiation of learning processes that have clustering methods in an educational context. Our approach,
been key to improving learning outcomes (e.g., [7]). In other inspired by [23], involves empirically studying the perfor-
words, clustering methods could be regarded as the task of mance of seven different clustering methods (from different
modeling the data in the form of a simplified set of properties, families) on 18 educational datasets of different sizes to
providing insightful explanations about important aspects of a generate two-way clustering results. Thereafter, seventeen
dataset. Supervised methods are usually less demanding than performance metrics (including three internal and multiple
clustering; however, clustering methods can provide more externals metrics) are used to measure the performance of
intuition when addressing complex data [8]. In the field of the produced clustering results. This study then employs a
education, several clustering approaches have been applied to well-known MCDM method, called TOPSIS, which takes
different variables, such as student motivation and behavior, the performance measures as inputs, to rank each clustering
time spent on learning tasks, and so on [9]. The performance algorithm for each dataset as a means of validating our pro-
of clustering methods can differ significantly for different posed evaluation approach. This approach makes it possible
types of data and applications because they usually operate in to find and recommend the most suitable clustering method
spaces with different dimensions and should deal with incom- for different educational datasets to practitioners, researchers,
plete, noisy, and sampled data. Multiple clustering methods and decision makers. However, it should be noted that the
have therefore been developed for such reasons [10]–[16], findings reported in this study are relevant and limited to
making the selection of the most appropriate clustering meth- educational datasets that contain students’ (online) activity
ods for a given dataset or problem a difficult challenge. Thus, data, and for other types of educational data, other results
comparing and recommending clustering algorithms in an could be anticipated. The main contributions of this work can
educational context could be beneficial to both researchers be summarized as follows:
and practitioners in the field. Similar to other disciplines, the 1) To model a cluster evaluation task as a multiple criteria
evaluation of a learning method’s performance is a crucial decision-making problem and accordingly propose a
challenge in the education field. While limited performance novel approach for the evaluation of clustering methods
measures are used in supervised methods to evaluate the in an educational context.
performance, assessing clustering methods is more difficult 2) To automatically find and recommend the most suitable
owing to the nature of cluster analysis (e.g., [17]–[20]). clustering method for different educational datasets, for
In cluster analysis, one key question that must be addressed practitioners, researchers, and decision makers.
is cluster validation or the evaluation of a clustering algo- 3) To systematically and comprehensively compare well-
rithm’s quality. This task is intrinsically challenging due to known clustering methods used in an educational con-
the lack of objective measures. Basically, there exist different text using multiple real-world educational datasets
types of validation methods for clustering algorithms [21]; (with different sizes and performance measures).
among them, internal and external validation are the most 4) To investigate the effect of normalization on the per-
frequently used by researchers [22]. Essentially, these mea- formance of clustering algorithms in an educational
sures assess clustering methods from different viewpoints, context (on datasets with different sizes and feature
and in practice, there is no clustering method that could pos- numbers).
sibly reach the best performance in all of these performance Sections II and III of this paper review the related work
metrics for a given problem domain [23]. A number of studies and describe the materials and methods, respectively, and
Section IV and V illustrate the experiments and results. was conducted by [37]. The authors used three datasets of
Section VI provides the conclusions and future directions. documents and six different clustering methods, including
k-means, random swap, expectation–maximization, hierar-
II. RELATED WORK chical clustering, self-organized maps (SOM), and fuzzy c-
A significant number of EDM and LA studies have applied means. According to their findings, in a small number of
clustering methods to educational data to provide instructors clusters, SOM and hierarchical methods perform weakly
and students with useful insights into their learning processes. and have a lower accuracy than the other methods. In the
For example, in a study conducted by [28] and [29], different study conducted by [35], four performance measures were
clustering methods, including k-means, were employed to used to evaluate five clustering methods—namely, k-means,
cluster the e-learning behavior of students and determine multivariate Gaussian mixture, hierarchical clustering, spec-
the impact of human characteristics on students preferences, tral and nearest neighbor methods—on 35 gene expression
respectively. Similarly, using item response theory and k- datasets. The results indicate that the multivariate Gaussian
means [30] could successfully identify the learning ability mixture method outperforms other methods, the spectral
of students in a collaborative-learning environment. In a method behaves sensitively to the proximity measure used,
different attempt, [31] employed expectation–maximization and k-means shows a good, stable performance, similar to the
clustering to identify active and passive collaborators within Gaussian mixture method.
a group of students at a European university. Moreover, [32] Using six datasets to model protein interactions in the
could determine the different behavior patterns adopted by yeast Saccharomyces cerevisiae, [38] conducted a compar-
students in online discussion forums using an agglomera- ative study of clustering methods (i.e., Markov clustering,
tive clustering method. According to the systematic review restricted neighborhood search clustering, super paramag-
conducted by [3], generally, clustering methods are applied netic clustering, and molecular complex detection), and they
to EDM, LA, and broad educational research for several found that restricted neighborhood search clustering is the
reasons, including the analysis of students’ motivations, atti- most robust and has the highest performance with respect
tudes, and behavior and understanding students’ learning to variation in the choice of parameters used in the clus-
styles. Furthermore, this study also highlights k-means as tering methods. They moreover found that other clustering
the most frequently used clustering method by EDM and LA methods behave more stably with regard to dataset alter-
research. Although a deep review of 30-year use of clustering ations. Finally, [22] conducted a broad comparative study
methods in the educational context is given by this study, on nine clustering methods—including k-means, CLARA,
there is no indication of comprehensive comparative studies hierarchical, expectation–maximization, hcmodel, spectral,
using educational datasets. However, there exist a few studies subspace, OPTICSs, and DBSCAN—using 400 different arti-
that use educational datasets to compare clustering methods ficial datasets of different sizes. Their findings reveal that
(e.g., [33]), but these mainly neglect systematically compar- spectral methods tend to show good performance (compared
ing well-known clustering methods, considering the applica- to others) when considering the default configuration of clus-
tion of multiple performance measures, considering datasets tering methods.
with different sizes and characteristics, studying datasets with Evidently, despite an increasing consensus regarding the
different feature numbers (e.g., large and medium), investi- significance of properly identifying the clustering method
gating the possible effect of normalization on a given dataset, that is best suited for a given problem, a surprising amount
and so forth. Furthermore, among those few studies that of educational research (including EDM and LA) neglects
evaluate the quality of clustering algorithms, there is no study this critical task [25]–[27]. While several clustering tech-
that approaches such an evaluation task using a combination niques have been employed in an educational context, it is
of multiple criteria. still controversial regarding which technique would perform
On the other hand, there are several studies, in other the best for a given dataset [22]. As previously discussed,
disciplines, that revolve around systematically and compre- to the best of our knowledge, there is no indication of com-
hensively comparing clustering methods with the aim of pro- prehensive comparative studies using educational datasets
viding guidelines and recommendations to researchers and that consider multiple performance measures (both internal
practitioners [22], [23], [34]–[37]. In the study conducted and external), well-known clustering methods, and real-world
by [23], for example, the authors used different performance educational datasets (with different sizes and features) to pro-
metrics to assess clustering algorithms in the field of finan- duce two-way clustering and accordingly provide researchers
cial risk analysis. In this study, the authors used a decision- and practitioners in education disciplines with appropriate,
making algorithm to choose the best method on three different much-needed guidelines. This goal would eventually help to
datasets. The results of their study confirmed that it is unlikely improve EDM and LA research and practice. To fill in this
to find a single clustering method with the best performance gap, this study puts forward an approach for the evaluation
on all performance measures. Additionally, they highlighted of clustering methods in an educational context, possibly
the repeated-bisection as the most suitable clustering method finding the most suitable clustering method for a dataset
in their study. In the context of the text-independent speaker in hand. This approach could potentially help to ensure the
verification task, a comparative study of clustering methods suitability of a clustering method for a given dataset because,
in many EDM or LA studies, the classes generated by cluster- Purity Index (PI), Rand Index (RI), Adjusted Rand Index
ing methods are later used as ground-truth for classification (ARI), Fowlkes-Mallows Index (FMI), F-measure (Fmea),
methods and wrongly clustered data can possibly affect the Rogers-Tanimoto (RTI), Normalized Mutual Information
prediction power of the approach. (NMI), Mutual Information (MI), Variation of Information
(VI), Geometric Accuracy (GA), and Overlapping Normal-
III. MATERIALS AND METHODS
ized Mutual Information (ONMI).
Even though there is a diversity in proposing and employ-
While some of the external metrics are closely related
ing several different types of clustering methods in the lit-
and belong to the same pair counting category, they pos-
erature, some methods are used more often [3]. Moreover,
sess some differences; for example, they can behave par-
several frequently used methods employ similar concepts
tially in regard to the distribution of class sizes in a par-
of mathematics—e.g., graph or spectral clustering similarity tition, and they can be biased toward the number of clus-
matrices—or are based on similar assumptions about the ters [22]. Considering a vast number of (internal and exter-
data—e.g., agglomerative and divisive—and thus, in typical
nal) performance metrics enables us to compare different
usage scenarios, they are expected to produce similar perfor-
clustering methods and possibly highlight the best method
mance. For such reasons, this study considers different clus-
for a certain dataset regardless of the partiality of the
tering methods from different families of methods. Several methods [46].
taxonomies have been proposed in the literature, breaking
down clustering methods into different families. For example, A. CLUSTERING METHODS
according to the existing literature, [39] grouped cluster- This section presents a brief overview of seven clustering
ing methods into partitioning and hierarchical methods; [23] methods used in this study.
broke clustering methods down to partitioning, grid-based, K-means is an algorithm that has been broadly employed
hierarchical, model-based, density-based, constraint-based by researchers in educational research [3]. As input, this algo-
and frequent pattern-based methods; [40] classified clus- rithm requires a distance metric and the number of clusters
tering methods as partitioning, hierarchical, density-based, (k). In this method, k objects are selected at random as centers
grid-based, and model-based approaches; [41] separated clus- of these clusters. Thereafter, using the minimum squared-
tering methods into density-based, partitioning, and hier- error criterion, which aims at measuring the distance between
archical methods; and finally, [22] added more categories the cluster center and an object, all of the objects are grouped
and defined clustering methods as partition-based, linkage, into k clusters. In the final stage, each cluster’s new mean
model-based, spectral methods, methods based on subspaces, is computed, and this process repeats until the center of each
and density-based methods. It is apparent that different fam- cluster remains unchanged. The time complexity of this algo-
ilies of clustering methods have been formed by several rithm is O(nkd), where d denotes number of features, k is the
researchers according to different criteria, such as objective number of clusters, and n is the number of objects. Key advan-
functions and cluster structures [42], [43]. This study consid- tages of the k-means algorithm are its simple implementation,
ers seven well-known, commonly used methods in the area of low computational cost, and good (or acceptable) results in
education, including expectation–maximization (EM) from many different practical situations. Nonetheless, it has several
the model-based family, k-means and k-medoids from the disadvantages, for example, requiring a specification of the
partitioning family, OPTICS and DBSCAN from the density- number of clusters in advance, which can strongly affect
based family, spectral clustering from the family of spectral the final classification, unsuitability for situations with non-
methods, and agglomerative clustering from the hierarchical convex clusters, clustering outliers, scaling with the number
family. These methods were selected according to the recent of dimensions, and dependency on the preliminary starting
systematic review of clustering methods conducted in [3]. condition [47].
For the performance metrics used for evaluating clus- K-medoids, which is a modification of k-means, is among
tering methods, several studies proposed different metrics the algorithms that have been proposed to address the lim-
(e.g., [44]), among them, internal and external metrics. itations of k-means. In this method, similar to k-means, the
In external validation metrics, the similarities between the goal is to minimize the distance between the point speci-
clustering method’s results and correct partitioning (pre- fied as the cluster center and the data points in that cluster.
defined class labels) are measured, whereas in internal Unlike k-means clustering, k-medoid selects datapoints as
validation metrics, according to intracluster similarity and cluster centers, which usually have minimal average dis-
intercluster dissimilarity, the appropriateness of a cluster- similarity with points labeled to be in a cluster (the most
ing structure is measured without using external informa- centrally situated point in the cluster). The time complex-
tion (class labels). Both internal and external validation ity of k-medoids is O(k(n − k)2 ). However, compared to
metrics play an important role in choosing an optimal k-medoids, the k-means method appears to be less flexible in
clustering method for a certain dataset [45]. The present certain situations; for example, it fails to be used with specific
study employs three internal metrics—namely, Dunn’s Index similarity measures, such as Absolute Pearson Correlation,
(DI), Silhouette Index (SI), and Davies-Bouldin Index because the distance used must be consistent with the mean;
(DBI)—and 14 external metrics: Adjusted Mutual Informa- moreover, it appears to be unsuitable for clustering non-
tion (AMI), Completeness Index (CI), Jaccard Index (JI), spherical groups of objects due to relying on minimizing the
VOLUME 8, 2020 146997
D. Hooshyar et al.: Clustering Algorithms in an Educational Context: An Automatic Comparative Approach
distances between the nonmedoid objects and the medoids; has a time complexity of O(ndki), where n is the number of
nonetheless, it has a lower computational time and requires data points, d is the dimensionality of each point, and i is the
less time to run than k-medoids [48]. number of iterations required for k-means to converge. The
Expectation–maximization (EM) is a model-based cluster- overall time complexity of the spectral clustering algorithm
ing method whose main goal is to define each data object’s is O(nzki), where z is the average number of rows in the
membership according to a probability. This method is com- similarity matrix. One main advantage of this method is that
posed of two steps, the expectation step, which estimates it does not impose a predetermined shape for the clusters due
the likelihood of each object belonging to a cluster, and the to its definition of an adjacency structure from the original
maximization step, which computes the parameters of the dataset. On the other hand, one disadvantage of this method
distributions for maximizing the distributions’ probabilities is its demanding process of computation of the eigenvectors
in the dataset [49]. The time complexity of this algorithm of the similarity matrix (in other words, it can become com-
is O(dni), where i is the number of iterations. Among the putationally expensive when it addresses large datasets).
advantages of EM are its stability and low complexity in the DBSCAN (which stands for Density-Based Spatial Clus-
implementation. Furthermore, it is regarded as particularly tering of Applications with Noise) is a density-based cluster-
suitable for incomplete datasets. However, this method suf- ing method, which means that in a set of points, it clusters
fers from some issues, as follows: it mostly fails in finding together points with many adjacent neighbors (or points that
small clusters, the obtained clusters from this method can are tightly packed together) [52]. In this approach, those
firmly depend on the initial conditions, intractable expecta- points whose adjacent neighbors are far away are considered
tion and maximization steps can be involved, and it is slow to outliers (or points that stand alone in low-density areas).
converge. Epsilon (ε) and the minimum number of points needed to
Agglomerative is a hierarchical clustering method that con- create a cluster (η) are the inputs of this method. Initially,
siders the linkage between data points. In this method, which it randomly selects a starting data point that has been unvis-
follows a bottom-up fashion, data objects are grouped into ited. Accordingly, if a sufficient number of data points are
a tree. Basically, it initially places individual data points in within the ε-neighborhood of these data points, it forms a
its own cluster and subsequently combines these clusters cluster; otherwise, it labels it as noise. All of the data points
into larger clusters. Until a certain termination condition is that are in the ε-neighborhood of a data point that is found
reached or all data points gather together in one cluster, this to be part of a cluster are also considered to be part of that
process iterates [50]. According to the similarity measure cluster. The noise points might later be found in a sufficient
between the two clusters, the clusters are agglomerated in number of data points that are within the ε-neighborhood of
this method. The time complexity of this algorithm is O(n2 ). these data points and thereby become a member of a cluster.
Some advantages of this method are its low complexity in This procedure goes on until a cluster is completely formed.
implementation and deciding on the number of clusters (the Thereafter, another new point that has not been previously
dendrogram produced is more informative than the unstruc- visited is chosen and processed similarly, forming the next
tured set of flat clusters produced by k-means). On the other noise or cluster. DBSCAN has a time complexity of O(n2 ).
hand, this method is less flexible, especially in undoing pre- DBSCAN performs well when it addresses outliers within
vious steps. In other words, once the instances have been the dataset and is great at dividing high-density clusters from
assigned to a cluster, they can no longer be moved around. low-density clusters in a dataset. However, similar to other
Another weakness of this method is its unsuitability for large clustering methods, it suffers from some shortcomings; for
datasets. Similar to some previously mentioned algorithms, example, it struggles with high-dimensional data or clusters
e.g., k-means, the final results in the agglomerative meth- with similar (or varying) density.
ods are strongly affected by the initial seeds. Moreover, this OPTICS (which stands for Ordering Points To Identify the
method is vulnerable to outliers, and the final results can be Clustering Structure) is also a density-based method based
affected by the order of the data. on the maximal density-reachability concept [53]. Similar to
Spectral clustering is another class of clustering methods DBSCAN’s process, this method begins with a data point
that has been developed to address the issues of conventional and grows its neighborhood. However, to solve the issue of
clustering methods (namely, determining nonlinear discrim- DBSCAN, which is the detection of meaningful clusters in
inative hypersurfaces) [51]. It initially develops an affinity data with varying density, it uses (linear) ordering in such
matrix—representing the data by a weighted graph—where a way that spatially nearest points become neighbors in the
the similarity between m and n points are indicated by the val- ordering. Furthermore, it stores a special distance for each
ues in the m-th row and n-th column. Note that for a weighted point to represent the density that must be accepted for a
graph representation of the data, this method usually benefits cluster that enables both points to belong to the same cluster.
from weighted kernel k-means, which is a generalization of OPTICS has a time complexity of O(nlogn). One advantage
the k-means method. Thereafter, to group the data based on a of this method is its capability of handling clusters with
given criterion, it uses the eigenvalues and eigenvectors of the irregular shapes and large density variations; however, dis-
matrix. In this method, various types of similarity matrices advantages are that it has a higher computational time and
can be used, such as the Laplacian matrix. This algorithm requires more time to run than DBSCAN.
The Jaccard index is an external performance metric that The Fowlkes-Mallows index is an external performance
measures the similarity between objects. It is defined and metric that is widely employed to measure the similarity
represented by the corresponding equation in Table 2. In the between two formed clusters. This similarity measure could
equation, the count of pairs of objects that have the same be between a benchmark classification and a clustering or two
label in C, which are assigned to the same cluster in P and hierarchical found clusters. This metric is defined using the
are assigned to different clusters are denoted by x and y, corresponding equation listed in Table 2.
respectively. Moreover, the number of pairs of points within The Mutual Information is an external performance metric
the same cluster that have different class labels is denoted that measures the similarity between two labels in the same
by j. This index results in a value between 0 and 1, where data. This metric is defined using the corresponding equation
1 indicates that C and P are identical. listed in Table 2.
The F-measure is an external performance metric, the har- The Normalized Mutual Information is an external perfor-
monic mean of the precision and recall, where the precision mance metric that normalizes the score of the mutual infor-
and recall indicate the percentage of a cluster with positive mation in such a way that the results scale between 0 and 1,
points and the true positive rate, respectively. This metric is where 0 denotes no mutual information and 1 refers to perfect
defined using the corresponding equation listed in Table 2. correlation. This metric is defined using the corresponding
equation listed in Table 2. In the equation, P and C denote the The Completeness index is an external performance metric
class and cluster labels, respectively, H (P) and H (C) refer to that indicates a clustering result’s completeness. In other
the entropy of P and C, respectively, and MI(P, C) denotes words, the completeness of a clustering result is satisfied
the mutual information between P and C. The entropy is when all of the objects of the same cluster are members of
expressed as H (N ) == − N i=1 P (ni ) log2 (1/P(ni ).
P
a given class label. This metric is defined using the corre-
The Adjusted Mutual Information is an external perfor- sponding equation listed in Table 2.
mance metric that adjusts the mutual information (MI) score
by considering chance normalization. In other words, it con- C. MULTIPLE CRITERIA DECISION-MAKING APPROACH
siders that MI usually results in a higher MI score in regard to FOR THE EVALUATION OF CLUSTERING METHODS
a larger cluster of numbers without considering whether there
The performance of clustering methods can differ signifi-
is more information to share. This metric is defined using the
cantly for different types of data and applications because
corresponding equation listed in Table 2.
they usually operate in spaces with different dimensions and
The Rogers and Tanimoto index is an external perfor-
should address incomplete, noisy, and sampled data. Multiple
mance metric that computes and measures the similarity
clustering methods have therefore been developed for such
coefficient of the formed clusters, for two clusterings of
reasons. Moreover, a large number of performance measures
the same dataset, from observations’ comemberships (the
(internal and external) have been proposed for the evaluation
comembership refers to the pairs of observations that are
of different clustering methods, which makes the selection of
partitioned together). This metric is defined using the cor-
the most appropriate clustering methods for a given dataset
responding equation listed in Table 2. In the equation,
or problem a difficult challenge.
the count of pairs of objects that have the same label and
To address this challenge and comprehensively evaluate
that are assigned to the same cluster and different clusters
the performance of different clustering methods in educa-
are denoted by w and x, respectively. Moreover, the count
tional datasets, this study models the cluster evaluation task
of pairs in the same cluster that have different class labels
as a multiple criteria decision-making (MCDM) problem and
and the number of pairs that have a different label that
accordingly propose a novel approach for the evaluation of
were assigned to a different cluster are denoted by y and z,
clustering methods in an educational context. In our empiri-
respectively.
cal approach, this study examines the performance of seven
The Variation of Information index, which is similar to the
different clustering methods (from different families) with
mutual information index due to being a straightforward lin-
18 educational datasets of different sizes to generate two-way
ear expression that involves the mutual information, is a dis-
clustering results. Thereafter, seventeen performance metrics
tance measure between two clusters. In this index, unlike the
are used to measure the performance of the produced cluster-
mutual information index, the triangle inequality is followed
ing results. This study then employs a well-known MCDM
regarding the variation of information, and when changing
method (called TOPSIS, see the following section), which
from cluster Pi to Pj , the loss or gain of the information
takes the performance measures as inputs, to rank each clus-
is measured. This metric is defined using the corresponding
tering algorithm for each dataset as a means of validating our
equation listed in Table 2.
evaluation approach (some examples of TOPSIS applications
The Geometric accuracy is acquired by calculating the
can be found in [56], [57]). This makes it possible to find and
sensitivity’s geometrical mean and positive predictive value.
recommend the most suitable clustering method for different
This metric is basically concerned with balancing the sensi-
educational datasets that can serve the needs of practitioners,
tivity and predictive value that reflects two conflicting incli-
researchers, and decision makers (see Fig. 1).
nations of clustering. Usually, when objects with different
and similar complexity are partitioned in the same cluster, the
positive predictive value and the sensitivity are, respectively, 1) TOPSIS
decreased and increased. This metric is defined using the There exist several MCDM methods for evaluating various
corresponding equation listed in Table 2. contradictory criteria in decision making, among them tech-
The Overlapping Normalized Mutual Information is an niques for order preference by similarity to an ideal solu-
extension of the NMI, which is developed to handle the tion (called TOPSIS), which is the most frequently used
overlapping partitions. This metric is defined using the cor- method by many researchers in different disciplines [58].
responding equation listed in Table 2. Even though some studies considered using more than one
The Purity index is a straightforward and simple external MCDM method for their decision making (e.g., VIKOR and
performance metric for measuring the quality of clustering. DEA methods), they found that in many cases, some of these
In this index, the purity is calculated by assigning each methods do not produce reliable results in different prob-
cluster to the most frequent class in the cluster. Thereafter, lem domains and the output mostly depends on the applied
by counting the correctly assigned documents and dividing domain and the problem in hand [59]. However, a significant
by the number of objects, the accuracy of this assignment can number of studies, conducted in different domains, concluded
be measured. This metric is defined using the corresponding that the TOPSIS method is among the strongest decision-
equation listed in Table 2. making methods that can be applied to many different
domains and problems [23]. TOPSIS is therefore selected to each criterion. The normalized value that is weighted (vij ) is
evaluate clustering algorithms in our approach. computed as:
This technique, developed by [60], was first used to rank
alternatives across many criteria. Basically, by minimizing vij = wi rij , j = 1, . . . , n; i = 1, . . . , m (2)
Pm
and maximizing the distance to the ideal and negative-ideal where wi is the weight of the i-th criterion, and i=1 wi = 1.
solution, respectively, TOPSIS aims to find the best alter- In the third step, both the ideal and negative-ideal solutions
natives. Below is a step-by-step description of the TOPSIS are found using Equation (3) and (4):
process used in this study:
In the first step, the normalized decision matrix is +
A = max vij |j ∈ I , min vij |j ∈ J |i = 1, 2, . . . , m
computed using the equation below (Equation (1)). This i i
n o
approach transforms the dimensions of various attributes into = v1 , v2 , . . . vj , . . . , vn
+ + + +
(3)
nondimensional attributes, making it possible to compare
attributes. −
A = min vij |j ∈ I , max vij |j ∈ J |i = 1, 2, . . . , m
xij i i
rij = qP (1) n o
n
x 2 = v− 1 , v −
2 , . . . v−
j , . . . , v−
n (4)
i=1 ij
where
In this equation, i and j refer to the number of alternatives
(clustering methods here) and criteria (evaluation metrics I = {j = 1, 2, . . . , n|j associated with benefit criteria} ,
here), while xij represents the evaluation metrics of the i-th J = {j = 1, 2, . . . , n| j associated with costcriteria}
criterion Ci for the alternative Aj .
In the second step, the weighted normalized decision This experiment, which aims to evaluate clustering meth-
matrix is computed by constructing a set of weights (wi ) for ods, considers the variation in information and the DI as the
cost criterion that requires minimization and all other metrics number of times the feedback was received, the number of
as benefits to requiring maximization. times a forum discussion was viewed, the number of attempts
In the fourth step, separation measures for both the ideal at quizzes, the number of times a discussion was created in
and negative-ideal solutions are calculated. To compute these a forum, the number of times book chapters were viewed,
separation measures, Equations (5) and (6) are used: the number of times a book list was viewed, the number
r
Xn 2 of times an assignment was submitted, the number of times
+
Si = vij − v+
j i = 1, 2, . . . , m (5) an assignment was viewed, the assignment grade, the quiz
j=1
v grades, the number of times a discussion was viewed in a
u n forum, the final grade, the number of times a post was created
uX 2
−
Si = t vij − v=j i = 1, 2, . . . , m (6) in a forum, the number of times comments were viewed,
j=1 the number of times posts were updated in a forum, and the
In the fifth step, the ratio that measures the relative close- number of posts. Using the extracted data, this study created
ness to the idea solution is computed (using Equation (7)): nine different datasets that have different student numbers
and feature sizes. For example, the first dataset contains ten
Si
Ci∗ = , 0 < c+
i < 1, i = 1, 2, . . . , m features with a small number of students, the second dataset
Si+ + Si− includes ten features with a medium number of students, and
Ci∗ = 1 if Ai = A+ the third dataset contains ten features with a large number of
Ci∗ = 0 if Ai = A− (7) students. In addition to these nine datasets, for each, a normal-
ized version of the dataset was generated using the min-max
In the final step, using the maximization of the ratio of Ci∗ , normalization technique to investigate the possible effect of
the alternatives are ranked. The time complexity of TOPSIS data normalization on the clustering algorithm performance.
has 3 stages, as follows: The complexity of the attribute Basically, this normalization technique is used in this study
normalization and weighting is O(n2 ). The complexity of the because attributes are mostly addressed on a different scale.
positive-negative ideal solution and V distance is O(n), and To do so, this technique maps the minimum and maximum
the complexity of the algorithm ranking results is O(1). value in X to 0 and 1, respectively, to allow the entire range
of values of X to be mapped to the range 0 to 1. Note that
IV. EXPERIMENTS this study used more than one course with similar attributes
This section first describes the datasets developed for this to generate datasets with a large number of students. Addi-
study and then presents our proposed approach in the form of tionally, before applying the clustering methods, according
an algorithm that automates the whole process of comparing to the guidelines provided by the handbook of EDM [61],
and selecting the best algorithm for each dataset. Finally, the authors performed data cleaning and preprocessing to
the results from our experiment are discussed to validate our address the sparsity issue and reduce the effect of noise
proposed evaluation approach. and outliers. Furthermore, a filter-based feature selection
approach was used to determine the high-potential variables
A. DATASETS AND DATA PREPROCESSING to form our datasets for the experiment (small number of
To properly compare clustering methods, several datasets features, medium number of features, and high number of
with various sizes and number of features are required. features). The description of the datasets is given in Table 3.
Several studies in different fields used artificial datasets
to compare clustering methods. However, the results and B. THE PROPOSED EVALUATION APPROACH
guidelines from most of these studies are not applicable to Algorithm 1 illustrates our proposed generalizable approach
real-world datasets because such datasets mostly lack the to systemizing the process of comprehensively evaluating
normal distribution of data samples, as characterized by clustering algorithms and selecting the most suitable algo-
certain features that are divided into certain classes. This rithm for a given problem. Fig. 1 also shows the framework
experiment used students’ activity data extracted from the of our proposed approach.
Moodle system of the University of Tartu in Estonia. The
blended courses used were taught in the Institute of Edu- V. RESULTS AND DISCUSSION
cation, including ‘‘Pedagogical Traineeship’’, ‘‘Continuous Table 4 shows the values of the performance metrics for the
Pedagogical Traineeship [part1]’’, ‘‘Continuous Pedagogical clustering algorithms in the datasets with a small number of
Traineeship [part2]’’, ‘‘Basics of Learning’’, ‘‘Teaching and features and a large number of students (SF-LS), a medium
Reflection’’, ‘‘Teacher Identity and Leadership [part 1]’’, number of features and a small number of students (MF-
and ‘‘Teacher Identity and Leadership [part 2]’’. To create SS), and large numbers of features and students (LF-LS).
our datasets, for each course, different types of data were These three datasets are presented as a sample (datasets
extracted, including the number of times the course resource with small, medium, and large feature numbers), and the
was viewed, the number of times the course modules were full list of performance measure results is provided in the
viewed, the number of times the course materials were down- Appendix (Table 6 and 7). As is apparent, there is no clus-
loaded, the number of times the feedback was viewed, the tering algorithm that could reach the best performance in all
TABLE 4. Values of the performance metrics for clustering algorithms in datasets SF-LS, MF-SS, and LF-LS.
limited performance in large datasets, some other methods, A. EFFECT OF NORMALIZATION ON THE PERFORMANCE
including agglomerative clustering, appear to be performing OF CLUSTERING METHODS
low. These findings are in line with those reported, in different According to Table 5 and Fig. 3a, for the normalized dataset
disciplines than education, by [34], [35], and [37], where with small numbers of features and students, DBSCAN
they concluded that hierarchical clustering methods (e.g., outperforms the other algorithms, while spectral clustering
agglomerative) perform low on large datasets, leading to a tends to be the lowest performing method. In terms of the
low accuracy in experiments. normalized dataset with a small number of features and a
Considering the performance of all of the algorithms on medium number of students, OPTICS and agglomerative
different datasets, it can be said that regardless of the number tend to have the best and worst performance among the
of students, spectral clustering is preferred in datasets with a other methods, respectively. In regard to normalized datasets
small and medium number of features (15 or less) regardless with a small number of features and a large number of
of the number of students. However, it tends to perform students, k-medoids outperforms the other methods, while
poorly in datasets with a large number of features (more than spectral clustering is among the lowest performing meth-
15). For datasets with a large number of features, OPTICS ods. Thus, according to the number of students, in datasets
strongly outperforms the other methods and appears to be with a small number of features, DBSCAN, OPTICS, and
the most stable algorithm (in datasets with more than 15 k-medoids appear to be better options than other clus-
features). Surprisingly, k-means, which has been frequently tering methods. Regardless of whether spectral clustering
employed by researchers in EDM and LA research scenarios and DBSCAN were strongly affected by normalization, the
that mostly lacked comparative clustering algorithms, could results from the other methods are fairly similar to the non-
never be ranked the best on any datasets. These findings can normalized datasets, with a slight improvement in perfor-
serve to notify the educational research community to take mance for all of the methods. In other words, all of the
such comparison tasks more seriously and possibly take more clustering algorithms have improved their performance by
measures before instinctively choosing a clustering algo- normalization of the datasets, except for spectral clustering.
rithm, e.g., k-means. In this regard, our findings are consistent More specifically, normalization worsens the performance of
with those reported by researchers in different disciplines, spectral clustering and improves the performance of all of the
e.g., [62]. other methods.
For the normalized datasets with a medium number of features and a small number of students, OPTICS mostly does
features, regardless of the number of students, considering not perform well in normalized datasets with a large number
both the first and second ranks, our finding shows that of features. This finding also indicates a negative effect of
agglomerative and spectral clustering are the best and worst normalization on the OPTICS method.
methods, respectively (see Fig. 3b). Regardless of the number Interestingly, the spectral algorithm, which was nega-
of students, in datasets with a medium number of features, tively affected by normalization in both a small and medium
almost all of the clustering algorithms have improved their number of features, shows better performance with normal-
performance by normalization of the datasets, except for ized datasets that have a large number of features. In con-
spectral clustering. More specifically, normalization reduces trast, OPTICS and DBSCAN appear to be mostly negatively
the spectral performance and mostly improves the perfor- affected by normalization when they address datasets with a
mance of the other methods. K-means, EM, and agglomera- large number of features. Last, the k-medoids approach tends
tive clustering appear to strongly benefit from normalization. to benefit strongly from the normalization of datasets with
In nonnormalized datasets with a medium number of features, a large number of features. In nonnormalized datasets with
spectral clustering showed the best performance, while in a large number of features, OPTICS was shown to have the
normalized datasets with a medium number of features, best performance, while in normalized datasets with a large
agglomerative clustering ranked the best. Therefore, it is number of features, k-medoids is ranked the best. Therefore,
fair to say that spectral and agglomerative clustering are it is fair to say that OPTICS and k-medoids are among the
among the high-performing algorithms in nonnormalized and highest performing algorithms in nonnormalized and normal-
normalized datasets with a medium number of features (i.e., ized datasets with a large number of features (more than 15),
15 features), respectively. respectively.
Finally, considering both the first and second ranked meth- In brief, these findings suggest that normalization can have
ods in normalized datasets with a large number of features a negative effect on the performance of the spectral algorithm
(20 features here), regardless of the number of students, it can in datasets with 15 or fewer features; however, it appears
be said that the k-medoids algorithm has the highest perfor- to mostly have a positive impact on several other cluster-
mance and is the most stable clustering method (see Fig. 3c). ing methods. This result is in line with those reported by
Furthermore, in addition to a normalized dataset with large other researchers, e.g., [29], where they state that spectral
FIGURE 2. Results of TOPSIS ranking for different datasets: (a) datasets with a small number of
features (i.e., 10), (b) datasets with a medium number of features (i.e., 15), (c) datasets with a large
number of features (i.e., 20).
FIGURE 3. Results of TOPSIS ranking for different normalized datasets: (a) normalized datasets with
a small number of features (i.e., 10), (b) normalized datasets with a medium number of features
(i.e., 15), (c) normalized datasets with a large number of features (i.e., 20).
TABLE 6. (Continued) Values of performance metrics for clustering algorithms in nonnormalized datasets.
clustering, in general, appears to be sensitive to different this evaluation task as a multiple criteria decision-making
measures. On the other hand, methods such as OPTICS and problem and employs the TOPSIS method to rank the results
agglomerative clustering tend to benefit more from normal- produced by multiple internal and external performance mea-
ization in datasets with a smaller number of features (15 or sures when applied to several clustering algorithms. This
smaller), whereas k-medoids clustering benefits more from approach would help to automatically recommend the best
normalization regarding datasets with more than 15 features. clustering methods for each dataset. In this study 18 differ-
Consequently, our findings reveal that in normalized datasets ent datasets (created from the University of Tartu’s Moodle
with more than 10 features, agglomerative and k-medoids system) with different sizes were evaluated to not only find
methods are preferred, while in datasets with 10 or less the most suitable clustering method for each dataset but also
features, different clustering methods could be employed, and investigate the possible effect of using different sizes and
in general, differences between the performance of clustering types of educational datasets on the performance of clustering
methods is not considerable. algorithms in two-way clustering. Furthermore, this study
examined the effect of normalization on the quality of the
VI. CONCLUSION AND FUTURE DIRECTIONS clusters produced by different clustering methods.
While several clustering techniques have been employed Our findings reveal that in regard to datasets with 15 or
in an educational context, it is still controversial regarding fewer features, regardless of the number of students, spectral
which technique would perform the best for a given dataset. appears to be more robust and outperforms the other clus-
Even though a growing number of researchers highlight the tering methods. However, it tends to perform low in datasets
importance of comparing and identifying the most suitable with a large number of features (more than 15). This find-
clustering methods for the problem at hand, many educa- ing implies that spectral is sensitive to improvements in the
tional studies ignore such a critical task. This limitation could number of features and the data size. Concerning the more
potentially have a negative effect on the prediction power suitable clustering method for datasets with a large number
of both the EDM and LA approaches. This study models of features, OPTICS strongly outperforms the other methods
TABLE 7. (Continued) Values of performance metrics for clustering algorithms in normalized datasets.
and appears to be the most stable algorithm. Unexpectedly, of features (15 or smaller), whereas k-medoids benefits more
the k-means method, which has been frequently employed from normalization in regard to datasets with more than15
by researchers in educational research (both EDM and LA features. In conclusion, it can be said that in general, normal-
research), could not rank first in any of the datasets. These ization can have a negative effect on the performance of cer-
findings could motivate the educational community to pay tain methods, e.g., spectral and OPTICS; however, it mostly
more attention to the evaluation task when using cluster- appears to have a positive impact on all of the other clustering
ing methods because using a low-performing method could methods.
possibly reduce the prediction power of their EDM and LA A few limitations of this study could be regarded as direc-
approaches. tions for future research. For example, the proposed eval-
With regard to the effect of normalization on the perfor- uation approach was applied to rather small data size for
mance of clustering methods, this study found that normal- two-way clustering. In future work, the proposed automatic
ization could have a negative effect on the performance of the approach could be applied to larger educational datasets with
spectral algorithm in datasets with 15 or less features; how- a larger value of k. Another direction of future work would
ever, it appears to have a positive impact on all of the other be to employ more advanced or heuristic clustering methods
clustering methods. More explicitly, the spectral algorithm, that are capable of reducing the effect of noise and out-
which was negatively affected by normalization with both liers in large datasets automatically, while addressing issue
a small and medium number of features, shows better per- of dimensionality and offering a good level of computation
formance with normalized datasets that have large features. ( [13], [14]). Finally, another interesting direction for future
Methods such as OPTICS and agglomerative tend to benefit work would involve implementing similar experiments on
more from normalization in datasets with a smaller number more datasets from different educational means and sources,
games or intelligent tutoring systems to possibly compare [23] G. Kou, Y. Peng, and G. Wang, ‘‘Evaluation of clustering algorithms for
with our findings from this study and provide wider recom- financial risk analysis using MCDM methods,’’ Inf. Sci., vol. 275, pp. 1–12,
Aug. 2014.
mendations and guidelines to researchers and practitioners in [24] R. Dubes and A. K. Jain, ‘‘Clustering techniques: The user’s dilemma,’’
the educational community. Pattern Recognit., vol. 8, no. 4, pp. 247–260, Oct. 1976.
[25] A. J. Gates, I. B. Wood, W. P. Hetrick, and Y.-Y. Ahn, ‘‘Element-centric
clustering comparison unifies overlaps and hierarchy,’’ Sci. Rep., vol. 9,
APPENDIX no. 1, p. 8574, Dec. 2019.
See Table 6, 7. [26] J. M. Kleinberg, ‘‘An impossibility theorem for clustering,’’ in Proc. Adv.
Neural Inf. Process. Syst., 2003, pp. 463–470.
[27] L. Peel, D. B. Larremore, and A. Clauset, ‘‘The ground truth about
REFERENCES metadata and community detection in networks,’’ Sci. Adv., vol. 3, no. 5,
[1] D. R. Garrison and H. Kanuka, ‘‘Blended learning: Uncovering its transfor- May 2017, Art. no. e1602548.
mative potential in higher education,’’ Internet Higher Edu., vol. 7, no. 2, [28] P. D. Antonenko, S. Toy, and D. S. Niederhauser, ‘‘Using cluster analysis
pp. 95–105, Apr. 2004. for data mining in educational technology research,’’ Educ. Technol. Res.
[2] C. Romero, S. Ventura, and E. García, ‘‘Data mining in course management Develop., vol. 60, no. 3, pp. 383–398, Jun. 2012.
systems: Moodle case study and tutorial,’’ Comput. Edu., vol. 51, no. 1, [29] K. L. Eranki and K. M. Moudgalya, ‘‘Evaluation of Web based behavioral
pp. 368–384, Aug. 2008. interventions using spoken tutorials,’’ in Proc. IEEE 4th Int. Conf. Technol.
[3] A. Dutt, M. A. Ismail, and T. Herawan, ‘‘A systematic review on educa- Educ., Jul. 2012, pp. 38–45.
tional data mining,’’ IEEE Access, vol. 5, pp. 15991–16005, 2017. [30] W.-C. Chang, T.-H. Wang, and M.-F. Li, ‘‘Learning ability clustering in
[4] C. Romero and S. Ventura, ‘‘Educational data mining: A survey from 1995 collaborative learning,’’ J. Softw., vol. 5, no. 12, pp. 1363–1370, Dec. 2010.
to 2005,’’ Expert Syst. Appl., vol. 33, no. 1, pp. 135–146, Jul. 2007. [31] A. R. Anaya and J. G. Boticario, ‘‘Clustering learners according to their
[5] C. Anuradha, T. Velmurugan, and R. Anandavally, ‘‘Clustering algorithms collaboration,’’ in Proc. 13th Int. Conf. Comput. Supported Cooperat. Work
in educational data mining: A review,’’ Int. J. Power Control Comput., Design, 2009, pp. 540–545.
vol. 7, no. 1, pp. 47–52, 2015. [32] G. Cobo, D. García-Solórzano, E. Santamaria, J. A. Morán, J. Melenchón,
and C. Monzo, ‘‘Modeling students’ activity in online discussion forums:
[6] C. Romero, S. Ventura, M. Pechenizkiy, and R. S. Baker, Handbook of
A strategy based on time series and agglomerative hierarchical clustering,’’
Educational Data Mining. Boca Raton, FL, USA: CRC Press, 2010.
in Proc. EDM, 2011, pp. 253–258.
[7] M. Pedaste and T. Sarapuu, ‘‘Developing an effective support system for
[33] A. M. Navarro and P. Moreno-Ger, ‘‘Comparison of clustering algorithms
inquiry learning in a Web-based environment,’’ J. Comput. Assist. Learn.,
for learning analytics with educational datasets,’’ Int. J. Interact. Multime-
vol. 22, no. 1, pp. 47–62, Jan. 2006.
dia Artif. Intell., vol. 5, no. 2, pp. 9–16, 2018.
[8] P. Arabie and G. De Soete, Clustering and Classification. Singapore: World
[34] I. G. Costa, F. D. A. T. D. Carvalho, and M. C. P. D. Souto, ‘‘Comparative
Scientific, 1996.
analysis of clustering methods for gene expression time course data,’’
[9] D. Hooshyar, M. Pedaste, and Y. Yang, ‘‘Mining educational data to predict Genet. Mol. Biol., vol. 27, no. 4, pp. 623–631, 2004.
students’ performance through procrastination behavior,’’ Entropy, vol. 22,
[35] M. C. de Souto, I. G. Costa, D. S. de Araujo, T. B. Ludermir, and
no. 1, p. 12, Dec. 2019.
A. Schliep, ‘‘Clustering cancer gene expression data: A comparative
[10] F. Camastra and A. Verri, ‘‘A novel kernel method for clustering,’’ IEEE study,’’ BMC Bioinf., vol. 9, no. 1, p. 497, Dec. 2008.
Trans. Pattern Anal. Mach. Intell., vol. 27, no. 5, pp. 801–805, May 2005. [36] Y. G. Jung, M. S. Kang, and J. Heo, ‘‘Clustering performance comparison
[11] L. Jing, M. K. Ng, and J. Z. Huang, ‘‘An entropy weighting k-means using K -means and expectation maximization algorithms,’’ Biotechnol.
algorithm for subspace clustering of high-dimensional sparse data,’’ IEEE Biotechnol. Equip., vol. 28, no. 1, pp. S44–S48, Nov. 2014.
Trans. Knowl. Data Eng., vol. 19, no. 8, pp. 1026–1041, Aug. 2007. [37] T. Kinnunen, I. Sidoroff, M. Tuononen, and P. Fränti, ‘‘Comparison of
[12] R. Suzuki and H. Shimodaira, ‘‘Pvclust: An R package for assessing the clustering methods: A case study of text-independent speaker modeling,’’
uncertainty in hierarchical clustering,’’ Bioinformatics, vol. 22, no. 12, Pattern Recognit. Lett., vol. 32, no. 13, pp. 1604–1617, Oct. 2011.
pp. 1540–1542, Jun. 2006. [38] S. Brohée and J. van Helden, ‘‘Evaluation of clustering algorithms for
[13] M. Sassi Hidri, M. A. Zoghlami, and R. Ben Ayed, ‘‘Speeding up the large- protein-protein interaction networks,’’ BMC Bioinf., vol. 7, no. 1, p. 488,
scale consensus fuzzy clustering for handling big data,’’ Fuzzy Sets Syst., Dec. 2006.
vol. 348, pp. 50–74, Oct. 2018. [39] N. Soni and A. Ganatra, ‘‘Categorization of several clustering algorithms
[14] M. A. Zoghlami, M. S. Hidri, and R. B. Ayed, ‘‘Consensus-driven cluster from different perspective: A review,’’ Int. J., to be published.
analysis: Top-down and bottom-up based split-and-merge classifiers,’’ Int. [40] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya,
J. Artif. Intell. Tools, vol. 26, no. 04, Aug. 2017, Art. no. 1750018. S. Foufou, and A. Bouras, ‘‘A survey of clustering algorithms for big data:
[15] J. Wang, C. Zhu, Y. Zhou, X. Zhu, Y. Wang, and W. Zhang, ‘‘From Taxonomy and empirical analysis,’’ IEEE Trans. Emerg. Topics Comput.,
partition-based clustering to density-based clustering: Fast find clusters vol. 2, no. 3, pp. 267–279, Sep. 2014, doi: 10.1109/TETC.2014.2330519.
with diverse shapes and densities in spatial databases,’’ IEEE Access, vol. 6, [41] P. Singh and A. Surya, ‘‘Performance analysis of clustering algorithms in
pp. 1718–1729, 2018. data mining in Weka,’’ Int. J. Adv. Eng. Technol., vol. 7, no. 6, p. 1866,
[16] L. Wang, S. Ding, and H. Jia, ‘‘An improvement of spectral clustering via 2015.
message passing and density sensitive similarity,’’ IEEE Access, vol. 7, [42] C. Fraley, ‘‘How many clusters? Which clustering method? Answers via
pp. 101054–101062, 2019. model-based cluster analysis,’’ Comput. J., vol. 41, no. 8, pp. 578–588,
[17] R. Gelbard, O. Goldman, and I. Spiegler, ‘‘Investigating diversity of clus- Aug. 1998.
tering methods: An empirical comparison,’’ Data Knowl. Eng., vol. 63, [43] A. K. Jain, A. Topchy, M. H. Law, and J. M. Buhmann, ‘‘Landscape of
no. 1, pp. 155–166, Oct. 2007. clustering algorithms,’’ in Proc. ICPR, vol. 1, 2004, pp. 260–263.
[18] U. Maulik and S. Bandyopadhyay, ‘‘Performance evaluation of some clus- [44] J.-O. Palacio-Niño and F. Berzal, ‘‘Evaluation metrics for unsuper-
tering algorithms and validity indices,’’ IEEE Trans. Pattern Anal. Mach. vised learning algorithms,’’ 2019, arXiv:1905.05667. [Online]. Available:
Intell., vol. 24, no. 12, pp. 1650–1654, Dec. 2002. http://arxiv.org/abs/1905.05667
[19] J. O. McClain and V. R. Rao, ‘‘CLUSTISZ: A program to test for the quality [45] Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu, ‘‘Understanding of internal
of clustering of a set of objects,’’ J. Mark. Res., vol. 12, no. 4, pp. 456–460, clustering validation measures,’’ in Proc. IEEE Int. Conf. Data Mining,
1975. Dec. 2010, pp. 911–916.
[20] G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions, [46] Y. Lei, J. C. Bezdek, S. Romano, N. X. Vinh, J. Chan, and J. Bailey,
vol. 382. Hoboken, NJ, USA: Wiley, 2007. ‘‘Ground truth bias in external cluster validity indices,’’ Pattern Recognit.,
[21] A. K. Jain, M. N. Murty, and P. J. Flynn, ‘‘Data clustering: A review,’’ ACM vol. 65, pp. 58–70, May 2017.
Comput. Surv., vol. 31, no. 3, pp. 264–323, 1999. [47] A. K. Jain, ‘‘Data clustering: 50 years beyond K-means,’’ Pattern Recognit.
[22] M. Z. Rodriguez, C. H. Comin, D. Casanova, O. M. Bruno, D. R. Amancio, Lett., vol. 31, no. 8, pp. 651–666, Jun. 2010.
L. D. F. Costa, and F. A. Rodrigues, ‘‘Clustering algorithms: A comparative [48] H.-S. Park and C.-H. Jun, ‘‘A simple and fast algorithm for K-medoids
approach,’’ PLoS ONE, vol. 14, no. 1, Jan. 2019, Art. no. e0210236. clustering,’’ Expert Syst. Appl., vol. 36, no. 2, pp. 3336–3341, Mar. 2009.
[49] T. K. Moon, ‘‘The expectation-maximization algorithm,’’ IEEE Signal YEONGWOOK YANG received the master’s
Process. Mag., vol. 13, no. 6, pp. 47–60, 1996. degree in computer science education and the
[50] W. H. E. Day and H. Edelsbrunner, ‘‘Efficient algorithms for agglomerative Ph.D. degree from the Department of Computer
hierarchical clustering methods,’’ J. Classification, vol. 1, no. 1, pp. 7–24, Science and Engineering, Korea University, Seoul,
Dec. 1984. South Korea. He is currently a Senior Researcher
[51] U. von Luxburg, ‘‘A tutorial on spectral clustering,’’ Statist. Comput., with the University of Tartu, Tartu, Estonia. His
vol. 17, no. 4, pp. 395–416, Dec. 2007. research interests include information filtering,
[52] D. Birant and A. Kut, ‘‘ST-DBSCAN: An algorithm for clustering spatial–
recommendation systems, educational data min-
temporal data,’’ Data Knowl. Eng., vol. 60, no. 1, pp. 208–221, Jan. 2007.
ing, and deep learning.
[53] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, ‘‘OPTICS:
Ordering points to identify the clustering structure,’’ ACM SIGMOD Rec.,
vol. 28, no. 2, pp. 49–60, 1999.
[54] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, ‘‘On clustering validation
techniques,’’ J. Intell. Inf. Syst., vol. 17, nos. 2–3, pp. 107–145, 2001.
[55] E. Rendón, I. Abundez, A. Arizmendi, and E. M. Quiroz, ‘‘Internal versus
external cluster validation indexes,’’ Int. J. Comput. Commun., vol. 5, no. 1,
pp. 27–34, 2011.
[56] M. Alhabo and L. Zhang, ‘‘Multi-criteria handover using modified
weighted TOPSIS methods for heterogeneous networks,’’ IEEE Access,
vol. 6, pp. 40547–40558, 2018.
[57] Q. M. Ashraf, M. H. Habaebi, and M. R. Islam, ‘‘TOPSIS-based ser- MARGUS PEDASTE (Senior Member, IEEE)
vice arbitration for autonomic Internet of Things,’’ IEEE Access, vol. 4, received the master’s degree in biology educa-
pp. 1313–1320, 2016. tion and the Ph.D. degree in biology and earth
[58] G.-H. Tzeng and J.-J. Huang, Multiple Attribute Decision Making: Meth- science education from the University of Tartu,
ods and Applications. Boca Raton, FL, USA: CRC Press, 2011. Tartu, Estonia. He is currently a Professor of edu-
[59] P. Wanke, C. Barros, and N. P. J. Macanda, ‘‘Predicting efficiency in cational technology with the University of Tartu.
angolan banks: A two-stage TOPSIS and neural networks approach,’’ His research interests include effect of educational
South Afr. J. Econ., vol. 84, no. 3, pp. 461–483, Sep. 2016. technologies, virtual and augmented reality, sci-
[60] C.-L. Hwang and K. Yoon, ‘‘Methods for multiple attribute decision mak- ence education, and teacher education and agency.
ing,’’ in Multiple Attribute Decision Making. Cham, Switzerland: Springer,
1981, pp. 58–191.
[61] A. Peña-Ayala, Ed., Educational Data Mining: Applications and Trends,
vol. 524. Cham, Switzerland: Springer, 2013.
[62] Z. Ansari, M. F. Azeem, W. Ahmed, and A. V. Babu, ‘‘Quanti-
tative evaluation of performance and validity indices for clustering
the Web navigational sessions,’’ World Comput. Sci. Inf. Technol.,
vol. 1, no. 5, pp. 217–226, Jun. 2011. [Online]. Available: https://arxiv.
org/abs/1507.03340v1