Clustering_Algorithms_in_an_Educational_Context_An_Automatic_Comparative_Approach

This research addresses the critical issue of selecting appropriate clustering methods in educational data mining and learning analytics, proposing an automatic evaluation approach that compares various clustering techniques on real-world educational datasets. The study employs multiple performance measures and a decision-making method (TOPSIS) to recommend the best clustering methods for different datasets, revealing that DBSCAN and k-medoids perform best in smaller datasets, while OPTICS and k-medoids excel in larger datasets. Additionally, the findings indicate that normalization can negatively affect certain clustering methods, emphasizing the importance of method selection based on dataset characteristics.

Uploaded by

hefesto19947085

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

Clustering_Algorithms_in_an_Educational_Context_An_Automatic_Comparative_Approach

Uploaded by

hefesto19947085

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Received July 7, 2020, accepted August 3, 2020, date of publication August 7, 2020, date of current version August 20,

2020.
Digital Object Identifier 10.1109/ACCESS.2020.3014948

Clustering Algorithms in an Educational Context:

An Automatic Comparative Approach
DANIAL HOOSHYAR 1 , (Member, IEEE), YEONGWOOK YANG 1,

MARGUS PEDASTE1 , (Senior Member, IEEE),

AND YUEH-MIN HUANG 2 , (Senior Member, IEEE)
1 Institute of Education, University of Tartu, 50090 Tartu, Estonia
2 Department of Engineering Science, National Cheng Kung University, Tainan 701, Taiwan
Corresponding author: Yeongwook Yang ([email protected])
This research was supported by the University of Tartu ASTRA Project PER ASPERA, financed by the European Regional
Development Fund.

ABSTRACT Despite an increasing consensus regarding the significance of properly identifying the most
suitable clustering method for a given problem, a surprising amount of educational research, including both
educational data mining (EDM) and learning analytics (LA), neglects this critical task. This shortcoming
could in many cases have a negative impact on the prediction power of both the EDM and LA based
approaches. To address such issues, this work proposes an evaluation approach that automatically compares
several clustering methods using multiple internal and external performance measures on 9 real-world
educational datasets of different sizes, created from the University of Tartu’s Moodle system, to produce
two-way clustering. Moreover, to investigate the possible effect of normalization on the performance of the
clustering algorithms, this work performs the same experiment on a normalized version of the datasets. Since
such an exhaustive evaluation includes multiple criteria, the proposed approach employs a multiple criteria
decision-making method (i.e., TOPSIS) to rank the most suitable methods for each dataset. Our results
reveal that the proposed approach can automatically compare the performance of the clustering methods
and accordingly recommend the most suitable method for each dataset. Furthermore, our results show that
in both normalized and nonnormalized datasets of different sizes with 10 features, DBSCAN and k-medoids
are the best clustering methods, whereas agglomerative and spectral methods appear to be among the most
stable and highly performing clustering methods for such datasets with 15 features. Regarding datasets with
more than 15 features, OPTICS is among the top-ranked algorithms among the nonnormalized datasets, and
k-medoids is the best among the normalized datasets. Interestingly, our findings reveal that normalization
may have a negative effect on the performance of certain methods, e.g., spectral clustering and OPTICS;
however, it appears to mostly have a positive impact on all of the other clustering methods.

INDEX TERMS Educational context, clustering methods, multiple criteria decision-making, educational
data mining, learning analytics.

I. INTRODUCTION learning. These LMSs not only facilitate managing and

Rapid advances in information and communication have offering learning resources but also provide educators with
resulted in noticeable changes in education, e.g., the way opportunities to monitor the learning progress of students
students learn and teachers teach. The use of the Inter- by tracking their actions in the system. In other words,
net to deliver online or blended courses is one example LMSs enable educators to modify and adapt their teaching
of such changes [1]. Currently, a significant number of to improve student learning in response to the data collected
institutes (both private and public) employ learning man- from students’ online behavior. However, since these log
agement systems (LMSs) to present and support online data are raw and do not present stable information or indi-
cate pre-existing theories, LMSs present some difficulty in
The associate editor coordinating the review of this manuscript and terms of how to employ these data productively in revising
approving it for publication was Ghassem Mokhtari . the learning process [2]. Educational data mining (EDM),

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
146994 VOLUME 8, 2020
D. Hooshyar et al.: Clustering Algorithms in an Educational Context: An Automatic Comparative Approach

learning analytics (LA), and their applications—which usu- revolve around developing performance measures for cluster-
ally aim to provide helpful understanding of the learning ing methods with the aim of determining the appropriateness
process to both students and instructors by converting raw of the produced clusters [24], [25]. However, surprisingly,
educational data into useful information—are some possi- although there is an increasing consensus concerning the
ble responses to this challenge [3]–[5]. To uncover hidden importance of properly identifying the best clustering method
patterns and knowledge from educational data, researchers and subsequently interpreting the produced result for a given
usually employ various EDM or LA approaches, e.g., super- problem, a limited number of research studies [25]–[27],
vised and unsupervised learning techniques (for a systematic if any, have comprehensively considered both internal and
review of EDM research, see [3]). Even though high predic- external measurements for the evaluation of clustering meth-
tion accuracy could be achieved using supervised learning, ods in an educational context. In addition to the tedious pro-
they are often inapplicable in regard to educational data with cess behind the experimentation and the data preprocessing,
no predefined class labels [6]. Unsupervised learning meth- one main reason is that cluster evaluation normally involves
ods, henceforth referred to as clustering, not only are capable multiple conflicting criteria (due to a large number of external
of discovering hidden, underlying patterns and structures in and internal metrics). In this study, to address this issue—
data without labels but also could be used to label unlabeled comprehensively evaluating the performance of different
data for the possible use of supervised methods. clustering methods in educational datasets (e.g., Moodle)—
The main aim of clustering methods is to find and form this study models the cluster evaluation task as a multiple
objects’ groups, or clusters, that are more similar to one criteria decision-making (MCDM) problem and accordingly
another. The practical value of this capability is to enable proposes a novel approach for the automatic evaluation of
personalized differentiation of learning processes that have clustering methods in an educational context. Our approach,
been key to improving learning outcomes (e.g., [7]). In other inspired by [23], involves empirically studying the perfor-
words, clustering methods could be regarded as the task of mance of seven different clustering methods (from different
modeling the data in the form of a simplified set of properties, families) on 18 educational datasets of different sizes to
providing insightful explanations about important aspects of a generate two-way clustering results. Thereafter, seventeen
dataset. Supervised methods are usually less demanding than performance metrics (including three internal and multiple
clustering; however, clustering methods can provide more externals metrics) are used to measure the performance of
intuition when addressing complex data [8]. In the field of the produced clustering results. This study then employs a
education, several clustering approaches have been applied to well-known MCDM method, called TOPSIS, which takes
different variables, such as student motivation and behavior, the performance measures as inputs, to rank each clustering
time spent on learning tasks, and so on [9]. The performance algorithm for each dataset as a means of validating our pro-
of clustering methods can differ significantly for different posed evaluation approach. This approach makes it possible
types of data and applications because they usually operate in to find and recommend the most suitable clustering method
spaces with different dimensions and should deal with incom- for different educational datasets to practitioners, researchers,
plete, noisy, and sampled data. Multiple clustering methods and decision makers. However, it should be noted that the
have therefore been developed for such reasons [10]–[16], findings reported in this study are relevant and limited to
making the selection of the most appropriate clustering meth- educational datasets that contain students’ (online) activity
ods for a given dataset or problem a difficult challenge. Thus, data, and for other types of educational data, other results
comparing and recommending clustering algorithms in an could be anticipated. The main contributions of this work can
educational context could be beneficial to both researchers be summarized as follows:
and practitioners in the field. Similar to other disciplines, the 1) To model a cluster evaluation task as a multiple criteria
evaluation of a learning method’s performance is a crucial decision-making problem and accordingly propose a
challenge in the education field. While limited performance novel approach for the evaluation of clustering methods
measures are used in supervised methods to evaluate the in an educational context.
performance, assessing clustering methods is more difficult 2) To automatically find and recommend the most suitable
owing to the nature of cluster analysis (e.g., [17]–[20]). clustering method for different educational datasets, for
In cluster analysis, one key question that must be addressed practitioners, researchers, and decision makers.
is cluster validation or the evaluation of a clustering algo- 3) To systematically and comprehensively compare well-
rithm’s quality. This task is intrinsically challenging due to known clustering methods used in an educational con-
the lack of objective measures. Basically, there exist different text using multiple real-world educational datasets
types of validation methods for clustering algorithms [21]; (with different sizes and performance measures).
among them, internal and external validation are the most 4) To investigate the effect of normalization on the per-
frequently used by researchers [22]. Essentially, these mea- formance of clustering algorithms in an educational
sures assess clustering methods from different viewpoints, context (on datasets with different sizes and feature
and in practice, there is no clustering method that could pos- numbers).
sibly reach the best performance in all of these performance Sections II and III of this paper review the related work
metrics for a given problem domain [23]. A number of studies and describe the materials and methods, respectively, and