Classifiers For Educational Data Mining
Classifiers For Educational Data Mining
k
i=1
P(A
i
|C). This Naive Bayes assumption can
be represented as a two-layer Bayesian network (Figure 3), with the class
variable C as the root node and all the other variables A
1
, ..., A
k
as leaf
nodes. Now we have to estimate only O(kv) probabilities per class. The use
of MDL score function in the model selection is also avoided, because the
model structure is xed, once we have decided the explanatory variables A
i
.
In practice, the Naive Bayes assumption holds very seldom, but still the
naive Bayes classiers have achieved good results. In fact, Domingos and
Pazzani (1997) have shown that Naive Bayes assumption is only a sucient
but not a necessary condition for the optimality of the naive Bayes classier.
In addition, if we are only interested in the ranked order of the classes, it
does not matter if the estimated probabilities are biassed.
18
...
C
A
1
A
2
A
k
Figure 3: A naive Bayes model with class attribute C and explanatory at-
tributes A
1
,...,A
k
.
As a consequence of Naive Bayes assumption, the representative power of
the naive Bayes model is lower than that of decision trees. If the model uses
nominal data, it can recognize only linear class boundaries. When numeric
data is used, more complex (non-linear) boundaries can be represented.
Otherwise, the naive Bayes model has many advantages: it is very simple,
ecient, robust to noise, and easy to interpret. It is especially suitable
for small data sets, because it combines small complexity with a exible
probabilistic model. The basic model suits only for discrete data and the
numeric data should be discretized. Alternatively, we can learn a continuous
model by estimating densities instead of distributions. However, continuous
Bayesian networks assume some general form of distribution, typically the
normal distribution, which is often unrealistic. Usually, discretization is a
better solution, because it also simplies the model and the resulting classier
is more robust to overtting.
4.3 Neural networks
Articial neural networks (see e.g. Duda et al. (2000)) are very popular in
pattern recognition, and justly so. According to a classical phrase (J. Denker,
quoted in Russell and Norvig (2002)[585]), they are the second best way of
19
doing just about anything. Still, they can be problematic when applied
to educational technology, unless you have a lot of numeric data and know
exactly how to train the model.
Feed-forward neural networks (FFNN) are the most widely used type of
neural networks. The FFNN architecture consists of layers of nodes: one for
input nodes, one for output nodes, and at least one layer of hidden nodes.
On each hidden layer the nodes are connected to the previous and next layer
nodes and the edges are associated with individual weights. The most general
model contains just one hidden layer. This is usually sucient, because in
principle any function can be represented by a three-layer network, given suf-
ciently many hidden nodes (Hecht-Nielsen, 1989). This implies that we can
also represent any kind of (non-linear) class boundaries. However, in practice
learning a highly non-linear network is very dicult or even impossible. For
linearly separable classes it is sucient to use a perceptron, a FFNN with no
hidden layers.
The learning algorithm is an essential part of the neural network model.
Even if neural networks can represent any kind of classiers, we are seldom
able to learn the optimal model. The learning is computationally hard and
the results depend on several open parameters like the number of hidden lay-
ers, number of hidden nodes on each layer, initial weights, and the termina-
tion criterion. Especially the selection of the architecture (network topology)
and the termination criterion are critical, because neural networks are very
sensitive to overtting. Unfortunately, there are no foolproof instructions
and the parameters have to be dened by trial-and-error. However, there
are some general rules of thumb, which restrict the number of trials needed.
For example, Duda et al. (2000)[317] suggest to use a three-layer network
as a default and add layers only for serious reasons. For stopping criterion
20
(deciding when the model is ready), a popular strategy is to use a separate
test set (Mitchell, 1997)[111].
Feed-forward neural networks have several attractive features. They can
easily learn non-linear boundaries and in principle represent any kind of
classiers. If the original variables are not discriminatory, FFNN transforms
them implicitly. In addition, FFNNs are robust to noise and can be updated
with new data.
The main disadvantage is that FFNNs need a lot of data much more
than typical educational data sets contain. They are very sensitive to over-
tting and the problem is even more critical with small training sets. The
data should be numeric and categorical data must be somehow quantized,
before it can be used. However, this increases the model complexity and the
results are sensitive to the quantization method used.
The neural network model is a black box and it is hard for people to
understand the explanations for the outcomes. In addition, neural networks
are unstable and achieve good results only in good hands (Duin, 2000). Fi-
nally, we recall that nding an optimal FFNN is an NP-complete problem
(Blum and Rivest, 1988) and the learning algorithm can get stuck at a local
optimum. Still, the training can be time consuming, especially if we want to
circumvent overtting.
4.4 K-nearest neighbour classiers
K-nearest neighbour classiers (see e.g. Hand et al. (2002)[347-352]) repre-
sent a totally dierent approach to classication. They do not build any
explicit global model, but approximate it only locally and implicitly. The
main idea is to classify a new object by examining the class values of the K
most similar data points. The selected class can be either the most common
21
class among the neighbours or a class distribution in the neighbourhood.
The only learning task in K-nearest neighbour classiers is to select two
important parameters: the number of neighbours K and distance metric d.
An appropriate K value can be selected by trying dierent values and
validating the results in a separate test set. When data sets are small, a
good strategy is to use leaveoneout cross-validation. If K is xed, then the
size of the neighbourhood varies. In sparse areas the nearest neighbours are
more remote than in dense areas. However, dening dierent Ks for dierent
areas is even more dicult. If K is very small, then the neighbourhood is
also small and the classication is based on just a few data points. As a
result the classier is unstable, because these few neighbours can vary a lot.
On the other hand, if K is very large, then the most likely class in the
neighbourhood can deviate much from the real class. For small dimensional
data sets suitable K is usually between 5 and 10. One solution is to weigh
the neighbours by their distances. In this case, the neighbourhood can cover
all data points so far and all neighbourhoods are equally large. The only
disadvantage is that the computation becomes slower.
Dening the distance metric d is another, even a more critical problem.
Usually, the metrics take into account all attributes, even if some attributes
were irrelevant. Now it is possible that the most similar neighbours become
remote and the wrong neighbours corrupt the classication. The problem
becomes more serious, when more attributes are used and the attribute space
is sparse. When all points are far, it is hard to recognize real neighbours
from other points and the predictions become inaccurate. As a solution, it
has been suggested (e.g. Hinneburg et al. (2000)) to give relevance weights
for attributes, but the relevant attributes can also vary from class to class.
In practice, appropriate feature selection can produce better results.
22
The nearest neighbourhood classiers have several advantages: there are
only two parameters to learn (or select), the classication accuracy can be
very good in some problems, and the classication is quite robust to noise
and missing values. Especially weighed distance smooths the noise in at-
tribute values and missing values can be simply skipped. Nearest neighbour
classiers have very high representative power, because they can work with
any kind of class boundaries, given suciently data.
The main disadvantage is the diculty to select distance function d. Edu-
cational data often consists of both numeric and categorical data and numeric
attributes can be in dierent scales. It means that we need a weighed distance
function, but also a large data set to learn the weights accurately. Irrelevant
attributes are also common in some educational data sets (e.g. questionnaire
data) and they should be removed rst.
The lack of an explicit model can be either an advantage or a disadvan-
tage. If the model is very complex, it is often easier to approximate it only
locally. In addition, there is no need to update the classier, when new data
is added. However, this kind of lazy methods are slower in classication
than model-based approaches. If the data set is large, we need some index to
nd the nearest neighbours eciently. It is also noteworthy that an explicit
model is useful for human evaluators and designers of the system.
4.5 Support vector machines
Support vector machines (SVMs) (Vapnik, 1998) are an ideal method, when
the class boundaries are non-linear but here is too little data to learn complex
non-linear models. The underlying idea is that when the data is mapped to
a higher dimension, the classes become linearly separable. In practice, the
mapping is done only implicitly, using kernel functions.
23
SVMs concentrate on only the class boundaries; points which are any way
easily classied, are skipped. The goal is to nd the thickest hyperplane
(with the largest margin), which separates the classes. Often, better results
are achieved with soft margins, which allow some misclassied data points.
When the optimal margin is determined, it is enough to save the support
vectors, i.e. data points which dene the class boundaries.
The main advantage of SVMs is that they nd always the global optimum,
because there are no local optima in maximizing the margin. Another benet
is that the accuracy does not depend on the dimensionality of data and the
system is very robust to overtting. This is an important advantage, when
the class boundary is non-linear. Most other classication paradigms produce
too complex models for non-linear boundaries.
However, SVMs have the same restriction as neural networks: the data
should be continuous numerical (or quantized); the model is not easily in-
terpreted, and selecting the appropriate parameters (especially the kernel
function) can be dicult. Outliers can cause problems, because they are
used to dene the class borders. Usually, the problem is avoided by soft
margins.
4.6 Linear regression
Linear regression is actually not a classication method, but it works well,
when all attributes are numeric. For example, passing a course depends on
the students points, and the points can be predicted by linear regression.
In linear regression, it is assumed that the target attribute (e.g. total
points) is a linear function of other, mutually independent attributes. How-
ever, the model is very exible and can work well, even if the actual de-
pendency is only approximately linear or the other attributes are weakly
24
correlated (e.g. Xycoon (2000-2006)). The reason is that linear regression
produces very simple models, which are not as risky for overtting as more
complex models. However, the data should not contain large gaps (empty
areas) and the number of outliers should be small (Huber, 1981)[162].
4.7 Comparison
Selecting the most appropriate classication method for the given task is a
dicult problem and no general answer can be given. In Table 1, we have
evaluated the main classication methods according to eight general criteria,
which are often relevant when educational data is classied.
The rst criterion concerns the form of class boundaries. Decision trees,
general Bayesian networks, FFNNs, nearest neighbour classiers, and SVMs
can represent highly non-linear boundaries. Naive Bayes model using nomi-
nal data can represent only a subset of linear boundaries, but with numeric
data it can represent quite complex non-linear boundaries. Linear regression
is restricted to only linear boundaries, but it tolerates small deviations from
the linearity. It should be noticed that strong representative power is not
desirable, if we have only little data and a simpler, linear model would suf-
ce. The reason is that complex, non-linear models are also more sensitive
to overtting.
The second criterion, accuracy on small data sets is crucial for the educa-
tional domain. An accurate classier cannot be learnt if there is not enough
data. The sucient amount of data depends on the model complexity. In
practice, we should favour simple models, like naive Bayes classiers or linear
regression. Support vector machines can produce extremely good results, if
the model parameters are just correctly selected. On the other hand, decision
trees, FFNNs, and nearest neighbour classiers require much larger data sets
25
Table 1: Comparison of dierent classication paradigms. Sign + means that
the method supports the property, that it does not. The abbreviations are
DT=decision tree, NB=Naive Bayes classier, GB=general Bayesian clas-
sier, FFNN= feed-forward neural network, K-nn=K-nearest neighbour
classier, SV M=support vector machine, and LR= linear regression.
DT NB GB FFNN K-nn SV M LR
Non-linear boundaries + (+) + + + +
Accuracy on small data sets + +/ + +
Works with incomplete data + + + +
Supports mixed variables + + + +
Natural interpretation + + + (+) +
Ecient reasoning + + + + + +
Ecient learning +/ + +/ + +
Ecient updating + + + + +
to work accurately. The accuracy of general Bayesian classiers depends on
how complex structure is used.
The third criterion concerns whether the method can handle incomplete
data, i.e. noise (errors), outliers (which can be due to noise), and missing
values. Educational data is usually clean, but outliers and missing values oc-
cur frequently. Naive and general Bayesian classiers, FFNNs, and nearest
neighbour models are especially robust to noise in the data. Bayesian classi-
ers, nearest neighbour models, and some enlargements of decision trees can
handle also missing values quite well. However, decision trees are generally
very sensitive to small changes like noise in the data. Linear regression can-
not handle missing attribute values at all and serious outliers can corrupt
the whole model. SVMs are also sensitive to outliers.
The fourth criterion tells whether the method supports mixed variables,
i.e. both numeric and categorical. All methods can handle numeric attributes,
but categorical attributes are problematic for FFNNs, linear regression and
SVMs.
Natural interpretation is also an important criterion, since all educational
26
models should be transparent to the learner (e.g. OShea et al. (1984)). All
the other paradigms except neural networks and SV Ms oer more or less
understandable models. Especially decision trees and Bayesian networks have
a comprehensive visual representation.
The last criteria concern the computational eciency of classication,
learning, and updating the model. The most important is ecient classica-
tion, because the system should adapt to the learners current situation im-
mediately. For example, if the system oers individual exercises for learners,
it should detect when easier or more challenging tasks are desired. Nearest
neighbour classier is the only one which lacks this property. The eciency
of learning is not so critical, because it is not done in real time. In some
methods the models can be eciently updated, given new data. This is an
attractive feature because often we can collect new data when the model is
already in use.
5 Conclusions
Classication has many applications in both traditional education and mod-
ern educational technology. The best results are achieved, when classiers
can be learnt from real data, but in educational domain the data sets are
often too small for accurate learning.
In this chapter, we have discussed the main principles which aect clas-
sication accuracy. The most important concern is to select a suciently
powerful model, which catches the dependencies between the class attribute
and other attributes, but which is suciently simple to avoid overtting.
Both data preprocessing and the selected classication method aect this
goal. To help the reader, we have analyzed the suitability of dierent classi-
27
cation methods for typical educational data and problems.
References
Baker, R.S., A.T. Corbett, and K.R. Koedinger. 2004. Detecting student
misuse of intelligent tutoring systems. In Proceedings of the 7th interna-
tional conference on intelligent tutoring systems (its04), 531540. Springer
Verlag.
Barker, K., T. Trafalis, and T.R. Rhoads. 2004. Learning from student data.
In Proceedings of the 2004 IEEE Systems and Information Engineering
Design Symposium, 7986. Charlottesville, VA: University of Virginia.
Blum, A., and R.L. Rivest. 1988. Training 3-node neural network is NP-
complete. In Proceedings of the 1988 Workshop on Computational Learning
Theory (COLT), 918. MA, USA: MIT.
Bresfelean, V.P., M. Bresfelean, N. Ghisoiu, and C.-A. Comes. 2008. Deter-
mining students academic failure prole founded on data mining methods.
In Proceedings of the 30th international conference on information tech-
nology interfaces (iti 2008), 317322.
Cocea, M., and S. Weibelzahl. 2006. Can log les analysis estimate learn-
ers level of motivation? In Proceedings of Lernen - Wissensentdeckung -
Adaptivitat (LWA2006), 3235. Hildesheim.
. 2007. Cross-system validation of engagement prediction from log
les. In Creating new learning experiences on a global scale, proceedings
of the second european conference on technology enhanced learning (ec-
tel2007), vol. 4753 of Lecture Notes in Computer Science, 1425. Springer.
28
Damez, M., T.H. Dang, C. Marsala, and B. Bouchon-Meunier. 2005. Fuzzy
decision tree for user modeling from human-computer interactions. In
Proceedings of the 5th international conference on human system learning
(ichsl05), 287302.
Dekker, G., M. Pechenizkiy, and J. Vleeshouwers. 2009. Predicting students
drop out: A case study. In Educational data mining 2009: Proceedings
of the 2nd international coference on educational data mining (edm09),
4150.
Desmarais, M.C., and X. Pu. 2005. A Bayesian student model without hidden
nodes and its comparison with item response theory. International Journal
of Articial Intelligence in Education 15:291323.
Domingos, P., and M. Pazzani. 1997. On the optimality of the simple
Bayesian classier under zero-one loss. Machine Learning 29:103130.
Duda, R.O., P.E. Hart, and D.G. Stork. 2000. Pattern classication. 2nd ed.
New Yor: Wiley-Interscience Publication.
Duin, R. 2000. Learned from neural networks. In Proceedings of the 6th
annual conference of the advanced school for computing and imaging (asci-
2000), 913. Advanced School for Computing and Imaging (ASCI).
Friedman, N., D. Geiger, and M. Goldszmidt. 1997. Bayesian network clas-
siers. Machine Learning 29(2-3):131163.
Ham al ainen, W., T.H. Laine, and E. Sutinen. 2006. Data mining in per-
sonalizing distance education courses. In Data mining in e-learning, ed.
C. Romero and S. Ventura, 157171. Southampton, UK: WitPress.
29
Ham al ainen, W., and M. Vinni. 2006. Comparison of machine learning meth-
ods for intelligent tutoring systems. In Proceedings of the 8th international
conference on intelligent tutoring systems, vol. 4053 of Lecture Notes in
Computer Science, 525534. Springer-Verlag.
Han, Jiawei, and Micheline Kamber. 2006. Data mining: Concepts and tech-
niques. 2nd ed. Morgan Kaufmann.
Hand, D., H. Mannila, and P. Smyth. 2002. Principles of data mining. Cam-
bridge, Massachussetts, USA: MIT Press.
Hecht-Nielsen, R. 1989. Theory of the backpropagation neural network.
In Proceedings of the international joint conference on neural networks
(ijcnn), vol. 1, 593605. IEEE.
Herzog, S. 2006. Estimating student retention and degree-completion time:
Decision trees and neural networks vis-a-vis regression. New Directions for
Institutional Research 1733.
Hinneburg, A., C.C. Aggarwal, and D.A. Kleim. 2000. What is the nearest
neighbor in high dimensional spaces? In Proceedings of 26th international
conference on very large data bases (vldb 2000), 506515. Morgan Kauf-
mann.
Huber, P.J. 1981. Robust statistics. Wiley Series in Probability and Mathe-
matical Statistics, New York: John Wiley & Sons.
Hurley, T., and S. Weibelzahl. 2007. Eliciting adaptation knowledge from
on-line tutors to increase motivation. In Proceedings of 11th international
conference on user modeling (um2007), vol. 4511 of Lecture Notes in Ar-
ticial Intelligence, 370374. Berlin: Springer Verlag.
30
Hyal, L., and R.L. Rivest. 1976. Constructing optimal binary decision trees
is NP-complete. Information Processing Letters 5(1).
Jain, A.K., P.W. Duin, and J. Mao. 2000. Statistical pattern recognition: a
review. IEEE Transactions on Pattern Analysis and Machine Intelligence
22(1):437.
Jollie, Ian T. 1986. Principal component analysis. Springer-Verlag.
Jonsson, A., J. Johns, H. Mehranian, I. Arroyo, B. Woolf, A.G. Barto,
D. Fisher, and S. Mahadevan. 2005. Evaluating the feasibility of learn-
ing student models from data. In Papers from the 2005 aaai workshop on
educational data mining, 16. Menlo Park, CA: AAAI Press.
Jutten, Christian, and Jeanny Herault. 1991. An adaptive algorithm based
on neuromimetic architecture. Signal Processing 24:110.
Kotsiantis, S.B., C.J. Pierrakeas, and P.E. Pintelas. 2003. Preventing stu-
dent dropout in distance learning using machine learning techniques. In
Proceedings of 7th international conference on knowledge-based intelligent
information and engineering systems (kes-2003), vol. 2774 of Lecture Notes
in Computer Science, 267274. Springer-Verlag.
Lee, M.-G. 2001. Proling students adaption styles in web-based learning.
Computers & Education 36:121132.
Liu, C.-C. 2000. Knowledge discovery from web portfolios: tools for learning
performance assessment. Ph.D. thesis, Department of Computer Science
Information Engineering Yuan Ze University, Taiwan.
Ma, Y., B. Liu, C.K. Wong, P.S. Yu, and S.M. Lee. 2000. Targeting the
right students using data mining. In Proceedings of the sixth acm sigkdd
31
international conference on knowledge discovery and data mining (kdd00),
457464. New York, NY, USA: ACM Press.
Minaei-Bidgoli, B., D.A. Kashy, G. Kortemeyer, and W. Punch. 2003. Pre-
dicting student performance: an application of data mining methods with
an educational web-based system. In Proceedings of 33rd frontiers in edu-
cation conference, T2A13T2A18.
Mitchell, T.M. 1997. Machine learning. New York, NY, USA: McGraw-Hill
Companies.
M uhlenbrock, M. 2005. Automatic action analysis in an interactive learning
environment. In Proceedings of the workshop on usage analysis in learning
systems at aied-2005, 7380.
Nghe, N. Thai, P. Janecek, and P. Haddawy. 2007. A comparative analysis
of techniques for predicting academic performance. In Proceedings of the
37th conference on asee/ieee frontiers in education, T2G7T2G12.
OShea, T., R. Bornat, B. Boulay, and M. Eisenstad. 1984. Tools for creat-
ing intelligent computer tutors. In Proceedings of the international nato
symposium on articial and human intelligence, 181199. New York, NY,
USA: Elsevier North-Holland, Inc.
Pearl, J. 1988. Probabilistic reasoning in intelligent systems: networks of
plausible inference. San Mateo, California: Morgan Kaufman Publishers.
Quinlan, J.R. 1986. Induction of decision trees. Machine Learning 1(1):
81106.
. 1993. C4.5: programs for machine learning. Morgan Kaufmann.
32
Romero, C., S. Ventura, P.G. P.G. Espejo, and C. Hervas. 2008. Data mining
algorithms to classify students. In Educational data mining 2008: Proceed-
ings of the 1st international conference on educational data mining, 817.
Russell, S.J., and P. Norvig. 2002. Articial intelligence: A modern approach.
2nd ed. Prentice Hall.
Superby, J.F., J-P. Vandamme, and N. Meskens. 2006. Determination of
factors inuencing the achievement of the rst-year university students
using data mining methods. In Proceedings of the workshop on educational
data mining at its06, 3744.
Valentini, G., and F. Masulli. 2002. Ensembles of learning machines, vol.
2486 of Lecture Notes in Computer Science, 322. Springer-Verlag. Invited
Review.
Vapnik, V.N. 1998. Statistical learning theory. John Wiley & Sons.
Vomlel, J. 2004. Bayesian networks in educational testing. International
Journal of Uncertainty, Fuzziness and Knowledge Based Systems 12(Sup-
plementary Issue 1):83100.
Witten, Ian H., and Eibe Frank. 2005. Data mining: Practical machine
learning tools and techniques. 2nd ed. San Francisco: Morgan Kaufmann.
Xycoon. 2000-2006. Linear regression techniques. In Statistics - Econometrics
- Forecasting (Online Econometrics Textbook), chap. II. Oce for Research
Development and Education. Available on http://www.xycoon.com/. Re-
trieved 1.1. 2006.
Zang, W., and F. Lin. 2003. Investigation of web-based teaching and learning
by boosting algorithms. In Proceedings of IEEE International Conference
33
on Information Technology: Research and Education (ITRE 2003), 445
449.
34