paper-predicting-student-scores
paper-predicting-student-scores
net/publication/335219629
CITATIONS READS
0 252
4 authors, including:
Some of the authors of this publication are also working on these related projects:
Study and develop parallel metaheuristic-based allocation algorithms on Intel Xeon Phi co-processors View project
All content following this page was uploaded by Minh Chung on 17 August 2019.
Thong Le Mai, Phat Thanh Do, Minh Thanh Chung, Nam Thoai
Faculty of Computer Science & Engineering
Ho Chi Minh City University of Technology, Vietnam
Email: 1513293,1512400,ctminh,[email protected]
Abstract—Nowadays, Education Data Mining (EDM) plays a The prediction has been one of the most dominant fields
very important role in higher education institutions. Plenty of in EDM since 1995 [3]. Related researches usually exploit
algorithms have been employed to measure student’s GPA in influential factors from the university’s data to build a pre-
the next semester’s courses. The results can be used to early diction model such as GPA, age, sex, etc. A lot of Machine
identify dropout students or help students choose the elective Learning algorithms are used to solve these problems includ-
courses which are appropriate for them. The most widely ing Decision Tree, Random Forest, Regression, Neural Net-
used methods are machine learning, however, the problem is work [4] [5] [6]. Some others techniques frequently based on
the accuracy which can be changed from dataset to dataset. Recommender System (e.g., Collaborative Filtering, Matrix
More importantly, the performance of prediction models can Factorization) are also found a lot of successes [7] [8]
be affected by the characteristic of dataset associated with the [9] [10] [11]. Furthermore, some studies focus on whether
applied model. In this paper, we build a distributed platform different types of predictor variables rather than the explicit
on Spark to predict missing grades of elective courses for ratings such as ages, sex, online time, response efficiency
undergraduate students. The paper compares several methods in improving the accuracy [6] [12]. However, the problems,
that are based on the combination of Collaborative Filtering & associated with the ease of collecting educational data today,
Matrix Factorization (namely Alternative Least Square). We are the scale of prediction models and the characteristic-
evaluate the performance of these algorithms using a dataset aware of each dataset to applied prediction methods.
provided by Ho Chi Minh University of Technology (HCMUT).
In this paper, we propose a distributed platform based on
Spark to predict the scores of future courses for undergradu-
The dataset consists of information about undergraduate stu-
ate students in our university. We obtain different prediction
dents from 2006 to 2017. Depending on the characteristics of
algorithms from combining techniques based on Recom-
our dataset, the paper highlights that Alternative Least Square
mender System such as Collaborative Filtering (namely
with non-negative constraint achieves the better results than
User-based Collaborative Filtering and Item-based Collab-
others in comparison.
orative Filtering), Matrix Factorization (using Alternative
Index Terms—Educational Data Mining, Spark, prediction, Least Square) and Nonnegative Matrix Factorization (added
student performance, distributed system, machine learning non-negative constraint to Alternative Least Square). The
original dataset is grouped by different faculties: Industrial
Maintenance, Chemical Engineering, Civil Engineering, ...
1. Introduction For the evaluation, the paper performs our proposed methods
to the data of four faculties. They are Computer Science
Education Data Mining (EDM) is a research field which and Engineering (MT), Environment and Natural Resources
concerns data mining techniques to analyze patterns from (MO), Chemical Engineering (HC), Civil Engineering (XD).
data in educational context [1]. Currently, there are many Then, our experiment highlights that the best prediction
learning systems that gather a large amount of educational model depends on the characteristic of the dataset, the per-
data such as Learning Management Systems (LMS), Mas- formance can be changed from dataset to dataset. Further-
sive Open Online Courses (MOOCs). Following that, ap- more, the accuracy of each prediction model is also affected
plications and tasks in EDM can be divided into different by the way of dividing or choosing dataset associated with
categories, depending on different properties. Based on the training the prediction model. In some cases, the experiment
literature review in 2013 [2], we consider two main groups: shows the difference when the prediction models are trained
“Student Modeling” and “Decision Support Systems” in from student’s data of the whole university and the separate
terms of EDM. Concretely, predicting student’s performance faculty.
belongs to the “Student Modeling” group. The most widely The rest of this paper is organized as follows. Section 2
used methods for this problem are regression and classi- shows related work about methods for predicting student
fication, but other methods have also been used such as performance. Section 3 is the background of algorithms and
clustering and feature selection. techniques that are used. We describe specifically the dataset
provided by Ho Chi Minh City University of Technology methods such as logistic/linear regression [10]. Furthermore,
in Section 4. Section 5 shows the proposed platform and in their follow up paper [11], they extended the research,
methods. In Section 6, we present the experiment results, using tensor-based factorization to take the temporal effect
and highlight the conclusions as well as future work in into account when predicting student performance. Feng et
Section 7. al. [12] have taken the advantage of the student-system
interaction information that is normally not available in
2. Related Work traditional practice tests such as the time students take to
answer questions and the time they take to correct an answer
In terms of Educational Data Mining (EDM), one of which they got wrong. The addition of this information
the most common tasks is filtering out information that is shown to make better predictions than tradition models
can be used to predict the student’s performance [3] [1]. which only uses correctness of test question. Elbadrawy
Generally, many studies have been conducted in order to et al. attempted to use Personalized Multiregression and
predict student’s grade as well as identify risky students. Matrix Factorization to forecast students’ grades on in-class
Currently, machine learning algorithms are used to solve the assessments [15]. The results revealed that these methods
prediction problem in this field. Some of them are based on can achieve a lower error rate than traditional methods.
techniques frequently used in Recommender System [13]. Furthermore, a lot of researches focus reviews on ex-
Romero et al. have used classification algorithms such as isting types of educational systems and methods applied in
Decision Tree, Rule Induction, Neural Network,... to predict EDM. Romero et al. have categorized EDM researches into
students’ final marks based on information in an e-learning 2 large groups [16]:
system [5]. Random forest was employed in [4] to examine • Statistics and visualization
the statistical relationship between students’ graduate-level • Web mining
performance and undergraduate achievements. Garcı́a et al.
Web mining is considered a prominent group of EDM
have applied association rule mining to discover interesting
because many methods revolve around the analysis of logs
information through students’ usage data in the form of IF-
of student-computer interaction. [1] examined three hundred
THEN recommendation rules, which is to build a system
published papers until 2009 grouped by task/category such
helping teachers to continually improve and maintain adap-
as recommendation, predicting performance, detecting be-
tive and non-adaptive e-learning courses [14]. Nurjanah et
havior, analysis, and visualization, etc. [3] investigated what
al. proposed a technique for recommending learning materi-
some of the major trends are in EDM research. They found
als which combine content-based filtering and collaborative
that in 43% papers which was examined in [16] published
filtering [7], [8]. In detail, content-based filtering is first
between 1995 and 2005 centered around relationship mining
applied to filter out relevant materials. Then, collaborative
methods. However, in 2008 and 2009, relationship mining
filtering is used to select good students. This technique
slipped to fifth place with only 9% papers. On the other
aims to reduce the drawback of classic collaborative filter-
hand, prediction, which was in second place between 1995
ing which recommends materials based on the similarity
and 2005, moved to the dominant position in 2008-2009.
between students and not take into account student’s com-
petence. The resulting model achieved MAE score of 0.96
for a scale of 1 to 10 and 0.73 for a scale of 1 to 5. In
3. Background
2017, Iqbal et al. have applied and evaluated Collaborative
3.1. Spark cluster overview
Filtering, Matrix Factorization and Restricted Boltzmann
Machine (RBM) to predict student grade in a dataset consists Today, Spark [17] is used at a wide range of areas to
of 225 students and 24 courses with 1736 available grades analyzing large-scale datasets. Figure 1 gives an overview
and 3664 missing grades (grades are given in scale 0-4). of how Spark performs tasks on clusters. Applications or
They concluded that the RBM model gives the best result user-defined programs run as independent sets of processes,
with 0.3 RMSE nearly half of the second best model (Non- this is coordinated by the definition of SparkContext object
negative Matrix Factorization) which had RMSE of 0.57 in the code (called the Driver Program). Then, SparkContext
[9]. Otherwise, Conijn et al. analyzed 17 blended courses plays a role in allocating resources across programs. Once
with 4,989 students using 23 predictor variables collected connected, Spark finds executors on worker nodes in the
from Moodle Learning Management Systems (LMS) [6], cluster where the processes of user’s programs will be run
They found that there is a larger improvement in prediction and stored data. Next, Spark sends the program code (written
when those grades are unavailable in the case of in-between by Java or Python) to the executors. Finally, SparkContext
assessment grades are available. Thus, the LMS data in this sends tasks to the executors to run.
dataset are substantially smaller predictive value compared
to the midterm grades. 3.2. Data mining techniques
Concerning another technique, Thai-Nghe et al. used
matrix factorization to predict student performance on the 3.2.1. Collaborative Filtering. Collaborative Filtering is
Knowledge Discovery and Data Mining Challenge 2010 the popular algorithm which is commonly used in recom-
dataset. They showed that matrix factorization could im- mender systems. A recommender system focuses on sug-
prove prediction results compared to traditional regression gesting the set of items for users based on their existing
• Then, the algorithm will select the most similar
courses which are learned by the active student based
on using k-nearest neighbors algorithm.
• Similar to UBCF, the prediction result is made by
aggregating the GPAs of the most similar courses.
Item-based Collaborative Filtering: IBCF is used in To learn matrix U and V , we will minimize the cost
the case that the courses have been rarely changed. In this function:
algorithm, the student’s grade in a course can be predicted
X
rui − pu qiT + λ p2u + qi2
(3)
by identifying similar courses which have learned by the
u,i∈H
current student. Instead of identifying the most similar stu-
dents in UBCF, the IBCF algorithm determines the most where H is the set of (u, i) pair where gui is in the training
similar courses from the set of courses that the current set; λ is the regularization parameter.
student have learned. The predictions are made by selecting Using gradient descent, for each given value of gui in
and aggregating the grades of other courses. To predict the the training set, to update vector pu and qi :
student’s grade in a course:
pu = pu + γ rui − pu qiT · qi − λpu
• Firstly, the IBCF algorithm calculates the similarity (4)
qi = qi + γ rui − pu qiT · pu − λqi
matrix between the courses to determine how similar
each course in the database to the course that needs where γ is the learning rate.
to be predicted.
Alternative Least Square (ALS): This method is one program will finished in 4 - 4.5 years and the courses are di-
of the optimization for the Singular-Value Decomposition vided into three groups: Basis Courses, Compulsory Courses
(SVD) [21] method. Recall Equation 3 where both pu , qi and Core&Elective Courses. The courses correspond to the
are unknown and tied with each other in a multiplication level of years that students are studying. Figure 2 shows
operation which makes this non-convex. The idea of ALS is: the program of undergraduate students in the Faculty of
when we fix one of the unknown variables which is either Computer Science & Engineering. This highlights that our
pu or qi , the cost function becomes a quadratic problem. module just predicts the scores of unfinished and elec-
In each iteration, ALS first fixes U (all vectors pu ) and tive courses. Fundamentally, the information from finished
solves for V , then it fixes V (all vectors pi ) and solves for courses (such as basic and compulsory courses) will be the
U . The process is repeated until there is a convergence. In basic foundation for the prediction of unfinished courses.
ALS, each pu is independent with other pu0 !=u , and each Besides 35 columns of information in the dataset, we
qi is independent with other qi0 !=i . This algorithm can be reprocess with 4 main fields, as Table 2. In the core of
massively parallelized. our prediction module, the proposed algorithms will use
Non-negative Matrix Factorization: Another type of these fields to train the prediction model. In detail, we have
matrix factorization, where the non-negative constraint is Student ID, name of faculty, course ID and the average grade
added. A given non-negative matrix G contains all observed from component grades on each course.
grades, then we need to find the non-negative matrix factors,
W and H [22]. Table 2: Educational dataset
G(n) ≈ W × H (5)
Student Faculty Course Grade
• W is a non-negative m × r matrix. Computer Science and
29081892 Data Mining 7.5
Engineering
• H is a non-negative r × n matrix.
28193782 Chemical Engineering Calculus 9.0
With normal matrix factorization, we can obtain negative 32876719 Mechanical Engineering Physical Education 7.5
... ... ... ...
affinity between a student u and a latent factor k which can
be hard to interpret (e.g., the difficulty of latent factors can
be negative). Non-negative matrix factorization can give us Each record is information about the course student’s
a better representation of the latent factors by guaranteeing grade with the corresponding course. The grades are scaled
non-negative value, thus a course can be described as a from 0 to 10 by the double type. The distribution of stu-
collection of smaller knowledge domains. For example, dent’s grades is shown in Figure 3. The popular range of
Computer Graphic Course can be interpreted as a collection undergraduate grade is from 5 to 8.5, and the sparsity of the
of 60% algebra +20% math +20% art +0% literature. dataset is 0.9845. The sparsity is calculated by the following
formula - (6):
4. Dataset G
S =1− (6)
N ·C
The educational dataset is collected from Ho Chi Minh where, N , G are C are the total number of student’s grades,
City University of Technology in Vietnam. The dataset students, and courses respectively. Furthermore, we also
contains data of 61271 undergraduate students with over show the overview of information in all faculties in our
three million records about student’s grade. There are 35 university from the dataset, as Table 3 shown, e.g., the
columns in the dataset, however, it is reprocessed with number of courses, students, or grades.
influence fields. In the context of this work, we focus
on the prediction of average grades in some unfinished 5. Methods
courses of students. Therefore, the basic foundation of our
proposed module is the relation between finished courses Apache Spark [17] has a well-defined layer architecture
and unfinished courses. Especially, this also depends on the where all components are loosely coupled. This architecture
educational program of each university. can be further integrated with various extensions and
libraries. The core of Spark is the base engine for large-
Table 1: The statistics of the dataset scale parallel and distributed data processing. Our platform
is built on the top of the core, which allows modifying as
Number of faculties 14 well as improve the predictor with different demands further.
Number of courses 2389
Number of students 61271 In our platform, there are 4 main modules: preprocess-
Number of student’s grades 2270045
Sparsity 0.9845 ing, training, deployed & recommender. Figure 4 shows the
working flow and overview of our prediction platform, the
white boxes are input, output or the definition of state in re-
In HCMUT, there are totally 14 faculties, 2389 courses lated modules. The paper aims at predicting student’s score
and the number of students in the given dataset (from 2006 of unfinished & elective courses in the third-fourth year. The
until 2017) is 61271, as Table 1 shown. The undergraduate main idea is the machine learning core with the insights from
Figure 2: An example of the educational program in Faculty of Computer Science & Engineering.
# stu-
Nota- # # stu- Spar-
Faculty dent’s
tion courses dents sity
grades
Computer Science and
MT 168 5158 155574 0.8205
Engineering
Industrial Maintenance BD 116 1958 54976 0.7580
Mechanical
CK 435 9233 351539 0.9125
Engineering
Geology & Petroleum
DC 207 2476 93516 0.8175
Engineering
Electrical and
DD 325 9391 360546 0.8819
Electronic Engineering
Transportation
GT 230 2323 88510 0.8343
Engineering
Chemical Engineering HC 322 6117 222478 0.8870
Environment and
MO 177 2401 90633 0.7867
Natural Resources
Energy Engineering PD 89 565 16912 0.6637
Industrial Management QL 137 3577 104514 0.7867
Applied Sciences UD 192 2099 78986 0.8040
Materials Technology VL 183 2910 109612 0.7942
Training Program of
Excellent Engineers in VP 309 1515 92040 0.8039
Vietnam (PFIEV)
Civil Engineering XD 445 11691 450209 0.9135
Figure 3: Distribution of student’s grades
Table 4: An example about dataset Table 6: Similarity Score between student 1511000 and
Student ID Course ID Grade
other students
1511000 0 9 1511000’s feature [9, 8, 0, 0]
1511000 1 8 1512000’s feature [8, 7, 7.5, 8.5]
1512000 1 7 1513000’s feature [7.5, 8.5, 7.5, 0]
1512000 0 8 Similarity(1511000, 1512000) 0.68402
1512000 2 7.5 Similarity(1511000, 1513000) 0.82787
1512000 3 8.5
1513000 0 7.5
1513000 2 7.5 Table 7: Similarity Score between course 2 and other courses
1513000 1 8.5
1511000 2 ? course 0’s feature [9, 8, 7.5]
course 1’s feature [8, 7, 8.5]
Table 5: Representing the student’s grades as a matrix course 2’s feature [0, 7.5, 7.5]
Similarity(course 2, course 0) 0.77259
0 1 2 3 Similarity(course 2, course 1) 0.80526
1511000 9 8 ? ?
1512000 8 7 7.5 8.5
1513000 7.5 8.5 7.5 ?
dicted course (course 2) and other courses which are learned
by student 1511000 instead of calculating the similarity of
5.3. User-based and Item-based Collaborative Fil- students. Each course’s feature contains the grade of stu-
tering dents [1511000, 1512000, 1513000] in that course. Finally,
the IBCF algorithm will predict grade of student 1511000
Combining with Dataframe API [23] supported in Spark, in course 2 by combining the grade of student 1512000 in
we implement the prediction models that based on the course 0 and course 1 with corresponding similarity:
principle of UBCF & IBCF. To illustrate how this method 0.77259 · 9 + 0.80526 · 8
predicted score = = 8.48960
works, from the sample data in Table 4, to predict grade 0.77259 + 0.80526
of student 1511000 in course 2 with UBCF, this algorithm
will calculate the similarity between student 1511000 and
others student who learned course 2 (using cosine similarity 5.4. Item-based Collaborative Filtering on Course
[19]), as Table 6 shown. Each student’s feature contains Factor matrix of Matrix Factorization
student’s grades in [course 0, course 1, course 2, course
3]. Finally, the UBCF algorithm will predict the grade of
student 1511000 in course 2 by combining grade of student
The resulting course factor matrices produced by ALS
1512000 & 1513000 with the corresponding similarity:
0.68402 · 7.5 + 0.82787 · 7.5 algorithms can be used to calculate the similarity among
predicted score = = 7.5 courses. Course factor matrix represents the relation between
0.68402 + 0.82787
courses and hidden features in ALS algorithms, we can use
For IBCF, it will calculate the similarity between the pre- those hidden features as inputs in IBCF.
6. Results Table 9: The detail information of proposed algorithms
Name Algorithms
6.1. Testing environment Baseline Taking average of all visible grades on each student
IBCF Item-based Collaborative Filtering
UBCF User-based Collaborative Filtering
Our experiments are performed on the cluster named ALS Alternative Least Square
SuperNode-XP which is a heterogeneous cluster with 24 Alternative Least Square with nonnegative
ALS NN
compute nodes. There are 2 CPU sockets - Intel Xeon E5- constraint
2680 v3 @ 2.70 GHz, 2 Intel Xeon Phi 7120P (Knight Item-based Collaborative Filtering on Non-negative
ALS NN IBCF
Corners) cards, and 128GB RAM per node. The Spark ALternative Least Square’s Course Factor Matrix
Item-based Collaborative Filtering on Normal
cluster in this paper is built on 4 nodes: 1 master & 3 ALS IBCF
ALternative Least Square’s Course Factor Matrix
workers. In addition, the software stack consists of:
Table 8: Software Specification on the testing environment
[2] K. Chrysafiadi and M. Virvou, “Student modeling approaches: A lit- [22] D. Lee and H. Seung, “Algorithms for non-negative matrix factoriza-
erature review for the last decade,” Expert Systems with Applications, tion,” in Proceedings of the 13th International Conference on Neural
vol. 40, no. 11, pp. 4715–4729, 2013. Information Processing Systems, vol. 13, Jan. 2000, pp. 535–541.
[23] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley,
[3] R. S. Baker and K. Yacef, “The state of educational data mining
X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi et al., “Spark sql:
in 2009: A review and future visions,” Journal of Educational Data
Relational data processing in spark,” in Proceedings of the 2015 ACM
Mining, vol. 1, pp. 601–618, Dec. 2009.
SIGMOD international conference on management of data. ACM,
[4] J. Zimmermann, K. H. Brodersen, J.-P. Pellet, E. August, and J. Buh- 2015, pp. 1383–1394.
mann, “Predicting graduate-level performance from undergraduate
achievements,” Jul. 2011, pp. 357–358.
[5] P. G. E. Cristóbal Romero, Sebastián Ventura and C. Hervás, “Data
mining algorithms to classify students,” 1st International Conference
on Educational Data Mining, p. 8–17, Jun. 2008.
[6] R. Conijn, C. Snijders, A. Kleingeld, and U. Matzat, “Predicting stu-
dent performance from lms data: A comparison of 17 blended courses
using moodle lms,” IEEE Transactions on Learning Technologies,
2017.
[7] D. Nurjanah, “Good and similar learners’ recommendation in adaptive
learning systems,” Conference on Computer Supported Education,
vol. 1, pp. 434–440, 2016.
[8] R. Turnip, D. Nurjanah, and D. Kusumo, “Hybrid recommender
system for learning material using content-based filtering and collab-
orative filtering with good learners’ rating,” Nov. 2017, pp. 61–66.
[9] Z. Iqbal, J. Qadir, A. N. Mian, and F. Kamiran, “Machine learning
based student grade prediction: A case study,” Aug. 2017.
[10] N. Thai-nghe, L. Drumond, A. Krohn-Grimberghe, and L. Schmidt-
Thieme, “Recommender system for predicting student performance,”
Procedia Computer Science, vol. 1, p. 2811–2819, 2010.