Deep Model For Dropout Prediction in MOOCs

ARTICLE

Uploaded by

ia.houssam.aouarib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views7 pages

Deep Model For Dropout Prediction in MOOCs

ARTICLE

Uploaded by

ia.houssam.aouarib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Deep Model for Dropout Prediction in MOOCs

Wei Wang Han Yu Chunyan Miao

LILY, Interdisciplinary Graduate Joint NTU-UBC Research Centre of School of Computer Science and
School, Nanyang Technological Excellence in Active Living for the Engineering, Nanyang Technological
University Elderly (LILY), Nanyang University
Singapore Technological University Singapore
[email protected] Singapore [email protected]
[email protected]

ABSTRACT Despite their rapid development and successes, there are still
Dropout prediction research in MOOCs aims to predict whether some problems within MOOCs. One prominent problem is the
students will drop out from the courses instead of completing them. high dropout rates. Most of the students taking online courses do
Due to the high dropout rates in current MOOCs, this problem is not complete the courses and drop out halfway. This could be a
of great importance. Current methods rely on features extracted by potential factor hindering the development of MOOCs. Making
feature engineering, in which all features are extracted manually. effective prediction of whether a student will drop out is of great
This process is costly, time consuming, and not extensible to new value for MOOC platforms.
datasets from different platforms or different courses with different In the dropout prediction problem, the data we have are raw
characters. In this paper, we propose a model that can automatically activity records of students in the online course platform over a
extract features from the raw MOOC data. Our model is a deep period of time. The prediction we need to make is whether these
neural network, which is a combination of Convolutional Neural students will drop out from the courses in the future. This is a
Networks and Recurrent Neural Networks. Through extensive ex- binary classification problem.
periments on a public dataset, we show that the proposed model can To solve this problem, some methods [1, 16, 21, 25, 27] have
achieve results comparable to feature engineering based methods. been proposed in recent years. Most of these methods follow the
standard way of solving classification problems. In the first step,
CCS CONCEPTS features are extracted from the raw activity records. In the second
step, classification is accomplished via classification algorithms.
• Computing methodologies → Supervised learning by classifi-
Although inspiring results have been achieved by these methods,
cation; Neural networks; • Applied computing → E-learning;
there still exist some problems in them. One salient problem is
the way of extracting features in the first step. In these methods,
KEYWORDS
the step of feature extraction is usually accomplished via feature
Deep Learning, MOOCs, Dropout Prediction engineering [19, 29]. Feature engineering is a process of manually
ACM Reference Format: extracting features from the raw data. This process is carried out
Wei Wang, Han Yu, and Chunyan Miao. 2017. Deep Model for Dropout in a heuristic manner and all features are extracted by hand. In this
Prediction in MOOCs. In Proceedings of ICCSE’17, Beijing, China, July 6–9, process, people who extract features need to be familiar with the
2017, 7 pages. dataset and have some corresponding domain knowledge. When
https://doi.org/10.1145/3126973.3126990 extracting features by feature engineering, it needs several iter-
ations of feature extraction and testing. This makes this process
1 INTRODUCTION very time consuming and inconsistent. On the other hand, in fea-
Massive Open Online Courses (MOOCs) [6, 20] have witnessed fast ture engineering, strategies for extracting features are specific to
development in recent years. Through MOOCs, students around the characteristics of datasets. Strategies effective for one kind of
the world have the opportunities of taking online courses via the datasets may not be effective for another kind of datasets. If there
Internet. This breaks limitations of traditional courses in classrooms. are new kinds of datasets, new strategies of extracting features
In MOOCs, students have more flexibility in terms of when and must be developed manually.
where to take the courses. In this way, MOOCs have attracted more These drawbacks of feature engineering are especially serious
and more students and becoming increasingly popular. in dropout prediction in MOOCs. In this problem, activity records
coming from different platforms and different courses often have
Permission to make digital or hard copies of all or part of this work for personal or different characteristics in both format and content. Effective strate-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
gies of extracting features in one dataset may not be effective in
on the first page. Copyrights for components of this work owned by others than the another dataset. In practice, we often meet new datasets coming
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or from new platforms or new courses. In this case, we need to develop
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. new strategies of extracting features.
ICCSE’17, July 6–9, 2017, Beijing, China Because of the drawbacks of feature engineering, approaches
© 2017 Copyright held by the owner/author(s). Publication rights licensed to Associa- that can automatically extract features are required. One promis-
tion for Computing Machinery.
ACM ISBN 978-1-4503-5375-5/17/07. . . $15.00 ing character of deep learning models is they can extract features
https://doi.org/10.1145/3126973.3126990 automatically from the raw input [3]. As a type of deep learning
ICCSE’17, July 6–9, 2017, Beijing, China W. Wang et al.

model, Convolutional Neural Networks (CNN) [17] are widely used

in feature extracting in lots of areas.
The MOOC activity records are time series data. When making Day 1 Day 2 Day 3 ...… Day N ...…
predictions about dropout, this characteristic should also be con-
sidered. To make use of this characteristic, another type of deep
learning model, Recurrent Neural Networks (RNN) is promising. Record Period Prediction Period
In this paper, we propose a deep learning model that combines Time Line
CNN and RNN in a bottom-up manner. In our model, features are
extracted automatically from raw records by convolutional and
pooling layers in the lower part of the model. Characteristics of Figure 1: The Dropout Prediction Problem
time series data are considered by recurrent layer in the upper part
of the model. As our model can be seen as a combination of CNN
and RNN, we call our model ConRec Network. The advantage of model, we mainly utilize two types of deep learning models, CNN
our model is that, features are automatically extracted. In this way, and RNN.
no feature engineering is needed. This saves time and human efforts CNNs are a kind of neural networks used to deal with data which
in model training. To evaluate the effectiveness of the proposed have grid-like topology [11]. They have been successfully used in
model, we conduct experiments on a public dataset. Experimental many areas, such as image object recognition [17], video classifica-
results show that our model can achieve comparable results to those tion [15], natural language processing [4], speech recognition [7]
obtained by feature engineering based methods. and human activity recognition [32]. In CNNs, each convolutional
layer only finds relationships among adjacent elements. As a CNN
2 RELATED WORK is of a layer-wise architecture, lower layers can find some local
patterns, and upper layers can find patterns in the whole scale of
In this section, we give a brief summary of related work in areas of
the input. CNN is often used as a tool for extracting features from
MOOCs dropout prediction and deep learning.
raw data.
RNNs are a kind of neural networks used to process sequential
2.1 Dropout Prediction in MOOCs
data. They have been successfully used in many areas, such as
A general definition of “dropout” in MOOCs is not having activity speech recognition [12], question answering [30] and machine
records in a period of time. In current literature, according to the translation [26]. In RNN, the input is a sequence, such as time series,
differences in datasets and prediction purposes adopted by different biological sequence, speech or language sequence. RNN process
papers, the specific definition varies in different papers. the elements in the input sequentially to acquires the information
Also in these works, different classification algorithms and differ- contained in the whole sequence.
ent approaches of extracting features were adopted. In [16], features By combining CNN and RNN into a new architecture, we propose
were extracted from click-stream data, and a linear SVM was used a framework to address the MOOC dropout prediction problem.
to predict dropout in each week. In [1], features were also extracted
from click-stream data, and a SVM with RBF kernel was used to pre- 3 PROPOSED MODEL
dict dropout. In [27], logistic regression was used to predict dropout
by the last activity on the corresponding course of a student. Sinha 3.1 Problem Formulation
et al. [25] only used video click-stream interaction data to extract In an online course platform, there are many registered students
features. Mi and Yeung [21] regarded dropout prediction as a se- and many online courses. Each student can take several different
quential prediction problem, and used RNN for prediction under courses, and each course can be taken by many different students.
different definitions of dropout. Li et al. [18] converted dropout In the platform, the courses usually last for several weeks. Students
prediction into a semi-supervised learning problem and used multi- taking these courses have some activity records in the platform. In
training to solve the problem. these records, each record is a time-stamped log, recording infor-
In MOOCs, a similar problem to dropout prediction is completion mation of an event in the platform.
prediction. In this problem, instead of predicting whether a student In this paper, we formulate the problem of dropout prediction
will drop out from a course, it aims to predict whether a student as: given activity records of some students on some courses in a
can complete a course and get the corresponding certificate. He et period of time (as shown in Figure 1, it is from Day 1 to Day N, we
al. [13] used logistic regression to identify students who seem to be call it “record period”), we aim to predict whether these students
not able to complete the course. Qiu et al. [24] used latent dynamic will drop out the courses in a period in the future (as shown in
factor graph model for the prediction. Figure 1, we call it “prediction period”). In the prediction period, if
There are also some works that comprehensively considered a student has activity records on a course, we consider this student
activity records and completion of the course, and proposed the has not dropped out from the corresponding course; otherwise, we
prediction problem suitable for their problem setting [23, 31]. consider this student has dropped out from this course.

2.2 Deep Learning 3.2 Preprocessing of Input

Deep learning [2, 8, 11] is a subfield of machine learning. It has 3.2.1 Characteristics of the Raw Records. The dataset we have in
been successfully applied in a lot of areas in recent years. In our this problem are raw activity records. These records are structured
Deep Model for Dropout Prediction in MOOCs ICCSE’17, July 6–9, 2017, Beijing, China

1 2014-06-14T09:41:49 browser problem volume of input will be very large. For this reason, it is necessary to
combine some vectors and reduce the volume of the input. In our
model, we combine all vectors belonging to the same time unit into
10 0000100 one vector. To combine these vectors, we add up all these vectors
in a bitwise manner. For example, if we have two vectors [0, 1, 0, 0,
100000100 1, 0, 1] and [0, 0, 1, 0, 1, 0, 0], by adding them, we get [0, 1, 1, 0, 2, 0,
1].
Figure 2: An Example of Converting a Raw Record to a One- After adding up, for each time unit, we have one vector to rep-
hot Vector resent it. Then, we concatenate all vectors in the same time slice
into a matrix, each row to be the vector of one time unit. After
that, for each student taking one course, we have several matrices
time-stamped logs ordered chronologically and consist of differ- representing records in different time slices. We use them as inputs
ent attributes. These attributes have different physical meanings to the model.
and contain information on different aspects. For example, in one
dataset, the records may contain attributes like student ID (denoting 3.3 Overview of the Model
the student which each record belongs to), time (denoting the time To solve the dropout prediction problem, we need to consider the
when each event occurs), activity type (denoting activity type of the characteristics of the input. The characteristics can be seen from
event) and so forth. Each record in the dataset records information the macro and the micro views, respectively.
of an event. It contains specific values on each of the attributes. In the macro view, activities of a student in a certain time will
The time span of record period in one dataset usually lasts for be affected by his/her states or activities in the past. The effect of
several weeks. In this paper, for the convenience of the description temporally closer ones is greater than temporally further ones. In
of our model, we divide the time span of record period by several our model, this characteristic is used by the RNN among different
different scales. For each record, it records an event in the platform time slices.
at a certain time point. We refer to the time in each record as a In the micro view, for each time slice, we have a matrix to rep-
time point. Several adjacent time points form a time unit. Several resent records within it. Features can be extracted from each of
adjacent time units form a time slice. All time slices form the record the matrices. To extract features, we use a CNN, which is the most
period. widely adopted neural network as feature extracting tool.
In our model, we combine CNN and RNN together to make use of
3.2.2 Converting to One-hot Vectors. The raw activity records in
the characters stated above. As our model is a combination of these
the dataset are in their original text format, can not be directly used
two kinds of neural networks, we call our model ConRec Network.
as inputs to our model. To use these data, we need to convert them
In this model, in the lower part, there are convolutional and pooling
into the format that can be processed by deep neural networks. In
layers to extract features in each time slice. In the upper part, there
this paper, we convert them into vectors.
is a recurrent layer to combine information in each time slice. The
In these records, each attribute can take several different values.
combined information is used to make predictions.
For each record, we convert the specific value of each attribute to
be a one-hot vector. The number of digits of each one-hot vector
3.4 Details of ConRec Network
equals the number of different values this attribute can take. This
method of converting has been used by [14] in the area of natural This model contains an input layer, an output layer and six hidden
language processing. In their work, they converted each word into layers. The structure of the model is shown in Figure 3. The first
a one-hot vector and concatenated vectors belonging to the same and third hidden layers are convolutional layers, the second and
sentence into a long vector to represent this sentence. In our model, forth hidden layers are pooling layers, the fifth hidden layer is a
as the example1 shown in Figure 2, for one record, we convert some fully connected layer, the sixth hidden layer is a recurrent layer.
of the attributes into one-hot vectors, and concatenate these vectors The nonlinear function we applied in each hidden layer is rectified
into a long vector to represent this record. linear unit (ReLU) function.
Each record in the dataset has a time attribute denoting the time In this model, a student taking a course is treated as an instance,
the event occurred. The occurring time of each record corresponds it contains activity records in some time slices. In this model, we
to a time point in record period. In the dataset, not every time point assume matrices of each time slice are of the same size, and the
in record period has a corresponding record. There exists a large numbers of matrices in each instance are also the same. For each
number of time points that do not have records. For these time instance, it contains T matrices, representing records in T time
points, we pad zero vectors at the corresponding positions. slices. Thus, for each instance, the input of the model is T matrices
X (1) , X (2) · · · X (t ) · · · X (T ) , each of size Q (0) × H (0) , where Q (0) is
3.2.3 Combining Vectors. In activity records, the scale of each the number of time units in each time slice, H (0) is the length of the
time point can be accurate to second, but the time span of record vectors at each time unit. As illustrated above, in the lower layers,
period can last for several weeks. Compared with the time scale of we extract features among adjacent elements.
each time point, the time span of record period is quite large. If we
directly use the original vectors at every time point as input, the 3.4.1 Convolutional Layers. In this model, the first and the third
hidden layers are convolutional layers. In these two layers, the
1 This example comes from the data provided by KDD Cup 2015. convolutional mode is valid convolutional, i.e. it does not pad zero
ICCSE’17, July 6–9, 2017, Beijing, China W. Wang et al.

output layer are:

(p),(k (p) ),(t )
6th hidden layer X i, j =
... (recurrent layer)
(p−1),(k (p−1) ),(t )
5th hidden layer max (X ), (2)
1≤l (p) ≤L (p),1≤d (p) ≤D (p) (i−1)×L (p) +l (p),(j−1)×D (p) +d (p)
(fully connected layer)
(p)
...
4th hidden layer ∀t = 1, 2 · · · T , ∀k = k (p−1) = 1, 2 · · · K (p−1)
(pooling layer)
in these two layers, pooling areas do not overlap, they are of size
... 3rd hidden layer L(p) × D (p) .
(convolutional layer)
3.4.3 Fully Connected Layer. The fifth hidden layer is fully con-
nected layer. This layer combines features extracted by the convo-
... 2nd hidden layer lutional and pooling layers, and generate one vector for each time
(pooling layer)
slice. For each instance, the inputs are T × K (4) matrices, each of
size Q (4) × H (4) . The outputs are T vectors, each of size M. In this
layer, for each matrix in the input, we firstly flatten it into a vector,
... 1st hidden layer (4)
(convolutional layer) each vector is represented as x(4),(k ),(t ) . Operation in this layer
is:
K (4)
... Õ (4) (4)
input layer x(5),(t ) = ReLU ( W (4),(5),(k ) x(4),(k ),(t ) + b(5),(t ) ),
(3)
k (4) =1
∀t = 1, 2 · · · T
Figure 3: The Proposed ConRec Network
(4)
where W (4),(5),(k ),(m) is a weight matrix generating the vector
from the (k (4) )th feature map of the forth layer. b(5),(t ) is the bias.

vectors at the boundary of the matrix when performing convolution 3.4.4 Recurrent Layer. In the first five layers, we extract features
operation. We set the stride to be 1. For each instance, the input in each time slice. In the sixth hidden layer, we combine information
is T × K (p−1) matrices, each of size Q (p−1) × H (p−1) . The output is in each time slice. This layer is a recurrent layer. Operation in this
T × K (p) matrices, each of size Q (p) × H (p) ,where Q (p) = Q (p−1) − layer is:
L(p) + 1 and H (p) = H (p−1) − D (p) + 1. Operations in these two s(t ) = ReLU (W (6) x(5),(t ) + U (6) s(t −1) + b(6) ),
layers are: (4)
∀t = 1, 2 · · · T
(p−1)
(p),(k (p) ),(t )
KÕ
(p−1) ),(t ) where s(t ) is the hidden state at time t, W (6) and U (6) are weight
X =ReLU ( X (p−1),(k matrices, b(6) is the bias.
k (p−1) =1 This prediction problem is a binary classification problem. We
(p) (p−1) ) (p)
∗ W (p),(k ),(k + b (p),(k ) ), denote class “not drop out” as “0” and class “drop out” as “1”. The
output of this model is
∀t = 1, 2 · · · T , ∀k (p) = 1, 2 · · · K (p)
(1) 1
ŷ = (5)
1 + e −(V s +c)
(T )

where T is the number of time slices, p is the index of the layer. where s(T ) is the hidden state in time slice T , it is a combination of
(p−1) ),(t )
X (p−1),(k is the matrix representing the (k (p−1) )th feature all information in previous time slices. V is the weight matrix and
map of (p − 1) layer at t t h time slice, it is a input of the p th
t h
c is the bias. The output ŷ is a scaler between 0 and 1 denoting the
(p) (p−1) )
layer. W (p),(k ),(k is a filter in the p t h layer, it is a matrix probability the output belongs to class “1”.
that generates the (k )t h feature map in the p t h layer from the
(p) The loss function of this model is the mean of negative log-
(k (p−1) )t h feature map from the (p−1)t h layer, it is of size L(p) ×D (p) . likelihood of training instances
(p)
b (p),(k ) is the bias of the generated (k (p) )t h feature map in the N
1 Õ
p layer, it is added to all elements of the corresponding feature
th L= (−yi log(ŷi ) − (1 − yi ) log(1 − ŷi )) (6)
N i=1
map matrices. Operator “∗” represents convolution operation.
where N is the number of training instances.
3.4.2 Pooling Layers. The second and the forth hidden layers are
pooling layers. In this model, we use max pooling. For each instance, 4 EXPERIMENTAL EVALUATION
the input is T × K (p−1) matrices, each of size Q (p−1) × H (p−1) . The
In this section, we first introduce the dataset used in the experi-
output is T × K (p) matrices, each of size Q (p) × H (p) , where K (p) =
ments, then describe the settings of the experiments. Finally, we
Q (p−1) H (p−1)
K (p−1) , Q (p) = L (p) , H (p) = D (p) . Operations in these two layers report the results of the experiments and discuss the implications.
Deep Model for Dropout Prediction in MOOCs ICCSE’17, July 6–9, 2017, Beijing, China

Table 1: Attributes in Files Containing Information about In our experiment, we follow this definition of dropout and the
Enrollments same prediction goal.

Attribute Meaning 4.2.2 Comparison Methods. In the experiments, we aim to eval-

Enrollment ID ID of the enrollment record uate the effectiveness of our model compared with feature engi-
User Name Name of the student neering based methods. Thus, in the experiments, we select several
Course ID ID of the course classification methods as baseline methods and provide them fea-
tures extracted via feature engineering.
In the experiments, we extract 186 features from the raw records
Table 2: Attributes in Files Containing Activity Records of via feature engineering. These features mainly come from five as-
the Enrollments pects: (1) number of different kinds of events, (2) number of the days
that have different kinds of events, (3) amount of operating time,
Attribute Meaning (4) time of the last event of an enrollment item, and (5) number of
Enrollment ID ID of the enrollment record registered students of the course in an enrollment item. All of these
Time Time when the event occurs features are normalized by min-max scaling with the maximum
Source Event source (“server” or “browser”) and minimum values of each feature in training dataset.
Event type (there are 7 types: “problem” The baseline methods are Linear SVM [16], SVM with RBF kernel
Event “video” “access” “wiki” “discussion” “navi- [1], Logistic Regression [27], Decision Tree [22], AdaBoost [9],
gate” and “close page” ) Gradient Tree Boosting [10], Random Forest [5] and Gaussian Naive
The object the student access or navigate Bayes [33]. For Linear SVM, Decision Tree and Random Forest,
Object we perform probability calibration to obtain the probabilities of
to.(Only for “navigate” and “access” event ).
belonging to class “dropout”.

4.2.3 Implementation Details. In our model, we select three at-

4.1 Dataset tributes (attributes “source”, “event” in activity records and “course
The dataset we used for the experiments comes from KDD Cup ID” in enrollment information records) from the dataset as input.
20152 . It contains information about 39 courses in the online course In this implementation, the scale of time point is second. We set
platform XuetangX. For each course, the record period is 30 days. the time unit to be one hour, and the time slice to be one day. For
Information contained in the dataset includes information about each time point, we convert values in these three attributes into
which student enrolls in which course, activity records of the stu- vectors and concatenate them to get a vector of size 48. Then, we
dents, and other information about the courses. Table 1 is the at- add vectors in the same hour, and combine vectors in the same day
tributes in files (“enrollment_train.csv” and “enrollment_test.csv”) into one matrix. The input of each enrollment item is 30 matrices,
containing information about enrollment records3 . Table 2 is the each of size 24 × 48. In the first hidden layer, we use 20 filters and
attributes in files (“log_train.csv” and “log_test.csv”) containing biases to generate 20 feature maps. The filters are all of size 5 × 5.
activity records of the students. The output of this layer is 30 × 20 matrices of size 20 × 44. In the
In this dataset, only the enrollment records in the file “enroll- second hidden layer, the pooling size is 2 × 2. The output is 30 × 20
ment_train.csv” have ground truth (in file “truth_train.csv”) of matrices of size 10 × 22. In the third hidden layer, the filters are also
whether a student drops out (enrollment records in file “enroll- of size 5 × 5, and the number of filters and biases is 50. The output is
ment_test.csv” have no corresponding ground truth). Thus in this 30 × 50 matrices of size 6 × 18. In the forth hidden layer, the pooling
experiment, we only use data of enrollment records from “enroll- size is also 2 × 2, the output is 30 × 50 matrices of size 3 × 9. The fifth
ment_train.csv”. This file contains 120,542 enrollment records. In hidden layer is the fully connected layer. It generates a vector for
the corresponding activity record file, there are 8,157,277 records. each day. The output is 30 vectors of size 20. The sixth hidden layer
In these enrollment records, 95,581 of them dropped out, the other is the recurrent layer. The input is 30 vectors of size 20. The hidden
24,961 enrollment records did not drop out. state vector in this layer is of size 50. The final output is a number
In this experiment, we follow the standard way of splitting train- between 0 and 1, indicating the probability that this student will
ing and testing datasets, with proportion of training and testing drop out the corresponding course. In this model, we denote “0” as
instances to be 4:1. In this way, we randomly separate these en- “not dropout” and “1” as “dropout”.
rollments into 96,434 as training instances and 24,108 as testing in- We implemented this model in Theano [28], which is a python
stances. In the training and testing sets, the proportions of dropout library used to realize deep learning methods. In the experiment,
and not dropout enrollment records are nearly the same. we use mini-batch stochastic gradient descent method to train the
parameters. We set the learning rate to be 0.1, and the batch size to
4.2 Experimental Settings be 20. The model is run for 15 epochs.
4.2.1 Prediction Goal. In KDD CUP 2015, the goal is to predict
4.2.4 Evaluation Metrics. The problem solved in this paper is a
whether the students will drop the courses in the following 10 days.
classification problem. In this problem, number of instances belong-
2 http://kddcup2015.com
ing to the two classes differ a lot. This is a phenomenon called class
3 Inthis dataset, they refer to a student taking a course as an enrollment record. It imbalance. In this case, accuracy is not an appropriate evaluation
corresponds to an instance in our model and the baseline methods. metric. In this paper, we use Precision, Recall, F1-score [18] and
ICCSE’17, July 6–9, 2017, Beijing, China W. Wang et al.

Table 3: Performances of Different Methods under Different ConRec Network and the corresponding baseline method under
Metrics (in %) the corresponding metric. Positive values mean ConRec Network
outperforms the corresponding baseline method; negative values
Method Precision Recall F1-score AUC represent the opposite meaning. The absolute value indicates the
Linear SVM [16] 88.91 96.34 92.48 87.83 significance of difference. From the table, we can see, our proposed
SVM(with RBF ConRec Network can achieve comparable results with most of
88.06 97.01 92.32 84.51
kernel) [1] the baseline methods under the four metrics. The performance of
Logistic Regression ConRec Network is much better than Decision Tree and Gaussian
89.17 96.12 92.52 87.70
[27] Naive Bayes.
Decision Tree [22] 84.57 97.87 90.74 80.03 From the results we can conclude: (1) for classification problems
AdaBoost [9] 89.39 95.59 92.39 87.96 based on raw activity records, there exist efficient way of auto-
Gradient Tree matically extracting features from the raw data. (2) the approach
89.61 95.71 92.56 88.26 adopted by our method to extract features is effective. (3) our pro-
Boosting [10]
Random Forest [5] 88.86 96.21 92.39 86.71 posed model ConRec Network is an efficient model to solve dropout
Gaussian Naive prediction problem in MOOCs.
89.89 92.52 91.18 77.40
Bayes [33]
ConRec Network 88.62 96.55 92.41 87.42 5 CONCLUSION
In this paper, we propose a deep neural network for solving dropout
Table 4: Relative Difference between ConRec Network and prediction problem in MOOCs. The key advantage of our model is
Various Baseline Methods under Different Metrics (in %) that features used in the model are automatically extracted from the
raw records. No manual feature engineering is needed. In this way,
Method Precision Recall F1-score AUC this method saves a lot of time and human efforts, and eliminates the
Linear SVM [16] -0.32 0.21 -0.07 -0.46 potential inconsistency introduced by the manual process. Experi-
SVM(with RBF mental results on a large scale public dataset show that the proposed
0.63 -0.47 0.09 3.44 model can achieve comparable performance to approaches relying
kernel) [1]
Logistic Regression on feature engineering performed by experts.
-0.61 0.44 -0.11 -0.31 In future work, we plan to evaluate the effectiveness of this
[27]
Decision Tree [22] 4.78 -1.34 1.84 9.23 model on other kinds of activity records or on other classification
AdaBoost [9] -0.86 1.00 0.02 -0.61 problems based on activity records. Also for the upper layers of the
Gradient Tree model, we plan to evaluate other types of RNN, such as LSTM and
-1.10 0.87 -0.16 -0.95 GRU. For the format of input, instead of one-hot vectors, we plan
Boosting [10]
to explore other formats.
Random Forest [5] -0.27 0.35 0.02 0.81
Gaussian Naive
-1.41 4.35 1.34 12.94 ACKNOWLEDGMENTS
Bayes [33]
This research is supported by the National Research Foundation,
Prime Minister’s Office, Singapore under its IDM Futures Funding
Area Under the Receiver Operating Characteristic Curve (AUC) Initiative; the Interdisciplinary Graduate School (IGS), NTU; the Lee
score [21] as evaluation metrics. Kuan Yew Post-Doctoral Fellowship Grant; and the NTU-PKU Joint
Research Institute, a collaboration between Nanyang Technological
4.3 Experimental Results and Analysis University and Peking University that is sponsored by a donation
Experimental results of our model and baseline methods are shown from the Ng Teng Fong Charitable Foundation. We would like to
in Table 3. From the table we can see, compared with baseline gratefully acknowledge the organizers of KDD Cup 2015 as well as
classification methods (they all use features extracted via feature XuetangX for making the datasets available.
engineering), our model can obtain comparable results on all of the
four metrics. This illustrates the effectiveness of our model. REFERENCES
To evaluate the differences between these results more accu- [1] Bussaba Amnueypornsakul, Suma Bhat, and Phakpoom Chinprutthiwong. 2014.
rately, we calculate relative differences between the performances Predicting attrition along the way: The UIUC model. In Proceedings of the EMNLP
of ConRec Network and each of the baseline methods on the four 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs. 55–59.
[2] Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and trends®
metrics. Here, we define “relative difference” as: for a performance in Machine Learning 2, 1 (2009), 1–127.
value a from ConRec Network and a performance value b from a [3] Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation
baseline method, the relative difference between them is (a − b)/b. learning: A review and new perspectives. IEEE transactions on pattern analysis
and machine intelligence 35, 8 (2013), 1798–1828.
This criteria takes the performance of baseline method as standard, [4] Phil Blunsom, Edward Grefenstette, and Nal Kalchbrenner. 2014. A convolutional
and measures the significance of difference between the perfor- neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics.
mances of ConRec Network and this method. [5] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
The values of relative differences are shown in Table 4. In this [6] John Daniel. 2012. Making sense of MOOCs: Musings in a maze of myth, paradox
table, the value in each cell denotes the relative difference between and possibility. Journal of interactive Media in education 2012, 3 (2012).
Deep Model for Dropout Prediction in MOOCs ICCSE’17, July 6–9, 2017, Beijing, China

[7] Li Deng, Ossama Abdel-Hamid, and Dong Yu. 2013. A deep convolutional neural [30] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Mer-
network using heterogeneous pooling for trading acoustic invariance with pho- riënboer, Armand Joulin, and Tomas Mikolov. 2015. Towards ai-complete question
netic confusion. In 2013 IEEE International Conference on Acoustics, Speech and answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698 (2015).
Signal Processing (ICASSP). IEEE, 6669–6673. [31] Jacob Whitehill, Joseph Jay Williams, Glenn Lopez, Cody Austun Coleman, and
[8] Li Deng, Dong Yu, et al. 2014. Deep learning: methods and applications. Founda- Justin Reich. 2015. Beyond prediction: First steps toward automatic intervention
tions and Trends® in Signal Processing 7, 3–4 (2014), 197–387. in MOOC student stopout. In International Educational Data Mining Society.
[9] Yoav Freund and Robert E Schapire. 1997. A Decision-Theoretic Generalization [32] Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krish-
of On-Line Learning and an Application to Boosting. J. Comput. System Sci. 55, 1 naswamy. 2015. Deep convolutional neural networks on multichannel time series
(1997), 119–139. for human activity recognition. In Twenty-Fourth International Joint Conference
[10] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting on Artificial Intelligence. 3995–4001.
machine. Annals of statistics (2001), 1189–1232. [33] Harry Zhang. 2004. The Optimality of Naive Bayes. In Proceedings of the Sev-
[11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT enteenth International Florida Artificial Intelligence Research Society Conference
Press. http://www.deeplearningbook.org. (FLAIRS 2004). 562–567.
[12] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech
recognition with deep recurrent neural networks. In 2013 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),. IEEE, 6645–6649.
[13] Jiazhen He, James Bailey, Benjamin IP Rubinstein, and Rui Zhang. 2015. Identi-
fying At-Risk Students in Massive Open Online Courses. In Proceedings of the
Twenty-Ninth AAAI Conference on Artificial Intelligence. 1749–1755.
[14] Rie Johnson and Tong Zhang. 2015. Effective use of word order for text cate-
gorization with convolutional neural networks. In NAACL HLT 2015, The 2015
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies. 103–112.
[15] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk-
thankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional
neural networks. In Proceedings of the IEEE conference on Computer Vision and
Pattern Recognition. 1725–1732.
[16] Marius Kloft, Felix Stiehler, Zhilin Zheng, and Niels Pinkwart. 2014. Predicting
MOOC dropout over weeks using machine learning methods. In Proceedings of
the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs.
60–65.
[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-
tion with deep convolutional neural networks. In Advances in neural information
processing systems. 1097–1105.
[18] Wentao Li, Min Gao, Hua Li, Qingyu Xiong, Junhao Wen, and Zhongfu Wu. 2016.
Dropout prediction in MOOCs using behavior features and multi-view semi-
supervised learning. In 2016 International Joint Conference on Neural Networks
(IJCNN). IEEE, 3130–3137.
[19] Chen Lin, Helena Canhao, Timothy Miller, Dmitriy Dligach, Robert M Plenge,
Elizabeth W Karlson, and Guergana K Savova. 2012. Feature engineering and
selection for rheumatoid arthritis disease activity classification using electronic
medical records. In ICML Workshop on Machine Learning for Clinical Data Analy-
sis.
[20] Tharindu Rekha Liyanagunawardena, Andrew Alexandar Adams, and
Shirley Ann Williams. 2013. MOOCs: A systematic study of the published liter-
ature 2008-2012. The International Review of Research in Open and Distributed
Learning 14, 3 (2013), 202–227.
[21] Fei Mi and Dit-Yan Yeung. 2015. Temporal models for predicting student dropout
in massive open online courses. In 2015 IEEE International Conference on Data
Mining Workshop (ICDMW). IEEE, 256–263.
[22] Sreerama K Murthy. 1998. Automatic construction of decision trees from data:
A multi-disciplinary survey. Data mining and knowledge discovery 2, 4 (1998),
345–389.
[23] Saurabh Nagrecha, John Z Dillon, and Nitesh V Chawla. 2017. MOOC Dropout
Prediction: Lessons Learned from Making Pipelines Interpretable. In Proceedings
of the 26th International Conference on World Wide Web Companion. International
World Wide Web Conferences Steering Committee, 351–359.
[24] Jiezhong Qiu, Jie Tang, Tracy Xiao Liu, Jie Gong, Chenhui Zhang, Qian Zhang,
and Yufei Xue. 2016. Modeling and predicting learning behavior in MOOCs. In
Proceedings of the Ninth ACM International Conference on Web Search and Data
Mining. ACM, 93–102.
[25] Tanmay Sinha, Patrick Jermann, Nan Li, and Pierre Dillenbourg. 2014. Your click
decides your fate: Inferring information processing and attrition behavior from
mooc video clickstream interactions. In 2014 Empirical Methods in Natural Lan-
guage Processing Workshop on Modeling Large Scale Social Interaction in Massively
Open Online Courses.
[26] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
with neural networks. In Advances in neural information processing systems. 3104–
3112.
[27] Colin Taylor, Kalyan Veeramachaneni, and Una-May O’Reilly. 2014. Likely
to stop? Predicting stopout in massive open online courses. arXiv preprint
arXiv:1408.3382 (2014).
[28] Theano Development Team. 2016. Theano: A Python framework for fast compu-
tation of mathematical expressions. arXiv e-prints abs/1605.02688 (May 2016).
http://arxiv.org/abs/1605.02688
[29] Kalyan Veeramachaneni, Una-May O’Reilly, and Colin Taylor. 2014. Towards
feature engineering at scale for data from massive open online courses. arXiv
preprint arXiv:1407.5238 (2014).