0% found this document useful (0 votes)
1 views

Context-Aware Online Learning for Course

Uploaded by

chiezieeucharia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Context-Aware Online Learning for Course

Uploaded by

chiezieeucharia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Context-Aware Online Learning for Course

Recommendation of MOOC Big Data


Yifan Hou, Member, IEEE, Pan Zhou, Member, IEEE, Ting Wang, Li Yu, Member, IEEE, Yuchong Hu, Member, IEEE,
Dapeng Wu, Fellow, IEEE

Abstract—The Massive Open Online Course (MOOC) has Coursera, edX, Udacity and so on [4]. However, due to the
expanded significantly in recent years. With the widespread rapid growth rate of users, the amount of needed courses has
of MOOC, the opportunity to study the fascinating courses
arXiv:1610.03147v2 [cs.LG] 16 Oct 2016

been expanding continuously. And according to the survey


for free has attracted numerous people of diverse educational
backgrounds all over the world. In the big data era, a key about the completion rate of MOOC [5], only 4% people
research topic for MOOC is how to mine the needed courses finish their chosen courses. Therefore, finding a preferable
in the massive course databases in cloud for each individual course resource and locating it in the massive data bank, e.g.,
student accurately and rapidly as the number of courses is cloud computing and storage platforms, would be a daunting
increasing fleetly. In this respect, the key challenge is how to “needle-in-a-haystack” problem.
realize personalized course recommendation as well as to reduce
the computing and storage costs for the tremendous course One key challenge in future MOOC course recommendation
data. In this paper, we propose a big data-supported, context- is processing tremendous data that bears the feature of volume,
aware online learning-based course recommender system that variety, velocity, variability and veracity [6] of big data.
could handle the dynamic and infinitely massive datasets, which Precisely, the recommender system for MOOC big data needs
recommends courses by using personalized context information to handle the dynamic changing and nearly infinite course
and historical statistics. The context-awareness takes the per-
sonal preferences into consideration, making the recommendation data with heterogeneous sources and prior unknown scale
suitable for people with different backgrounds. Besides, the effectively. Moreover, since the Internet and cloud computing
algorithm achieves the sublinear regret performance, which services are turning in the direction of supporting different
means it can gradually recommend the mostly preferred and users around the world, recommender systems are necessary
matched courses to students. In addition, our storage module is to consider the features of students, i.e. cultural difference, ge-
expanded to the distributed-connected storage nodes, where the
devised algorithm can handle massive course storage problems ographic disparity and education level, , one has his/her unique
from heterogeneous sources of course datasets. Comparing to preference in evaluating a course in MOOC. For example,
existing algorithms, our proposed algorithms achieve the linear someone pays more attention to the quality of exercises while
time complexity and space complexity. Experiment results verify the other one focuses on the classroom rhythm more. We use
the superiority of our algorithms when comparing with existing the concept of context to represent those mentioned features
ones in the MOOC big data setting.
as the students’ personalized information. The context space is
Index Terms—MOOC, big data, context bandit, course recom- encoded as a multidimensional space (dX dimensions), where
mendation, online learning dX is the number of features. As such, the recommendation
becomes student-specific, which could improve the recommen-
I. I NTRODUCTION dation accuracy. Hence, appending context information to the
MOOC is a concept first proposed in 2008 and known to the models for processing the online courses is ineluctable [8] [9].
world in 2012 [1] [2]. Not being accustomed to the traditional Previous context-aware algorithms such as [29] only per-
teaching model or being desirous to find a unique learning form well with the known scale of recommendation datasets.
style, a growing number of people have partiality for learning Specifically, the algorithm in [29] would rank all courses
on MOOCs. Advanced thoughts and novel ideas give great in MOOC as leaf nodes, then it clusters some relevance
vitality to MOOC, and over 15 million users have marked in courses together as their parent nodes based on the historical
Coursera [3] which is a platform of it. Course recommender information and current users’ features. The algorithm keeps
system helps students to find the requisite courses directly clustering the course nodes and building their parent nodes
in the course ocean of numerous MOOC platforms such like until the root node (bottom-up design). If there comes a new
course, all the nodes are changed and needed to compute again.
Yifan Hou, Pan Zhou and Li Yu are with School of Electronic Information As for the MOOC big data, since the number of courses keeps
and Communications, Huazhong University of Science and Technology,
Wuhan 430074, China. increasing and becoming fairly tremendous, algorithms in [29]
Ting Wang is with Computer Science and Engineering, Lehigh University, are prohibitive to be applied.
PA 18015, USA. Our main theme in this paper is recommending courses in
Yuchong Hu is with School of Computer Science and Technology,
Huazhong University of Science and Technology, Wuhan, 430074 China. tremendous datasets to students in real-time based on their
Dapeng Oliver Wu is with Department of Electrical and Computer Engi- preferences. The course data are stored in course cloud and
neering, University of Florida, Gainesville, FL 32611, USA. new courses can be loaded at any time. We devise a top-down
Contacting email: [email protected]
This work was supported by the National Science Foundation of China binary tree to denote and record the process of partitioning
under Grant 61231010, 61401169, 61529101 and CNS-1116970. course datasets, and every node in the tree is a set of courses.
Specifically, there is only one root course node including tion III formulates the recommendation problem and algorithm
all the courses in the binary tree at first. The course scores models. Section IV and Section V illustrate our algorithms and
feedback from students in marking system are denoted as bound their regret. Section VI analyzes the space complexity
rewards. Every time a course is recommended, a reward which of our algorithms and compares the theoretical results with
is used to improve the next recommending accuracy is fed existing works. In Section VII, we verify the algorithms
back from the student. The reward structure consists as a by experiment results and compare with relevant previous
unknown stochastic function of context features and course algorithms [29] [30]. Section VIII concludes the paper.
features at each recommendation, and our algorithm concerns
the expected reward of every node in the long run. Then the
course binary tree divides the current node into two child II. R ELATED W ORKS
nodes and selects one course randomly in the node with
the current best expected value. It omits most of courses in A plethora of previous works exist on recommending al-
the node that would not be selected to greatly improve the gorithms. As for MOOC, two major tactics to actualize the
learning performance. It also supports incoming new courses algorithms are filtering-based approaches and online learning
to the existing nodes as unselected items without changing the methods [12]. Apropos of filtering-based approaches, there
current built tree pattern. are some branches such like collaborative filtering [10] [13],
However, other challenges influencing on the recommending content-based filtering [14] and hybrid approaches [15] [16].
accuracy still remain. In practice, we observe that the number The collaborative filtering approach gathers the students’
of courses keeps increasing and the in-memory storage cost learning records together and then classifies them into groups
of one online course is about 1GB in average, which is based on the characteristics provided, recommending a course
fairly large. Therefore, how to store the tremendous course from the group’s learning records to new students [13] [10].
data and how to process the course data effectively become a Content-based filtering recommends a course to the student
challenge. Most previous works [21] [29] could only realize which is relevant to the learning records before [14]. Hybrid
the linear space complexity, however it’s not promising for approach is the combination of the two methods. The filtering-
MOOC big data. We propose a distributed storage scheme to based approaches can perform better at the beginning than
store the course data with many distributed-connected storage online learning algorithms. However, when the data come to
units in the course cloud. For example, the storage units may very large-scale or become stochastic, the filtering-based ap-
be divided based on the platforms of MOOC. On the one hand, proaches lose the accuracy and become incapable of utilizing
this method can make invoking process effectively with little the history records adequately. Meanwhile, not considering
extra costs on course recommendation. On the other hand, we the context makes the method unable to recommend courses
prove the space complexity can be bounded sublinearly under precisely by taking every student’s preference into account.
the optimal condition (the number of units satisfies certain Online learning can overcome the deficiencies of filtering-
relations) which is much better than [29]. based approaches. Most previous works of recommending
In summary, we propose an effective context-aware online courses utilize the adaptive learning [17] [18] [19]. In [17],
learning algorithm for course big data recommendation to offer the CML model was presented. This model combines the
courses to students in MOOCs. The main contributions are cloud, personalized course map and adaptive MOOC learning
listed as follows: system together, which is quite comprehensive for the course
recommendation with context-awareness. Nevertheless, as for
• The algorithm can accommodate to highly-dynamic in-
big data, the model is not efficient enough since these works
creasing course database environments, realizing the real
could not handle dynamic datasets and they may have a pro-
big data support by the course tree that could index nearly
hibitively high time cost with near “infinite” massive datasets.
infinite and dynamic changing datasets.
Similar works are widely distributed in [20]–[24] as contextual
• We consider context-awareness for personalized course
bandit problems. In these works, the systems know the rewards
recommendations, and devise an effective context parti-
of selected ones and record them every time, which means
tion scheme that greatly improves the learning rate and
the course feedback can be gathered from students after they
recommendation accuracy for different featured students.
receive the recommended courses. There is no work before that
• Our proposed distributed storage model stores data with
realizes contextual bandits with infinitely increasing datasets.
distributed units rather than single storage carrier, al-
Our work is motivated from [11] for big data support bandit
lowing the system to utilize the course data better and
theory, but [11] is not context-aware. We consider the context-
performing well with huge amount of data.
aware online learning for the first time with delicately devised
• Our algorithms enjoy superior time and space complexity.
context partition schemes for MOOC big data.
The time complexity is bounded linearly, which means
they achieve a higher learning rate than previous methods
in MOOC. For the space complexity, we prove that it is
III. P ROBLEM F ORMULATION
linear in the primary algorithm and could be sublinear in
distributed storage algorithm under the optimal condition. In this section, we present the system model, context model,
The reminder of the paper is organized as follows. Section II course model and the regret definition. Besides, we define
reviews related works and compares with our algorithms. Sec- some relevant notations and preliminary definitions.

2
Load the
course

Reward
Professor or online
learning platform

Reward
Course cloud with
recommender system Course
A student information

Reward
nationality with context
ᮽᵢ
The context-
awareness cloud
A payoff is recommend a
Education level: fed back course Context
elementary or Give a information
professor course to
learner Recommended
Fig. 2: Context and Course Space Relation Schema at slot t0
Students
course

Fig. 1: MOOC Course Recommendation and Feedback System horizontal axes in a space rectangular coordinate system. From
the schematic diagram in Fig. 2 at time slot t0 , the reward
varies in the context axis and course axis. To be more specific,
A. System Model for a determined student sk0 whose context xi0 is unchanged,
the reward rxi0 ,cj (t0 ) differs from courses cj shown in blue
Fig. 1 illustrates our model of operation. At first, the pro-
plane coordinate system. On the other hand, for a determined
fessors upload the course resources to the course cloud, where
course cj0 shown in crystal plane coordinate system, people
the uploaded courses are indexed by the set C = {c1 , c2 , ...}
with different context xi have different rewards rxi ,cj0 (t0 ) of
whose elements are vectors with dimension dC representing
courses.
the number of course features. As for the users, there are
consciously incoming students over time which are denoted
as S = {s1 , s2 , ...}. Then, the system collects context infor- B. Context Model for Individualization
mation of students. We denote the set of context information of The context space is a dX -dimensional space which means
students as X = {x1 , x2 , ..., xi , ..., }, where xi is the vector the context xi ∈ X is a vector with dX dimensions. The
in context space X . dX -dimensional vectors encode features such as ages, cultural
We use time slots t = 1, 2, ..., T to denote rounds. For backgrounds, nationalities, the educational level, etc., repre-
simplicity, we use st , xt , ct to denote the current incoming senting the characteristics of the student. We normalize every
student, the student context vector and the recommended dimension of context range from 0 to 1, e.g., educational
course at time t. In each time slot t, there are three running level ranges from [0, 1] denoting the educational level from
states: (1) a student st with an exclusive context vector xt the elementary to the expert in the related fields. With the
comes into our model; (2) the model recommends a course normalization in each dimension, we denote the context space
ct by randomly selecting one from the current course node to as X = [0, 1]dX , which is a unit hypercube. As for the differ-
the student st ; (3) the student st provides feedback due to the ence between two contexts, DX (xi , xj ) is used to delegate the
newly recommended course ct to the system. dissimilarity between context xi and xj . We use the Lipschitz
We assume the context sequence that generating the rewards condition to define the dissimilarity.
of courses follows an i.i.d. process, otherwise if there are
Assumption 1. There exists constant LX > 0 such that for
mixing within the sequence in practice, we could use the
all context xi , xj ∈ X , we have DX (xi ,xj ) ≤ LX ||xi − xj ||α ,
technique in [25] by using two i.i.d sequences to bound the
where || • || denotes the Euclidian norm in RdX .
mixing process without much performance difference. The
rxi ,cj (t) denotes the feedback reward from the student with Note that the Lipschitz constants LX are not required to be
context xi of course cj at time t. For the recommending known by our recommendation algorithms. They will only be
process, first there comes a student sk with context vector xi . used in quantifying the learning algorithms’ performance. As
Then the system recommends a course cj to the student sk for the parameter α, it’s referred to as similarity information
based on the historical reward information and context vector [21] and we assume that it’s known by the algorithms that
xi , after that the student sk gives a new reward rxi ,cj (t) to qualify the degree of similarity among courses. We present
the system. We define rxi ,cj (t) = f (xi , cj ) + εt , where εt the context dissimilarity mathematically with LX and α and
is a bounded noise with E[εt |(xi , cj )] = 0 and f (xi , cj ) is a they will appear in our regret bounds.
function of two variables (xi , cj ). Besides, we normalize the To illustrate the context information precisely, we define the
reward as rxi ,cj (t) ∈ [0, 1]. slicing number of context unit hypercube as nT , indicating the
Fig. 2 illustrates the relationship between context vector number of sets in the partition of the context space X . With
xi and course vector cj over reward. To better illustrate the the slicing number nT , each dimension can be divided into
relations, we degenerate the dimensions of them as dX = nT parts, and the context space is divided into (nT )dX parts
dC = 1. Practically, we have the reward axis with dimension where each part is a dX -dimensional hypercube with dimen-
1. Thus, we take the context vector and course details as two sions n1T × n1T × ... n1T . To have a better formulation, PT =

3
Center point ࡼ࢚ 澔
Selected tree, the region of two child nodes contains all the courses

Reward
Optimal
Context from their parent region, and they never intersect with each
dimension 3 Sub-hypercub node
Pt Pt Pt Pt Pt
other, Nh,i = Nh+1,2i−1 ∪ Nh+1,2i , Nh,i ∩ Nh,j = ∅ for any
Context

Pt
i 6= j. Thus, the C can be covered by the regions of Nh,i
h Pt
Context information at any depth C = ∪21 Nh,i . To better describe the regions, we
point Pt
define the diam(Nh,i ) to indicate the size of course regions,
Pt Pt Pt
Context
Courses
diam(Nh,i ) = supci ,cj ∈N Pt DC (ci , cj ) for any ci , cj ∈ Nh,i .
Optimal
࢔ࢀ =2 dimension 2 h,i
Course Pt
The dissimilarity DC (ci , cj ) between courses ci and cj can
be represented as the gap between course languages, course
Fig. 3: Context and Course Partition Model time length, course types and any others which indicate the
Pt
discrepancy. We denote the size of regions diam(Nh,i ) with
Pt
the largest dissimilarity in the course dataset Nh,i for any
{P1 ,P2 ,...,P(nT )dX } is used to denote the sliced chronological context xi ∈ Pt . Note that the diam is based on the dissimilar-
sub-hypercubes, and we use Pt to denote the sub-hypercube ity, and that can be adjusted by selecting different mappings.
selected at time t. As illustrated in Fig. 3, we let dX = 3 and For our analysis, we make some reasonable assumptions as
nT = 2. We divide every axis into 2 parts and the number d
follows. We define the set M = {mP1 , mP2 , ...m(nT ) X }
of sub-hypercubes is (nT )dX = 8. For the simplicity, we use
as the parameter to bound the size of regions of nodes in
the center point xPt in the sub-hypercube Pt to represent the
context sub-hypercube Pt , where all the elements in M satisfy
specific contexts xt at time t. With this model of context, we
mPt ∈ (0, 1). For simplicity, we take m as the maximum in
divide the different users into (nT )dX types. For simplicity,
M, which means m = max{mPt |mPt ∈ M}.
when Pt is used in the upper right of the notation, it means
Pt
that the notation is in the sub-hypercube Pt which is selected Assumption 2. For any region Nh,i , there exists constant θ ≥
k1 Pt
at time t, and the subscript “∗” means the optimal solution 1, k1 and m, where we can get θ (m)h ≤ diam(Nh,i ) ≤
over that notation. h
k1 (m) .
With Assumption 2 we can bound the size of regions with
C. Course Set Model for Recommendation
k1 (m)h , which accounts for the maximum possible variation
We model the set of courses as a dC -dimensional space, of the reward over Nh,i Pt
. Due to the properties of binary tree,
where dC is a constant to denote the number of all courses the number of regions increases exponentially with the depth
features e.g. language, professional level, provided school in C. rising, where using the exponential decreasing term k1 (m)h
We set every course in C as a dC dimensional vector, and for to bound the size of regions is reasonable. We use the mean
the newly added dimensions of courses, the value is set as 0. reward f (xi , cj ) to handle the model. Based on the concept
Similar to the context, we define the dissimilarity of courses as of the region and reward, we denote the courses in Nh,i Pt
as
Pt
DC (ci , cj ) to indicate the farthest relativity between the two Pt
ct (h, i) at time t in the context sub-hypercube Pt . Since
courses ci cj belonging to any the context vectors xi ∈ Pt there are tremendous courses and it is nearly impossible to
at time t, where the context vector xt belongs to the context find two courses with equal reward, for each context sub-
sub-hypercube Pt . hypercube Pt , there is only one overall optimal course defined
Pt
Definition 1. Let DC xt
over C be a non-negative mapping as Pt as cPt ∗ = arg maxcj ∈C f (rcPjt ) and each region Nh,i
2 Pt xi Pt ∗
(C → R): DC (ci , cj ) = supxi ∈Pt DC (ci , cj ) , where has a local optimal course defined as Pt as c (h, i) =
Pt xt
DC (ci , cj ) = DC (ci , cj ) = 0 when i = j. arg maxcj ∈N Pt f (rcPjt ), where we let f (•) be the mean value,
h,i

We assume that the two courses which are more relevant i.e., f (rxi ,cj ) = E[f (xi , cj ) + εt ] = f (xi , cj ) and rcPjt means
have the smaller dissimilarity between them. For example, the rxt ,cj in Pt .
the courses taught both in English have closer dissimilarity
than the courses with different languages when concerning D. The Regret of Learning Algorithm
the language feature of course. Simply, the regret R(T ) indicates the loss of reward in the
As for the course model, we use the binary tree whose nodes recommending procedure due to the unknown dynamics. As
are associated with subsets of X to index the course dataset. for our tree model, the regret R(T ) is based on the regions of
We denote the nodes of courses as the selected tree nodes Nh,iPt
. In other words, the regret R(T )
Pt
{Nh,i |1 ≤ i ≤ 2h ; h = 0, 1...; ∀ Pt ∈ PT }. is calculated by the accumulated reward difference between
Pt
Let Nh,i denote the nodes in the depth h and ranked i from recommended courses ct and the optimal course cPt ∗ with
left to right in context sub-hypercube Pt which is selected context xt over reward in the context sub-hypercube Pt at
at time t, where the ranked number i of nodes at depth h time t, thus we define the regret as
Pt T
T 
is restricted by 1 ≤ i ≤ 2h . We let Nh,i ∈ X represent
f (rcPPtt ∗ ) − E
P P Pt
Pt
R(T ) = rxt ,ct (t) , (1)
the course region associated with the node Nh,i . The region t=1 t=1
of root node N0,1 of the binary course tree is a set of the where rcPPtt ∗ is the reward of optimal course in Pt and rxPtt,ct is
Pt
whole courses N0,1 = C. And with the exploration of the the reward of course ct with context xt in Pt . Regret shows the

4
convergence rate of the optimal recommended option. When where k2 is a parameter used to control the exploration-
γ
the regret is sublinear R(T ) = O(T ) where 0 < γ < 1, exploitation tradeoff. And we define the Estimation as the
Pt
the algorithm will finally converge to the best course towards estimated reward value of the node Nh,i based on the Bound,
the student. In the following section we will propose our
n o
Pt Pt Pt Pt
Eh,i (t) = min Bh,i (t), max{Eh+1,2i−1 (t), Eh+1,2i (t)} .(3)
algorithms with sublinear regret.
Pt
The role of Eh,i (t) is to put a tight, optimistic, high-probability
Pt Pt
IV. R EFORMATIONAL H IERARCHICAL T REE upper bound for the reward over the region Nh,i of node Nh,i
in context sub-hypercube Pt at time t. It’s obvious that for
In this section we propose our main online learning algo- Pt Pt Pt
the leaf course nodes Nh,i we have Eh,i (t) = Bh,i (t) and for
rithm to mine courses in MOOC big data. Pt Pt
other nodes Nh,i we have Eh,i (t) ≤ Bh,i (t). Pt

In this algorithm we first find the arrived students’ context


A. Algorithm of Course Recommendation sub-hypercube xt ∈ Pt from the context space and replace
the original context with the center point xPt in that sub-
hypercube Pt (line 2-5). Then the algorithm finds one course
Algorithm 1 Reformational Hierarchical Trees (RHT) Pt Pt
region Nh,i whose Eh,i (t) is highest in the set ΓPt and
Require: The constant k1 and m, the student’s context xt and Pt
walks to the region Nh,i with the route ΩPt , selecting one
time T . course ct from that region and recommending it for the reward
Auxiliary function: Exploration and Bound Updating rcPtt from student st (line 7-10). As illustrated in Fig. 4, the
Initialization: Context sub-hypercubes belonging to PT algorithm walks upon the nodes with the bold arrow and the
Pt
The explored nodes set ΓPt = {N0,1 } set ΩPt = {N0,1 Pt Pt
, N1,2 Pt
, N2,4 Pt
, N3,7 Pt
, N4,13 Pt
}, and the node N4,13
Pt Pt
Upper bound of region N0,1 over reward E1,i = ∞ for i = has the highest Estimation value in ΓPt . When the reward
1, 2. Pt
feeds back, the algorithm refreshes Eh,i (t) of regions of the
Pt
1: for t = 1, 2, ...T do current tree based on Bh,i (t) and rewards rcPtt (t) (line 11-19).
2: for dt = 0, 1, 2...dX do Specifically, the algorithm refreshes the value of Estimation
3: Find the context interval in dt dimension from the leaf nodes to the root node by (3) (line 13-18). Since
4: end for exploring is a top-down process, after we refresh the upper
5: Get the context sub-hypercube Pt bound of reward in course regions, we update the Estimation
Pt Pt
6: Initialize the current region Nh,i ← N0,1 value from bottom to the top based on the Bound with (3).
Pt
7: Build the path set of regions ΩPt ← Nh,i
8: Call Exploration (Γ ) Pt Algorithm 2 Exploration
Pt
9: Select a course ct from the region Nh,i randomly and 1:
Pt
for all Nh,i ∈ ΓPt do
recommend to the student st 2:
Pt Pt
if Eh+1,2i−1 > Eh+1,2i then
10: Get the reward rxt ,ct 3: T emp = 1
11: for all Pt ∈ PT do 4:
Pt
else if Eh+1,2i−1 Pt
< Eh+1,2i then
12: Call Bound Updating (ΩPt ) 5: T emp = 0
13: ΩPtemp ← Ω
t Pt
6: else
Pt Pt
14: for Ωtemp 6= N0,1 do 7: T emp ∼ Bernoulli(0.5)
Pt
15: Nh,i ← one leaf of ΩPt 8: end if
Refresh the value of Estimation according to (3) Pt Pt
16: 9: Nh,i ← Nh+1,2i−T emp
10: Select the better region of child node into the path set
Pt
17: Delete the Nh,i from ΩP t
temp ΩPt ← ΩPt ∪ Nh,i Pt

18: end for 11: end for


19: end for 12: Add better region of child node into the path set
Pt
20: end for ΓPt ← Nh,i ∪ ΓPt

The algorithm is called Reformational Hierarchical Trees


(RHT) and the pseudocode is given in Algorithm 1. We use Algorithm 3 Bound Updating
the explored nodes set ΓPt = {NhPtt,it |t ∈ 1, 2...T } to denote Pt
1: for all Nh,i ∈ ΩPt do
all the regions whose courses have been recommended in Pt Pt
and the path set ΩPt = {Nh,i Pt Pt
, Nh−1, , N Pt Pt
...N0,1 } 2: Refresh selected times Th,i ++
d 2i e h−2,d 2i2 e 3: Refresh the average reward according to (4)
to show the explored path in Pt . Besides, we introduce some 4: Refresh the Bound value on the path according to (2)
new notations, the Bound and the Estimation. 5: end for
Pt
We define the Bound Bh,i (t) as the upper bound reward 6:
Pt
Eh+1,2i−1 Pt
= ∞, Eh+1,2i =∞
Pt
value of the node Nh,i in the depth of h ranked i of the context
sub-hypercube Pt , q
√ Algorithm 2 shows the exploration process in RHT. When
Pt
Bh,i (t) = µ̂P Pt h
h,i (t)+ k2 ln T /Th,i (t)+k1 (m) +LX(
t dX α
nT ) ,
(2) we turn to explore new course regions, the model prefers to

5
Courses B. Regret Analyze of RHT
Depth 0 Nodes with
ࡺࡼ૙ǡ૚࢚ highest E now According to the definition of regret in (1), all suboptimal
ࡺࡼ૚ǡ૛࢚
Depth 1 courses which have been selected bring regret. We consider the
Current path regret in one sub-hypercube Pt and get the sum of it at last.
Depth 2
ࡺࡼ૛ǡ૝࢚ Since the regret is the difference between the recommended
ࡺࡼ૜ǡૠ࢚
Depth 3 courses and the best course over reward, we need to define
Optimal path
the best course regions at first. We define the best regions
ࡼ࢚ Pt Pt ∗
ࡺ૝ǡ૚૜ as Nh,i ∗ which contain the best course c in depth h and
After many Optimal nodes
h

rounds optimally ranked ih in context sub-hypercube Pt at time t. To
Depth λ澳 illustrate the regret with regions better, we define the best path
Depth

Optimal course t∗
as `P Pt
h,i = {Nh0 ,i∗ |cPt ∗ ∈ NhP0t,i∗ 0 f or h0 = 1, 2, ...h}.
h0 h
The path is the aggregation of the optimal regions whose
Fig. 4: Algorithm Demonstration depth ranges from 1 to h. To represent the regret precisely, we
need to define the minimum suboptimality  gap which indicates
Pt
the dissimilarity DC cPt ∗ (h, i), cPt ∗ between the optimal
course in that region and the overall optimal course cPt ∗ to
select the regions with higher Estimation value. Note that better describe the model.
based on (3), the parent nodes of the node with the highest
Definition 2. The Minimum Suboptimality Gap is
Estimation value also have highest value of the Estimation Pt
Pt DC(h,i) = f (rcPPtt ∗ ) − f (rcPPtt ∗ (h,i) ),
in their depth, which means for all nodes Nh,i ∈ ΩPt , we can
Pt Pt 0 h
get that Eh,i = max{Eh,i0 |1 ≤ i ≤ 2 }, thus Algorithm 2 and the Context Gap is √
Pt
can find the node with highest Estimation value. After the DX = max{DX (xP Pt
i , xj )} = LX ( nT ) .
t dX α

new regions being chosen, they will be taken in the sets ΓPt Pt
The minimum suboptimality gap of Nh,i is the expected
and ΩPt for the next calculation.
reward defference between overall optimal course and the best
Pt
Pt
In Algorithm 3, we define C(Nh,i Pt
) as the set of node Nh,i one in Nh,i , and the context gap is the difference between
and its descendants, the original point and center point in context sub-hypercube
Pt
C(Nh,i Pt
) = Nh,i Pt
∪ C(Nh+1,2i−1 Pt
) ∪ C(Nh+1,2i ). Pt . As for the context gap, we take the upper bound of it as
max{DX (xP Pt
i , xj )} to bound the regret.
t

And we define NhPtt,it as the node selected by the algorithm


at time t. Then we define Th,i Pt
(t) = t I{NhPtt,it ∈ C(Nh,i
P Pt
)} Assumption 3. For all courses cj , ck ∈ C given the same
as the times that the algorithm has passed by the node Nh,i Pt
, context vector xt , they satisfy
which is equal to the number of selected descendants of Nh,i Pt f (rxt ,ck )−f (rxt ,cj ) ≤ max{f (rcPPtt ∗ )−f (rxt ,ck ), DC Pt
(cj , ck )},
since each node will only be selected once. We use Bh,i Pt
(t) which means
f (rcPPtt ∗ ) − f (rxt ,cj ) ≤ f (rcPPtt ∗ ) − f (rxt ,ck )
in (2) to indicate the upper bound of highest reward. The first
term µ̂P + max{f (rcPPtt ∗ ) − f (rxt ,ck ), DC
Pt
(cj , ck )}.
h,i (t) is the average rewards, and they come from the
t

students’ payoffs as defined Assumption 3 bounds the difference based on dissimilarity


P P
Pt
µ̂h,i (t) =
t (t)−1)µ̂ t (t−1)+r Pt
(Th,i h,i xt ,ct (t)
. (4) between the optimal course cPt ∗ and course cj in context sub-
P
q
t (t)
Th,i hypercube Pt with two terms: (1) the difference between cPt ∗
The second one Pt
k2 ln T /Th,i (t) indicates the uncertainty and ck ; (2) dissimilarity between cj and ck . Taking cj , ck
arising from the randomness of the rewards based on the aver- with appropriate values, we could get some useful conclusions
age value. And the third term k1 (m)h is the maximum possible presented in the following lemma. After the definitions and as-
variation of the reward over the region Nh,i Pt
. As for the sumptions, we can find a measurement to divide all the regions
last term, since we substitute the sub-hypercube center point into two kinds for our following proof. Based on √ the Definition
max{DX (xP Pt 2, we let the set φPt to be the 2[k1 (m)h +LX ( ndTX )α ]-optimal
i , xj )} =
t
for the previous context, we utilize

max {LX ||xi − xj ||α } = LX ( ndTX )α to denote the deviation  in the depth h,
regions
h √ i
Pt
in the context sub-hypercube Pt . φPt = Nh,i f (rcPPtt ∗ )−f (rcPPtt ∗(h,i) ) ≤ 2 k1 (m)h+LX ( ndTX )α .

Note that the we only know a part of courses in nodes, Note that we call the regions in set φPt as optimal regions
uploading new courses into the cloud would not change the and those out of it as suboptimal regions.P
Besides, we divide
Estimation value and the Bound value (this two is irrelevant the set by depth h which means φPt = h φP h , where φh
t Pt
Pt
with course number), thus the algorithm could hold the past denote the regions in the depth h which are in the set φ .
path and explored nodes without recalculating the tree. Based We define the regret when one region is selected above.
on this feature, our model can handle the dynamic increasing Since for every region the algorithm chooses only once, we
dataset effectively. However as for [29], the leaf node is one can bound the regret after we determine how many regions the
single course, which means the added courses will change the algorithm has selected in the recommending process. Based
t∗
whole structure of the course tree. on definition of `Ph,i and Definition 2, we assume that the

6
t∗
suboptimal regions are divorced from `P
h,i in depth k (in Fig.
The whole Packing ball with
regions in radius Ra
4 the depth k = 2). Since we do not know in time T how depth h
many times this context sub-hypercube Pt has been selected, The optimal
regions in depth h
we use context time T Pt to represent
P the total times in Pt . Course in Course in
The sum of T Pt is the total time Pt T Pt = T . suboptimal regions optimal regions
To get the upper bound of the number of suboptimal regions,
we introduce Lemma 1 and Lemma 2. Fig. 5: Distributed Storage based on Binary Tree in Cloud
Pt
Lemma 1. Nodes Nh,i are suboptimal, and in the depth
k (1 ≤ k ≤ h − 1) the path is out of the best path. For
Pt times so the probability is equal to 1, and the sum of them
any integer q, we can get the expect times of the region Nh,i Pt
is equal to q. In the second term, since the Th,i (n) > q, the
and it’s descendants in Pt are
Pt
TP
h i terms when n ≤ q are zero and with the help of inequation
Pt Pt
E[Th,i (T Pt)] ≤ q+ P Bh,i Pt
(n) > f (rcPPtt ∗) and Th,i (n) > q (7) we can get the conclusion.
n=q+1 We determine the threshold of the selected times of the
i Pt
nodes in C(Nh,i ) by Lemma 1. However, from Lemma 1 we
h
Pt Pt
or Bk,i∗ (n) ≤ f (r Pt ∗) for k ∈ {q+1, ..., n−1} .
c Pt
k
decompose the E[Th,i ] with the sum of events, which means
Pt
Proof: We assume that the path is out of the best in the we cannot get the upper bound of E[Th,i ] directly, thus we
depth of k. Since the selected path is out of the optimal path Pt
introduce Lemma 2 to bound E[Th,i ] with the deviation of
in depth k and the algorithm select the regions with higher contexts and courses based on Lemma 1.
Pt Pt
Estimation value, we can know that Ek,i ∗ (n) ≤ Ek,i (n),
k Pt
k
where the first Estimation value is for the best path region Lemma 2. For the suboptimal regions Nh,i , if q satisfies
4k2 ln T
and the second one is for the region selected in the depth of k. q≥  √ α 2
 ,
Pt Pt Pt
DC(h,i) −k1 (m)h −LX (
dX
) (8)
According to (3), we can know that Ek,i k
(n) ≤ Ek+1,i k+1
(n), nT

Pt Pt
then we could get that Ek,i∗ (n) ≤ Ek,ik (n) ≤ Eh,i Pt
(n) ≤ Then for all T Pt ≥ 1, we can get the expected times that node
k Pt
Pt
Bh,i Pt
(n). We define {NhPtt,it ∈ C(Nhi,i )} as the event that the Nh,i has been selected as
Pt 4k2 ln T
algorithm passes from the root node by the node Nh,i Pt
. Ob- E[Th,i (T Pt )] ≤  √ α 2
 + M,
d
k1 (m)h +LX ( n X ) (9)
Pt Pt Pt
viously, we can get that {Nht ,it ∈ C(Nhi,i )} ⊂ {Bh,i (n) ≥ T

Pt Pt where the M is a constant less than 5.


Ek,i ∗ (n)}. So we can bound the time when C(Nh,i ) has been
k
selected as Proof: See appendix A.
Pt PT Pt Pt Pt
E[Th,i (T Pt )] ≤ t=1 P {Bh,i (n) ≥ Ek,i ∗ (n)}. (5) We use the deviation of context and course to represent
k

We divide the set {Bh,i Pt


(n) ≥ Ek,iPt Pt played times in this lemma. Practically speaking, we find
∗ (n)} into {Bh,i (n) > Pt
Pt Pt Pt
k a upper bound for the times of suboptimal regions E[Th,i ],
f (rcPt ∗ )} ∪ {f (rcPt ∗ ) ≥ Ek,i∗ (n)}. According to the (3) once which means we can determine one region’s regret during the
k
again,
n we can get o process. But this is not sufficient to bound the whole regret,
f (rcPPtt ∗ ) ≥ Ek,i
Pt
∗ (n) what we also have to know is the number of optimal regions.
k o(6)
As mentioned above, we divide the regions into two parts
n o n
⊂ f (rcPPtt ∗ ) ≥ Bk,i Pt
∗ (n) ∪ f (rcPPtt ∗ ) ≥ Ek+1,i
Pt
∗ (n) .
k k+1 based on the course model as ΓPt = φPt ∪ (φPt )c , where (•)c
From (6) we find that the set {f (rcPPtt ∗ ) ≥ Ek,i Pt
∗ (n)} can means the complementary set. For the convenience, we use
k
be divided into two parts, and we notice that {f (rcPPtt ∗ ) ≥ the sets of depth to hillustrateithehregion sets i
Pt
(n)} is similar to {f (rcPPtt ∗ ) ≥ Ek,i Pt P Pt Pt c
Ek+1,i ∗ (n)}, thus we ΓPt =
P

k+1 k h φh ∪ h (φh ) . (10)
can keep dividing the set until the depth comes to k. Hence,  
wen obtain We define the packing number as κP h
t Pt
∪Nh,i , Ra to show
o n o
Pt
Bh,i Pt
(n) ≥ Ek,i ∗ (n) ⊂ Bh,iPt
(n) > f (rcPPtt ∗ ) the minimum number of packing balls whose radius is Ra
Pt
k
n−1
n o (7) covering optimal regions composed of ∪Nh,i , where K is the
∪ f (rcPPtt ∗ ) ≥ Bj,iPt
∗ (n) . constant of the whole space size, Ra is the packing balls’
j=k+1 j
radius and d0 is the dimension of the packing ball.
We introduce an integer q to divide (5) further. As for any q,
we have Assumption 4. We assume that there exists a constant K0 ,
i TPPt
that for all the regions of nodes in the depth of h, we can get
h n o
Pt Pt Pt Pt Pt
E Th,i (T ) = P Bh,i (n) ≥ Ek,i ∗ (n), Th,i (n) ≤ q
n=1 k the packing number 
Pt

TP
κP Pt
∪Nh,i K0
n o
+ Pt
P Bh,i Pt
(n) ≥ Ek,i Pt
∗ (n), Th,i (n) > q h
t
, Ra = 0 . (11)
[Ra]d
k
n=1
Pt
TPnh i From this assumption, we could make sure that all the
Pt
≤ q+ P Bh,i Pt
(n) > f (rcPPtt ∗) and Th,i (n)> q Pt
courses in the regions {∪Nh,i } can be covered by the packing
n=q+1
h
Pt Pt
io ball whose radius is Ra. And as for the optimal nodes regions,
or Bj,i ∗(n)≤f (r Pt ∗) for j∈{q+1, ...,n−1}
c
. we could use the packing balls and the radius to bound the
j
In the inequation, we let the event in first term happens all the regret of them. In Fig. 5, we take the dimension of packing

7
ball the same as the course regions as 2 thus we can illustrate Theorem 1. From the lemma above, regret of RHT is 
all the courses by dots in black square (plane). We use the red dX dX +α(dC +2) α
E[R(T )] = O LX dX +α(dC +3) T dX +α(dC +3) (ln T ) dX +α(dC +3) .
dot to denote the courses in the optimal regions and black dot
to denote the courses in the suboptimal regions in depth h. As
Proof: We bound the regret with (14). For E[R1 (T )], the
shown, we could use the number of packing balls to cover all
regret is generated from the optimal course regions whose
the courses in the course regions, which means the number of
courses have been recommended. We use the maximum times
optimal regions in depth h can be bounded with the number
T Pt to bound the number of optimal regions in ΓP t
1 . Since all
of packing balls with the constant K0 and θ. Pt
the regions in Γ1 is optimal, from Assumption 3 if we take cj
With Assumption 4, we introduce Lemma 3 to bound the Pt
as the worst course in region Nh,i which has the lowest mean
number of optimal regions in depth h with the number of Pt ∗
reward and ck = c (h, i), then we can bound the regret of
packing balls. these nodes as
P h √ i
E[R1 (T )] ≤ 4 k1 (m)H + LX ( ndTX )α T Pt
Lemma 3. In h the same context√
sub-hypercube Pt , the num- Pth (15)
i √ i
h dX α
ber of the 2 k1 (m) + LX ( nT ) -optimal regions can be = 4 k1 (m)H + LX ( ndTX )α T .
bounded as As for the second term whose depth is from 1 to H, with
h √ i−dC
φP
h
t
≤ K k1 (m) h
+ LX ( n
dX α
T
) . (12) Lemma 3 and the fact that each regions in ΓP t
2 is just played
at most once, we can get
Proof: From Assumption 2 we can bound the region with H h √ αi
h
4 k1 (m) + LX ( ndTX ) φP
PP
diam(Nh,iPt
) ≥ kθ1 (m)h . As for context deviation we still use E[R2 (T )] ≤ h
t

√ Pt h=1
the bound with LX ( ndTX )α . Since the course number is can be 4K(nT )dX
H h
h
√ αi
4 k1 (m) + LX ( ndTX ) .
P
≤ d
huge such that we cannot know the data exactly, the dimension [k1 (m)H ] C h=0
0
of course cannot  be determined. There exists a constant
√ d , From Lemma 3 we canhknow the number of optimal regions
√ −dC
Pt Pt Pt Pt k1 dX α
h
i
φh ≤ κh ∪{Nh,i ∈ φh }, θ (m) + LX ( nT ) in depth h are φP h dX α
h ≤ K k1 (m) +LX ( nT ) , and the
t

 √ −d0 (13) dX
≤ K0 kθ1 (m)h + LX ( ndTX )α . number of the context sub-hypercubes is (nT ) . Thus the
last inequation can be derived.
Obviously, we know that θ > 10 which means we can simplify
 √ −d When it comes to the last term, we notice that the top
K0 kθ1 (m)h + LX ( ndTX )α further, regions in ΓP Pt
3 are the child regions of the regions in Γ2 , since
t

Pt
 √ −d0  √ −d0 all the regions in Γ2 is the parent regions of the suboptimal
K0 kθ1 (m)h +LX ( ndTX )α ≤ K0 kθ1 (m)h + LθX ( ndTX )α regions. And as for the upper bound of course node k1 (m)h ,

0
√ −d0 the region of child node is smaller than that of parent node,
= K0 θd k1 (m)h +LX ( ndTX )α . which means with the depth increasing, the course gap will
0
Then we take K = K0 θd to get the conclusion. The | • | be smaller than before. Hence we can get that the number of
represents the number of elements in the set and we take the top regions in ΓP Pt
3 is less than twice of Γ2 . Due to the fact
t

minimal d0 as the dimension of course dC . that the child nodes has smaller diam than their parent nodes,
we could find that the course deviation of suboptimal
√ i region
Since we bound the number of suboptimal regions and opti- Pt
h
mal regions, we can bound the regret with attained conclusion Nh,i can be bounded as 4 k1 (m)h−1 + LX ( ndTX )α . And the
above. For simplicity, we divide the regret into three parts regret bound is
H h √
according to ΓPt = ΓP Pt Pt
1 ∪ Γ1 ∪ Γ1 , where E[Ri (T )] is the
t
E[R3 (T )] ≤
PP h−1
4 k1 (m) +LX ( ndTX )
αi P
Pt
Th,i (TPt)
Pt
expected regret of the set Γi (i = 1, 2, 3). Then, we can get Pt h=1 Pt ∈Γ t
Nh,i
P
3
E[R(T )] = E[R1 (T )] + E[R2 (T )] + E[R3 (T )], (14)
(
P K(nT )dX ln T
32k2 
Pt Pt
where Γ1 contains the descendants of φH (H is a constant ≤ √
d

h [k1 (m)h ]dC +1 k1 (m)h +LX ( n X )α
depth to be determined later), ΓP t Pt
2 contains the regions φh the
T

Pt
 )
depth from 1 to H and Γ3 contains descendants of regions d
8M K(nT )dX k1 (m)h +LX ( n X )α
T
in (φP t c Pt
h ) (0 ≤ h ≤ H). Note that top regions in Γ3 is the
+ m[k1 (m)h ]dC
.
Pt
child of regions in φH .
P Pt Note that the bound of E[R2 (T )] is the infinitesimal of higher
Due to the fact that T = T , when all the contexts order of the bound of E[R3 (T )] mathematically, thus we focus
xt are in the same context sub-hypercube Pt , the regret is more on the first term and the last term since the decisive
the smallest. And we consider the situation that time T is factors of regret is the first one and last one. We notice that
distributed uniformly. Under this condition each context sub- with the depth increasing, E[R1 (T )] decreases but E[R3 (T )]
hypercube has the least training data, so the sum of deviation increases. When we let this two terms to be equal, we can get
towards course is the largest. In this extreme situation, all the the regret as follows.
context sub-hypercube has the same times T Pt . After we know E[R1 (T )] isnbounded by
the regret in selecting one region, the times when a region has h √i o
dX α
been selected and the number of chosen regions, we can bound O 4 k1 (m)H + LX ( nT ) T . (16)
the whole regret in Theorem 1. As for E[R3 (T )], we notice that the constant M is the

8
4k2 ln T
infinitesimal of higher order of  √
dX
α 2
 , which
k1 (m)h +LX ( nT )
means we can ignore the influence of the constant M . There- ࢆ૝
ࢆ૜ Virtual node
fore, the bound of E[R3 (T )] is determined by the first term ࢆ૚ ࢆ૛
Course node
and it can be shown as When z=2
d=3
 Distributed
P 32k2 
K(nT ) dX
ln T storage units
O √
d

d +1
h [k1 (m)h ] C k1 (m)h +LX ( n X )α
T
  (17) Fig. 6: Distributed Storage based on Binary Tree in Cloud
ln T (nT )dX
=O H dC +2
.
[k1 (m) ]
As for a context sub-hypercube Pt , all the regions which have
been played bring√two kinds of regret: the regret contributed by regions are empty Z∅ = {Zd+1 , Zd+2 , ...Z2z } be the virtual
context gap LX ( ndTX )α and the regret contributed by course nodes, which means there is no course in that distributed
units {Zj = ∅|j = d + 1, d + 2, ...2z } for any context sub-
region gap k1 (m)H . To √optimize the upper bound of regret,
hypercube. Fig. 6 illustrates the condition when there are 3
we take k1 (m)H = LX ( ndTX )α . Under that condition we let
storage platforms (Coursera, edX and Udacity). We can get
O(E[R1 (T )]) = O(E[R3 (T )]) to get
ln T (nT )dX
the number of distributed units as d = 3 and the depth is
[k1 (m)H ]dC +2
= k1 (m)H T, (18) z = 2 (21 ≤ 3 ≤ 22 ), and the set Z = {Z1 , Z2 , Z3 } and the
α
where nT = lnTT
d
X +α(dC +3)
. For the simplicity, we use set Z∅ = {Z4 }.
γ = ddX
X +α(dC +2)
+α(dC +3) and we use the constant M2 to denote the
Algorithm 4 Distributed Course Recommendation Tree
E[R2 (T )] in E[R3 (T )]. Then we can get the regret as
α[2dX +α(dC +3)] dX Require: The constants k1 and m, the parameter of the storage
2[d +α(dC +3)] d +α(dC +3)
E[R(T )] = 8dX X LXX T γ (ln T )1−γ (19) unit z, the student’s context xt and time T .
α(dC +2)(γ−1)
+32k2 KM2 (dX ) 2 (d
(LX ) C +3)γ−(d C +2) γ
T (ln T ) 1−γ Auxiliary function: Exploration and Bound Updating
 dX dX +α(dC +2) α
 Initialization: For all context sub-hypercubes belonging to PT
Pt Pt Pt
= O LX dX +α(dC +3) T dX +α(dC +3) (ln T ) dX +α(dC +3) . ΓPt = {Nz,1 , Nz,2 ...Nz,2 z}
Pt
Ez,i = ∞ f or i = 1, 2...2z
E[R(T )]
Remark 1: From (19) we can make sure limT →∞ T = 1: for t =1,2,...T do
0, which means the algorithm can find the optimal courses 2: for dt = 0, 1, 2...dX do
for the students finally. Note that the tree exists actually, 3: Find the context interval in dt dimension
we store the tree in the cloud and during the recommending 4: end for
process. Since the dataset is fairly large in the future, using 5: Get the context sub-hypercube Pt
the distributed storage method to solve storage problems is 6: xt ← center point of Pt
inescapable. 7: for j=1,2...2z − 1 do
Pt Pt
8: if Nz,j < Nz,j+1 then
Pt Pt
V. D ISTRIBUTIVELY S TORED C OURSE T REE 9: N z,j = N z,j+1
10: end if
A. Distributed Algorithm for Multiple Course Storage 11: end for
Pt Pt Pt
In practice, there are many MOOC platforms e.g. Coursera, 12: Nh,i ← Nz,j , ΩPt ← Nh,i
edX, Udacity, and the course resources are stored in their 13: Same to Algorithm 1 from line 8 to line 19
respective databases. Thus course recommendation towards 14: end for
heterogeneous sources in the course cloud needs to be handled
by a system that supports distributed-connected storage nodes, In Algorithm 4, we still find the context sub-hypercube
where the storage nodes are in the same cloud with different at first (line 2-6). Then since there are 2z distributed units,
zones. In this section, we turn to present a new algorithm we first identify these top regions (line 7-12). Based on
called Distributed Storage Reformational Hierarchical Trees the attained information, the algorithm can start to find the
(DSRHT), which can handle the heterogeneous sources of course by utilizing the Bound and Estimation the same as
course datasets and improve the storage condition by mapping Algorithm 1 (line 13). For the virtual nodes, we set the Bound
them into distributed units in the course cloud. value of them as 0. As for the tree partition, the difference is
We denote the distributed storage units whose number is d that we leave the course regions whose depth is less than z
as Z = {Z1 , Z2 , ...Zd }, where Zi could be a MOOC learning out to cut down the storage cost. In the complexity section we
platform. We bound the number of distributed units Zd with will prove that the storage can be bounded sublinearly under
2z−1 < d ≤ 2z to fit with the binary tree mode, where z is the optimal condition.
the depth of the tree and 2z is the number of regions in that
depth. Note that the number of distributed units is determined
by the practical situation, thus in every context sub-hypercube B. Regret Analyze of DSRHT
Pt the number of elements in set Z is the same as d. Since In this subsection we prove the regret result in DSRHT can
Zd is not always equal to 2z , we let the storage units whose be bounded sublinearly. Now, again, we divide the regions

9
contrast to get the regret upper bound separately by ΓPt = gorithm, since it explores one region in one round, it’s obvious
ΓP t Pt Pt Pt
1 + Γ2 + Γ3 + Γ4 , where E[Ri (T )] is the expected regret to know the space complexity is linear E[S(T )] = O(T ).
of the set Γi (i = 1, 2, 3, 4). ΓP
Pt t
1 means the regions and Theorem 3. In the optimal condition, we take the number of
their descendants in set φP H whose depth is H(H > z); Γ2
t Pt
 dX +αdC
is the set whose regions are in set φP t
(z < h ≤ H); Γ Pt storage units satisfied 2z = lnTT dX +α(dC +3) , then we can
h 3
contains the regions and their descendants in set (φP t c get the space complexity
h ) (z < dX +αdC
 
Pt 3α dX +α(dC +3)
h ≤ H); and for Γ4 , they are the regions at depth z which E[S(T )] = O T dX +α(d C +3) T dX +α(dC +3) −(ln T ) dX +αdC
.
will be selected twice each based on the Algorithm 1. The
depth H (z < H) is a constant to be selected later. Proof: Every round t has to explore a new leaf region. To
Theorem 2. The regret of the distributively stored algorithm get the optimal result,
 dX we suppose the depth is as deepest as we
+αdC

dX +α(dC +3)
ln( lnTT )
is   can choose z = ln 2 . Under the condition
dX dX +α(dC +2) α
E[R(T )] = O LX dX +α(dC +3) T dX +α(dC +3) (ln T ) dX +α(dC +3) ,  dX +αdC
that t < 2z+1 , we have S1 (T ) ≤ 2z = lnTT dX +α(dC +2) ,
if the number of distributed units satisfies when the time t ≥ 2z+1 , after one round there is one unplayed
dX +αdC

d ≤ 2z ≤ lnTT dX +α(dC +3) .



(20) region being selected,d so+αd the second part is S2 (T ) ≤ T −
X C
T
z+1
 d +α(d
2 = T − 2 ln T X C +2) . Thus we can get the storage
Proof: (Sketch) Detailed proof is given in Appendix B.
complexity
For the first third term, the regret upper bound is the less than 
 d d+α(d
X +αdC

Pt T
the result in Theorem 1, since the regret of node Nh,i will be E[S(T )] = O T − ln T X C +2)
. (23)
larger as far as the increasing depth h.
When it comes to the fourth term, we notice that since the
Remark 3: Since the value of z is changeable, appropriate
depth of z is bounded, and the worst situation happens when
value can make the space complexity sublinear. From (23), if
the number of distributed units is the maximum (2z ).
the data dimension is fairly large, the space complexity will
  be relative small. However, the large database and tremendous
distributed units will make the algorithm learning too slow.
 
z 4k ln T
E[R4 (T )] ≤ (2 − 1)  2
√ 2 + M
 k1 (m)z +LX ( ndX )α  Thus taking an appropriate parameter is crucial.
T
 (21) Besides, we compare our algorithms with some similar
 dX +αdC 
T dX+α(dC+3)  4k2 ln T

works which all use the tree partition. In table I we catego-
≤ ln T √ 2 +M .
 k1 (m)z +LX ( ndX )α  rize these algorithms based on the following characteristics:
T
context-awareness, big data-oriented, time complexity, space
For the value of nT determined by the first third term nT =
T
 α
dX +α(dC +3)
complexity and regret. As for the context-awareness and big
ln T . we have data-oriented, our two algorithms both take them into consid-
 
 d d+α(d
X +αdC
2
E[R4 (T )] = O ln T T X C +3)
ln T (nT ) eration, and ACR [29] and HCT [30] only take one respect
 d +α(d +2)
α
 (22) each. For the time complexity, we can find  that the ACR [29]
X C 2
=O T dX +α(dC +3)
(ln T ) dX +α(dC +3)
. is polynomial in T with O T + K E T but others are linear
with time O (T ln T ). When it comes to space complexity,
From Theorem 1, we minimize the regret by mak- our algorithm RHT and algorithm ACR [29] can bound it
ing context gap√ and course region gap equal too, i.e., linearly, and the HCT [30] reduces it to sublinear. For our
H α
k1 (m) = LX ( ndTX ) . For the simplicity we take the con- DSRHT, we can also realize the sublinear space complexity
stant k2 = 2, and the slicing number can be derived αby setting under the optimal condition. The four algorithms all realize the
O(E[R1 (T )]) = O(E[R3 (T )]) as nT = lnTT dX +α(dC +3) .

sublinear
 regret, and our two algorithms  can bound the regret
Remark 2: Note that if there is only one distributed unit (z = with O T ddX X
+dC +2
+d C +3
(ln T ) X C
d +d
1
+3
by setting α = 1 to make
0), the regret E[R4 (T )] = 0, thus we can get the conclusion of
Theorem 1. Compared to the RHT algorithm, we notice that sure fair comparison with ACR [29] and HCT [30]. To sum
the regret upper bound is the same. Since this algorithm starts up, our algorithms not only consider the context-awareness but
at the depth of z, it need to explore all the nodes in depth z also are big data-oriented. Besides, their time complexity and
first. Thus it performs not as well as RHT in the beginning. space complexity are promising.
However, the algorithm can fit the practical problem better
since there are many MOOC platforms in practice. VII. N UMERICAL R ESULTS
In this section, we present: (1) the source of data-set; (2) the
sum of regret are sublinear and the average regret converges to
VI. S TORAGE C OMPLEXITY
0 finally; (3) we compare the regret bounds of our algorithms
The storage problem has been existing in big data analytics with other similar works; (4) distributed storage method can
for a long time, so how to use the distributed storage scheme reduce the space complexity. Fig. 5 illustrates the MOOC
to handle the problem matters a lot. In this section, we analyze operation pattern in edX [27]. The right side is the teaching
the two algorithms’ space complexity mathematically. We use window and learning resources, and the left includes lessons
S(T ) to represent the storage space complexity. For RHT al- content, homepage, forums and other function options.

10
TABLE I: Theoretical Comparison

Algorithm Context Big data-oriented Time Complexity Space Complexity Regret


P   dI +dC +1 
E
O T 2 + KE T
 dI +dC +2
ACR [29] Yes No O l=0 Kl + T O T ln T
   
d 2 d+1 1
HCT [30] No Yes O(T ln T ) O T d+2 (ln T ) d+2 O T d+2 (ln T ) d+2
 d +d +2 1

X C
RHT Yes Yes O(T ln T ) O(T ) O T dX +dC +3 (ln T ) dX +dC +3
 d +d +2 1

X C
DSRHT Yes Yes O(T ln T ) O (T − 2z ) O T dX +dC +3 (ln T ) dX +dC +3

Including
A. Description of the Database Lessons Homework,Video
We take the database which contains feedback information content files and so on

and course details from the edX [27] and the intermedi-
ary website of MOOC [7]. In those platforms, the context Forums :
dimensions contain nationality, gender, age and the highest Getting
rewards
education level, therefore we take dX = 4. As for the course Video
dimensions, they comprise starting time, language, profes- window
sional level, provided school, course and program proportion,
whether it’s self-paced, subordinative subject etc. Thus we take
the course dimension as 10. For the feedback system, we can
Fig. 7: MOOC Learning Model
acquire reward information from review plates and forums.
Thoroughly, the reward is produced from two aspects, which
are the marking system and the comments from forums.
For the users, when a novel field comes into vogue, tremen- course data to better illustrate the comparing effect. The works
dous people will get access to this field in seconds. The are introduced as follows.
data we get include 2 × 105 students using MOOC in those
platforms, and the average number of courses the students • Adaptive Clustering Recommendation Algorithm (ACR)
comment is around 30. As for our algorithm, it focuses on the [29]: The algorithm injects contextual factors capable of
group of students in the same context sub-hypercube rather adapting to more students, however, when the course
than individuals. Thus, when in the next time users come database is fairly large, ergodic process in this model
with context information and historical records, we just treat cannot handle the dataset well.
them as the new training data without distinguishing them. • High Confidence Tree algorithm (HCT) [30]: The algo-
However the number of users is limited, even if generating a rithm supports unlimited dataset however large it is, but
course is time-costing, the number of courses is unlimited and there is only one student for the recommendation model
education runs through the development of human being. Our since it does not take context into consideration.
algorithm pays more attention to the future highly inflated • We consider both the scale of courses and users’ context,
MOOC curriculum resources, and existing data bank is not thus our model can better suit future MOOC situation.
tremendous enough to demonstrate the superiority of our In DSRHT we sacrifice some immediate interests to get
algorithm since MOOC is a new field in education. better long-term performance.
We find 11352 courses from those platforms including To verify the conclusions practically, we divide the experiment
plenty of finished courses. The number of courses doubles into following three steps:
every year. Based on the trend, the quantity will be more than
1) Step 1.: In this step we compare our RHT algorithm with
forty thousand times within 20 years. To give consideration
the two previous works which are ACR [29] and HCT [30]
to both accuracy and scale of sources of data, we copy the
with different size of training data. We input over 6×106 train-
original sources to forty five thousand times to satisfy the
ing data including context information and feedback records
number requirements. Thus we extend the 11352 course data
in the reward space mentioned in the section of database
to around 5 × 108 to simulate future explosive data size of
description into the three models, and then the models will
courses in 2030.
start to recommend the courses stored in the cloud. In consid-
eration of HCT not supporting context, we normalize all the
B. Experimental Setup context information to the same (center point of unit context
As for our algorithm, the final training number of data is hypercube). Since the reward distribution is stochastic, we
over 6 × 106 and the number of courses is about 5 × 108 . Note simulate 10 times to get the average values where the interfere
that we focus more on the comparison rather than showing of random factor is restrained. Then the two regret tendency
the superiorities of our algorithms, thus we take the statistic diagrams are plotted to evaluate algorithms performances.

11
ൈ ૚૙૞ 澔 ൈ ૚૙૞ 澔
6 6
ACR z=0
5 HCT 5 z=10
RHT z=20
4 4
Regret

Regret
3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Arrival data ൈ ૚૙૟ 澔 Arrival data ൈ ૚૙૟ 澔

Fig. 8: Comparison of Regret (RHT) Fig. 10: Comparison of Regret with Different z

0.6 0.6
ACR z=0
0.5 HCT 0.5 z=10
RHT z=20

Average Regret
Average Regret

0.4 0.4

0.3 0.3

0.2 0.2
0.1 0.1

0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Arrival data ൈ ૚૙૟ 澔 Arrival data ൈ ૚૙૟ 澔

Fig. 9: Comparison of Average Regret (RHT) Fig. 11:Comparison of Average Regret with Different z

2) Step 2.: We use the DSRHT algorithm to simulate represent the training cost. As for the DSRHT, we use the
the results. The RHT algorithm can be seemed as degraded virtual partitions in school servers to simulate the distributively
DSRHT with z = 0, and we compare the DSRHT algorithm stored course data. Specifically, we reupload the course data to
with different parametersz. Without loss of generality, we take the school servers in 1024 virtual partitions, and then perform
dX +αdC

dX +α(dC +3)
ln( lnTT ) the DSRHT algorithm.
z = 0, z = 10 and z = ln 2 ≈ 20. Then we
plot the regret and z diagram to analyze the constant optimal
parameter. C. Results and Analysis
3) Step 3.: We record the storage data to analyze the We analyze our algorithm from two different angles: Com-
space complexity of those four algorithms. First we upload paring with other two works and comparing with itself with
517.68 TB indexing information of courses to our university different parameter z. In each direction, we compare the regret
high performance computing platform, whose GPU reaches to first, and analyze the average regret. And then we discuss the
18.46 TFlops and SSD cache is 1.25 TB. Then, we implement accuracies based on the average regret. At last we will compare
and perform the four algorithms successively. In the process the storage conditions from different algorithms.
of training, we record the regret for six times. And in the In Fig. 8 and Fig. 9 we compare the RHT algorithm with
end of training, we record the space usage of the tree which ACR and HCT. From the Fig. 8 (Regret diagram), we can

TABLE II: Average Accuracies of RHT TABLE III: Average Accuracies of DSRHT

Algorithm DS factor z
ACR [29] HCT [30] RHT z=0 z = 10 z = 20
Number ×106 Number×106
1 65.43% 81.02% 85.34% 1 85.34% 82.67% 51.10%
2 78.62% 82.13% 87.62% 2 87.62% 86.98% 72.94%
3 83.23% 82.76% 89.92% 3 89.92% 90.49% 81.37%
4 86.28% 83.01% 90.45% 4 90.45% 91.50% 85.79%
5 88.19% 83.22% 91.09% 5 91.09% 92.03% 88.33%
6 88.79% 83.98% 91.87% 6 91.87% 92.89% 89.04%

12
TABLE IV: Average Storage Cost big data. Experiment results verifies the superior performance
of RHT and DSRHT when comparing with existing related
ACR [29] HCT [30] RHT DSRHT (z = 10)
Storage Cost (TB) 12573 2762 4123 2132
algorithms.
Storage Ratio 24.287 5.335 7.964 4.118
A PPENDIX A P ROOF OF L EMMA 2
Proof: To the first term of in Lemma 1, we take cj , ck ∈
get that our method is better than the two others which has N Pt and ck = cPt ∗ for all context xi ∈ X , then we can get
h,i
less regret from the beginning. The HCT algorithm performs that

better than ACR when it starts. With time going on, the ACR’s f (rcPPtt ∗ ) − f (rcPjt ) ≤ diam(Nh,i Pt
) + LX ( ndTX )α
√ (24)
regret comes to be lower than HCT. From the Fig. 9 (Average
≤ k1 (m)h +LX ( ndTX )α ,
Regret diagram), HCT’s average regret is less than that of ACR Pt ∗
at first, the results also showing that ACR performs slightly where c is the best course whose reward is highest in the
better than HCT finally. context sub-hypercube Pt . We note the event when the path
Table II records the average accuracies which is the total go through the region N Pt
h,i as event {Nh,i ∈
Pt
`Pt ∗ }, therefore,
n o H,I
Pt
rewards divided by the number of training data (denoted by P Bh,i (T Pt ) ≤ f (rcP∗t ) and Th,i Pt
(T Pt ) ≥ 1
“Num”). We find that when the time increases, all the perfor-  q
mance of three algorithms can get promoted. Our algorithm = P µ̂P Pt
h,i (T ) +
t
k2 ln T /Th,i Pt
(T Pt ) + k1 (m)h
has the highest accuracies during the learning period. The ACR √

performs not good when the process starts, whose accuracy is +LX ( ndTX )α ≤ f (rcP∗t ) and Th,i Pt
(T Pt ) ≥ 1
65.43% and is worse than that of HCT. Finally, ACR converges h √ i
to 88.79% but HCT is still 83.98%. When it comes to our = P µ̂P h,i
t
(T Pt
)+k 1 (m) h
+L X ( dX α
nT ) −f (r Pt
c ∗ ) Pt
Th,i (T Pt)
algorithm, it’s 91.87% which is much better than HCT. q 
Pt P P P
Fig. 10 and Fig. 11 analyze the DSRHT algorithm by using ≤ − k2 (ln T )Th,i (T t ) and Th,i (T ) ≥ 1 t t

different parameters z as 0, 10 and 20. From the diagrams we  T Pt


find that comparing with z = 0, z = 10 is not as well as z = 0  n Pt o
∈ `P
P Pt
=P rcn (n) − f (rcPnt ) I Nh,i H,I
t

at the beginning but outperforms it in the long run. However, n=1


Pth
when z = 20, the algorithm has taken a lot of time to start TP √ α in o
Pt h dX Pt Pt Pt
recommend course precisely. Even if finally the accuracy of + f (rcn )+k 1 (m) +L X ( nT ) −f (r c ∗ ) I N h,i ∈ `H,I
n=1 
z = 10 closes to the results that of other two algorithms at q
Pt Pt ) and T Pt (T Pt ) ≥ 1
the end, the effect is not as well as we expect. ≤ − k 2 (ln T )Th,i (T h,i
Table III illustrates the accuracy more precisely. When the  T Pt
Pt
∈ `P
P Pt
training number is less than 4 × 106 , the condition that z = 20 ≤P (rcn (n) − f (rcPnt ))I{Nh,i H,I }
t

n=1
is the worst in the three conditions. After that, it come to catch q 
Pt P Pt Pt
the RHT 91.09% with 88.33%. Thus we can see selecting the ≤ − k2 (ln T )Th,i (T ) and Th,i (T ) ≥ 1 .
t

distributed storage number cannot pursuit the quantity only, The last inequation is based on the expression (24), since
whether it’s makes sense as well in practice. the second term is positive and we drop it to get the last
As for the storage analysis, we use the detailed information expression.
of courses to represent courses data, and the whole course
storage is 517.68 TB. To get more intuition, we use the nFor the convenience of illustration, we pick the n when
^Pt
o
Pt Pt
I Nh,i ∈ `H,I is equal to 1. We use r c to indicate the rcPnt
ratio of actual space occupied and course space occupied to n o
Pt
denote storage ratio. From table IV we know that ACR [29] happened in I Nh,i ∈ `P t
H,I . Thus,
algorithm is not suitable for real big data since the storage  T Pt
 n Pt o
ratio reaches 24.287 TB. HCT [30] algorithm performs well P ∈ `P
P Pt
rcn (n) − f (rcPnt ) I Nh,i t
H,I
in space complexity which is better than RHT. As for DSRHT, n=1
q 
the storage ratio is 4.118 TB which is less than HCT and nearly Pt P Pt Pt
≤ − k2 (ln T )Th,i (T ) and Th,i (T ) ≥ 1
t

half of RHT.  T Pt
 n Pt o
∈ `P
P Pt
≤P rcn (n) − f (rcPnt ) I Nh,i H,I
t

VIII. C ONCLUSION n=1


q 
Pt Pt
This paper has presented RHT and DSRHT algorithms for ≤ − k2 (ln T )Th,i (T Pt ) and Th,i (T Pt ) ≥ 1
the courses recommendation in MOOC big data. Considering ( Pt Pt
Th,i (T ) 
the individualization in recommender system, we introduce ^Pt ^P t
P 
=P r c − r cn
the context-awareness into our algorithm. They are suitable n=1
for the tremendously huge ad changeable datasets in the future
)
q
Pt P P P
MOOC. Meanwhile, they can achieve the linear time and space ≤ − k2 (ln T )Th,i (T t ) and Th,i (T ) ≥ 1 t t

complexity, and can achieve the sublinear space complexity in Pt  n  


TP
the optimal condition. Furthermore, we use distributed storage ^Pt ^P t
P  p
≤ P f ( r c ) − f ( r cj ) ≤ − k2 (ln T )n .
to relieve the storing pressure and make it more suitable for n=1 j=1

13
Pt
We consider the situation when n = 1, 2...Th,i (T Pt ) and Thus,
Pt Pt Pt i TPPt h
the fact that Th,i (T ) ≤ T . Besides, the last inequation h
Pt Pt
i
Pt
E Th,i (T ) ≤ P Bh,i Pt
(n) > f (rcPPtt ∗ ) and Th,i (n) > q
use the union bound theory and loose the threshold n=q+1
Pt  n   i
TP h
^Pt ^P t
 p Pt Pt
∈ {q+1,
P
P f ( r c ) − f ( r cj ) ≤ − k2 (ln T )n or Bj,i h0
(n)≤ f (rcPt ∗
) for j ..., n−1}
n=1 j=1
(25) + 4k2 ln T
α 2 + 1
Pt √
TP d

exp(−2k2 ln T ) ≤ (T Pt )−2k2 +1 .
Pt
≤ DC(h,i) −k1 (m)h −LX ( n X )
T
n=1 4k2 ln T
Note that the sum of time T represents the contextual sum of ≤ √
d
α 2 +1

P X
t
DC(h,i) −k1 (m)h −LX ( nT )
time since the number of courses in the context sub-hypercube
T Pt
P h i
is stochastic. And for the convenience, we use T as the sum + (T Pt )
−2k2 +1
+n−2k2 +2 .
of time. With the help of Hoeffding-Azuma inequality [26], n=q+1
we get the conclusion. And we take the constant k2 ≥ 1,
Pt
TP h i
−2k +1
1+ (T Pt ) 2 +n−2k2 +2 ≤ 4 ≤ M, (27)
With the help of the assumption of range over q, we can n=q+1
get √ thus we can get the conclusion Lemma 2.
α
Pt d
DC(h,i) −k1 (m)h −LX ( n X )
q
≥ k2 ln T (26)
q .
T
2 A PPENDIX B P ROOF OF T HEOREM 2
Thus,
n the o Proof: Based on the segmentation, the regret can be
Pt
P Bh,i (T Pt ) > f (rcPPtt ∗ ) and Th,i
Pt
(T Pt ) ≥ q presented with
n q √ E[R(T )] = E[R1 (T )] + E[R2 (T )] + E[R3 (T )] + E[R4 (T )].
Pt Pt
= P µ̂h,i (TPt)+ k2 ln T /Th,i (TPt)+k1 (m)h+LX ( ndTX )α
o For E[R1 (T )], since it’s the same as the Algorithm 1, so we
Pt
> f (rcPPtt ∗ (h,i) ) + DC(h,i) Pt
and Th,i (T Pt ) ≥ q can get the first term ash

i √
E[R1 (T )] ≤ 4 k1 (m)H + LX ( ndTX )α T .
n q
Pt (28)
≤ P µ̂h,i (T Pt ) + k2 ln q
T
+ k1 (m)h + LX ( ndTX )α
The depth is from z to H, revealing that H > z. To
o
Pt
> f (rcPPtt ∗ (h,i) ) + DC(h,i) Pt
and Th,i (T Pt ) ≥ q  dX +αdC
 Pt

d
α satisfy this, we suppose 2H ≥ lnTT dX +α(dC +2) . Since the
Pt Pt DC(h,i)−k1 (m)h−LX ( n X )
Pt exploration process started from depth z, the depth we can
= P [µ̂h,i (T )−f (rcPt∗(h,i))]>[ 2
T
]
 select satisfy the inequation above. Thus the second term’s
Pt
and Th,i (T Pt ) ≥q . regret bound is
H h √ αi
h
4 k1 (m) + LX ( ndTX ) φP
PP
E[R2 (T )] ≤ h
t

Pt h=z
Pt
When we multiply Th,i (T Pt ) with both sides, we can get H √ (29)
4K(nT )dX P
h
h αi
dX
the inequations below. ≤ d 4 k1 (m) + LX ( nT ) .
[k1 (m)h ] C h=z
We choose the context sub-hypercube whose regret bound
 Pt

d
α is biggest to continue the inequation (29). And as for the third
Pt Pt DC(h,i)−k1 (m)h−LX ( n X ) term, the regret bound is
Pt
P [µ̂h,i (T )−f (rcPt∗(h,i))]>[ 2
T
]
H h √ αi P c
h−1
4 k1 (m) +LX ( ndTX ) (φP
PP
E[R3 (T )] ≤ h ) .
t

Pt
and Th,i (T Pt ) ≥ q Pt h=z P
Nh,i
P
t ∈Γ t
3

We notice that since the regions in ΓP t


is the child region of
3
Pt
n TP ΓP t
2 . To be more specific, in the binary tree, the child regions is
Pt
=P (rnPt (n) − f (rh,i
Pt
))I{Nh,i ∈ `P
H,I }
t
more than parent regions but less than twice, thus the number
n=1
P

dX
α of top regions in ΓP
3 is less than twice of Γ2 .
t Pt
t
DC(h,i)−k1 (m)h−LX( )
o
Pt Pt H √
>[ nT
]Th,i (TPt) and Th,i (TPt)≥q . αi c
h
h−1
+ LX ( ndTX ) (φP
PP P
2 4 k1 (m) h )
t

Pt h=z P
t ∈Γ t
Nh,i
P
3
(
With the union bound and the Hoeffding-Azuma inequality P K(nT )dX ln T
32k2 
[26], we can get that ≤ √
dX

Pt h [k1 (m)h ]dC +1 k1 (m)h +LX ( nT )α
n TP  n Pt o √
rnPt (n) − f (rcPnt ) I Nh,i ∈ `P
 )
t d
P H,I 8M K(nT )dX k1 (m)h +LX ( n X )α
T
n=1 √ α + .
tP
DC(h,i)−k1 (m)h−LX(
dX
)
o m[k1 (m)h ]dC
Pt Pt
>[ 2
nT
]Th,i (TPt) and Th,i (TPt)≥q
From the upper bounds of regret E[R1 (T )], E[R2 (T )],
≤ (T Pt )−2k2 +1 . E[R3 (T )], we can get that the three upper bound is the same
as algorithm RHT.
According to Lemma 1 and the prerequisite in Lemma 2,
we select upper bound of q as  4k2 ln T
√ α 2 +1.
 R EFERENCES
d
P X
t
DC(h,i)−k1 (m)h−LX ( nT ) [1] L. Pappano, “The Year of the MOOC,” The New York Times, 2014.

14
[2] T. Lewin, “Universities Abroad Join Partnerships on the Web,” New York
Times, 2013.
[3] Coursera, https://www.coursera.org/.
[4] A. Brown, “MOOCs make their move,” The Bent, vol. 104, no. 2, pp.
13-17, 2013.
[5] D. Glance, “Universities are still standing. The MOOC revolution that
never happened,” The Conversation, www.theconversation.com/au, July
15, 2014a.
[6] M. Hilbert, “Big data for development: a review of promises and
challenges,” Development Policy Review, vol. 34, no. 1 pp. 135-174, 2016.
[7] Guoke MOOC, http://mooc.guokr.com/
[8] G. Paquette, A. Miara, “Managing open educational resources on the
web of data,” International Journal of Advanced Computer Science and
Applications (IJACSA), vol. 5, no. 8, 2014.
[9] G. Paquette, O. Mario, D. Rogozan, M. Lonard, “Competency-based per-
sonalization for Massive Online Learning,” Smart Learning Environments,
vol. 2, no. 1, pp. 1-19, 2015.
[10] C. G. Brinton, M. Chiang, “MOOC performance prediction via click-
stream data and social learning networks,” IEEE Conference on Computer
Communications (INFOCOM), pp. 2299-2307, 2015.
[11] S. Bubeck, R. Munos, G. Stoltz, C. Szepesvari, “X-armed bandits,”
Journal of Machine Learning Research pp. 1655-1695, 2011.
[12] G. Adomavicius, A. Tuzhilin, “Toward the next generation of recom-
mender systems: a survey of the state-of-the-art and possible extensions,”
IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6,
pp. 734-749, 2005.
[13] D. Yanhui, W. Dequan, Z. Yongxin, et al. “A group recommender
system for online course study,” International Conference on Information
Technology in Medicine and Education, pp. 318-320, 2015.
[14] M. J. Pazzani, D. Billsus, “Content-based recommendation over a cus-
tomer network for ubiquitous shopping,” IEEE Transactions on Services
Computing, vol. 2, no. 2, pp. 140-151, 2009.
[15] R. Burke, “Hybrid recommender systems: Survey and experiments,”
User Modeling and User-adapted Interaction, vol. 12, no. 4, pp. 325-
341, 2007.
[16] K. Yoshii, M. Goto, K. Komatani, T. Ogata, H. G. Okuno, “An efficient
hybrid music recommender system using an incrementally trainable
probabilistic generative model,” IEEE Transactions on Audio, Speech,
Language Processing, vol. 16, no. 2, pp. 435-447, 2008.
[17] L. Yanhong, Z. Bo, G. Jianhou, “Make adaptive learning of the MOOC:
The CML model,” International Conference on Computer Science and
Education (ICCSE), pp. 1001-1004, 2015.
[18] A. Alzaghoul, E. Tovar, “A proposed framework for an adaptive learning
of Massive Open Online Courses (MOOCs),” International Conference
on Remote Engineering and Virtual Instrumentation, pp. 127-132, 2016.
[19] C. Cherkaoui, A. Qazdar, A. Battou, A. Mezouary, A. Bakki, D.
Mamass, A. Qazdar, B. Er-Raha, “A model of adaptation in online
learning environments (LMSs and MOOCs),” International Conference
on Intelligent Systems: Theories and Applications (SITA), 2015, pp. 1-6.
[20] E. Hazan, N. Megiddo, “Online learning with prior knowledge,” Inter-
national Conference on Computational Learning Theory, Springer Berlin
Heidelberg, pp. 499-513, 2007.
[21] A. Slivkins, “Contextual bandits with similarity information,” Journal
of Machine Learning Research, vol. 15, no. 1, pp. 2533-2568, 2014.
[22] J. Langford T. Zhang, “The epoch-greedy algorithm for multi-armed
bandits with side information,” Advances in neural information processing
systems, pp. 817-842, 2008.
[23] W. Chu, L. Li, L. Reyzin, R. E. Schapire, “Contextual bandits with
linear payoff functions,” AISTATS, vol. 15, pp. 208-214, 2011.
[24] T. Lu, D. Pál, M. Pál, “Contextual multi-armed bandits,” International
Conference on Artificial Intelligence and Statistics (AISTATS), pp. 485-
492, 2010.
[25] C. Tekin, M. van der Schaar, “Distributed online big data classification
using context information,” IEEE Annual Allerton Conference: Commu-
nication, Control, and Computing, pp. 1435-1442, 2013.
[26] W. Hoeffding, “Probability inequalities for sums of bounded random
variables,” Journal of the American Statistical Association, vol. 58, no.
301, pp. 13-30, 1963.
[27] edX, https://www.edx.org/
[28] J. P. Berrut, L. N. Trefethen, “Barycentric lagrange interpolation,” Siam
Review, vol. 46, no. 3, pp. 501-517, 2004.
[29] L. Song, C. Tekin, M. van der Schaar, “Online learning in large-
scale contextual recommender systems,” IEEE Transactions on Services
Computing, vol. 9, no. 3, pp. 433-445, 2014
[30] M. G. Azar, A. Lazaric, E. Brunskill, “Online Stochastic Optimization
under Correlated Bandit Feedback,” Proc. Int. Conf. on Machine Learning
(ICML), Beijing, pp. 1557-1565, 2014.

15

You might also like