Context-Aware Online Learning for Course
Context-Aware Online Learning for Course
Abstract—The Massive Open Online Course (MOOC) has Coursera, edX, Udacity and so on [4]. However, due to the
expanded significantly in recent years. With the widespread rapid growth rate of users, the amount of needed courses has
of MOOC, the opportunity to study the fascinating courses
arXiv:1610.03147v2 [cs.LG] 16 Oct 2016
2
Load the
course
Reward
Professor or online
learning platform
Reward
Course cloud with
recommender system Course
A student information
Reward
nationality with context
ᮽᵢ
The context-
awareness cloud
A payoff is recommend a
Education level: fed back course Context
elementary or Give a information
professor course to
learner Recommended
Fig. 2: Context and Course Space Relation Schema at slot t0
Students
course
Fig. 1: MOOC Course Recommendation and Feedback System horizontal axes in a space rectangular coordinate system. From
the schematic diagram in Fig. 2 at time slot t0 , the reward
varies in the context axis and course axis. To be more specific,
A. System Model for a determined student sk0 whose context xi0 is unchanged,
the reward rxi0 ,cj (t0 ) differs from courses cj shown in blue
Fig. 1 illustrates our model of operation. At first, the pro-
plane coordinate system. On the other hand, for a determined
fessors upload the course resources to the course cloud, where
course cj0 shown in crystal plane coordinate system, people
the uploaded courses are indexed by the set C = {c1 , c2 , ...}
with different context xi have different rewards rxi ,cj0 (t0 ) of
whose elements are vectors with dimension dC representing
courses.
the number of course features. As for the users, there are
consciously incoming students over time which are denoted
as S = {s1 , s2 , ...}. Then, the system collects context infor- B. Context Model for Individualization
mation of students. We denote the set of context information of The context space is a dX -dimensional space which means
students as X = {x1 , x2 , ..., xi , ..., }, where xi is the vector the context xi ∈ X is a vector with dX dimensions. The
in context space X . dX -dimensional vectors encode features such as ages, cultural
We use time slots t = 1, 2, ..., T to denote rounds. For backgrounds, nationalities, the educational level, etc., repre-
simplicity, we use st , xt , ct to denote the current incoming senting the characteristics of the student. We normalize every
student, the student context vector and the recommended dimension of context range from 0 to 1, e.g., educational
course at time t. In each time slot t, there are three running level ranges from [0, 1] denoting the educational level from
states: (1) a student st with an exclusive context vector xt the elementary to the expert in the related fields. With the
comes into our model; (2) the model recommends a course normalization in each dimension, we denote the context space
ct by randomly selecting one from the current course node to as X = [0, 1]dX , which is a unit hypercube. As for the differ-
the student st ; (3) the student st provides feedback due to the ence between two contexts, DX (xi , xj ) is used to delegate the
newly recommended course ct to the system. dissimilarity between context xi and xj . We use the Lipschitz
We assume the context sequence that generating the rewards condition to define the dissimilarity.
of courses follows an i.i.d. process, otherwise if there are
Assumption 1. There exists constant LX > 0 such that for
mixing within the sequence in practice, we could use the
all context xi , xj ∈ X , we have DX (xi ,xj ) ≤ LX ||xi − xj ||α ,
technique in [25] by using two i.i.d sequences to bound the
where || • || denotes the Euclidian norm in RdX .
mixing process without much performance difference. The
rxi ,cj (t) denotes the feedback reward from the student with Note that the Lipschitz constants LX are not required to be
context xi of course cj at time t. For the recommending known by our recommendation algorithms. They will only be
process, first there comes a student sk with context vector xi . used in quantifying the learning algorithms’ performance. As
Then the system recommends a course cj to the student sk for the parameter α, it’s referred to as similarity information
based on the historical reward information and context vector [21] and we assume that it’s known by the algorithms that
xi , after that the student sk gives a new reward rxi ,cj (t) to qualify the degree of similarity among courses. We present
the system. We define rxi ,cj (t) = f (xi , cj ) + εt , where εt the context dissimilarity mathematically with LX and α and
is a bounded noise with E[εt |(xi , cj )] = 0 and f (xi , cj ) is a they will appear in our regret bounds.
function of two variables (xi , cj ). Besides, we normalize the To illustrate the context information precisely, we define the
reward as rxi ,cj (t) ∈ [0, 1]. slicing number of context unit hypercube as nT , indicating the
Fig. 2 illustrates the relationship between context vector number of sets in the partition of the context space X . With
xi and course vector cj over reward. To better illustrate the the slicing number nT , each dimension can be divided into
relations, we degenerate the dimensions of them as dX = nT parts, and the context space is divided into (nT )dX parts
dC = 1. Practically, we have the reward axis with dimension where each part is a dX -dimensional hypercube with dimen-
1. Thus, we take the context vector and course details as two sions n1T × n1T × ... n1T . To have a better formulation, PT =
3
Center point ࡼ࢚ 澔
Selected tree, the region of two child nodes contains all the courses
Reward
Optimal
Context from their parent region, and they never intersect with each
dimension 3 Sub-hypercub node
Pt Pt Pt Pt Pt
other, Nh,i = Nh+1,2i−1 ∪ Nh+1,2i , Nh,i ∩ Nh,j = ∅ for any
Context
Pt
i 6= j. Thus, the C can be covered by the regions of Nh,i
h Pt
Context information at any depth C = ∪21 Nh,i . To better describe the regions, we
point Pt
define the diam(Nh,i ) to indicate the size of course regions,
Pt Pt Pt
Context
Courses
diam(Nh,i ) = supci ,cj ∈N Pt DC (ci , cj ) for any ci , cj ∈ Nh,i .
Optimal
ࢀ =2 dimension 2 h,i
Course Pt
The dissimilarity DC (ci , cj ) between courses ci and cj can
be represented as the gap between course languages, course
Fig. 3: Context and Course Partition Model time length, course types and any others which indicate the
Pt
discrepancy. We denote the size of regions diam(Nh,i ) with
Pt
the largest dissimilarity in the course dataset Nh,i for any
{P1 ,P2 ,...,P(nT )dX } is used to denote the sliced chronological context xi ∈ Pt . Note that the diam is based on the dissimilar-
sub-hypercubes, and we use Pt to denote the sub-hypercube ity, and that can be adjusted by selecting different mappings.
selected at time t. As illustrated in Fig. 3, we let dX = 3 and For our analysis, we make some reasonable assumptions as
nT = 2. We divide every axis into 2 parts and the number d
follows. We define the set M = {mP1 , mP2 , ...m(nT ) X }
of sub-hypercubes is (nT )dX = 8. For the simplicity, we use
as the parameter to bound the size of regions of nodes in
the center point xPt in the sub-hypercube Pt to represent the
context sub-hypercube Pt , where all the elements in M satisfy
specific contexts xt at time t. With this model of context, we
mPt ∈ (0, 1). For simplicity, we take m as the maximum in
divide the different users into (nT )dX types. For simplicity,
M, which means m = max{mPt |mPt ∈ M}.
when Pt is used in the upper right of the notation, it means
Pt
that the notation is in the sub-hypercube Pt which is selected Assumption 2. For any region Nh,i , there exists constant θ ≥
k1 Pt
at time t, and the subscript “∗” means the optimal solution 1, k1 and m, where we can get θ (m)h ≤ diam(Nh,i ) ≤
over that notation. h
k1 (m) .
With Assumption 2 we can bound the size of regions with
C. Course Set Model for Recommendation
k1 (m)h , which accounts for the maximum possible variation
We model the set of courses as a dC -dimensional space, of the reward over Nh,i Pt
. Due to the properties of binary tree,
where dC is a constant to denote the number of all courses the number of regions increases exponentially with the depth
features e.g. language, professional level, provided school in C. rising, where using the exponential decreasing term k1 (m)h
We set every course in C as a dC dimensional vector, and for to bound the size of regions is reasonable. We use the mean
the newly added dimensions of courses, the value is set as 0. reward f (xi , cj ) to handle the model. Based on the concept
Similar to the context, we define the dissimilarity of courses as of the region and reward, we denote the courses in Nh,i Pt
as
Pt
DC (ci , cj ) to indicate the farthest relativity between the two Pt
ct (h, i) at time t in the context sub-hypercube Pt . Since
courses ci cj belonging to any the context vectors xi ∈ Pt there are tremendous courses and it is nearly impossible to
at time t, where the context vector xt belongs to the context find two courses with equal reward, for each context sub-
sub-hypercube Pt . hypercube Pt , there is only one overall optimal course defined
Pt
Definition 1. Let DC xt
over C be a non-negative mapping as Pt as cPt ∗ = arg maxcj ∈C f (rcPjt ) and each region Nh,i
2 Pt xi Pt ∗
(C → R): DC (ci , cj ) = supxi ∈Pt DC (ci , cj ) , where has a local optimal course defined as Pt as c (h, i) =
Pt xt
DC (ci , cj ) = DC (ci , cj ) = 0 when i = j. arg maxcj ∈N Pt f (rcPjt ), where we let f (•) be the mean value,
h,i
We assume that the two courses which are more relevant i.e., f (rxi ,cj ) = E[f (xi , cj ) + εt ] = f (xi , cj ) and rcPjt means
have the smaller dissimilarity between them. For example, the rxt ,cj in Pt .
the courses taught both in English have closer dissimilarity
than the courses with different languages when concerning D. The Regret of Learning Algorithm
the language feature of course. Simply, the regret R(T ) indicates the loss of reward in the
As for the course model, we use the binary tree whose nodes recommending procedure due to the unknown dynamics. As
are associated with subsets of X to index the course dataset. for our tree model, the regret R(T ) is based on the regions of
We denote the nodes of courses as the selected tree nodes Nh,iPt
. In other words, the regret R(T )
Pt
{Nh,i |1 ≤ i ≤ 2h ; h = 0, 1...; ∀ Pt ∈ PT }. is calculated by the accumulated reward difference between
Pt
Let Nh,i denote the nodes in the depth h and ranked i from recommended courses ct and the optimal course cPt ∗ with
left to right in context sub-hypercube Pt which is selected context xt over reward in the context sub-hypercube Pt at
at time t, where the ranked number i of nodes at depth h time t, thus we define the regret as
Pt T
T
is restricted by 1 ≤ i ≤ 2h . We let Nh,i ∈ X represent
f (rcPPtt ∗ ) − E
P P Pt
Pt
R(T ) = rxt ,ct (t) , (1)
the course region associated with the node Nh,i . The region t=1 t=1
of root node N0,1 of the binary course tree is a set of the where rcPPtt ∗ is the reward of optimal course in Pt and rxPtt,ct is
Pt
whole courses N0,1 = C. And with the exploration of the the reward of course ct with context xt in Pt . Regret shows the
4
convergence rate of the optimal recommended option. When where k2 is a parameter used to control the exploration-
γ
the regret is sublinear R(T ) = O(T ) where 0 < γ < 1, exploitation tradeoff. And we define the Estimation as the
Pt
the algorithm will finally converge to the best course towards estimated reward value of the node Nh,i based on the Bound,
the student. In the following section we will propose our
n o
Pt Pt Pt Pt
Eh,i (t) = min Bh,i (t), max{Eh+1,2i−1 (t), Eh+1,2i (t)} .(3)
algorithms with sublinear regret.
Pt
The role of Eh,i (t) is to put a tight, optimistic, high-probability
Pt Pt
IV. R EFORMATIONAL H IERARCHICAL T REE upper bound for the reward over the region Nh,i of node Nh,i
in context sub-hypercube Pt at time t. It’s obvious that for
In this section we propose our main online learning algo- Pt Pt Pt
the leaf course nodes Nh,i we have Eh,i (t) = Bh,i (t) and for
rithm to mine courses in MOOC big data. Pt Pt
other nodes Nh,i we have Eh,i (t) ≤ Bh,i (t). Pt
5
Courses B. Regret Analyze of RHT
Depth 0 Nodes with
ࡺࡼǡ࢚ highest E now According to the definition of regret in (1), all suboptimal
ࡺࡼǡ࢚
Depth 1 courses which have been selected bring regret. We consider the
Current path regret in one sub-hypercube Pt and get the sum of it at last.
Depth 2
ࡺࡼǡ࢚ Since the regret is the difference between the recommended
ࡺࡼǡૠ࢚
Depth 3 courses and the best course over reward, we need to define
Optimal path
the best course regions at first. We define the best regions
ࡼ࢚ Pt Pt ∗
ࡺǡ as Nh,i ∗ which contain the best course c in depth h and
After many Optimal nodes
h
∗
rounds optimally ranked ih in context sub-hypercube Pt at time t. To
Depth λ澳 illustrate the regret with regions better, we define the best path
Depth
Optimal course t∗
as `P Pt
h,i = {Nh0 ,i∗ |cPt ∗ ∈ NhP0t,i∗ 0 f or h0 = 1, 2, ...h}.
h0 h
The path is the aggregation of the optimal regions whose
Fig. 4: Algorithm Demonstration depth ranges from 1 to h. To represent the regret precisely, we
need to define the minimum suboptimality gap which indicates
Pt
the dissimilarity DC cPt ∗ (h, i), cPt ∗ between the optimal
course in that region and the overall optimal course cPt ∗ to
select the regions with higher Estimation value. Note that better describe the model.
based on (3), the parent nodes of the node with the highest
Definition 2. The Minimum Suboptimality Gap is
Estimation value also have highest value of the Estimation Pt
Pt DC(h,i) = f (rcPPtt ∗ ) − f (rcPPtt ∗ (h,i) ),
in their depth, which means for all nodes Nh,i ∈ ΩPt , we can
Pt Pt 0 h
get that Eh,i = max{Eh,i0 |1 ≤ i ≤ 2 }, thus Algorithm 2 and the Context Gap is √
Pt
can find the node with highest Estimation value. After the DX = max{DX (xP Pt
i , xj )} = LX ( nT ) .
t dX α
new regions being chosen, they will be taken in the sets ΓPt Pt
The minimum suboptimality gap of Nh,i is the expected
and ΩPt for the next calculation.
reward defference between overall optimal course and the best
Pt
Pt
In Algorithm 3, we define C(Nh,i Pt
) as the set of node Nh,i one in Nh,i , and the context gap is the difference between
and its descendants, the original point and center point in context sub-hypercube
Pt
C(Nh,i Pt
) = Nh,i Pt
∪ C(Nh+1,2i−1 Pt
) ∪ C(Nh+1,2i ). Pt . As for the context gap, we take the upper bound of it as
max{DX (xP Pt
i , xj )} to bound the regret.
t
Note that the we only know a part of courses in nodes, Note that we call the regions in set φPt as optimal regions
uploading new courses into the cloud would not change the and those out of it as suboptimal regions.P
Besides, we divide
Estimation value and the Bound value (this two is irrelevant the set by depth h which means φPt = h φP h , where φh
t Pt
Pt
with course number), thus the algorithm could hold the past denote the regions in the depth h which are in the set φ .
path and explored nodes without recalculating the tree. Based We define the regret when one region is selected above.
on this feature, our model can handle the dynamic increasing Since for every region the algorithm chooses only once, we
dataset effectively. However as for [29], the leaf node is one can bound the regret after we determine how many regions the
single course, which means the added courses will change the algorithm has selected in the recommending process. Based
t∗
whole structure of the course tree. on definition of `Ph,i and Definition 2, we assume that the
6
t∗
suboptimal regions are divorced from `P
h,i in depth k (in Fig.
The whole Packing ball with
regions in radius Ra
4 the depth k = 2). Since we do not know in time T how depth h
many times this context sub-hypercube Pt has been selected, The optimal
regions in depth h
we use context time T Pt to represent
P the total times in Pt . Course in Course in
The sum of T Pt is the total time Pt T Pt = T . suboptimal regions optimal regions
To get the upper bound of the number of suboptimal regions,
we introduce Lemma 1 and Lemma 2. Fig. 5: Distributed Storage based on Binary Tree in Cloud
Pt
Lemma 1. Nodes Nh,i are suboptimal, and in the depth
k (1 ≤ k ≤ h − 1) the path is out of the best path. For
Pt times so the probability is equal to 1, and the sum of them
any integer q, we can get the expect times of the region Nh,i Pt
is equal to q. In the second term, since the Th,i (n) > q, the
and it’s descendants in Pt are
Pt
TP
h i terms when n ≤ q are zero and with the help of inequation
Pt Pt
E[Th,i (T Pt)] ≤ q+ P Bh,i Pt
(n) > f (rcPPtt ∗) and Th,i (n) > q (7) we can get the conclusion.
n=q+1 We determine the threshold of the selected times of the
i Pt
nodes in C(Nh,i ) by Lemma 1. However, from Lemma 1 we
h
Pt Pt
or Bk,i∗ (n) ≤ f (r Pt ∗) for k ∈ {q+1, ..., n−1} .
c Pt
k
decompose the E[Th,i ] with the sum of events, which means
Pt
Proof: We assume that the path is out of the best in the we cannot get the upper bound of E[Th,i ] directly, thus we
depth of k. Since the selected path is out of the optimal path Pt
introduce Lemma 2 to bound E[Th,i ] with the deviation of
in depth k and the algorithm select the regions with higher contexts and courses based on Lemma 1.
Pt Pt
Estimation value, we can know that Ek,i ∗ (n) ≤ Ek,i (n),
k Pt
k
where the first Estimation value is for the best path region Lemma 2. For the suboptimal regions Nh,i , if q satisfies
4k2 ln T
and the second one is for the region selected in the depth of k. q≥ √ α 2
,
Pt Pt Pt
DC(h,i) −k1 (m)h −LX (
dX
) (8)
According to (3), we can know that Ek,i k
(n) ≤ Ek+1,i k+1
(n), nT
Pt Pt
then we could get that Ek,i∗ (n) ≤ Ek,ik (n) ≤ Eh,i Pt
(n) ≤ Then for all T Pt ≥ 1, we can get the expected times that node
k Pt
Pt
Bh,i Pt
(n). We define {NhPtt,it ∈ C(Nhi,i )} as the event that the Nh,i has been selected as
Pt 4k2 ln T
algorithm passes from the root node by the node Nh,i Pt
. Ob- E[Th,i (T Pt )] ≤ √ α 2
+ M,
d
k1 (m)h +LX ( n X ) (9)
Pt Pt Pt
viously, we can get that {Nht ,it ∈ C(Nhi,i )} ⊂ {Bh,i (n) ≥ T
7
ball the same as the course regions as 2 thus we can illustrate Theorem 1. From the lemma above, regret of RHT is
all the courses by dots in black square (plane). We use the red dX dX +α(dC +2) α
E[R(T )] = O LX dX +α(dC +3) T dX +α(dC +3) (ln T ) dX +α(dC +3) .
dot to denote the courses in the optimal regions and black dot
to denote the courses in the suboptimal regions in depth h. As
Proof: We bound the regret with (14). For E[R1 (T )], the
shown, we could use the number of packing balls to cover all
regret is generated from the optimal course regions whose
the courses in the course regions, which means the number of
courses have been recommended. We use the maximum times
optimal regions in depth h can be bounded with the number
T Pt to bound the number of optimal regions in ΓP t
1 . Since all
of packing balls with the constant K0 and θ. Pt
the regions in Γ1 is optimal, from Assumption 3 if we take cj
With Assumption 4, we introduce Lemma 3 to bound the Pt
as the worst course in region Nh,i which has the lowest mean
number of optimal regions in depth h with the number of Pt ∗
reward and ck = c (h, i), then we can bound the regret of
packing balls. these nodes as
P h √ i
E[R1 (T )] ≤ 4 k1 (m)H + LX ( ndTX )α T Pt
Lemma 3. In h the same context√
sub-hypercube Pt , the num- Pth (15)
i √ i
h dX α
ber of the 2 k1 (m) + LX ( nT ) -optimal regions can be = 4 k1 (m)H + LX ( ndTX )α T .
bounded as As for the second term whose depth is from 1 to H, with
h √ i−dC
φP
h
t
≤ K k1 (m) h
+ LX ( n
dX α
T
) . (12) Lemma 3 and the fact that each regions in ΓP t
2 is just played
at most once, we can get
Proof: From Assumption 2 we can bound the region with H h √ αi
h
4 k1 (m) + LX ( ndTX ) φP
PP
diam(Nh,iPt
) ≥ kθ1 (m)h . As for context deviation we still use E[R2 (T )] ≤ h
t
√ Pt h=1
the bound with LX ( ndTX )α . Since the course number is can be 4K(nT )dX
H h
h
√ αi
4 k1 (m) + LX ( ndTX ) .
P
≤ d
huge such that we cannot know the data exactly, the dimension [k1 (m)H ] C h=0
0
of course cannot be determined. There exists a constant
√ d , From Lemma 3 we canhknow the number of optimal regions
√ −dC
Pt Pt Pt Pt k1 dX α
h
i
φh ≤ κh ∪{Nh,i ∈ φh }, θ (m) + LX ( nT ) in depth h are φP h dX α
h ≤ K k1 (m) +LX ( nT ) , and the
t
√ −d0 (13) dX
≤ K0 kθ1 (m)h + LX ( ndTX )α . number of the context sub-hypercubes is (nT ) . Thus the
last inequation can be derived.
Obviously, we know that θ > 10 which means we can simplify
√ −d When it comes to the last term, we notice that the top
K0 kθ1 (m)h + LX ( ndTX )α further, regions in ΓP Pt
3 are the child regions of the regions in Γ2 , since
t
Pt
√ −d0 √ −d0 all the regions in Γ2 is the parent regions of the suboptimal
K0 kθ1 (m)h +LX ( ndTX )α ≤ K0 kθ1 (m)h + LθX ( ndTX )α regions. And as for the upper bound of course node k1 (m)h ,
0
√ −d0 the region of child node is smaller than that of parent node,
= K0 θd k1 (m)h +LX ( ndTX )α . which means with the depth increasing, the course gap will
0
Then we take K = K0 θd to get the conclusion. The | • | be smaller than before. Hence we can get that the number of
represents the number of elements in the set and we take the top regions in ΓP Pt
3 is less than twice of Γ2 . Due to the fact
t
minimal d0 as the dimension of course dC . that the child nodes has smaller diam than their parent nodes,
we could find that the course deviation of suboptimal
√ i region
Since we bound the number of suboptimal regions and opti- Pt
h
mal regions, we can bound the regret with attained conclusion Nh,i can be bounded as 4 k1 (m)h−1 + LX ( ndTX )α . And the
above. For simplicity, we divide the regret into three parts regret bound is
H h √
according to ΓPt = ΓP Pt Pt
1 ∪ Γ1 ∪ Γ1 , where E[Ri (T )] is the
t
E[R3 (T )] ≤
PP h−1
4 k1 (m) +LX ( ndTX )
αi P
Pt
Th,i (TPt)
Pt
expected regret of the set Γi (i = 1, 2, 3). Then, we can get Pt h=1 Pt ∈Γ t
Nh,i
P
3
E[R(T )] = E[R1 (T )] + E[R2 (T )] + E[R3 (T )], (14)
(
P K(nT )dX ln T
32k2
Pt Pt
where Γ1 contains the descendants of φH (H is a constant ≤ √
d
h [k1 (m)h ]dC +1 k1 (m)h +LX ( n X )α
depth to be determined later), ΓP t Pt
2 contains the regions φh the
T
√
Pt
)
depth from 1 to H and Γ3 contains descendants of regions d
8M K(nT )dX k1 (m)h +LX ( n X )α
T
in (φP t c Pt
h ) (0 ≤ h ≤ H). Note that top regions in Γ3 is the
+ m[k1 (m)h ]dC
.
Pt
child of regions in φH .
P Pt Note that the bound of E[R2 (T )] is the infinitesimal of higher
Due to the fact that T = T , when all the contexts order of the bound of E[R3 (T )] mathematically, thus we focus
xt are in the same context sub-hypercube Pt , the regret is more on the first term and the last term since the decisive
the smallest. And we consider the situation that time T is factors of regret is the first one and last one. We notice that
distributed uniformly. Under this condition each context sub- with the depth increasing, E[R1 (T )] decreases but E[R3 (T )]
hypercube has the least training data, so the sum of deviation increases. When we let this two terms to be equal, we can get
towards course is the largest. In this extreme situation, all the the regret as follows.
context sub-hypercube has the same times T Pt . After we know E[R1 (T )] isnbounded by
the regret in selecting one region, the times when a region has h √i o
dX α
been selected and the number of chosen regions, we can bound O 4 k1 (m)H + LX ( nT ) T . (16)
the whole regret in Theorem 1. As for E[R3 (T )], we notice that the constant M is the
8
4k2 ln T
infinitesimal of higher order of √
dX
α 2
, which
k1 (m)h +LX ( nT )
means we can ignore the influence of the constant M . There- ࢆ
ࢆ Virtual node
fore, the bound of E[R3 (T )] is determined by the first term ࢆ ࢆ
Course node
and it can be shown as When z=2
d=3
Distributed
P 32k2
K(nT ) dX
ln T storage units
O √
d
d +1
h [k1 (m)h ] C k1 (m)h +LX ( n X )α
T
(17) Fig. 6: Distributed Storage based on Binary Tree in Cloud
ln T (nT )dX
=O H dC +2
.
[k1 (m) ]
As for a context sub-hypercube Pt , all the regions which have
been played bring√two kinds of regret: the regret contributed by regions are empty Z∅ = {Zd+1 , Zd+2 , ...Z2z } be the virtual
context gap LX ( ndTX )α and the regret contributed by course nodes, which means there is no course in that distributed
units {Zj = ∅|j = d + 1, d + 2, ...2z } for any context sub-
region gap k1 (m)H . To √optimize the upper bound of regret,
hypercube. Fig. 6 illustrates the condition when there are 3
we take k1 (m)H = LX ( ndTX )α . Under that condition we let
storage platforms (Coursera, edX and Udacity). We can get
O(E[R1 (T )]) = O(E[R3 (T )]) to get
ln T (nT )dX
the number of distributed units as d = 3 and the depth is
[k1 (m)H ]dC +2
= k1 (m)H T, (18) z = 2 (21 ≤ 3 ≤ 22 ), and the set Z = {Z1 , Z2 , Z3 } and the
α
where nT = lnTT
d
X +α(dC +3)
. For the simplicity, we use set Z∅ = {Z4 }.
γ = ddX
X +α(dC +2)
+α(dC +3) and we use the constant M2 to denote the
Algorithm 4 Distributed Course Recommendation Tree
E[R2 (T )] in E[R3 (T )]. Then we can get the regret as
α[2dX +α(dC +3)] dX Require: The constants k1 and m, the parameter of the storage
2[d +α(dC +3)] d +α(dC +3)
E[R(T )] = 8dX X LXX T γ (ln T )1−γ (19) unit z, the student’s context xt and time T .
α(dC +2)(γ−1)
+32k2 KM2 (dX ) 2 (d
(LX ) C +3)γ−(d C +2) γ
T (ln T ) 1−γ Auxiliary function: Exploration and Bound Updating
dX dX +α(dC +2) α
Initialization: For all context sub-hypercubes belonging to PT
Pt Pt Pt
= O LX dX +α(dC +3) T dX +α(dC +3) (ln T ) dX +α(dC +3) . ΓPt = {Nz,1 , Nz,2 ...Nz,2 z}
Pt
Ez,i = ∞ f or i = 1, 2...2z
E[R(T )]
Remark 1: From (19) we can make sure limT →∞ T = 1: for t =1,2,...T do
0, which means the algorithm can find the optimal courses 2: for dt = 0, 1, 2...dX do
for the students finally. Note that the tree exists actually, 3: Find the context interval in dt dimension
we store the tree in the cloud and during the recommending 4: end for
process. Since the dataset is fairly large in the future, using 5: Get the context sub-hypercube Pt
the distributed storage method to solve storage problems is 6: xt ← center point of Pt
inescapable. 7: for j=1,2...2z − 1 do
Pt Pt
8: if Nz,j < Nz,j+1 then
Pt Pt
V. D ISTRIBUTIVELY S TORED C OURSE T REE 9: N z,j = N z,j+1
10: end if
A. Distributed Algorithm for Multiple Course Storage 11: end for
Pt Pt Pt
In practice, there are many MOOC platforms e.g. Coursera, 12: Nh,i ← Nz,j , ΩPt ← Nh,i
edX, Udacity, and the course resources are stored in their 13: Same to Algorithm 1 from line 8 to line 19
respective databases. Thus course recommendation towards 14: end for
heterogeneous sources in the course cloud needs to be handled
by a system that supports distributed-connected storage nodes, In Algorithm 4, we still find the context sub-hypercube
where the storage nodes are in the same cloud with different at first (line 2-6). Then since there are 2z distributed units,
zones. In this section, we turn to present a new algorithm we first identify these top regions (line 7-12). Based on
called Distributed Storage Reformational Hierarchical Trees the attained information, the algorithm can start to find the
(DSRHT), which can handle the heterogeneous sources of course by utilizing the Bound and Estimation the same as
course datasets and improve the storage condition by mapping Algorithm 1 (line 13). For the virtual nodes, we set the Bound
them into distributed units in the course cloud. value of them as 0. As for the tree partition, the difference is
We denote the distributed storage units whose number is d that we leave the course regions whose depth is less than z
as Z = {Z1 , Z2 , ...Zd }, where Zi could be a MOOC learning out to cut down the storage cost. In the complexity section we
platform. We bound the number of distributed units Zd with will prove that the storage can be bounded sublinearly under
2z−1 < d ≤ 2z to fit with the binary tree mode, where z is the optimal condition.
the depth of the tree and 2z is the number of regions in that
depth. Note that the number of distributed units is determined
by the practical situation, thus in every context sub-hypercube B. Regret Analyze of DSRHT
Pt the number of elements in set Z is the same as d. Since In this subsection we prove the regret result in DSRHT can
Zd is not always equal to 2z , we let the storage units whose be bounded sublinearly. Now, again, we divide the regions
9
contrast to get the regret upper bound separately by ΓPt = gorithm, since it explores one region in one round, it’s obvious
ΓP t Pt Pt Pt
1 + Γ2 + Γ3 + Γ4 , where E[Ri (T )] is the expected regret to know the space complexity is linear E[S(T )] = O(T ).
of the set Γi (i = 1, 2, 3, 4). ΓP
Pt t
1 means the regions and Theorem 3. In the optimal condition, we take the number of
their descendants in set φP H whose depth is H(H > z); Γ2
t Pt
dX +αdC
is the set whose regions are in set φP t
(z < h ≤ H); Γ Pt storage units satisfied 2z = lnTT dX +α(dC +3) , then we can
h 3
contains the regions and their descendants in set (φP t c get the space complexity
h ) (z < dX +αdC
Pt 3α dX +α(dC +3)
h ≤ H); and for Γ4 , they are the regions at depth z which E[S(T )] = O T dX +α(d C +3) T dX +α(dC +3) −(ln T ) dX +αdC
.
will be selected twice each based on the Algorithm 1. The
depth H (z < H) is a constant to be selected later. Proof: Every round t has to explore a new leaf region. To
Theorem 2. The regret of the distributively stored algorithm get the optimal result,
dX we suppose the depth is as deepest as we
+αdC
dX +α(dC +3)
ln( lnTT )
is can choose z = ln 2 . Under the condition
dX dX +α(dC +2) α
E[R(T )] = O LX dX +α(dC +3) T dX +α(dC +3) (ln T ) dX +α(dC +3) , dX +αdC
that t < 2z+1 , we have S1 (T ) ≤ 2z = lnTT dX +α(dC +2) ,
if the number of distributed units satisfies when the time t ≥ 2z+1 , after one round there is one unplayed
dX +αdC
10
TABLE I: Theoretical Comparison
Including
A. Description of the Database Lessons Homework,Video
We take the database which contains feedback information content files and so on
and course details from the edX [27] and the intermedi-
ary website of MOOC [7]. In those platforms, the context Forums :
dimensions contain nationality, gender, age and the highest Getting
rewards
education level, therefore we take dX = 4. As for the course Video
dimensions, they comprise starting time, language, profes- window
sional level, provided school, course and program proportion,
whether it’s self-paced, subordinative subject etc. Thus we take
the course dimension as 10. For the feedback system, we can
Fig. 7: MOOC Learning Model
acquire reward information from review plates and forums.
Thoroughly, the reward is produced from two aspects, which
are the marking system and the comments from forums.
For the users, when a novel field comes into vogue, tremen- course data to better illustrate the comparing effect. The works
dous people will get access to this field in seconds. The are introduced as follows.
data we get include 2 × 105 students using MOOC in those
platforms, and the average number of courses the students • Adaptive Clustering Recommendation Algorithm (ACR)
comment is around 30. As for our algorithm, it focuses on the [29]: The algorithm injects contextual factors capable of
group of students in the same context sub-hypercube rather adapting to more students, however, when the course
than individuals. Thus, when in the next time users come database is fairly large, ergodic process in this model
with context information and historical records, we just treat cannot handle the dataset well.
them as the new training data without distinguishing them. • High Confidence Tree algorithm (HCT) [30]: The algo-
However the number of users is limited, even if generating a rithm supports unlimited dataset however large it is, but
course is time-costing, the number of courses is unlimited and there is only one student for the recommendation model
education runs through the development of human being. Our since it does not take context into consideration.
algorithm pays more attention to the future highly inflated • We consider both the scale of courses and users’ context,
MOOC curriculum resources, and existing data bank is not thus our model can better suit future MOOC situation.
tremendous enough to demonstrate the superiority of our In DSRHT we sacrifice some immediate interests to get
algorithm since MOOC is a new field in education. better long-term performance.
We find 11352 courses from those platforms including To verify the conclusions practically, we divide the experiment
plenty of finished courses. The number of courses doubles into following three steps:
every year. Based on the trend, the quantity will be more than
1) Step 1.: In this step we compare our RHT algorithm with
forty thousand times within 20 years. To give consideration
the two previous works which are ACR [29] and HCT [30]
to both accuracy and scale of sources of data, we copy the
with different size of training data. We input over 6×106 train-
original sources to forty five thousand times to satisfy the
ing data including context information and feedback records
number requirements. Thus we extend the 11352 course data
in the reward space mentioned in the section of database
to around 5 × 108 to simulate future explosive data size of
description into the three models, and then the models will
courses in 2030.
start to recommend the courses stored in the cloud. In consid-
eration of HCT not supporting context, we normalize all the
B. Experimental Setup context information to the same (center point of unit context
As for our algorithm, the final training number of data is hypercube). Since the reward distribution is stochastic, we
over 6 × 106 and the number of courses is about 5 × 108 . Note simulate 10 times to get the average values where the interfere
that we focus more on the comparison rather than showing of random factor is restrained. Then the two regret tendency
the superiorities of our algorithms, thus we take the statistic diagrams are plotted to evaluate algorithms performances.
11
ൈ 澔 ൈ 澔
6 6
ACR z=0
5 HCT 5 z=10
RHT z=20
4 4
Regret
Regret
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Arrival data ൈ 澔 Arrival data ൈ 澔
Fig. 8: Comparison of Regret (RHT) Fig. 10: Comparison of Regret with Different z
0.6 0.6
ACR z=0
0.5 HCT 0.5 z=10
RHT z=20
Average Regret
Average Regret
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Arrival data ൈ 澔 Arrival data ൈ 澔
Fig. 9: Comparison of Average Regret (RHT) Fig. 11:Comparison of Average Regret with Different z
2) Step 2.: We use the DSRHT algorithm to simulate represent the training cost. As for the DSRHT, we use the
the results. The RHT algorithm can be seemed as degraded virtual partitions in school servers to simulate the distributively
DSRHT with z = 0, and we compare the DSRHT algorithm stored course data. Specifically, we reupload the course data to
with different parametersz. Without loss of generality, we take the school servers in 1024 virtual partitions, and then perform
dX +αdC
dX +α(dC +3)
ln( lnTT ) the DSRHT algorithm.
z = 0, z = 10 and z = ln 2 ≈ 20. Then we
plot the regret and z diagram to analyze the constant optimal
parameter. C. Results and Analysis
3) Step 3.: We record the storage data to analyze the We analyze our algorithm from two different angles: Com-
space complexity of those four algorithms. First we upload paring with other two works and comparing with itself with
517.68 TB indexing information of courses to our university different parameter z. In each direction, we compare the regret
high performance computing platform, whose GPU reaches to first, and analyze the average regret. And then we discuss the
18.46 TFlops and SSD cache is 1.25 TB. Then, we implement accuracies based on the average regret. At last we will compare
and perform the four algorithms successively. In the process the storage conditions from different algorithms.
of training, we record the regret for six times. And in the In Fig. 8 and Fig. 9 we compare the RHT algorithm with
end of training, we record the space usage of the tree which ACR and HCT. From the Fig. 8 (Regret diagram), we can
TABLE II: Average Accuracies of RHT TABLE III: Average Accuracies of DSRHT
Algorithm DS factor z
ACR [29] HCT [30] RHT z=0 z = 10 z = 20
Number ×106 Number×106
1 65.43% 81.02% 85.34% 1 85.34% 82.67% 51.10%
2 78.62% 82.13% 87.62% 2 87.62% 86.98% 72.94%
3 83.23% 82.76% 89.92% 3 89.92% 90.49% 81.37%
4 86.28% 83.01% 90.45% 4 90.45% 91.50% 85.79%
5 88.19% 83.22% 91.09% 5 91.09% 92.03% 88.33%
6 88.79% 83.98% 91.87% 6 91.87% 92.89% 89.04%
12
TABLE IV: Average Storage Cost big data. Experiment results verifies the superior performance
of RHT and DSRHT when comparing with existing related
ACR [29] HCT [30] RHT DSRHT (z = 10)
Storage Cost (TB) 12573 2762 4123 2132
algorithms.
Storage Ratio 24.287 5.335 7.964 4.118
A PPENDIX A P ROOF OF L EMMA 2
Proof: To the first term of in Lemma 1, we take cj , ck ∈
get that our method is better than the two others which has N Pt and ck = cPt ∗ for all context xi ∈ X , then we can get
h,i
less regret from the beginning. The HCT algorithm performs that
√
better than ACR when it starts. With time going on, the ACR’s f (rcPPtt ∗ ) − f (rcPjt ) ≤ diam(Nh,i Pt
) + LX ( ndTX )α
√ (24)
regret comes to be lower than HCT. From the Fig. 9 (Average
≤ k1 (m)h +LX ( ndTX )α ,
Regret diagram), HCT’s average regret is less than that of ACR Pt ∗
at first, the results also showing that ACR performs slightly where c is the best course whose reward is highest in the
better than HCT finally. context sub-hypercube Pt . We note the event when the path
Table II records the average accuracies which is the total go through the region N Pt
h,i as event {Nh,i ∈
Pt
`Pt ∗ }, therefore,
n o H,I
Pt
rewards divided by the number of training data (denoted by P Bh,i (T Pt ) ≤ f (rcP∗t ) and Th,i Pt
(T Pt ) ≥ 1
“Num”). We find that when the time increases, all the perfor- q
mance of three algorithms can get promoted. Our algorithm = P µ̂P Pt
h,i (T ) +
t
k2 ln T /Th,i Pt
(T Pt ) + k1 (m)h
has the highest accuracies during the learning period. The ACR √
performs not good when the process starts, whose accuracy is +LX ( ndTX )α ≤ f (rcP∗t ) and Th,i Pt
(T Pt ) ≥ 1
65.43% and is worse than that of HCT. Finally, ACR converges h √ i
to 88.79% but HCT is still 83.98%. When it comes to our = P µ̂P h,i
t
(T Pt
)+k 1 (m) h
+L X ( dX α
nT ) −f (r Pt
c ∗ ) Pt
Th,i (T Pt)
algorithm, it’s 91.87% which is much better than HCT. q
Pt P P P
Fig. 10 and Fig. 11 analyze the DSRHT algorithm by using ≤ − k2 (ln T )Th,i (T t ) and Th,i (T ) ≥ 1 t t
n=1
is the worst in the three conditions. After that, it come to catch q
Pt P Pt Pt
the RHT 91.09% with 88.33%. Thus we can see selecting the ≤ − k2 (ln T )Th,i (T ) and Th,i (T ) ≥ 1 .
t
distributed storage number cannot pursuit the quantity only, The last inequation is based on the expression (24), since
whether it’s makes sense as well in practice. the second term is positive and we drop it to get the last
As for the storage analysis, we use the detailed information expression.
of courses to represent courses data, and the whole course
storage is 517.68 TB. To get more intuition, we use the nFor the convenience of illustration, we pick the n when
^Pt
o
Pt Pt
I Nh,i ∈ `H,I is equal to 1. We use r c to indicate the rcPnt
ratio of actual space occupied and course space occupied to n o
Pt
denote storage ratio. From table IV we know that ACR [29] happened in I Nh,i ∈ `P t
H,I . Thus,
algorithm is not suitable for real big data since the storage T Pt
n Pt o
ratio reaches 24.287 TB. HCT [30] algorithm performs well P ∈ `P
P Pt
rcn (n) − f (rcPnt ) I Nh,i t
H,I
in space complexity which is better than RHT. As for DSRHT, n=1
q
the storage ratio is 4.118 TB which is less than HCT and nearly Pt P Pt Pt
≤ − k2 (ln T )Th,i (T ) and Th,i (T ) ≥ 1
t
half of RHT. T Pt
n Pt o
∈ `P
P Pt
≤P rcn (n) − f (rcPnt ) I Nh,i H,I
t
13
Pt
We consider the situation when n = 1, 2...Th,i (T Pt ) and Thus,
Pt Pt Pt i TPPt h
the fact that Th,i (T ) ≤ T . Besides, the last inequation h
Pt Pt
i
Pt
E Th,i (T ) ≤ P Bh,i Pt
(n) > f (rcPPtt ∗ ) and Th,i (n) > q
use the union bound theory and loose the threshold n=q+1
Pt n i
TP h
^Pt ^P t
p Pt Pt
∈ {q+1,
P
P f ( r c ) − f ( r cj ) ≤ − k2 (ln T )n or Bj,i h0
(n)≤ f (rcPt ∗
) for j ..., n−1}
n=1 j=1
(25) + 4k2 ln T
α 2 + 1
Pt √
TP d
exp(−2k2 ln T ) ≤ (T Pt )−2k2 +1 .
Pt
≤ DC(h,i) −k1 (m)h −LX ( n X )
T
n=1 4k2 ln T
Note that the sum of time T represents the contextual sum of ≤ √
d
α 2 +1
P X
t
DC(h,i) −k1 (m)h −LX ( nT )
time since the number of courses in the context sub-hypercube
T Pt
P h i
is stochastic. And for the convenience, we use T as the sum + (T Pt )
−2k2 +1
+n−2k2 +2 .
of time. With the help of Hoeffding-Azuma inequality [26], n=q+1
we get the conclusion. And we take the constant k2 ≥ 1,
Pt
TP h i
−2k +1
1+ (T Pt ) 2 +n−2k2 +2 ≤ 4 ≤ M, (27)
With the help of the assumption of range over q, we can n=q+1
get √ thus we can get the conclusion Lemma 2.
α
Pt d
DC(h,i) −k1 (m)h −LX ( n X )
q
≥ k2 ln T (26)
q .
T
2 A PPENDIX B P ROOF OF T HEOREM 2
Thus,
n the o Proof: Based on the segmentation, the regret can be
Pt
P Bh,i (T Pt ) > f (rcPPtt ∗ ) and Th,i
Pt
(T Pt ) ≥ q presented with
n q √ E[R(T )] = E[R1 (T )] + E[R2 (T )] + E[R3 (T )] + E[R4 (T )].
Pt Pt
= P µ̂h,i (TPt)+ k2 ln T /Th,i (TPt)+k1 (m)h+LX ( ndTX )α
o For E[R1 (T )], since it’s the same as the Algorithm 1, so we
Pt
> f (rcPPtt ∗ (h,i) ) + DC(h,i) Pt
and Th,i (T Pt ) ≥ q can get the first term ash
√
i √
E[R1 (T )] ≤ 4 k1 (m)H + LX ( ndTX )α T .
n q
Pt (28)
≤ P µ̂h,i (T Pt ) + k2 ln q
T
+ k1 (m)h + LX ( ndTX )α
The depth is from z to H, revealing that H > z. To
o
Pt
> f (rcPPtt ∗ (h,i) ) + DC(h,i) Pt
and Th,i (T Pt ) ≥ q dX +αdC
Pt
√
d
α satisfy this, we suppose 2H ≥ lnTT dX +α(dC +2) . Since the
Pt Pt DC(h,i)−k1 (m)h−LX ( n X )
Pt exploration process started from depth z, the depth we can
= P [µ̂h,i (T )−f (rcPt∗(h,i))]>[ 2
T
]
select satisfy the inequation above. Thus the second term’s
Pt
and Th,i (T Pt ) ≥q . regret bound is
H h √ αi
h
4 k1 (m) + LX ( ndTX ) φP
PP
E[R2 (T )] ≤ h
t
Pt h=z
Pt
When we multiply Th,i (T Pt ) with both sides, we can get H √ (29)
4K(nT )dX P
h
h αi
dX
the inequations below. ≤ d 4 k1 (m) + LX ( nT ) .
[k1 (m)h ] C h=z
We choose the context sub-hypercube whose regret bound
Pt
√
d
α is biggest to continue the inequation (29). And as for the third
Pt Pt DC(h,i)−k1 (m)h−LX ( n X ) term, the regret bound is
Pt
P [µ̂h,i (T )−f (rcPt∗(h,i))]>[ 2
T
]
H h √ αi P c
h−1
4 k1 (m) +LX ( ndTX ) (φP
PP
E[R3 (T )] ≤ h ) .
t
Pt
and Th,i (T Pt ) ≥ q Pt h=z P
Nh,i
P
t ∈Γ t
3
Pt h=z P
t ∈Γ t
Nh,i
P
3
(
With the union bound and the Hoeffding-Azuma inequality P K(nT )dX ln T
32k2
[26], we can get that ≤ √
dX
Pt h [k1 (m)h ]dC +1 k1 (m)h +LX ( nT )α
n TP n Pt o √
rnPt (n) − f (rcPnt ) I Nh,i ∈ `P
)
t d
P H,I 8M K(nT )dX k1 (m)h +LX ( n X )α
T
n=1 √ α + .
tP
DC(h,i)−k1 (m)h−LX(
dX
)
o m[k1 (m)h ]dC
Pt Pt
>[ 2
nT
]Th,i (TPt) and Th,i (TPt)≥q
From the upper bounds of regret E[R1 (T )], E[R2 (T )],
≤ (T Pt )−2k2 +1 . E[R3 (T )], we can get that the three upper bound is the same
as algorithm RHT.
According to Lemma 1 and the prerequisite in Lemma 2,
we select upper bound of q as 4k2 ln T
√ α 2 +1.
R EFERENCES
d
P X
t
DC(h,i)−k1 (m)h−LX ( nT ) [1] L. Pappano, “The Year of the MOOC,” The New York Times, 2014.
14
[2] T. Lewin, “Universities Abroad Join Partnerships on the Web,” New York
Times, 2013.
[3] Coursera, https://www.coursera.org/.
[4] A. Brown, “MOOCs make their move,” The Bent, vol. 104, no. 2, pp.
13-17, 2013.
[5] D. Glance, “Universities are still standing. The MOOC revolution that
never happened,” The Conversation, www.theconversation.com/au, July
15, 2014a.
[6] M. Hilbert, “Big data for development: a review of promises and
challenges,” Development Policy Review, vol. 34, no. 1 pp. 135-174, 2016.
[7] Guoke MOOC, http://mooc.guokr.com/
[8] G. Paquette, A. Miara, “Managing open educational resources on the
web of data,” International Journal of Advanced Computer Science and
Applications (IJACSA), vol. 5, no. 8, 2014.
[9] G. Paquette, O. Mario, D. Rogozan, M. Lonard, “Competency-based per-
sonalization for Massive Online Learning,” Smart Learning Environments,
vol. 2, no. 1, pp. 1-19, 2015.
[10] C. G. Brinton, M. Chiang, “MOOC performance prediction via click-
stream data and social learning networks,” IEEE Conference on Computer
Communications (INFOCOM), pp. 2299-2307, 2015.
[11] S. Bubeck, R. Munos, G. Stoltz, C. Szepesvari, “X-armed bandits,”
Journal of Machine Learning Research pp. 1655-1695, 2011.
[12] G. Adomavicius, A. Tuzhilin, “Toward the next generation of recom-
mender systems: a survey of the state-of-the-art and possible extensions,”
IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6,
pp. 734-749, 2005.
[13] D. Yanhui, W. Dequan, Z. Yongxin, et al. “A group recommender
system for online course study,” International Conference on Information
Technology in Medicine and Education, pp. 318-320, 2015.
[14] M. J. Pazzani, D. Billsus, “Content-based recommendation over a cus-
tomer network for ubiquitous shopping,” IEEE Transactions on Services
Computing, vol. 2, no. 2, pp. 140-151, 2009.
[15] R. Burke, “Hybrid recommender systems: Survey and experiments,”
User Modeling and User-adapted Interaction, vol. 12, no. 4, pp. 325-
341, 2007.
[16] K. Yoshii, M. Goto, K. Komatani, T. Ogata, H. G. Okuno, “An efficient
hybrid music recommender system using an incrementally trainable
probabilistic generative model,” IEEE Transactions on Audio, Speech,
Language Processing, vol. 16, no. 2, pp. 435-447, 2008.
[17] L. Yanhong, Z. Bo, G. Jianhou, “Make adaptive learning of the MOOC:
The CML model,” International Conference on Computer Science and
Education (ICCSE), pp. 1001-1004, 2015.
[18] A. Alzaghoul, E. Tovar, “A proposed framework for an adaptive learning
of Massive Open Online Courses (MOOCs),” International Conference
on Remote Engineering and Virtual Instrumentation, pp. 127-132, 2016.
[19] C. Cherkaoui, A. Qazdar, A. Battou, A. Mezouary, A. Bakki, D.
Mamass, A. Qazdar, B. Er-Raha, “A model of adaptation in online
learning environments (LMSs and MOOCs),” International Conference
on Intelligent Systems: Theories and Applications (SITA), 2015, pp. 1-6.
[20] E. Hazan, N. Megiddo, “Online learning with prior knowledge,” Inter-
national Conference on Computational Learning Theory, Springer Berlin
Heidelberg, pp. 499-513, 2007.
[21] A. Slivkins, “Contextual bandits with similarity information,” Journal
of Machine Learning Research, vol. 15, no. 1, pp. 2533-2568, 2014.
[22] J. Langford T. Zhang, “The epoch-greedy algorithm for multi-armed
bandits with side information,” Advances in neural information processing
systems, pp. 817-842, 2008.
[23] W. Chu, L. Li, L. Reyzin, R. E. Schapire, “Contextual bandits with
linear payoff functions,” AISTATS, vol. 15, pp. 208-214, 2011.
[24] T. Lu, D. Pál, M. Pál, “Contextual multi-armed bandits,” International
Conference on Artificial Intelligence and Statistics (AISTATS), pp. 485-
492, 2010.
[25] C. Tekin, M. van der Schaar, “Distributed online big data classification
using context information,” IEEE Annual Allerton Conference: Commu-
nication, Control, and Computing, pp. 1435-1442, 2013.
[26] W. Hoeffding, “Probability inequalities for sums of bounded random
variables,” Journal of the American Statistical Association, vol. 58, no.
301, pp. 13-30, 1963.
[27] edX, https://www.edx.org/
[28] J. P. Berrut, L. N. Trefethen, “Barycentric lagrange interpolation,” Siam
Review, vol. 46, no. 3, pp. 501-517, 2004.
[29] L. Song, C. Tekin, M. van der Schaar, “Online learning in large-
scale contextual recommender systems,” IEEE Transactions on Services
Computing, vol. 9, no. 3, pp. 433-445, 2014
[30] M. G. Azar, A. Lazaric, E. Brunskill, “Online Stochastic Optimization
under Correlated Bandit Feedback,” Proc. Int. Conf. on Machine Learning
(ICML), Beijing, pp. 1557-1565, 2014.
15