A modified attention mechanism powered by Bayesian Network for user activity analysis and prediction
A modified attention mechanism powered by Bayesian Network for user activity analysis and prediction
1. Introduction
Machine Learning models have been used successfully on multiple situations where the size of the input data is large and stable.
Researchers have also focused on the advantages of Machine Learning (ML) models to succeed regardless of the size of data [1,2].
The main purpose of this work is to improve the performance of the forecasting component of the solution presented in our earlier
work [3] in the presence of relatively small and irregular datasets. The solution in [3] consisted in a smart classifier empowered
by a hybrid algorithm which combines a Hidden Markov Model (HMM), with a Long Short-Term Memory (LSTM) neural network
with the purpose of provide performance consistency in the task of to detect patterns in the click-stream data generated by users
interacting with a Learning Management Systems (LMS). Click-stream data refers to a detailed log of how participants navigate
through an online platform during a working session. The overall idea in [3] and also in this work can be described as follows:
the click-stream data are gathered from a collecting system (the LMS in this case) and processed dynamically by the hybrid smart
classifier, which generates alert flags (i.e. positive, negative, or just a warning). The flags activate an alert assistant that provides
∗ Corresponding author.
E-mail addresses: [email protected] (A. Amezaga Hechavarria), [email protected] (M.O. Shafiq).
https://doi.org/10.1016/j.datak.2022.102034
Received 19 September 2021; Received in revised form 21 February 2022; Accepted 8 May 2022
Available online 18 May 2022
0169-023X/© 2022 Published by Elsevier B.V.
A. Amezaga Hechavarria and M.O. Shafiq Data & Knowledge Engineering 140 (2022) 102034
informed feedback to the student, or to an assistant who timely will provide meaningful tips and recommendations to the student.
This process is sketched in Fig. 1. This work focuses on the algorithm that potentially could power up the alert assistant, but its
development is not addressed in this work.
This classifier is meant to work in a live-session on LMSs which is actively collecting data from the interaction student-LMS (or
user-LMS in general), and its purpose is to assess the student’s interaction behavior in quasi-real time to timely provide them with
useful assistance. The main challenge in such scenarios is the highly erratic incoming data, and on the other hand, the scarcity of
data in early stages of that interaction. Those issues negatively impact on the ability of any forecasting method of delivering reliable
outputs.
Finding patterns in data generated by humans often represents a challenge. In [4], the information that is available to the user in
a searching engine is encoded into a vector state, and then distributed representations are used in the proposal of an alternative. In
their approach, the concepts that are useful for modeling user behavior represent the components of the vector state. The vector state
is initialized with a query, and then, with the search engine results, iteratively updated with the use of information on interactions.
Their method is particularly interesting because of their innovation in feature engineering that provides them with a very handy tool.
Their approach somewhat inspired our work, in particular with regards to the feature engineering we carry out using a Bayesian
Network (BN) [5–7].
Click-stream data have been used in a range of ways, recommendations for online shopping, movies suggestions, and others.
Using them for feature engineering could be seen as the most efficiently manner to take advantage of them. Markov-Chain based
methods are extensively applied with very good results [6,8–14]. For example in [11], the authors use a Hidden Markov Model
to compute the algorithms and to model the series, by applying transition and emission probabilities. Recently, in [15], a novel
Markov modulated marked point process (M3PP) model was developed with the goal of detecting users at risk of exiting with
no purchase from click-stream data. Their M3PP approach covers both the sequence of pages visited and the temporal dynamics
between them, i.e., the time spent on pages, which examines click-stream data in a comprehensive manner. All these works discuss
the challenges of Hidden Markov Models such as learning about the model, the evaluation process and the parameter estimation.
Also, forward–backward algorithms are presented in these works as solutions to these issues.
With the purpose to improve prediction in cases where data shows high variation, this work focused on introducing a measurable
score that could capture some hidden dynamics in the data-generation process in the hope that this could improve the prediction
accuracy and enhance the overall performance of the pattern-detection task. As a result, the main contribution of this work is the
introduction of a methodology to customize the attention mechanism in a Transformer model using probabilistic inferential methods,
in a manner that incorporates knowledge extracted from click-stream data, which includes both categorical and numerical data.
In this case, a Bayesian Network (BN) model was used as the probabilistic approach to model the click-stream data flow. The
methodology proposed could potentially be applied beyond the scope of the LMS case-study presented in this research. The solution
presented is a model to process and classify patterns in multivariate time series with non-homogeneous data. In this work, we will
refer to data containing both numerical and non-numerical variables as non-homogeneous data. This solution has the potential to
be used as a base for the development of a tool for an LMS. Another potential use of the hybrid model presented in this work could
be to process multivariate time-series generated by multi-sensor systems recording non-homogeneous data.
In summary, the problem addressed is concerned with how a hybrid solution combining a Bayesian Network (BN) with a
Transformer model could leverage performance in categorical sequence prediction problems for stochastic data. The research
question that drives this stage of the work is:
RQ: How data processing using a Bayesian Network (BN) could enhance performance in a sequence-to-sequence forecasting
engine based on a Transformer?
This paper presents a review of related findings in the literature on Section 2. This is followed by a detailed description of
our model in Section 3, and the building of our baseline, alternative models, and the customization of the attention mechanism
in Section 4. An assessment of our experiment results from baseline and alternative models is described in Section 5. The paper
concludes with overall insights and directions for future work in Section 6.
2. Related work
Regarding prediction problems for user’s behavior, and regardless of the specific area, the panoramic in model performances are
quite diverse. As we discussed before, Borisov et al. [4] present a Click Model based on user behavior which relies on what they
2
A. Amezaga Hechavarria and M.O. Shafiq Data & Knowledge Engineering 140 (2022) 102034
call ‘‘distributed representation’’; essentially, they represent each document by its associated click patterns that are observed in a
search engine and use an LSTM approach to predict user behavior.
In the recent work [16], a hybrid deep autoencoder framework is presented, which uses the autoencoder scheme for anomaly
detection. The model has a multi-channel convolutional neural network (CNN) [17] used as encoder, and an LSTM network as
the decoder. One of the benefits of the proposed solution is that the framework can be implemented on data that has mixed-type
attributes. In addition, it generalize well to multivariate datasets that can have a large number of features, which is an advantage
over proximity-based methods. Markov chain approaches continue as a very robust approach. Mizera, Pang and Yuang examine a
two-state Markov chain approach; they show that splitting the state space into two disjoint sub-spaces and approximating steady-state
probabilities help to avoid the problem of generating biased results when biological systems are modeled using large probabilistic
Boolean networks (PBNs) [13]. In another approach, non-stationary data is transformed into time series in [18] with the use of
discrete wavelet analysis.
In the context of natural language processing (NLP), [19] propose the Hierarchical Attention Network (HAN) that captures
two basic insights into the structure of documents. Instead of filtering for sequences and order of tokens with context, they apply
context to uncover if and when a sequence of tokens can be considered to be relevant. Recently, [20] developed a Transformer-based
approach to forecasting multivariate time series data; as a novelty, they take advantage of self-attention mechanism so that they
can capture complex dependencies of various lengths from time series data. Wu et al. [20] also suggest that their model can be
generalized to discover the hidden relationship between two arbitrary points in spatio-temporal space. Huynh et al. [21] use an
innovative shortcut to analyze sequential pattern lattices, which evade reiterative data copies during execution.
Some data analytics on LMS have been reported in related works, such as [22–25]. Most of these studies focus on analyzing user
behavior and detecting patterns from interaction with LMS in a post-processing fashion mainly to establish correlation with final
outcomes of the student academic activities. The approach presented here is primarily focused on quasi-real time analytics, it is,
processing the click-stream data as they are actively generated.
From this close-eye look over the related work literature review, the lessons can be summarized as follows. First, LSTM and GRU
seem to be the preferred versions of RNN when addressing time series forecasting. Second, when considering multivariate correlated
sequential data, hybrid approaches are often adopted. These methods generally use some type of hybrid model to select the sequential
data features that are likely to be impacted by correlation in the data. This is key when the data to be processed mixes categorical
data and numerical data. Finally, from this review it emerges that applying probabilistic inference models facilitates model output
interpretability, while deep learning techniques deliver a more efficient search for better performance.
In this research, we address the prediction problem in categorical sequential data and how click-stream data could leverage
performance when the target variable can be described as highly stochastic. A solution to this problem explored in this paper
is a model that utilizes and is built-upon the existing state-of-the-art methods. More specifically, the solution combines Bayesian
Network [5] with a modified Sequence-To-Sequence Model [26] based on an Encoder–Decoder architecture [27,28] powered by
Long Short-Term Memory (LSTM) [29] and Gated Recurrent Unit (GRU) [30] neural networks.
This work focuses on building a model to predict categorical sequential data with highly random stochastic components. In the
solution presented, click-stream data will be used to improve the performance of the model. Click-stream data is associated to every
stage of the sequential data or time steps.
The Transformer model, an upgrade of encoder–decoder models based on LSTM/GRU-RNNs, is one of the most successful
approaches in sequential data prediction. The key feature with Transformer models is the attention mechanism introduced in encoder–
decoder architectures [27,28]. The role of the attention mechanism in Transformer is to maximize short–long dependencies in portions
of the next-to-be-processed input data and to understand the context of those dependencies that leads to the next prediction output. In
other words, the attention mechanism makes active monitoring to decide when and how to adjust the weights of relevant dependencies
in portions of the data during the training process to improve the forecasting output. Therefore, in this picture, with context we are
talking about the dependencies that can play a key role in a portion of the data, that could be specific only to that data portion.
The transformer model performs very well in Natural Language Processing (NLP) problems, especially for such applications as
language translation, language modeling, audio-to-text conversion, among others [31]. Despite its unquestionable success in many
applications, Transformer models still have some limitations.
On one side, in the Transformer models, the computational task is increased quickly due to the addition of more weight
parameters. As it has been indicated in the context of a sequence-to-sequence model (e.g., RNN-based encoder–decoder) [26],
attention is a very effective mechanism because the context vector contains information from the encoder that can be relevant
to be considered. The attention mechanism works by combining the encoder’s output at every time step with the decoder’s output
at time step 𝑡, to generate the context vector for that particular time step. When the input data for the model are long sequences,
this procedure greatly increases training time [28].
On the other, in problems such as NLP, it is not so difficult to extract some meaning from portions of the data in a way that
allows establishing a sort of mapping to elaborate a meaningful output. In those cases, an ideal mapping tool is a dictionary. In other
cases, such as in DNA sequencing problems, any helping tool came from biochemical interactions [32]. When the predicting problem
deals with categorical sequences that could evolve as a stochastic variable, the finding of a suitable mapping tool is difficult. This
could be the case when the target variable depends on personal preferences, and no ground truth is available [3].
Our approach in this work takes another direction. We propose to use click-stream data as a source of context for the portion
of data to be processed. The click-stream data will feed the modified attention mechanism in our model by creating a scoring index,
3
A. Amezaga Hechavarria and M.O. Shafiq Data & Knowledge Engineering 140 (2022) 102034
based on joint probability of a BN model [33] which is associated to every time-step for the categorical target variable. The
evolution of this Bayesian scoring in each sub-sequence feeding the network will provide additional information that could help
the sequence-to-sequence model to learn a meaningful context to improve predicting capability.
The methodology could be summarized as click-stream data associated with the categorical target variable, used to build a
Bayesian Network (BN). This BN computes a scoring value to each state of every sub-sequence of the input sequence. The length of
each sub-sequence is a hyper-parameter of the model to be optimized so that an appropriate balance between accuracy and speed
could be achieved as part of the overall performance of the model.
The Transformer model with modified attention mechanism uses the Bayesian scoring values to better analyze the context of every
single state in a sub-sequence. The context here means what are the transition probabilities in every sub-sequence. The calculated
Bayesian scoring indexes are added to the attention layer as a modulator over the encoder output and the decoder output before
combining both in the training process.
Finally, a new sub-sequence is predicted. At this stage of the work, the aim is to build a forecasting engine that is suitable for
sequential categorical data. The algorithm is powered by a Bayesian Network model and a recurrent neural network. When the
categorical sequence data is generated by an underlying stochastic process, an extra guide is needed to help the forecasting model
to understand the ‘‘context’’, so that a given input sub-sequence could predict the next sub-sequence.
In the proposed approach, the BN, fed by click-stream data, modifies the attention mechanism in a Transformer architecture.
The inclusion of the click-stream data could leverage performance while helping in predicting problems that involve categorical
sequential, random variables.
As mentioned before, the goal in this work is to assist the predicting task of a new subsequence of the target data by modeling it
in combination with other correlated features in the click-stream data as a BN-driven process. The proposed algorithm will compute
a probability score related to the underlying BN process and use it to customize the attention mechanism in a forecasting Transformer
model. Fig. 2 shows a high-level diagram representation of the whole solution.
The data used in this work is composed of click-stream data from an experiment in LMS in which 115 students were engaged in
six lab sessions performing several academic activities [34].
The detailed log data as shown in Fig. 3 includes for example, activities, which are labeled based on the titles of the web pages
that are being browsed by the student; idle_time, the time duration between the start and end time in milliseconds for a given activity;
mClick_L, the number of mouse left clicks during a given activity; and KyS, the number of keystrokes during a given activity among
other click data.
The solution is built upon and extending the existing works. The solution is composed of a Bayesian Network [5], with state-to-
state transitions governed by an underlying Markov-Chain process [33] and an RNN-based (GRU [30] and LSTM [29]) Transformer
model.
4
A. Amezaga Hechavarria and M.O. Shafiq Data & Knowledge Engineering 140 (2022) 102034
∏
𝑃 (𝑋1 , … , 𝑋𝑛 ) = 𝑃 (𝑋𝑖 |𝑃𝑘 (𝑋𝑖 )) (1)
𝑖
Each time-step, the user is in one of the activities accessible to her/him, and the amount of 𝐾𝑦𝑆, 𝐿𝑚𝐶, and other click data is
recorded along with the time spent in the activity. The set {𝐴}, that is, the whole sequence of categorical values will be considered
as a latent variable of the model, while the click-stream data represent the observable data in this model. A BN is built using the
sequential data for the activities and the click-stream data associated with them. The purpose of the BN component in the solution
is to compute a scoring index for each activity chosen by the user from the set of available activities (the state space). The index
will depend on the Bayesian joint-probability 𝑃 (𝐴𝑡 , 𝐾𝑦𝑆, 𝐿𝑐𝑀).
5
A. Amezaga Hechavarria and M.O. Shafiq Data & Knowledge Engineering 140 (2022) 102034
The goal is to build an encoder–decoder algorithm to be powered by a BN built with the click-stream data. A graph network
is used to compute a score associated with every time-stamp of the sequential data. This score will be used to be inserted in the
attention mechanism for the corresponding transformer model.
4.1. Architecture
In our previous work, a solution is developed to predict a new sequence of online activity for users, e.g. students working online.
First, a Markov-Chain model reads the entire activity history for one student and trains an Inspector module. At the same time, a
predictor module, in this case a Long Short Term Memory (LSTM) model, also reads the student’s history of online activity and
predicts the new sub-sequenceequence, which is the next sequence of steps. Second, the trained inspector module evaluates both the
last sub-sequence and the predicted new sequence. Finally, the trained Inspector module launches a label to conclude the process.
In this work, as presented in Fig. 2, first the model loads the data provided by the pre-processing module and prepares the data
into ordered sequences of online activities with additional additional data (e.g., time spent in activities, mouse clicks, etc.). The
recorded click-stream data is used to build a BN that computes a scoring. This Bayesian scoring helps to analyze the context for the
prediction process. This score feeds the attention mechanism of a Transformer model for the predictor module [3]. The focus is on
improving the performance of the predictor module that creates the new sub-sequence. In this approach, the predictor module is
designed as a Sequence-to-Sequence model with an Encoder–Decoder architecture. Its improvement is done in several steps. In the
first version of the Sequence-to-Sequence model, both the Encoder and the Decoder are feed-forward LSTMs. In a second version
of the model, the Encoder has bidirectional LSTM layers, and the Decoder has feed-forward LSTM layers. In the third version, the
Sequence-to-Sequence model is built as a Transformer model. The multiplicative attention (Loung Attention) layer is applied in the
Encoder-to-Decoder transition. Finally, in the fourth and last version, the model will be a Hybrid Ensemble containing a BN model
coupled to a Transformer. The BN is added to the Transformer model modifying its attention mechanism. This represents a substantial
change to the standard treatment of the attention mechanism in Transformers found in the literature. This is a new approach for the
analysis of the relevant context in the forecasting of sequential data that contains both categorical and numerical data.
In this approach, the BN is a model that represents how to estimate the next activity based on the previous one. The activities
are considered as latent or hidden variables, and the estimation process is based on scores that are computed through the observable
6
A. Amezaga Hechavarria and M.O. Shafiq Data & Knowledge Engineering 140 (2022) 102034
variables, which are click-stream data and the time length of each activity. In this work, the BN purpose is to estimate a Bayesian
score to be used in the attention mechanism (AM) of the Transformer Model (TM). The Bayesian score is included in the computation
of the alignment score, before the context is estimated. In Fig. 5 the BN diagram shows this process. The BN model will compute
scores, that is, probabilities for each incoming sequence, to compute the score alignment. Here, the Bayesian Behavior Scoring is
computed as
[ ]
𝑆𝐵𝑁 (𝐴𝑡 ) = 𝑃 (𝐴𝑡 ∣ 𝐴𝑡−1 ) ⋅ exp 𝛼 ⋅ 𝑃 (𝑚𝐶𝐿𝑡 ∣ 𝐴𝑡 ) + 𝛽 ⋅ 𝑃 (𝐾𝑦𝑆𝑡 ∣ 𝐴𝑡 ) − 𝛿 ⋅ 𝑚𝑀𝑜𝑣𝑡 (2)
where 𝑚𝐶𝐿 is left mouse-clicks; 𝐾𝑦𝑆 is key strokes; 𝐴 is Activity (state), and 𝑚𝑀𝑜𝑣 is a proxy for the elapsed time in the activity.
Hyperparameters in Eq. (2)𝛼, 𝛽, and 𝛿 are meant to adjust the BN model. Each row of click-stream data as presented in Fig. 6 shows
the observable variables, such as left mouse-clicks, keystrokes, and mouse movement. The observable indicators are processed by
the BN. The Bayesian scoring is transferred to the Transformer model. Then, the output sequences are generated. The Ensemble
model is implemented in Python language programming using the following libraries: Pandas, Numpy, Sklearn, Math, Matplotlib,
Tensorflow-backend Keras, and Statmodels.
This model consists of an Encoder–Decoder algorithm to perform the forecasting task in a Sequence-To-Sequence fashion. We
build three different approaches to the Sequence-To-Sequence model. The model breaks down the series of activities in chains of
short sub-sequences that predict upcoming next sub-sequences based on the click-stream recorded data. Each one of these models
is based on an LSTM/GRU - RNN architecture.
Four versions of the Encoder–Decoder architecture are presented:
• The first model: This consists of an Encoder–Decoder architecture where the Encoder and the Decoder both feature a stacked
LSTM/GRU with two LSTM/GRU layers and Dropout layers as well.
• The second model: It is also an Encoder–Decoder, where the Encoder has up to two bidirectional composited LSTM/GRU layers
and regular single layers in the Decoder.
• Third model: Transformer model where the Encoder has two LSTM/GRU plus dropouts layers, and the Decoder also has other
two LSTM/GRU layers. The multiplicative attention (Loung Attention) layer is applied in the encoder-to-decoder transition.
• Fourth model: Transformer similar to he Third model, but now using Additive attention mechanism (Bahdanau Attention)
Now, the next model consists of an Hybrid Ensemble containing a BN model coupled to a Transformer. The architecture details
of the modified Transformer are shown in Fig. 7. The BN is built using the click-stream data (i.e., keystrokes, left mouse clicks, and
the duration in each activity by the user). A Bayesian score is computed with this BN, which estimates the probability that the user
chooses to go from activity 𝐴 to activity 𝐵, considering the current click data.
7
A. Amezaga Hechavarria and M.O. Shafiq Data & Knowledge Engineering 140 (2022) 102034
The vector representation of each target step-value is multiplied by the computed Bayesian score of its actual step value.
When the data is entered in the Transformer architecture, for each input sequence, its Bayesian-scored copy is also entered to the
input layer of the RNN (𝑆𝐵𝑁 [𝐴𝑡 ] is computed using Eq. (2)). The alignment-score to determine context is computed similar to the
Luong mechanism:
where
A linear layer computes the vector of contexts before the output layer is reached by the data. The modified attention mechanism is
adjusted throughout the hyper-parameters alfa, beta, and delta introduced to compute the Bayesian score. In case other than the
Dot method is used, the expression for the alignment score will be modified accordantly [12,28,35]. Essentially, the modification
to the attention mechanism proposed here consist in applying a multivariate treatment to a highly random variable. This treatment
is based on a new correlated variable that accounts for the underlying Bayesian process driving the model (see Fig. 7).
5. Evaluation
The experiments described in the following section evaluate the performance of the predictor and compare the model results to
a baseline model developed in our previous work [3], which is based on a stacked LSTM architecture. The predictor in this work
is an Encoder-Decoder RNN-powered architecture. The performance of the BN as a useful tool to modify the attention mechanism in
Transformers is also tested. As described in the previous section, there are four model architectures to which experiments apply: (1)
Model-1: feed-forward LSTM/GRU based sequence-to-sequence model, (2) Model-2: Bi(LSTM/GRU)-powered model, (3) Model-3: a
Transformer model, featuring Loung attention mechanism; and (4) Model-4: another Transformer model with its attention mechanism
modified by using a BN.
One important challenge faced by this work is that the solution has to be capable of processing incoming data continuously,
while at the same time, it should also continuously deliver reasonable outputs in its role as a detection artifact. Another aspect
considered in the design is the ability of self-adjusting hyper-parameters to better cope with time efficiency in the whole process
of pattern detection. Considering that the predictor module (the BN-Transformer) has to process a collection of extra features, the
computational complexity will certainly be increased, albeit only linearly with regards to the number of extra features.
8
A. Amezaga Hechavarria and M.O. Shafiq Data & Knowledge Engineering 140 (2022) 102034
Table 1a
BN configuration for Transformer (GRU).
Model-4: BN Click-stream configuration (mCL - KyS - mMov)
Modified Trasformer Baseline (LSTM) Model-3 (GRU) 0-0-0 1-0-0 0-1-0 0-0-1 1-1-0 1-0-1 0-1-1 1-1-1
Accuracy 31% 40% 43% 34% 44% 38% 37% 36% 37% 39%
Right Content 49% 43% 45% 41% 49% 42% 40% 41% 43% 42%
1st State Accuracy 32% 45% 49% 43% 51% 42% 39% 38% 35% 43%
Table 1b
BN configuration for Transformer (LSTM).
BN config. (mCL - KyS - mMov)
Modified Trasformer Baseline (LSTM) Model-3 (LSTM) 0-0-0 0-1-0 1-1-1
Accuracy 31% 34% 36% 37% 37%
Right Content 49% 36% 39% 41% 42%
1st State Accuracy 32% 36% 43% 39% 42%
Table 2
Predictive capability of model architectures.
All sessions averages Model version
Baseline Model-1 Model-1 Model-2 Model-3 Model-3 Model-4 Model-4
(LSTM) (LSTM) (GRU) (LSTM) (LSTM) (GRU) (LSTM) (GRU)
Accuracy 31% 26% 33% 18% 34% 40% 37% 44%
Right Content 49% 29% 37% 20% 36% 43% 41% 49%
1st State Accuracy 32% 29% 42% 28% 36% 45% 39% 51%
Tables 1a to 3b shown results in evaluating the performance of the BN-Transformer predictor in terms of accuracy, right-content
(percentage of the real states from the real sequence that can be found in the predicted sequence), and first state accuracy (the
matching-percentage of the first predicted state with the real first state in the incoming sequence). Details of the dataset used in
the experiment are described in Section 3.2.
The first set of experiments attempts to determine the best configuration of the click-stream data for the composition of the BN.
In other words, in terms of Eq. (2), the task of the second set of experiments is to find which of the set of parameter coefficients
could be selected. This refers to the Bayesian probabilities associated to our observed variables, such as left mouse-clicks (mCL), key
strokes (KyS), and mouse movements (mMov). Each column on Table 1a that is named with a 3-character combination of zeroes
and ones, represents the three parameters, 𝛼 − 𝛽 − 𝛿 in Eq. (2), which capture the three observed variables (mCL-KyS-mMov).
In Table 1a, Model-4, our BN-Transformer model (GRU), is compared with our baseline model from previous work (stacked
LSTM), and with the standard Transformer (GRU). The configuration where only the key-stroke variable is considered, that is,
column ‘0-1-0’, shows the best performance of all BN click-stream configurations. When the ‘0-1-0’ leading configuration is compared
to the baseline model for accuracy and first-state accuracy, the improvement is significant, with more than 10 percentage points
above baseline. For the right-content score, the leading configuration matches the baseline performance. Comparing the leading
configuration to the standard Transformer model (Model-3) our experiment shows the superior performance of our model in all
three comparative scores. Our second best BN click-stream configuration that is ‘0-0-0’, shows better performance for accuracy and
first-state accuracy when compared with both baseline and Model-3. It should be noted from Eq. (2) that this configuration also
benefits from the Markov-Chain transition probability of the BN.
In Table 1b, the analysis is reduced to three configurations. The selection for Table 1b is as follows: no BN (alpha-beta-delta)
parameter selected (that is all zeroes), all BN parameters (all ones), and the best performer from Table 1 (‘0-1-0’). From Table 1b,
it is clear that the proposed model can offer better performance when compared to both baseline and Model-3 for accuracy and
first-state accuracy, while the baseline model still behaves better for the right-content score.
The four model architectures from Model-1 to Model-4 and experiments results are listed in Table 2. For Model-4, Table 2 shows
results for the best configuration of BN stream data that was found in our experiments.
The third set of experiments shown in Table 2 wishes to test the accuracy and contents of model results. In terms of accuracy,
overall GRU versions for all proposed models show better results than the baseline case (33% matching reality). The GRU version of
Model-4 especially shows that the inclusion of a BN notably improves results (44%). This result is also noted in the GRU version of
Model-3 with a regular Transformer (40%). Right-content performance in the baseline model, a stacked LSTM, is still competitive,
and our best runner is the only option that reaches the same standards (49% of right contents to predict the states). Finally, in
predicting the first state of the incoming subsequence, our best performer, the GRU version of Model-4, outperforms all other
approaches with the highest score of 51% accuracy, compared to a lower 32% in the baseline case, and above 6 percentage points
over the regular Transformer (version GRU of Model-3 at 45%).
9
A. Amezaga Hechavarria and M.O. Shafiq Data & Knowledge Engineering 140 (2022) 102034
Table 3a
Data resampled from one session (GRU)
Baseline Transformer (mCL - KyS - mMov)
0-0-0 0-1-0 1-1-1
Accuracy 40% 45% 45% 50% 33%
Right Content 43% 49% 51% 52% 40%
1st State Accuracy 45% 57% 62% 65% 43%
(LSTM)
Accuracy 34% 39% 40% 44% 34%
Right Content 36% 41% 43% 46% 39%
1st State Accuracy 36% 51% 51% 59% 46%
Table 3b
Data resampled from all sessions (GRU)
Baseline Transformer (mCL - KyS - mMov)
0-0-0 0-1-0 1-1-1
Accuracy 40% 23% 36% 34% 30%
Right Content 43% 24% 38% 38% 35%
1st State Accuracy 45% 33% 57% 52% 41%
(LSTM)
Accuracy 34% 28% 33% 31% 20%
Right Content 36% 29% 33% 34% 24%
1st State Accuracy 36% 39% 46% 43% 26%
A complementary experiment is carried out to examine the improvement in the performance of the proposed BN-Transformer
model with respect to the regular Transformer. A data resampling is performed to generate new datasets for a comparison of the
proposed approach with the standard Transformer.
For data generation, each new dataset was built by combining the data of two different students that had been randomly selected
from the same lab session. In order to retain the intrinsic activity-selection order of the tasks assigned in each session, every
pair of datasets are combined sectionally while keeping the time-step ordering. Table 3a shows the comparison on performance
between the BN-Transformer approach presented here and the standard Transformer approach for two of the configurations used.
The performance of the standard Transformer model with the original data (the Model-3 on Table 2) is used as a baseline.
In addition, another collection of datasets was prepared in a similar fashion by mixing data from two randomly selected students,
but this time using their data across all lab sessions, again maintaining the time order among the mixed data. One more time, the
performance comparison presented in Table 3b shows that better results come from the Transformer coupled to the BN, but in this
case, only for the GRU architecture. It can be observed, however, that the overall performance of all models featuring an LSTM
architecture drops with respect to the performance of the baseline model (Model-3). This could indicate that the GRU architecture
does a better job when handling a large amount of noise introduced by mixing data from all lab sessions.
The outcome from this experiment confirms that the BN-Transformer approach introduced in this work represents an improve-
ment with respect to the standard Transformer in this particular forecasting problem.
Finally, the trade-off between running time and performance will be discussed in more detail. The computer setting in which all
experiments were carried out was MacBook Pro (2.3 GHz Intel Core i7 16 GB). The chart in Fig. 8 shows the trade-off between the
additional time required for the model to run and the performance gains from the upgrade applied to the predictor component of
the smart-classifier; built-in [3].
The additional running time rate 𝑄𝛥𝑇 was estimated with the use of:
𝛥𝑇𝑀𝑜𝑑𝑒𝑙𝑖 − 𝛥𝑇𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒
𝑄𝛥𝑇 = (5)
𝛥𝑇𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒
and for the gain rates in Accuracy, Right Content, and 1st State Accuracy 𝑄𝑆𝑐𝑜𝑟𝑒 we used:
𝑆𝑐𝑜𝑟𝑒𝑀𝑜𝑑𝑒𝑙𝑖 − 𝑆𝑐𝑜𝑟𝑒𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒
𝑄𝑆𝑐𝑜𝑟𝑒 = (6)
𝑆𝑐𝑜𝑟𝑒𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒
When processing data from about one hundred students, in its baseline version, the typical running time of each iteration of the
smart-classifier is roughly 6 to 8 min. The graph shows that increasing the running time of each iteration from ×2 to ×3.5 leads to an
improvement in performance reaching up to 42% and 59% over baseline scores for Accuracy and 1st State Accuracy, respectively.
10
A. Amezaga Hechavarria and M.O. Shafiq Data & Knowledge Engineering 140 (2022) 102034
In the classification task, the model looks at subsequences of three to five consecutive activities, and according to the recorded
data, a typical student spends around 10 min on average in each activity. Each iteration of the smart-classifier has a running time
of up to 0.3 min when processing the data for each individual student. In that scenario, considering that the upgraded predictor
increases the running time by 3.5 times, this represents roughly a 1.1-min increase, which is a considerably small delay, compared
to the average time per activity, which is about 10 min.
5.4. Discussion
In summary, the proposed Hybrid BN-Transformer approach improves the overall performance of the smart-classifier, while
keeping running time under a reasonable range [3]. It is worth mentioning that the significant improvement achieved for the
predictor module in particular and for the overall model in terms of performance is a promising step for future improvements
and refining in click-stream data processing in general. However, there are still some challenges to be addressed.
Potential long-term correlations in the target variable are continuously overridden by the BN scoring, which keeps no memory
beyond a couple of time periods preceding the current stage. This limitation could be addressed by a different approach, for
example, considering reinforcement learning instead of guidance through the BN. In this case, the learning process would be more
time-consuming, and this would be difficult to accommodate in the limited time window that individual LMS sessions typically
offer.
At this time, the results obtained in the experiments show a few key insights and guidelines for future work. By using a hybrid
model combining BN and Transformer, we were able to modify the attention mechanism of the Transformer model. This results in a
better performance capacity for context analysis to produce a better forecast. We learned that this approach is very useful when a
highly erratic and small dataset is the only choice available, which is often the case when trying to provide advice in the early stages
and quasi-real time. This is a common scenario when analyzing log data from a Learning Management System (LMS) in quasi-real
time.
Results discussed in this paper show that the approach followed in this research improves the Predictor module of the solution
that had been presented in previous work. As discussed in the Evaluation section, we were able to obtain better results for the key
metrics, accuracy, right-content, and first-state accuracy. The improved Predictor has the ability to provide a consistently better
forecast for the first state of the next subsequence of activities, which better enables the model to produce a timely response. This
is a key solution feature that we want to continue prioritizing in future work. We wish to highlight that the use of a BN to help
the system understand the context in a dynamic-data-generation environment could substantially improve performance at the early
running stages of a helper monitoring algorithm.
The insights delivered by the solution could lead to valuable help for both students and professors in quasi-real time LMS. Future
work avenues could attempt to connect detected patterns from click-stream data to other students’ outputs from interactions with
LMS.
11
A. Amezaga Hechavarria and M.O. Shafiq Data & Knowledge Engineering 140 (2022) 102034
A key approach presented in this work is to modify the Transformer’s attention mechanism. This could be applicable in future
work to a wider range of problems in which there are extra features in the data that can provide insights about the underlying
process that drives the evolution of the target variable. Although the possibility to include modifications to the way the alignment
scores are computed in Transformer models could be anticipated from the work of one of the Transformer pioneers [28], with this
work, we are providing a concrete example that can be extended to other problems in customized manners.
In future work, a reinforcement learning approach to boost the performance of an RNN-based forecasting engine will be explored.
The focus could be on the evaluation of the click-stream data as a proxy to infer user intentions and its contribution to improving the
entire detection algorithm, that is the smart-classifier. A possible improvement could be to individually optimize a neural network
architecture for each session. This could significantly improve the predictive capacity of the model and, consequently, the overall
performance of the detection task. Finally, exploring other combinations of deep-RNNs architectures coupled to Bayesian Networks
and other statistic methods, such as genetic algorithms, could help to better interpret the outcomes of hybrid deep-learning models.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.
Acknowledgments
The authors acknowledge the support from Natural Sciences and Engineering Research Council of Canada (NSERC) and Carleton
University, Canada. This paper is based on and part of the thesis of Alexis Amezaga Hechavarria, supervised by M. Omair Shafiq,
Carleton University in year 2020-2021.
References
[1] Y. Matsubara, Y. Sakurai, Dynamic modeling and forecasting of time-evolving data streams, in: Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, in: KDD ’19, Association for Computing Machinery, New York, NY, USA, 2019, pp. 458–468,
http://dx.doi.org/10.1145/3292500.3330947.
[2] M. Tan, C. dos Santos, B. Xiang, B. Zhou, LSTM-Based deep learning models for non-factoid answer selection, 2015, ArXiv E-Prints, arXiv:1511.04108.
[3] A. Amezaga Hechavarria, M.O. Shafiq, Modeling and predicting online learning activities of students: an hmm-lstm based hybrid solution, in: 2021 20th
IEEE International Conference on Machine Learning and Applications (ICMLA), 2021, pp. 682–687, http://dx.doi.org/10.1109/ICMLA52953.2021.00114.
[4] A. Borisov, I. Markov, M. de Rijke, P. Serdyukov, A neural click model for web search, in: Proceedings of the 25th International Conference on World
Wide Web, International World Wide Web Conferences Steering Committee, 2016, http://dx.doi.org/10.1145/2872427.2883033.
[5] O. Pourret, Bayesian Networks : A Practical Guide to Applications, John Wiley, Chichester, England Hoboken, NJ, 2008.
[6] J.C. Chang, Predictive Bayesian selection of multistep Markov chains, applied to the detection of the hot hand and other statistical dependencies in free
throws, R. Soc. Open Sci. 6 (3) (2019) 182174, http://dx.doi.org/10.1098/rsos.182174.
[7] J. Unnikrsnan, Oil and Gas Processing Equipment : Risk Assessment with Bayesian Networks, CRC Press, Boca Raton, ISBN: 9780429287800, 2021.
[8] S. Jha, K. Tan, R.A. Maxion, Markov chains, classifiers, and intrusion detection, in: Proceedings of the 14th IEEE Workshop on Computer Security
Foundations, in: CSFW ’01, IEEE Computer Society, USA, 2001, p. 206, http://dx.doi.org/10.5555/872752.873519.
[9] R. Begleiter, R. El-Yaniv, G. Yona, On prediction using variable order Markov models, J. Artif. Int. Res. (ISSN: 1076-9757) 22 (1) (2004) 385–421.
[10] V. Tran, D. Maxwell, N. Fuhr, L. Azzopardi, Personalised search time prediction using Markov chains, in: Proceedings of the ACM SIGIR International
Conference on Theory of Information Retrieval, ACM, 2017, http://dx.doi.org/10.1145/3121050.3121085.
[11] M. Hanif, F. Sami, M. Hyder, M.I. Ch, Hidden Markov model for time series prediction, J. Asian Sci. Res. 7 (5) (2017) 196–205, http://dx.doi.org/10.
18488/journal.2.2017.75.196.205.
[12] Y. Zhou, L. Wang, R. Zhong, Y. Tan, A Markov chain based demand prediction model for stations in bike sharing systems, Math. Probl. Eng. 2018 doi =
10.1155/2018/8028714, pages=1–8, orgname = Hindawi Limited, (2018).
[13] A. Mizera, J. Pang, Q. Yuan, Reviving the two-state Markov chain approach, IEEE/ACM Trans. Comput. Biol. Bioinform. 15 (5) (2018) 1525–1537,
http://dx.doi.org/10.1109/tcbb.2017.2704592.
[14] C. Wu, F. Wu, T. Qi, Y. Huang, User modeling with click preference and reading satisfaction for news recommendation, in: Proceedings of the
Twenty-Ninth International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, 2020,
http://dx.doi.org/10.24963/ijcai.2020/418.
[15] T. Hatt, S. Feuerriegel, Early detection of user exits from clickstream data: A Markov modulated marked point process model, in: Proceedings of the Web
Conference 2020, ACM, 2020, http://dx.doi.org/10.1145/3366423.3380238.
[16] Y. Karadayı, M.N. Aydin, A.S. Öğrenci, A hybrid deep learning framework for unsupervised anomaly detection in multivariate spatio-temporal data, Appl.
Sci. 10 (15) (2020) 5191, http://dx.doi.org/10.3390/app10155191.
[17] P. Kim, Convolutional neural network, in: MATLAB Deep Learning, Springer, 2017, pp. 121–147.
[18] A. Jamshed, B. Mallick, P. Kumar, Deep learning-based sequential pattern mining for progressive database, Soft Comput. 24 (22) (2020) 17233–17246,
http://dx.doi.org/10.1007/s00500-020-05015-2.
[19] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical attention networks for document classification, in: Proceedings of the 2016 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics,
San Diego, California, 2016, pp. 1480–1489, http://dx.doi.org/10.18653/v1/N16-1174.
[20] N. Wu, B. Green, X. Ben, S. O’Banion, Deep transformer models for time series forecasting: The influenza prevalence case, 2020, ArXiv E-Prints,
arXiv:2001.08317.
[21] H.M. Huynh, L.T. Nguyen, B. Vo, U. Yun, Z.K. Oplatková, T.-P. Hong, Efficient algorithms for mining clickstream patterns using pseudo-idlists, Future
Gener. Comput. Syst. 107 (2020) 18–30, http://dx.doi.org/10.1016/j.future.2020.01.034.
[22] S. Graf, Kinshuk, Analysing the behaviour of students in learning management systems with respect to learning styles, in: Advances in Semantic Media
Adaptation and Personalization, Springer Berlin Heidelberg, 2008, pp. 53–73, http://dx.doi.org/10.1007/978-3-540-76361_3.
[23] A. Maratea, A. Petrosino, M. Manzo, User click modeling on a learning management system, Int. J. Hum. Cap. Inf. Technol. Prof. 8 (4) (2017) 38–49,
http://dx.doi.org/10.4018/ijhcitp.2017100104.
12
A. Amezaga Hechavarria and M.O. Shafiq Data & Knowledge Engineering 140 (2022) 102034
[24] M. Cantabella, R. Martínez-España, B. Ayuso, J.A. Yáñez, A. Muñoz, Analysis of student behavior in learning management systems through a big data
framework, Future Gener. Comput. Syst. 90 (2019) 262–272, http://dx.doi.org/10.1016/j.future.2018.08.003.
[25] D.A. Filvà, M.A. Forment, F.J.G.-P. nalvo, D.F. Escudero, M.J. Casañ, Clickstream for learning analytics to assess students’ behavior with scratch, Future
Gener. Comput. Syst. 93 (2019) 673–686, http://dx.doi.org/10.1016/j.future.2018.10.057.
[26] I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in: Proceedings of the 27th International Conference on Neural
Information Processing Systems - Volume 2, in: NIPS’14, MIT Press, Cambridge, MA, USA, 2014, pp. 3104–3112.
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, u. Kaiser, I. Polosukhin, Attention is all you need, in: NIPS’17, Curran Associates
Inc., Red Hook, NY, USA, 2017, pp. 6000–6010.
[28] T. Luong, H. Pham, C.D. Manning, Effective approaches to attention-based neural machine translation, in: Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, Association for Computational Linguistics, Lisbon, Portugal, 2015, pp. 1412–1421, http://dx.doi.org/10.18653/
v1/D15-1166.
[29] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780.
[30] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014, arXiv preprint
arXiv:1412.3555.
[31] L. Wu, S. Li, C.-J. Hsieh, J. Sharpnack, SSE-PT: Sequential recommendation via personalized transformer, in: Fourteenth ACM Conference on Recommender
Systems, ACM, 2020, http://dx.doi.org/10.1145/3383313.3412258.
[32] S. Shadab, M.T.A. Khan, N.A. Neezi, S. Adilina, S. Shatabda, DeepDBP: DEep neural networks for identification of DNA-binding proteins, Inform. Med.
Unlocked 19 (2020) 100318, http://dx.doi.org/10.1016/j.imu.2020.100318.
[33] R. van de Schoot, S. Depaoli, R. King, B. Kramer, K. Märtens, M.G. Tadesse, M. Vannucci, A. Gelman, D. Veen, J. Willemsen, C. Yau, BayesIan statistics
and modelling, Nat. Rev. Methods Primers 1 (1) (2021) http://dx.doi.org/10.1038/s43586-020-00001-2.
[34] M. Vahdat, L. Oneto, D. Anguita, M. Funk, M. Rauterberg, A learning analytics approach to correlate the academic achievements of students with interaction
data from an educational simulator, in: Design for Teaching and Learning in a Networked World, Springer International Publishing, 2015, pp. 352–366,
http://dx.doi.org/10.1007/978-3-319-24258-3.
[35] M. Du, F. Li, G. Zheng, V. Srikumar, DeepLog: ANomaly detection and diagnosis from system logs through deep learning, in: Proceedings of the 2017
ACM SIGSAC Conference on Computer and Communications Security, in: CCS ’17, Association for Computing Machinery, New York, NY, USA, 2017, pp.
1285–1298, http://dx.doi.org/10.1145/3133956.3134015.
Alexis Amezaga-Hechavarria is currently pursuing a Ph.D. in Information Technology at Carleton University. His current research interests include Data Modeling,
Big Data Analytics, Machine Learning and Deep Learning combined with Probabilistic Inferential Models. He holds a Master in Information Technology with
Data Science specialization from Carleton University (2021). He receivd a Master’s degree in Physics in 1999, at the Faculty of Physics of the University of
Havana. In the period 2006 - 2014, he worked as Assisstant Professor at the Institute of Physics and Mathematics, Austral University of Chile, in this period,
his research interest focused on Computational Materials Science.
M. Omair Shafiq is an Associate Professor at the School of Information Technology, Carleton University since August 2016. His research interests include Data
Modeling, Big Data Analytics, Machine Learning and Deep Learning, Human-centered Artificial Intelligence, Web and Social Media Analytics. He has received
awards such as Teaching Excellence Award from the Faculty of Engineering and Design (FED), Carleton University, 2021, New Faculty Excellence in Teaching
Award, from Associate Vice-President (Teaching and Learning), Carleton University, 2019, NSERC Postdoctoral Fellowship Award, 2015–2016 competition,
MITACS Elevate Postdoctoral Fellowship Award, 2015–2016 competition, Vanier Scholarship, 2012, Alberta Innovates – Technology Futures (iCore) Graduate
Student Award (Ph.D.), 2010 and 2011, J. B. Hynes Research Innovation Award, University of Calgary, Alberta, Canada, 2011, and etc. He has published several
peer-reviewed research papers in journals, conferences, and workshops, served in technical program committees of several conferences and workshops, as well
as co-organized conferences and workshops.
13