Potent Real-Time Recommendations Using Multimodel Contextual Reinforcement Learning
Potent Real-Time Recommendations Using Multimodel Contextual Reinforcement Learning
N OMENCLATURE
I N TODAY’S fast-paced commerce, efficient recommen-
dations to user bases have become a priority for most
industries. From media visuals to advertising and E-commerce,
Q Item model. the recommendation of items holds a central place. Hence,
R Reward function. to balance with a continuously changing environment, user
a Action. base, and market inclinations, a real-time, competent, and
s State. personalized system is necessary. For instance, Amazon, Face-
r Reward. book, Google, and most other big and small firms earn a
U User set. large amount of their revenue from recommending advertise-
E Experience buffer. ments [1]. Customer retention by showing the customer the
ih User specific item history. right recommendation at the right time becomesthe ultimate
goal of the competitive marketplace. Over time, a plenitude
Manuscript received October 12, 2020; revised May 9, 2021 and July 16, of ideas have been proposed to solve the real-time item
2021; accepted July 21, 2021. Date of publication August 5, 2021; date recommendation problem. A variety of static techniques such
of current version April 1, 2022. (Anubha Kabra, Anu Agarwal, and as modifications of collaborative filtering (CF) [2]–[4], certain
Anil Singh Parihar contributed equally to this work.) (Corresponding author:
Anil Singh Parihar.) hybrid methodologies [5] have been proposed. However, these
Anubha Kabra was with the Machine Learning Research Laboratory, are static and do not perform efficiently in real-time scenarios.
Department of Computer Science and Engineering, Delhi Technological Static recommendation strategies have a fixed training and
University, New Delhi 110042, India. She is now with Adobe Systems, Noida
201304, Uttar Pradesh, India (e-mail: [email protected]). testing set that restricts it from capturing user behavior changes
Anu Agarwal was with the Machine Learning Research Laboratory, Depart- over time. The recommendations are shown to the users, while
ment of Computer Science and Engineering, Delhi Technological University, testing is solely dependent on the information learned during
New Delhi 110042, India. She is now with the Samsung Semiconductors
India Research and Development Centre, Bengaluru 560037, Karnataka, India training. For instance, if a user prefers type A items during
(e-mail: [email protected]). training, however, changes their preferences to type B later,
Anil Singh Parihar is with the Machine Learning Research Laboratory, the model might still be recommended type A items as it
Department of Computer Science and Engineering, Delhi Technological
University, New Delhi 110042, India (e-mail: [email protected]). did not learn the changes. It ignores the stochastic nature of
Digital Object Identifier 10.1109/TCSS.2021.3100291 the user. Fig. 1 reflects the dynamic nature of two real time
2329-924X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
582 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 2, APRIL 2022
datasets which we have used for experimentation. We plotted change in the user preferences via continuous updates and
occurence of five random items against time for each dataset. feedbacks.
The occurrence of an item strictly differs with timeline. Hence, To address the problem of new users and users with uncon-
training on a fixed segment of data cannot guarantee that ventional choices, MMCR has a provision for separate models
the same trend will follow during testing. In Fig. 1, for corresponding to every item in the itemset. It enables each
TalkingData dataset, if the training covers the items 1, 3 and model to identify the characteristics of its audience properly.
4, and the trend shifts toward items 2 and 5 post training, Thus, whenever a new or user with unconventional choice
the items recommended to users during testing would be 1, is encountered, multimodel framework is able to provide
3 and 4 instead of 2 and 5. relevant recommendations based on user-characteristics. It also
Content-based recommendation is a separate class of rec- ensures that popular items do not have an unfair edge over the
ommendation algorithms which leverage the user and item unpopular ones. Furthermore, we propose a novel Contextual
features [6], [7]. A segment of these algorithms are developed Cluster Strategy for Exploration. It works on the principle that
on top deep learning networks [8], [9], since they can co-relate we only explore random actions in an unforeseen situation.
non-linear user-item interactions more accurately. However, Clusters of user-context embeddings are formed to make the
many content-based techniques [6], [9] suffer from the cold implementation of this principle feasible and efficient. As our
start problem. This problem occurs due to insufficient infor- knowledge of the environment gradually increases, the number
mation available to make initial recommendations (when new of random items in the recommendation list decreases.
users enter the system). Thus, random recommendations are In this article, any subsequent mention of the term environ-
shown to the user. It is crucial to solve the cold start problem ment refers to the complete set of users and items available in
due to the continuously changing user base in a real-time envi- the dataset. The system refers to our MMCR framework and
ronment. Looking at another topic of research, reinforcement the term model corresponds to the per-item neural network.
learning (RL) has been gaining popularity in gaming [10], State refers to the embedding generated using user features,
[11], automation, and robotics [12], [13]. The general norm session context features, and user’s item history. The terms
in an RL-based problem is to have an agent/agents in an arm and action used interchangeably in the text indicate
environment, according to which it takes actions. The goal items in the itemset. When a user requests the system for
is to take actions in the environment which maximizes the recommendations, his/her specific features along with session-
long-term reward. The task of generating recommendations context features serve as the input. This input is further
can be viewed as an RL problem, the purpose being to processed by the state generator (SG) module to obtain state
maximize customer retention and provide a quality user embedding. Based on this state, the system selects k most
experience [14]–[16]. relevant items and presents them to the user as output to
Another major challenge observed in the recommender obtain feedback. However, only the clicked item obtains actual
system literature is utter neglect of the user-bases who do not feedback from the user in the form of a positive reward. All
always prefer mainstream or popular content [15]. Preferences obtained feedbacks are stored in the system memory to be
of such users can be captured according to item-specific user replayed on the respective models after fixed intervals.
space segregation. RL techniques have a module to explore or We summarize our novelty and key contributions as follows.
exploit. Effective exploration can advance the working of an 1) Proposed a novel multimodel framework which main-
RL model. State-of-the-art exploration techniques may result tains an independent agent per item. Such a frame-
in high randomness [17], which adversely affects the model. work gives this architecture an edge over traditional
Sometimes, exploration strategies take a long time to learn a recommendation techniques by integrating users with
trend [18]. In a fast-paced, real-time scenario, this negatively unconventional item choices.
affects the recommendations. 2) Introduced a novel Contextual Cluster Exploration
Hence, the current state-of-the-art techniques have one or (CCE) for exploration which provides an adaptive explo-
more of the following challenges. ration rate by forming clusters based on existing knowl-
edge of the environment. CCE outperformed consider-
1) Fails at dynamic front and hence fails to give real-time
ably on traditional exploration strategies.
recommendations.
3) Proposed a first of its kind method of incorporating
2) Suffers from cold-start problem.
user’s interaction history with the system, into the input
3) Neglects unconventional item choices.
state. This enables the model to learn from user’s
4) Fails to perform optimal yet effective exploration.
previous interactions effectively.
To tackle all the aforementioned challenges in delivering 4) Leveraged the benefits of both user and item features to
dynamic and personalized recommendations, we propose recommend relevant items.
Multimodel Contextual Reinforcement Learning (MMCR) 5) Demonstrated the effectiveness of our framework by
Farmework. In order to address the issue of model stationarity, performing thorough experimentation on two publicly
MMCR uses RL-based Deep Q-Networks. Its advantages available real-world datasets.
are twofold. First, the use of deep neural networks enables The rest of this article is structured as follows: Section II
efficient representation of all underlying non-linearities present defines the Background and Related work, Section III gives
between dataset features. Second, RL leverages on the the Problem Definition, Section IV focuses on Proposed work
sequential nature of the problem and grasps any observable and our Pipeline for MMCR technique. Section V has com-
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
KABRA et al.: POTENT REAL-TIME RECOMMENDATIONS USING MULTIMODEL CONTEXTUAL RL 583
prehensive Experimentation and Results evaluation. Section VI function (Q) described above can be expressed mathematically
concludes this article and discusses the scope of future work using the Bellman equation [28] as follows:
in this domain.
Q(s) = E rt+1 + γ Q(st+1 )|(st = s) . (1)
II. BACKGROUND AND R ELATED W ORK Osband et al. [29] proposed DQN which suggested the use
This section is further divided into three subsections based of deep neural networks for estimating the Q-function. The
on the recommendation strategy used. The first subsection loss function (L) which is minimized by DQN is as follows:
summarizes the prior art, which treats recommendation as 2
a supervised learning problem. Second lists all prominent L(st , at ) = rt + γ max
Q st+1 , a , θ − Q(st , a t , θ ) . (2)
a
strategies based on single-agent RL, and the last one describes
the literature that has treated it as a multi-agent RL problem. Here, θ refers to the parameters of the target model and
θ refers to the parameters of the online network. The target
model is used to generate the Q value predictions for a
A. Traditional Recommendation Strategies certain fixed number of iterations. After these iterations, target
A significant amount of research has been done to solve model’s weights are updated to θ .
the problem of generating recommendations that maximize Several attempts have been made to leverage
the cumulative gain. The initial literature for recommendations Q-Learning [30], DQN frameworks for effective
describes the systems that are purely based on item popular- recommendations [31], [32]. Zheng et al. [15] uses
ity [19]. A critical class of recommenders that followed was DQN as a base model with Dueling Bandit Gradient
content-based filtering [6], [7], which takes into account the Descent Strategy for exploration. User-Profiling has also
correlations between the user and item features. Another major been used as a value-based RL problem for generating
breakthrough was achieved with the advent of CF [3], [4], recommendations [33]. Zhao et al. [34] leverages long-term
[20]. It can be further categorized as user-based, item-based, and session-based knowledge using adversarial training and
and hybrid filtering. In user-based filtering, recommendations RL for recommending movies.
are made based on the popular choices of the most-similar 2) Policy-Based RL: Policy-based RL is especially helpful
users [21]. Item-based filtering [22] shows the items which are in the case of continuous state and action representations
highly similar to the ones that the user has previously selected. which cannot be efficiently dealt with using Q-Learning or
Other variant approaches such as Matrix Factorization [23], DQN. It uses gradient ascent to optimize the parameters for
Factorization Machines [24], [25], Logistic regressions mod- maximizing the expected reward. In policy-based methods
els have also managed to give a performance boost to the such as actor-critic [35], the actor accepts the state as input and
CF-based methods. The incorporation of Deep Neural net- outputs continuous action value. The action is further passed to
works in Recommenders is used for efficiently modeling the the critic network along with the input state vector to generate
complex non-linearities [26]. Wang et al [27] uses denoising Q value for the state-action pair. The gradient J for actor-critic
auto-encoders on top of Matrix Factorization model to extract is described as follows:
useful features from the side information available. The key T
feature that distinguishes our model from the rest of these J (θ ) = E π θ log πθ (at |st ; θ )R (3)
techniques is our framework’s ability to learn at every step of t=0
the training as well as the testing. Hence, it can learn all the where R is the sum of gamma-discounted rewards. Fur-
dynamic trends from the incoming data, unlike the traditional thermore, the system generates a list of recommendations
technique whose learning is static and is only limited to consisting of all top-scoring items. Several recommendation
training data patterns. strategies [14], [36]–[38] use actor-critic as a base RL model
with modifications in state representation, exploration func-
B. Single Agent RL-Based Recommendation tions, etc. to provide better results.
In a generic single-agent RL setting, the agent observes a
state st from the environment at timestep t. It uses a policy π C. Multi-Agent RL-Based Recommendation
to obtain the best action for the given state st . Action is pushed Multi-Agent RL strategy consists of multiple RL agents try-
to the environment, and a reward rt is obtained, leading to the ing to maximize the discounted cumulative long-term reward
next state st+1 . The objective here is to tune the parameters θ collectively. However, with multiple agents in action simul-
of the policy π to obtain the maximum cumulative expected taneously [39]–[41], each agent only has a partial view of
reward. RL frameworks can be broadly divided into value- the environment. Most common approach followed in multi-
based methods and policy-based methods. Both the methods agent systems is decentralized execution and standard Q-value
involve the system in a current state st . γ refers to the discount function evaluation [42]. However, this involves a significant
factor. issue of credit assignment for each agent. The global action
1) Value-Based RL: Value-based RL involves the estima- performed is usually a combined result of execution performed
tion of an action-value function. This function determines by all agents together. Hence, the global action receives a
the expected sum of discounted future reward for a specific global reward that cannot be linked to any one agent in
action at corresponding to the current state st . The Q-value particular [43]. Huang et al. [44] compares the global reward
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
584 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 2, APRIL 2022
Fig. 1. Occurrences of random five items throughout the timestamp for TalkingData and Globo data, respectively.
with the reward obtained by taking a default outcome for all 3) CCE is a new method to perform enriched exploration
other agents except the one for which the reward is being by relying on its present knowledge of the environment
calculated. Independent Q-learning [45] models multi-agent to make smart and less randomized choices.
by learning wholly separate and independent DQN networks We first explicate the SG Module, Novel exploration strat-
with the environment as the only source of interaction between egy (CCE), and Multi-Agent RL to define concepts used in
them. our framework. We then elucidate on our Model Architecture
However, one problem that remains is the convergence and which incorporates all the aforementioned segments.
stability of the system. Since all other agents are updating
their weights simultaneously, no single agent can consider its A. SG Module
environment stationary, and hence it ends up chasing an ever- This module leverages the extra information present in the
moving target. Most of the available literature on multi-agent commonly available user-interaction logs such as user, item,
systems deal only with the game-playing scenario and not and context features. Fig. 2 shows its overall working. Its
with recommendation systems. Our approach involves using a functioning can be broadly divided into two parts.
separate agent corresponding to each available action (which
1) Maintaining User’s Interaction History: Simple aggre-
is possible in a recommendation system scenario as there are
gation of all the item-labels that the user has previously
only a discrete number of items available to choose from) in
clicked not only consumes excessive storage but also
an attempt to overcome the issues mentioned above.
provides very little insight into user behavior. Instead,
we propose to maintain interaction-history in terms of
III. P ROBLEM D EFINITION item-feature vectors. Hence, whenever a new item a is
Whenever a user i with user features u i and context features to be added in a user’s history vector h u , it is updated
ci makes a recommendation request to the system, our model as follows:
generates a list of top-k most relevant recommendations for the
h u = αh u + a. (4)
user. These recommendations are based on the results obtained
from N RL agents, where N is the fixed number of items in Here, α is the age multiplier used to decrease the impact
the available itemset. The notations used throughout this article of the old feature values as they move further into the
are summarized in the Nomenclature section. past and to prevent feature values from exploding.
2) Generating State Embeddings: Auto-encoder is used
IV. P ROPOSED W ORK at this step for generating uniform embeddings. The
user and context features from the input request are
In this section, we elaborate extensively on segments of our
concatenated with the interaction history feature vector
proposed architecture.
ih .
The novelty of the proposed work is as follows.
v = u t + ct + i h . This concatenated vector v is fed
1) Developing a novel multimodel architecture for recom- as an input to the autoencoder. The output from the
mendation systems using per-item agent strategy with autoencoder is the required state embedding st .
independent credit assignment.
2) SG Module is the first of its kind technique to incor-
B. CCE: Contextual-Cluster Exploration Strategy
porate the user interaction history and user features as
the input. This enables the model to learn from user’s For deciding between exploration and exploitation, we pro-
previous interactions effectively. pose a novel strategy called CCE. This is diagrammatically
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
KABRA et al.: POTENT REAL-TIME RECOMMENDATIONS USING MULTIMODEL CONTEXTUAL RL 585
Algorithm 1: Context Clustering Exploration total number of clusters exceed cthresh (default: 15), then re-
Initialize: clustering of all the existing user state embeddings is done.
cthresh A clear representation of the same is shown in Fig. 1(b).
Random n- experience clusters
C. Multi-Agent Reinforcement
Define the centroid of each cluster randomly (c)
for each user state embedding st do Instead of letting a single agent handle the execution of
dmin = MAX_VAL all the incoming recommendation requests, we have divided
csel = None the execution task among multiple homogeneous DQN-based
for 1 to n Cluster Centroids do agents. Unlike other multi-agent strategies, our framework
d = Euclidean_dis(st , ci ) differentiates the functioning of these agents in terms of the
if d < dmin then information they act upon. Each agent is responsible for
dmin = d learning the user-behavior or characteristics corresponding to
end a specific item/action. This not only makes these models
if dmin > dthesh then independent of each other but also addresses one of the main
EXPLORE
issues faced by other multi-agent recommendation strategies,
Recommend k random items to user
else i.e., instability [46]. Other strategies are prone to instability
EXPLOIT as one agent cannot assume the environment to be stationary
Recommend k best items to user if other agents are getting updated simultaneously. In our
end case, since every agent acts on action-segregated information,
After showing the user k items they tend not to get affected by the update of weights in
for j in k items do other models. Another major problem in current systems is
if Reward(j) = 1 then the local credit assignment for global reward [43]. When
st = Updated State Embedding multiple agents cumulatively score the candidate items and
else finally present a list, it is difficult to assign credit for the
st = Original State Embedding
obtained reward. However, since our framework assigns a
end
separate agent for every action, the credit assignment is
end
entirely straightforward. Although multi-agent systems are
for 1 to n Cluster Centroids do
highly prevalent in applications such as game-playing, their
d = Euclidean_dis(st, ci )
usage in recommender systems is still rare.
if d < dmin then
dmin = d A significant advantage of our MMCR framework over other
csel = ci common single-agent recommendation strategies [14], [15] is
end that it gives relevant recommendations, even for the sparsely
if dmin > dthesh then active users. Each model pays attention to only specific
Assign a new cluster to this state embedding information and naturally does a better job at identifying what
c_count + = 1 user and context combination suits the given item the best.
if c_count > c_thresh then Moreover, MMCR does not neglect relatively unpopular
Reassign all the clusters items. As observed in several real-life scenarios, some items
else are highly popular among the users in general while others
Add state embedding (st ) to csel are liked by a rather specific set of users. In traditional
end recommender systems [5], [9], [15] however, the unpopular
end items might never get recommended to the right user. MMCR
has the ability to learn the preferred user characteristics for a
particular item, as it maintains a separate model for each item.
In this framework, we have opted for an item-centric approach
represented in Fig. 3. As described in the Algorithm 1, rather than the prevalent user-centric model. Thus, even if an
we first initialize centroids of n-Experience Clusters randomly. item appears less frequently than others, its chances of being
The default value of n is 10. For every input request at recommended to the right user are not hurt.
time t, the distance di of the generated state embedding st
is calculated from all the current centroids using Euclidean D. Model Framework
distance. If the minimum distance (dmin ) is greater than the Our Multimodel Contextual Reinforcement (MMCR)
decided threshold (dthresh ), then we explore else exploit. During framework combines user, item, and user-item context features
exploration, random k items are shown to user else best k items along with the user’s positive interaction history to create a
are chosen. After showing the items, the state embedding for personalized and interactive recommender system. Instead of a
the current user is updated according to the items clicked single agent, the task of generating recommendations is spread
by him/her (st ). The distance of st is calculated from all over multiple RL agents. A detailed depiction of the proposed
the existing clusters. If st has its minimum distance (dmin ) framework is given in Fig. 4. The step-by-step procedure
smaller than (dthresh ), then st gets assigned to the nearest for generating recommendations, following Algorithm 2 is as
cluster. Otherwise, a new cluster is formed. However, if the follows.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
586 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 2, APRIL 2022
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
KABRA et al.: POTENT REAL-TIME RECOMMENDATIONS USING MULTIMODEL CONTEXTUAL RL 587
TABLE I
D ETAILS A BOUT THE D ATASETS U SED . H ERE # R EPRESENTS THE C OUNT
features of users as input. This approach, however fails 2) Single Agent RL:
to model the complex non-linearities of the data features.
2) W&D: Wide & Deep [9] is a widely used state-of-the- 1) DRN [15]: Deep Reinforcement Learning Framework
art deep learning model combining the memorization for News Recommendation is a strategy that takes both
(through a logistic regression on wide combinations of user and item features into consideration. It uses double
categorical features) and generalization (through a deep deep-Q networks DDQNs) as a base framework along
neural network embedding of the raw features) to predict with Dueling Bandit Gradient Descent-based exploration
the click label. strategy.
3) FM [24], [25]: Factorization Machine is a com- 2) DRR [14]: This stands for Deep RL-based Recom-
bination of factorization models and support vector mendation. A sequential regulation framework that has
machines (SVMs). This is advantageous for parameter Actor-Critic RL-based pipeline is used for mapping
estimation in sparse settings.3 relations between the users and recommendation system.
Moreover, explicit user item interactions are used while
3 https://www.csie.ntu.edu.tw/ b97053/paper/Rendle2010FM.png modeling.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
588 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 2, APRIL 2022
Algorithm 2: Overall Methodology which lies in this bound. Results have been compared
Initialize: with Linear and Bootstrapped versions of UCB.
input : User Features (u), Context Features(c),User 3) TS [47], [49]: It essentially involves selecting an arm
specific item History(i h ), Item Features (a) , E= [null] , with a probabibility proportional to it being the best
for i E (1 to N) initialize DQN Models Q i arm. Results have been compared with Linear and
for session i = 1 : data.length do Bootstrapped versions of TS.
arms = [] 4) Adaptive Greedy (AG) [49]: This is a variant of classic
right_arm = li -greedy. However, during the execution is varied in
Ui = (u 0 , u 1 , . . . . . . , u m ) a controlled manner. A threshold value for probability
Ci = (c0 , c1 , . . . . . . , cm ) scores is generated. This threshold value is used to
i h = [null] decide between exploration and exploitation. However,
si = State_Gener ator (Ui , Ci , i h ) the threshold value changes when the number of trials
if (Explore Module(si ) == True) then exceeds a chosen window size so as to adapt the model
EXPLORATION to give better predictions. Two variants of this strategy
for k in K do namely AAG and AGT have been compared here.
arms.append(random(arm_list)) 5) AE [49]: This strategy uses active-learning as its base
end concept. It tries to select an observation depending upon
else the amount of leverage it gets on the model if its label
EXPLOITATION were known.
for j in Items do
scores.add(Q j . pr edi ct (si ))
Sort scores in descending order C. Evaluation
arms.add(top k items in scores) 1) Simulated Online Environment: The evaluation for all
end the methodologies was done in an online simulation. We have
end adhered the following guidelines.
Update i h with right_arm 1) The data were processed in increasing order of
si = State_Gener ator (Ui , Ci , i h ) timestamp. This maintained the time-dependent pattern
for m in arms do of the data to adhere to how users would enter the system
if (m == li ) then in a real-time scenario.
reward = 1 2) Although the traditional methods perform separate train-
Q i .fit([si , m, r ewar d, si ]) ing, our framework adopts learning while testing. Imme-
E.add([si , m, r ewar d, si ]) diate feedback and experience replay provides continu-
else ous learning. This is elaborated in Section IV-D.
reward = 0
The simulation takes each state (depicting a user entering
if Random(num) < thresh then
E.add([si , m, r ewar d, si ]) the system) as entry into the system. The system predicts the
Q i .fit([si , m, r ewar d, si ]) item for each state, which is then compared with the actual
end label (right arm), The right_arm is not known to the system
end before the comparison similar to a real-world scenario. For
if (E.length % Batch_size == 0) then example: If a user enters a news website, k news items will
for (t in E) do be recommended to her. She can decide to choose one or none.
Q i .fit(t) If any of the items are clicked, it counts for a positive reward,
end else negative.
E = [] 2) Online Evaluation Metrics: The following metrics were
end used for the evaluation of all the techniques compared in this
article.
1) CTR: Click Through Rate is an effective measure for
3) Exploration Strategies: Explorations constitute a major assessing the effectiveness of recommender systems in
part of any Multi Arm Bandit-based recommender system. understanding the user click behavior. It is calculated for
We combined our proposed model with various existing explo- each item as follows:
ration strategies to compare their performance against CCE. Total number of clicks
CTRi = . (6)
1) Epsilon-Greedy [47], [48]: This is the most basic strat- Total number of impressions
egy which explores arms at random if a randomly gen- Further, average of CTR values of all the items is
erated number has a value less than a chosen parameter calculated to obtain the final CTR value as follows:
, and prefers exploitation otherwise. N
2) UCB [48], [49]: This leverages the uncertainty in CTRi
CTR = .
i=0
(7)
action-value pair. A confidence interval for each item N
is formed using a suitable distribution which determines 2) Relative CTR: It is a measure of evaluating how a rec-
the confidence about the actual action-value of an item ommendation strategy is performing on a given dataset
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
KABRA et al.: POTENT REAL-TIME RECOMMENDATIONS USING MULTIMODEL CONTEXTUAL RL 589
D. Results
The comparison of the proposed techniques is made with
the state-of-the-art recommendation strategies as well as the
various exploration strategies mentioned in Section V-B. The
exploration strategies are implemented on our base pipeline.
Results are obtained in both online and offline settings.
Tables IV and V show the performance of CCE as an
exploration strategy over the rest. CCE majorly outperformed
mentioned state-of-the-art techniques. Furthermore, the per-
formance of MMCR in comparison to state-of-the-art RL and
non RL-based strategies has been shown in Tables II and III.
For maintaining consistency with the non-RL-based strategies,
we have only shown results on the test set (i.e., 30% of the
entire dataset). To have a comparison over the timestep for
each RL-based technique, we plotted the CMR graphs shown
in Figs. 5–7. The subsequent paragraphs highlight the intuition
Fig. 5. CMR trends on RL-based techniques for k = 1. The y-axis signifies
CMR and x-axis signify the number of experiences. The red, blue and green behind the observed trends.
line stands for MMCR, DRN and DRR, respectively. (a) TalkingData SDK. 1) Context-Based Exploration: MMCR focuses on user
(b) Globo.
feature and context-based exploration. It can be observed
from Tables IV and V that CCE performs best, irre-
or platform as compared to other strategies when applied spective of the number of items recommended (k) for
on the same dataset. The Relative CTR for i th strategy both the datasets. CCE tends to shift gradually from
can be calculated as exploration to exploitation as our knowledge of the envi-
ronment increases. The positive effect of this shift can
CTRi
Relative CTR = . (8) be seen in the significantly higher CTR, Precision, and
CTR of other strategies Recall values for CCE as compared to other exploration
3) CMR: Cumulative Mean Reward is a useful measure to strategies. Being a context-based strategy, Section IV-D
continuously observe and analyze the performance of deeply narrows the possibility of random items to be
recommender systems over time. At any given point, shown to the users. AAG and AE give comparable per-
if R is the number of rows evaluated so far, CMR is formance to CCE. These exploration strategies change
calculated as follows: the amount of exploration, generally reducing it as the
R
Rewardi model learns over time. However, AAG and AE do not
CMR R = i=0
. (9) use contextual information for this decay in exploration.
R
The -greedy, UCB, and TS-based exploration do not
3) Offline Evaluation Metrics: Following popular recom- do justice as -greedy relies on a consistent degree of
mender system evaluation measures have been compared for randomness during exploration. UCB and TS require
the offline evaluation of this article. multiple instances of the same item to be shown before
1) Precision at k: Precision is defined as the fraction of the exploration stabilizes. In a brief period, these two
the total number of predicted recommendations (i.e., k) techniques can hurt performance.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
590 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 2, APRIL 2022
TABLE II
O NLINE AND O FFLINE E VALUATION FOR TALKING D ATA D ATASET. B EST R ESULTS A RE H IGHLIGHTED IN B OLD
TABLE III
O NLINE AND O FFLINE E VALUATION OF G LOBO D ATASET. B EST R ESULTS A RE H IGHLIGHTED IN B OLD
TABLE IV
VARIATION OF E XPLORATION S TRATEGIES ON TALKING D ATA D ATASET. B EST R ESULTS A RE H IGHLIGHTED IN B OLD
TABLE V
VARIATION OF E XPLORATION S TRATEGIES ON G LOBO D ATASET. B EST R ESULTS A RE H IGHLIGHTED IN B OLD
2) Multi-Agent Versus Single-Agent: When observing CMR as some items may have an acquired user set, e.g., a TV
in Figs. 5(a), 6(a) and (b), and 7(a) it is evident that show recommender has generic shows as well as the
DRN, DRR, and MMCR have a similar curve in the ones which have a limited user base. In a single model-
beginning. Over time, the performance of MMCR super- based technique such as DRR or DRN, the latter types
sedes the other RL-based strategies. This trend can be of shows may get sidelined as they are not frequently
accredited to the provision of item-based independent chosen. With independent arm-based modeling, this
multi-agent pipeline of MMCR. Single-agent techniques problem is solved as the user type is easily recognized
eventually give less weightage to sparsely used items. by the model of that particular item and thus item
This is a disadvantage in the case of recommendations is correctly recommended. Overall, MMCR tends to
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
KABRA et al.: POTENT REAL-TIME RECOMMENDATIONS USING MULTIMODEL CONTEXTUAL RL 591
Fig. 6. CMR trends on RL-based techniques for k = 3. The y-axis signifies Fig. 7. CMR trends on RL-based techniques for k = 5. The y-axis signifies
CMR and x-axis signify the number of experiences. The red, blue and green CMR and x-axis signify the number of experiences. The red, blue and green
line stands for MMCR, DRN, and DRR, respectively. (a) TalkingData SDK. line stands for MMCR, DRN, and DRR, respectively. (a) TalkingData SDK.
(b) Globo. (b) Globo.
outperform DRN and DRR across both the datasets for initially CMR graphs show that DRN and DRR are
all values of k (see Figs. 5–7). The same trend can also in competence with MMCR technique. In contrast for
be reaffirmed from Tables II and III for online, as well Figs. 5(b), 6(b) and 7(b) the MMCR technique has
as offline metrics. an edge from the beginning. Nevertheless, over time,
3) User-Item Weighted Interaction History:The state MMCR captures the trends far better than DRR and
embedding representation Section IV-A used by MMCR DRN. All the techniques eventually achieve stability in
has used item information fruitfully by developing CMR trends.
embeddings using the user-item interaction history as Although the current RL-based techniques such as DRN and
mentioned in Section IV-A. DRN and DRR use item DRR give improved results over non-RL-based strategies such
information but do not give adequate weightage to the as W&D and FM (see Tables II and III), they are conquered
user-item interaction history. Item history for a particular by MMCR. The non-RL techniques fail to perform in real-
user can give a deep insight into user behavior. MMCR time data as the learning during training may not be sufficient
ensures that new item features have a greater weight in to recognize the trends during testing. MMCR has three
interaction history than older items. Tables II and III as novel properties which are item-based multi-agent framework,
well as Figs. 5–7 representing CMR graphs have veri- non-conventional way of state embedding representation, and
fied our claim by stating significantly improved results novel exploration CCE (see Section IV-D) which significantly
throughout for MMCR. In Figs. 5(a), 6(a), and 7(a) improve its performance.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
592 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 2, APRIL 2022
VI. C ONCLUSION AND F UTURE W ORK [16] A. Kabra, A. Agarwal, and A. S. Parihar, “Cluster-based deep contextual
reinforcement learning for top-k recommendations,” in Proc. Int. Conf.
In this article, we propose an item-based multi-agent recom- Comput. Commun. Syst. (I CS), vol. 170. Shillong, India: Springer, 2021,
mender system with the capability to tackle real-time dynamic p. 125.
[17] M. Tokic, “Adaptive ε-greedy exploration in reinforcement learning
recommendations with ease. This is especially beneficial to based on value differences,” in Proc. Annu. Conf. Artif. Intell. Ho Chi
sparsely yet periodically occurring item sets that may be Minh City, Vietnam: Springer, 2010, pp. 203–210.
ignored in a single agent strategy. The basis of the tech- [18] P. Auer, “Using confidence bounds for exploitation-exploration trade-
offs,” J. Mach. Learn. Res., vol. 3, pp. 397–422, Nov. 2003.
nique relies on Contextual Multi-Arm Bandit with a novel [19] H. Steck, “Item popularity and recommendation accuracy,” in Proc. 5th
exploration strategy called CCE which conquers SOTA explo- ACM Conf. Recommender Syst. (RecSys), 2011, pp. 125–132.
ration strategies by using effective clustering with minimal [20] Z. Cui et al., “Personalized recommendation system based on collabora-
tive filtering for IoT scenarios,” IEEE Trans. Services Comput., vol. 13,
randomity. Moreover, using user and item information with no. 4, pp. 685–695, Jul. 2020.
an inclination to learn more from recent item trends using [21] F. Ge, “A user-based collaborative filtering recommendation algorithm
user-item interactive history has increased the potential for based on folksonomy smoothing,” in Advances in Computer Science
and Education Applications. Berlin, Germany: Springer-Verlag, 2011,
relevance in recommendations. The continuous updates and pp. 514–518.
experience reply ensure current trends are learned and keep [22] M. Deshpande and G. Karypis, “Item-based top-N recommendation
reinforcing them throughout. Our extensive experimentation algorithms,” ACM Trans. Inf. Syst., vol. 22, no. 1, pp. 143–177, 2004.
[23] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques
has shown that these properties significantly improve results for recommender systems,” IEEE Comput., vol. 42, no. 8, pp. 30–37,
on both offline and online evaluation metrics. For future work, Aug. 2009.
improving scalability by reducing the number of models in [24] S. Rendle, “Factorization machines,” in Proc. IEEE Int. Conf. Data
Mining, Dec. 2010, pp. 995–1000.
multi-agent RL or incorporating the properties into a single [25] S. Rendle, Z. Gantner, C. Freudenthaler, and L. Schmidt-Thieme,
agent model has a scope. Moreover, new techniques of clus- “Fast context-aware recommendations with factorization machines,” in
tering can be tested for this CCE exploration strategy to make Proc. 34th Int. ACM SIGIR Conf. Res. Develop. Inf. (SIGIR), 2011,
pp. 635–644.
it advance. [26] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural
collaborative filtering,” in Proc. 26th Int. Conf. World Wide Web, 2017,
pp. 173–182.
R EFERENCES [27] K. Wang, L. Xu, L. Huang, C.-D. Wang, and J.-H. Lai, “SDDRS:
Stacked discriminative denoising auto-encoder based recommender sys-
[1] J-Clement. (2021). Selected Online Companies Ranked by Total tem,” Cognit. Syst. Res., vol. 55, pp. 164–174, Jun. 2019.
Digital Advertising Revenue From 2012 to 2020. [Online]. Available: [28] R. Bellman, Dynamic Programming, vol. 707. New York, NY, USA:
https://www.statista.com/statistics/205352/digital-advertising-revenue- Courier Corporation, 2013.
of-leading-online-companies/ [29] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, “Deep exploration
[2] A. S. Das, M. Datar, A. Garg, and S. Rajaram, “Google news personal- via bootstrapped DQN,” in Proc. Adv. neural Inf. Process. Syst., 2016,
ization: Scalable online collaborative filtering,” in Proc. 16th Int. Conf. pp. 4026–4034.
World Wide Web (WWW), 2007, pp. 271–280. [30] N. Taghipour and A. Kardan, “A hybrid web recommender system
[3] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl, “Item-based collab- based on Q-learning,” in Proc. ACM Symp. Appl. Comput. (SAC), 2008,
orative filtering recommendation algorithms,” in Proc. 10th Int. Conf. pp. 1164–1168.
World Wide Web (WWW), 2001, pp. 285–295. [31] V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013,
[4] J. B. Schafer, D. Frankowski, J. Herlocker, and S. Sen, “Collaborative arXiv:1312.5602. [Online]. Available: http://arxiv.org/abs/1312.5602
filtering recommender systems,” in The Adaptive Web. Berlin, Germany: [32] X. Zhao, L. Zhang, Z. Ding, L. Xia, J. Tang, and D. Yin, “Recommenda-
Springer, 2007, pp. 291–324. tions with negative feedback via pairwise deep reinforcement learning,”
[5] J. Liu, P. Dolan, and E. R. Pedersen, “Personalized news recommen- in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
dation based on click behavior,” in Proc. 15th Int. Conf. Intell. user 2018, pp. 1040–1048.
Interfaces (IUI), 2010, pp. 31–40. [33] H. Liang, “DRprofiling: Deep reinforcement user profiling for
[6] M. Kompan and M. Bieliková, “Content-based news recommenda- recommendations in heterogenous information networks,” IEEE
tion,” in Proc. Int. Conf. Electron. Commerce Web Technol. Cham, Trans. Knowl. Data Eng., early access, May 29, 2020, doi:
Switzerland: Springer, 2010, pp. 61–72. 10.1109/TKDE.2020.2998695.
[7] M. J. Pazzani and D. Billsus, “Content-based recommendation systems,” [34] W. Zhao et al., “Leveraging long and short-term information in content-
in The Adaptive Web. Berlin, Germany: Springer, 2007, pp. 325–341. aware movie recommendation via adversarial training,” IEEE Trans.
[8] L. Zheng, V. Noroozi, and P. S. Yu, “Joint deep modeling of users and Cybern., vol. 50, no. 11, pp. 4680–4693, Nov. 2020.
items using reviews for recommendation,” in Proc. 10th ACM Int. Conf. [35] I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska, “A survey
Web Search Data Mining, Feb. 2017, pp. 425–434. of actor-critic reinforcement learning: Standard and natural policy gra-
[9] H.-T. Cheng et al., “Wide & deep learning for recommender sys- dients,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 42, no. 6,
tems,” in Proc. 1st Workshop Deep Learn. Recommender Syst., 2016, pp. 1291–1307, Nov. 2012.
pp. 7–10. [36] L. Wang, W. Zhang, X. He, and H. Zha, “Supervised reinforcement
[10] D. Silver et al., “Mastering the game of Go with deep neural networks learning with recurrent neural network for dynamic treatment recom-
and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016. mendation,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery
[11] G. Tesauro, “Temporal difference learning and TD-Gammon,” Commun. Data Mining, Jul. 2018, pp. 2447–2456.
ACM, vol. 38, no. 3, pp. 58–68, Mar. 1995. [37] X. Zhao, L. Xia, L. Zhang, Z. Ding, D. Yin, and J. Tang, “Deep
[12] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in reinforcement learning for page-wise recommendations,” in Proc. 12th
robotics: A survey,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1238–1274, ACM Conf. Recommender Syst., Sep. 2018, pp. 95–103.
2013. [38] X. Zhao, L. Zhang, L. Xia, Z. Ding, D. Yin, and J. Tang,
[13] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training “Deep reinforcement learning for list-wise recommendations,” 2017,
of deep visuomotor policies,” J. Mach. Learn. Res., vol. 17, no. 1, arXiv:1801.00209. [Online]. Available: http://arxiv.org/abs/1801.00209
pp. 1334–1373, 2015. [39] F. Buccafurri, D. Rosaci, G. M. L. Sarnè, and L. Palopoli, “Modeling
[14] F. Liu et al., “Deep reinforcement learning based recommendation cooperation in multi-agent communities,” Cognit. Syst. Res., vol. 5,
with explicit user-item interactions modeling,” 2018, arXiv:1810.12027. no. 3, pp. 171–190, Sep. 2004.
[Online]. Available: http://arxiv.org/abs/1810.12027 [40] P. Tommasino, D. Caligiore, M. Mirolli, and G. Baldassarre, “A rein-
[15] G. Zheng et al., “DRN: A deep reinforcement learning framework for forcement learning architecture that transfers knowledge between skills
news recommendation,” in Proc. World Wide Web Conf. (WWW), 2018, when solving multiple tasks,” IEEE Trans. Cognit. Develop. Syst.,
pp. 167–176. vol. 11, no. 2, pp. 292–317, Jun. 2019.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
KABRA et al.: POTENT REAL-TIME RECOMMENDATIONS USING MULTIMODEL CONTEXTUAL RL 593
[41] Y. Rizk, M. Awad, and E. W. Tunstel, “Decision making in multiagent Anu Agarwal received the B.Tech. degree in com-
systems: A survey,” IEEE Trans. Cognit. Develop. Syst., vol. 10, no. 3, puter science and engineering from Delhi Techno-
pp. 514–529, Sep. 2018. logical University, Delhi, India, in 2020.
[42] J. Feng et al., “Learning to collaborate: Multi-scenario ranking via multi- She has worked with the Machine Learning Labo-
agent reinforcement learning,” in Proc. World Wide Web Conf. World ratory, Delhi Technological University. She is cur-
Wide Web (WWW), 2018, pp. 1939–1948. rently working as a Senior Engineer with the
[43] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, Advanced Sensor Algorithm Development Team,
“Counterfactual multi-agent policy gradients,” in Proc. 32nd AAAI Conf. Samsung Semiconductors India Research and Devel-
Artif. Intell., 2018, pp. 1–9. opment Centre, Bengaluru, Karnataka, India. Her
[44] H. Huang, Q. Zhang, and X. Huang, “Mention recommendation for current research interests include reinforcement
Twitter with end-to-end memory network,” in Proc. 26th Int. Joint Conf. learning, 3-D scene reconstruction, neural network
Artif. Intell., Aug. 2017, pp. 1872–1878. optimization, computer vision, and deep learning.
[45] A. Tampuu et al., “Multiagent cooperation and competition with
deep reinforcement learning,” PLoS ONE, vol. 12, no. 4, Apr. 2017,
Art. no. e0172395.
[46] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of
multiagent reinforcement learning,” IEEE Trans. Syst., Man, Cybern. C,
Appl. Rev., vol. 38, no. 2, pp. 156–172, Mar. 2008.
[47] S. Agrawal and N. Goyal, “Thompson sampling for contextual ban-
dits with linear payoffs,” in Proc. Int. Conf. Mach. Learn., 2013,
pp. 127–135.
[48] J. Vermorel and M. Mohri, “Multi-armed bandit algorithms and empir-
ical evaluation,” in Proc. Eur. Conf. Mach. Learn. Cham, Switzerland:
Springer, 2005, pp. 437–448.
[49] D. Cortes, “Adapting multi-armed bandits policies to contextual
bandits scenarios,” 2018, arXiv:1811.04383. [Online]. Available:
http://arxiv.org/abs/1811.04383
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.