Potent Real-Time Recommendations Using Multimodel Contextual Reinforcement Learning

Research paper

Uploaded by

avanthikavarshini996

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Potent Real-Time Recommendations Using Multimodel Contextual Reinforcement Learning

Research paper

Uploaded by

avanthikavarshini996

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO.

2, APRIL 2022 581

Potent Real-Time Recommendations Using

Multimodel Contextual Reinforcement Learning
Anubha Kabra , Anu Agarwal , and Anil Singh Parihar , Member, IEEE
Abstract— Widespread digitalization has led to almost all Ui User features.
utilities and services thrive on an online medium. A real-time, Ci Context features.
personalized, and trend grasping recommendation system is l True label.
necessary to enhance user experience and boost business on
E-commerce platforms. We propose the Multimodel Contextual N Number of candidate items.
Reinforcement Learning (MMCR) constituting three novel fea- thresh Threshold (Default : 0).
tures for real-time and customized recommendations. The first k Number of items to be recommended.
feature is user-item interactive state embedding which uses not
only item information but also assigns weightage to this infor- LR Logistic regression.
mation according to its usage history. It gives higher importance W&D Wide and deep.
to the newly clicked items by the users than the older ones. FM Factorization machines.
Second, we devised Contextual Cluster Exploration (CCE) strat- DRN Deep reinforcement learning framework
egy. This strategy enhances the item-choice recommendations by
consistently reducing the randomness during exploration. The for News recommendation.
third novelty is an item-based multi-agent framework that can DRR Deep-reinforcement-learning-based
tackle the case of sparsely chosen items. Generally, such items recommendation.
are disregarded in a single agent model as the more popular UCB Upper confidence bound.
items take supremacy. Our technique ensures that the user- TS Thompson sampling.
item history per item is learned separately; thus, no item is
neglected. MMCR has shown an average of 5% increase in CTR B Bootstrapped.
rate. Moreover, CCE exploration gives a considerably higher AGT Adaptive greedy threshold.
score than state-of-the-art exploration strategies. Thorough AE Active exploration.
experimentation demonstrates that our proposed strategy has AAG Active adaptive greedy.
shown significantly improved results over various state-of-the-art
strategies.
Index Terms— Deep Q-learning, human–computer interaction, I. I NTRODUCTION
multi-agent reinforcement, recommender systems, reinforcement
learning (RL).

N OMENCLATURE
I N TODAY’S fast-paced commerce, efficient recommen-
dations to user bases have become a priority for most
industries. From media visuals to advertising and E-commerce,
Q Item model. the recommendation of items holds a central place. Hence,
R Reward function. to balance with a continuously changing environment, user
a Action. base, and market inclinations, a real-time, competent, and
s State. personalized system is necessary. For instance, Amazon, Face-
r Reward. book, Google, and most other big and small firms earn a
U User set. large amount of their revenue from recommending advertise-
E Experience buffer. ments [1]. Customer retention by showing the customer the
ih User specific item history. right recommendation at the right time becomesthe ultimate
goal of the competitive marketplace. Over time, a plenitude
Manuscript received October 12, 2020; revised May 9, 2021 and July 16, of ideas have been proposed to solve the real-time item
2021; accepted July 21, 2021. Date of publication August 5, 2021; date recommendation problem. A variety of static techniques such
of current version April 1, 2022. (Anubha Kabra, Anu Agarwal, and as modifications of collaborative filtering (CF) [2]–[4], certain
Anil Singh Parihar contributed equally to this work.) (Corresponding author:
Anil Singh Parihar.) hybrid methodologies [5] have been proposed. However, these
Anubha Kabra was with the Machine Learning Research Laboratory, are static and do not perform efficiently in real-time scenarios.
Department of Computer Science and Engineering, Delhi Technological Static recommendation strategies have a fixed training and
University, New Delhi 110042, India. She is now with Adobe Systems, Noida
201304, Uttar Pradesh, India (e-mail: [email protected]). testing set that restricts it from capturing user behavior changes
Anu Agarwal was with the Machine Learning Research Laboratory, Depart- over time. The recommendations are shown to the users, while
ment of Computer Science and Engineering, Delhi Technological University, testing is solely dependent on the information learned during
New Delhi 110042, India. She is now with the Samsung Semiconductors
India Research and Development Centre, Bengaluru 560037, Karnataka, India training. For instance, if a user prefers type A items during
(e-mail: [email protected]). training, however, changes their preferences to type B later,
Anil Singh Parihar is with the Machine Learning Research Laboratory, the model might still be recommended type A items as it
Department of Computer Science and Engineering, Delhi Technological
University, New Delhi 110042, India (e-mail: [email protected]). did not learn the changes. It ignores the stochastic nature of
Digital Object Identifier 10.1109/TCSS.2021.3100291 the user. Fig. 1 reflects the dynamic nature of two real time
2329-924X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
582 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 2, APRIL 2022

datasets which we have used for experimentation. We plotted change in the user preferences via continuous updates and
occurence of five random items against time for each dataset. feedbacks.
The occurrence of an item strictly differs with timeline. Hence, To address the problem of new users and users with uncon-
training on a fixed segment of data cannot guarantee that ventional choices, MMCR has a provision for separate models
the same trend will follow during testing. In Fig. 1, for corresponding to every item in the itemset. It enables each
TalkingData dataset, if the training covers the items 1, 3 and model to identify the characteristics of its audience properly.
4, and the trend shifts toward items 2 and 5 post training, Thus, whenever a new or user with unconventional choice
the items recommended to users during testing would be 1, is encountered, multimodel framework is able to provide
3 and 4 instead of 2 and 5. relevant recommendations based on user-characteristics. It also
Content-based recommendation is a separate class of rec- ensures that popular items do not have an unfair edge over the
ommendation algorithms which leverage the user and item unpopular ones. Furthermore, we propose a novel Contextual
features [6], [7]. A segment of these algorithms are developed Cluster Strategy for Exploration. It works on the principle that
on top deep learning networks [8], [9], since they can co-relate we only explore random actions in an unforeseen situation.
non-linear user-item interactions more accurately. However, Clusters of user-context embeddings are formed to make the
many content-based techniques [6], [9] suffer from the cold implementation of this principle feasible and efficient. As our
start problem. This problem occurs due to insufficient infor- knowledge of the environment gradually increases, the number
mation available to make initial recommendations (when new of random items in the recommendation list decreases.
users enter the system). Thus, random recommendations are In this article, any subsequent mention of the term environ-
shown to the user. It is crucial to solve the cold start problem ment refers to the complete set of users and items available in
due to the continuously changing user base in a real-time envi- the dataset. The system refers to our MMCR framework and
ronment. Looking at another topic of research, reinforcement the term model corresponds to the per-item neural network.
learning (RL) has been gaining popularity in gaming [10], State refers to the embedding generated using user features,
[11], automation, and robotics [12], [13]. The general norm session context features, and user’s item history. The terms
in an RL-based problem is to have an agent/agents in an arm and action used interchangeably in the text indicate
environment, according to which it takes actions. The goal items in the itemset. When a user requests the system for
is to take actions in the environment which maximizes the recommendations, his/her specific features along with session-
long-term reward. The task of generating recommendations context features serve as the input. This input is further
can be viewed as an RL problem, the purpose being to processed by the state generator (SG) module to obtain state
maximize customer retention and provide a quality user embedding. Based on this state, the system selects k most
experience [14]–[16]. relevant items and presents them to the user as output to
Another major challenge observed in the recommender obtain feedback. However, only the clicked item obtains actual
system literature is utter neglect of the user-bases who do not feedback from the user in the form of a positive reward. All
always prefer mainstream or popular content [15]. Preferences obtained feedbacks are stored in the system memory to be
of such users can be captured according to item-specific user replayed on the respective models after fixed intervals.
space segregation. RL techniques have a module to explore or We summarize our novelty and key contributions as follows.
exploit. Effective exploration can advance the working of an 1) Proposed a novel multimodel framework which main-
RL model. State-of-the-art exploration techniques may result tains an independent agent per item. Such a frame-
in high randomness [17], which adversely affects the model. work gives this architecture an edge over traditional
Sometimes, exploration strategies take a long time to learn a recommendation techniques by integrating users with
trend [18]. In a fast-paced, real-time scenario, this negatively unconventional item choices.
affects the recommendations. 2) Introduced a novel Contextual Cluster Exploration
Hence, the current state-of-the-art techniques have one or (CCE) for exploration which provides an adaptive explo-
more of the following challenges. ration rate by forming clusters based on existing knowl-
edge of the environment. CCE outperformed consider-
1) Fails at dynamic front and hence fails to give real-time
ably on traditional exploration strategies.
recommendations.
3) Proposed a first of its kind method of incorporating
2) Suffers from cold-start problem.
user’s interaction history with the system, into the input
3) Neglects unconventional item choices.
state. This enables the model to learn from user’s
4) Fails to perform optimal yet effective exploration.
previous interactions effectively.
To tackle all the aforementioned challenges in delivering 4) Leveraged the benefits of both user and item features to
dynamic and personalized recommendations, we propose recommend relevant items.
Multimodel Contextual Reinforcement Learning (MMCR) 5) Demonstrated the effectiveness of our framework by
Farmework. In order to address the issue of model stationarity, performing thorough experimentation on two publicly
MMCR uses RL-based Deep Q-Networks. Its advantages available real-world datasets.
are twofold. First, the use of deep neural networks enables The rest of this article is structured as follows: Section II
efficient representation of all underlying non-linearities present defines the Background and Related work, Section III gives
between dataset features. Second, RL leverages on the the Problem Definition, Section IV focuses on Proposed work
sequential nature of the problem and grasps any observable and our Pipeline for MMCR technique. Section V has com-

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
KABRA et al.: POTENT REAL-TIME RECOMMENDATIONS USING MULTIMODEL CONTEXTUAL RL 583

prehensive Experimentation and Results evaluation. Section VI function (Q) described above can be expressed mathematically
concludes this article and discusses the scope of future work using the Bellman equation [28] as follows:
in this domain.
Q(s) = E rt+1 + γ Q(st+1 )|(st = s) . (1)
II. BACKGROUND AND R ELATED W ORK Osband et al. [29] proposed DQN which suggested the use
This section is further divided into three subsections based of deep neural networks for estimating the Q-function. The
on the recommendation strategy used. The first subsection loss function (L) which is minimized by DQN is as follows:
summarizes the prior art, which treats recommendation as 2

a supervised learning problem. Second lists all prominent L(st , at ) = rt + γ max

Q st+1 , a , θ − Q(st , a t , θ ) . (2)
a
strategies based on single-agent RL, and the last one describes
the literature that has treated it as a multi-agent RL problem. Here, θ refers to the parameters of the target model and
θ refers to the parameters of the online network. The target
model is used to generate the Q value predictions for a
A. Traditional Recommendation Strategies certain fixed number of iterations. After these iterations, target
A significant amount of research has been done to solve model’s weights are updated to θ .
the problem of generating recommendations that maximize Several attempts have been made to leverage
the cumulative gain. The initial literature for recommendations Q-Learning [30], DQN frameworks for effective
describes the systems that are purely based on item popular- recommendations [31], [32]. Zheng et al. [15] uses
ity [19]. A critical class of recommenders that followed was DQN as a base model with Dueling Bandit Gradient
content-based filtering [6], [7], which takes into account the Descent Strategy for exploration. User-Profiling has also
correlations between the user and item features. Another major been used as a value-based RL problem for generating
breakthrough was achieved with the advent of CF [3], [4], recommendations [33]. Zhao et al. [34] leverages long-term
[20]. It can be further categorized as user-based, item-based, and session-based knowledge using adversarial training and
and hybrid filtering. In user-based filtering, recommendations RL for recommending movies.
are made based on the popular choices of the most-similar 2) Policy-Based RL: Policy-based RL is especially helpful
users [21]. Item-based filtering [22] shows the items which are in the case of continuous state and action representations
highly similar to the ones that the user has previously selected. which cannot be efficiently dealt with using Q-Learning or
Other variant approaches such as Matrix Factorization [23], DQN. It uses gradient ascent to optimize the parameters for
Factorization Machines [24], [25], Logistic regressions mod- maximizing the expected reward. In policy-based methods
els have also managed to give a performance boost to the such as actor-critic [35], the actor accepts the state as input and
CF-based methods. The incorporation of Deep Neural net- outputs continuous action value. The action is further passed to
works in Recommenders is used for efficiently modeling the the critic network along with the input state vector to generate
complex non-linearities [26]. Wang et al [27] uses denoising Q value for the state-action pair. The gradient J for actor-critic
auto-encoders on top of Matrix Factorization model to extract is described as follows:
useful features from the side information available. The key T
feature that distinguishes our model from the rest of these J (θ ) = E π θ log πθ (at |st ; θ )R (3)
techniques is our framework’s ability to learn at every step of t=0
the training as well as the testing. Hence, it can learn all the where R is the sum of gamma-discounted rewards. Fur-
dynamic trends from the incoming data, unlike the traditional thermore, the system generates a list of recommendations
technique whose learning is static and is only limited to consisting of all top-scoring items. Several recommendation
training data patterns. strategies [14], [36]–[38] use actor-critic as a base RL model
with modifications in state representation, exploration func-
B. Single Agent RL-Based Recommendation tions, etc. to provide better results.
In a generic single-agent RL setting, the agent observes a
state st from the environment at timestep t. It uses a policy π C. Multi-Agent RL-Based Recommendation
to obtain the best action for the given state st . Action is pushed Multi-Agent RL strategy consists of multiple RL agents try-
to the environment, and a reward rt is obtained, leading to the ing to maximize the discounted cumulative long-term reward
next state st+1 . The objective here is to tune the parameters θ collectively. However, with multiple agents in action simul-
of the policy π to obtain the maximum cumulative expected taneously [39]–[41], each agent only has a partial view of
reward. RL frameworks can be broadly divided into value- the environment. Most common approach followed in multi-
based methods and policy-based methods. Both the methods agent systems is decentralized execution and standard Q-value
involve the system in a current state st . γ refers to the discount function evaluation [42]. However, this involves a significant
factor. issue of credit assignment for each agent. The global action
1) Value-Based RL: Value-based RL involves the estima- performed is usually a combined result of execution performed
tion of an action-value function. This function determines by all agents together. Hence, the global action receives a
the expected sum of discounted future reward for a specific global reward that cannot be linked to any one agent in
action at corresponding to the current state st . The Q-value particular [43]. Huang et al. [44] compares the global reward

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
584 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 2, APRIL 2022

Fig. 1. Occurrences of random five items throughout the timestamp for TalkingData and Globo data, respectively.

with the reward obtained by taking a default outcome for all 3) CCE is a new method to perform enriched exploration
other agents except the one for which the reward is being by relying on its present knowledge of the environment
calculated. Independent Q-learning [45] models multi-agent to make smart and less randomized choices.
by learning wholly separate and independent DQN networks We first explicate the SG Module, Novel exploration strat-
with the environment as the only source of interaction between egy (CCE), and Multi-Agent RL to define concepts used in
them. our framework. We then elucidate on our Model Architecture
However, one problem that remains is the convergence and which incorporates all the aforementioned segments.
stability of the system. Since all other agents are updating
their weights simultaneously, no single agent can consider its A. SG Module
environment stationary, and hence it ends up chasing an ever- This module leverages the extra information present in the
moving target. Most of the available literature on multi-agent commonly available user-interaction logs such as user, item,
systems deal only with the game-playing scenario and not and context features. Fig. 2 shows its overall working. Its
with recommendation systems. Our approach involves using a functioning can be broadly divided into two parts.
separate agent corresponding to each available action (which
1) Maintaining User’s Interaction History: Simple aggre-
is possible in a recommendation system scenario as there are
gation of all the item-labels that the user has previously
only a discrete number of items available to choose from) in
clicked not only consumes excessive storage but also
an attempt to overcome the issues mentioned above.
provides very little insight into user behavior. Instead,
we propose to maintain interaction-history in terms of
III. P ROBLEM D EFINITION item-feature vectors. Hence, whenever a new item a is
Whenever a user i with user features u i and context features to be added in a user’s history vector h u , it is updated
ci makes a recommendation request to the system, our model as follows:
generates a list of top-k most relevant recommendations for the
h u = αh u + a. (4)
user. These recommendations are based on the results obtained
from N RL agents, where N is the fixed number of items in Here, α is the age multiplier used to decrease the impact
the available itemset. The notations used throughout this article of the old feature values as they move further into the
are summarized in the Nomenclature section. past and to prevent feature values from exploding.
2) Generating State Embeddings: Auto-encoder is used
IV. P ROPOSED W ORK at this step for generating uniform embeddings. The
user and context features from the input request are
In this section, we elaborate extensively on segments of our
concatenated with the interaction history feature vector
proposed architecture.
ih .
The novelty of the proposed work is as follows.
v = u t + ct + i h . This concatenated vector v is fed
1) Developing a novel multimodel architecture for recom- as an input to the autoencoder. The output from the
mendation systems using per-item agent strategy with autoencoder is the required state embedding st .
independent credit assignment.
2) SG Module is the first of its kind technique to incor-
B. CCE: Contextual-Cluster Exploration Strategy
porate the user interaction history and user features as
the input. This enables the model to learn from user’s For deciding between exploration and exploitation, we pro-
previous interactions effectively. pose a novel strategy called CCE. This is diagrammatically

Algorithm 1: Context Clustering Exploration total number of clusters exceed cthresh (default: 15), then re-
Initialize: clustering of all the existing user state embeddings is done.
cthresh A clear representation of the same is shown in Fig. 1(b).
Random n- experience clusters
C. Multi-Agent Reinforcement
Define the centroid of each cluster randomly (c)
for each user state embedding st do Instead of letting a single agent handle the execution of
dmin = MAX_VAL all the incoming recommendation requests, we have divided
csel = None the execution task among multiple homogeneous DQN-based
for 1 to n Cluster Centroids do agents. Unlike other multi-agent strategies, our framework
d = Euclidean_dis(st , ci ) differentiates the functioning of these agents in terms of the
if d < dmin then information they act upon. Each agent is responsible for
dmin = d learning the user-behavior or characteristics corresponding to
end a specific item/action. This not only makes these models
if dmin > dthesh then independent of each other but also addresses one of the main
EXPLORE
issues faced by other multi-agent recommendation strategies,
Recommend k random items to user
else i.e., instability [46]. Other strategies are prone to instability
EXPLOIT as one agent cannot assume the environment to be stationary
Recommend k best items to user if other agents are getting updated simultaneously. In our
end case, since every agent acts on action-segregated information,
After showing the user k items they tend not to get affected by the update of weights in
for j in k items do other models. Another major problem in current systems is
if Reward(j) = 1 then the local credit assignment for global reward [43]. When
st = Updated State Embedding multiple agents cumulatively score the candidate items and
else finally present a list, it is difficult to assign credit for the
st = Original State Embedding
obtained reward. However, since our framework assigns a
end
separate agent for every action, the credit assignment is
end
entirely straightforward. Although multi-agent systems are
for 1 to n Cluster Centroids do
highly prevalent in applications such as game-playing, their
d = Euclidean_dis(st, ci )
usage in recommender systems is still rare.
if d < dmin then
dmin = d A significant advantage of our MMCR framework over other
csel = ci common single-agent recommendation strategies [14], [15] is
end that it gives relevant recommendations, even for the sparsely
if dmin > dthesh then active users. Each model pays attention to only specific
Assign a new cluster to this state embedding information and naturally does a better job at identifying what
c_count + = 1 user and context combination suits the given item the best.
if c_count > c_thresh then Moreover, MMCR does not neglect relatively unpopular
Reassign all the clusters items. As observed in several real-life scenarios, some items
else are highly popular among the users in general while others
Add state embedding (st ) to csel are liked by a rather specific set of users. In traditional
end recommender systems [5], [9], [15] however, the unpopular
end items might never get recommended to the right user. MMCR
has the ability to learn the preferred user characteristics for a
particular item, as it maintains a separate model for each item.
In this framework, we have opted for an item-centric approach
represented in Fig. 3. As described in the Algorithm 1, rather than the prevalent user-centric model. Thus, even if an
we first initialize centroids of n-Experience Clusters randomly. item appears less frequently than others, its chances of being
The default value of n is 10. For every input request at recommended to the right user are not hurt.
time t, the distance di of the generated state embedding st
is calculated from all the current centroids using Euclidean D. Model Framework
distance. If the minimum distance (dmin ) is greater than the Our Multimodel Contextual Reinforcement (MMCR)
decided threshold (dthresh ), then we explore else exploit. During framework combines user, item, and user-item context features
exploration, random k items are shown to user else best k items along with the user’s positive interaction history to create a
are chosen. After showing the items, the state embedding for personalized and interactive recommender system. Instead of a
the current user is updated according to the items clicked single agent, the task of generating recommendations is spread
by him/her (st ). The distance of st is calculated from all over multiple RL agents. A detailed depiction of the proposed
the existing clusters. If st has its minimum distance (dmin ) framework is given in Fig. 4. The step-by-step procedure
smaller than (dthresh ), then st gets assigned to the nearest for generating recommendations, following Algorithm 2 is as
cluster. Otherwise, a new cluster is formed. However, if the follows.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
586 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 2, APRIL 2022

Fig. 3. Context-clustering exploration.

weightage to the negative experiences however it can be

varied as per use-case by changing the value of thresh),
Fig. 2. SG module. as shown in Algorithm 2. This is performed in order
to ensure that the model is more inclined toward the
1) At each timestep ti , SG (see Section IV-A) is fed trends observed recently as compared to those observed
with user and context features as input. This module in the distant past. Moreover, we update the Experience
combines them with the user’s positive interaction his- clusters by moving the cluster centroids according to the
tory and feeds it to an auto-encoder, thus generating a newly accumulated states using the k-means clustering
uniformly sized state embedding st . algorithm. The Experience buffer is then emptied.
2) The state embedding st from the previous step is then
sent to the Explore module (see Section IV-B) which
decides whether to use exploration or exploitation for V. E XPERIMENTATION AND R ESULTS
the current request. In this module, st ’s proximity is cal- A. Datasets
culated from the existing Experience Clusters to decide
1) TalkingData User Gemographics Dataset: This dataset1
if a similar state has been encountered previously or not.
contains user and mobile application interaction logs. The Data
If the module decides that a similar state has previously
is collected from TalkingData SDK integrated within mobile
been handled, the system proceeds with exploitation, and
apps. Here, the mobile application consumer is the proposed
the state is assigned to one of the experience clusters.
user, and mobile applications refer to the proposed itemset.
Otherwise, it explores the existing itemset randomly.
2) Globo.Com News Dataset: This dataset2 has page view
This step produces a list L of k recommended items.
logs corresponding to news articles from a popular news portal
3) The list L generated from the previous step is pushed to
of Brazil. (Globo.com). Here, the newsreaders are proposed
the user to obtain feedback. Since we know the actual
users and news articles refer to the proposed itemset.
feedback of only one item in the itemset (i.e., the item
Keeping in mind several non-repetitive or hardly repetitive
clicked by the user), the agent corresponding to that item
items as well as the vastness of the data, we chose to select
is updated with positive feedback for the current state
most frequently appearing 40 items. The selection was made
embedding. The reward is defined as
similarly for both the datasets to maintain the consistency
1, if at = right_arm of experiments. The data had a combination of categorical
R(st , at ) = (5) and continuous features which were one hot encoded and
0, if at = right_arm.
normalized, respectively. The statistics of both the datasets
Here, st refers to current state, at refers to the selected after pre-processing are given in Table I.
item and right_arm is the correct item with respect
to the state (st ). Agents corresponding to the rest of B. Baselines
the items on L are updated with negative feedback. 1) Non-RL Baselines:
Each embedding-item-reward tuple (st , at , r ) is added
to the Experience buffer memory. The buffer collects 1) LR: Logistic Regression is widely used in the industry
the embedding-item-reward tuples. as a common baseline method due to its easy imple-
4) Once experiences equivalent to batch size have accu- mentation and high efficiency. It takes all the contextual
mulated in the experience buffer, selective samples are 1 https://www.kaggle.com/c/talkingdata-mobile-user-demographics/data
picked up as per the algorithm and replayed on their 2 https://www.kaggle.com/gspmoreira/news-portal-user-interactions-by-
respective agents. (In our case, we have given zero globocom

TABLE I
D ETAILS A BOUT THE D ATASETS U SED . H ERE # R EPRESENTS THE C OUNT

Fig. 4. MMCR pipeline.

features of users as input. This approach, however fails 2) Single Agent RL:
to model the complex non-linearities of the data features.
2) W&D: Wide & Deep [9] is a widely used state-of-the- 1) DRN [15]: Deep Reinforcement Learning Framework
art deep learning model combining the memorization for News Recommendation is a strategy that takes both
(through a logistic regression on wide combinations of user and item features into consideration. It uses double
categorical features) and generalization (through a deep deep-Q networks DDQNs) as a base framework along
neural network embedding of the raw features) to predict with Dueling Bandit Gradient Descent-based exploration
the click label. strategy.
3) FM [24], [25]: Factorization Machine is a com- 2) DRR [14]: This stands for Deep RL-based Recom-
bination of factorization models and support vector mendation. A sequential regulation framework that has
machines (SVMs). This is advantageous for parameter Actor-Critic RL-based pipeline is used for mapping
estimation in sparse settings.3 relations between the users and recommendation system.
Moreover, explicit user item interactions are used while
3 https://www.csie.ntu.edu.tw/ b97053/paper/Rendle2010FM.png modeling.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
588 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 2, APRIL 2022

Algorithm 2: Overall Methodology which lies in this bound. Results have been compared
Initialize: with Linear and Bootstrapped versions of UCB.
input : User Features (u), Context Features(c),User 3) TS [47], [49]: It essentially involves selecting an arm
specific item History(i h ), Item Features (a) , E= [null] , with a probabibility proportional to it being the best
for i E (1 to N) initialize DQN Models Q i arm. Results have been compared with Linear and
for session i = 1 : data.length do Bootstrapped versions of TS.
arms = [] 4) Adaptive Greedy (AG) [49]: This is a variant of classic
right_arm = li -greedy. However, during the execution is varied in
Ui = (u 0 , u 1 , . . . . . . , u m ) a controlled manner. A threshold value for probability
Ci = (c0 , c1 , . . . . . . , cm ) scores is generated. This threshold value is used to
i h = [null] decide between exploration and exploitation. However,
si = State_Gener ator (Ui , Ci , i h ) the threshold value changes when the number of trials
if (Explore Module(si ) == True) then exceeds a chosen window size so as to adapt the model
EXPLORATION to give better predictions. Two variants of this strategy
for k in K do namely AAG and AGT have been compared here.
arms.append(random(arm_list)) 5) AE [49]: This strategy uses active-learning as its base
end concept. It tries to select an observation depending upon
else the amount of leverage it gets on the model if its label
EXPLOITATION were known.
for j in Items do
scores.add(Q j . pr edi ct (si ))
Sort scores in descending order C. Evaluation
arms.add(top k items in scores) 1) Simulated Online Environment: The evaluation for all
end the methodologies was done in an online simulation. We have
end adhered the following guidelines.
Update i h with right_arm 1) The data were processed in increasing order of
si = State_Gener ator (Ui , Ci , i h ) timestamp. This maintained the time-dependent pattern
for m in arms do of the data to adhere to how users would enter the system
if (m == li ) then in a real-time scenario.
reward = 1 2) Although the traditional methods perform separate train-
Q i .fit([si , m, r ewar d, si ]) ing, our framework adopts learning while testing. Imme-
E.add([si , m, r ewar d, si ]) diate feedback and experience replay provides continu-
else ous learning. This is elaborated in Section IV-D.
reward = 0
The simulation takes each state (depicting a user entering
if Random(num) < thresh then
E.add([si , m, r ewar d, si ]) the system) as entry into the system. The system predicts the
Q i .fit([si , m, r ewar d, si ]) item for each state, which is then compared with the actual
end label (right arm), The right_arm is not known to the system
end before the comparison similar to a real-world scenario. For
if (E.length % Batch_size == 0) then example: If a user enters a news website, k news items will
for (t in E) do be recommended to her. She can decide to choose one or none.
Q i .fit(t) If any of the items are clicked, it counts for a positive reward,
end else negative.
E = [] 2) Online Evaluation Metrics: The following metrics were
end used for the evaluation of all the techniques compared in this
article.
1) CTR: Click Through Rate is an effective measure for
3) Exploration Strategies: Explorations constitute a major assessing the effectiveness of recommender systems in
part of any Multi Arm Bandit-based recommender system. understanding the user click behavior. It is calculated for
We combined our proposed model with various existing explo- each item as follows:
ration strategies to compare their performance against CCE. Total number of clicks
CTRi = . (6)
1) Epsilon-Greedy [47], [48]: This is the most basic strat- Total number of impressions
egy which explores arms at random if a randomly gen- Further, average of CTR values of all the items is
erated number has a value less than a chosen parameter calculated to obtain the final CTR value as follows:
, and prefers exploitation otherwise. N
2) UCB [48], [49]: This leverages the uncertainty in CTRi
CTR = .
i=0
(7)
action-value pair. A confidence interval for each item N
is formed using a suitable distribution which determines 2) Relative CTR: It is a measure of evaluating how a rec-
the confidence about the actual action-value of an item ommendation strategy is performing on a given dataset

which are relevant to the given user

No. of relevant recommendations
Precision@k = .
Total no. of recommendations made
(10)
2) Recall at k: Recall or sensitivity for k-recommended
items is defined as the ratio of number of relevant
recommendations to the total number of items relevant
to the given user
No. of relevant recommendations
Recall@k = . (11)
Total no. of clicked items
The experimentation settings for the proposed method uses
a deep neural network with two hidden layers for each item
model. The batch size is 1000, i.e., the policy is updated
after every trial, and experience replay after the batch size
is reached. The optimizer used is Adam and loss is Binary
Crossentropy.

D. Results
The comparison of the proposed techniques is made with
the state-of-the-art recommendation strategies as well as the
various exploration strategies mentioned in Section V-B. The
exploration strategies are implemented on our base pipeline.
Results are obtained in both online and offline settings.
Tables IV and V show the performance of CCE as an
exploration strategy over the rest. CCE majorly outperformed
mentioned state-of-the-art techniques. Furthermore, the per-
formance of MMCR in comparison to state-of-the-art RL and
non RL-based strategies has been shown in Tables II and III.
For maintaining consistency with the non-RL-based strategies,
we have only shown results on the test set (i.e., 30% of the
entire dataset). To have a comparison over the timestep for
each RL-based technique, we plotted the CMR graphs shown
in Figs. 5–7. The subsequent paragraphs highlight the intuition
Fig. 5. CMR trends on RL-based techniques for k = 1. The y-axis signifies
CMR and x-axis signify the number of experiences. The red, blue and green behind the observed trends.
line stands for MMCR, DRN and DRR, respectively. (a) TalkingData SDK. 1) Context-Based Exploration: MMCR focuses on user
(b) Globo.
feature and context-based exploration. It can be observed
from Tables IV and V that CCE performs best, irre-
or platform as compared to other strategies when applied spective of the number of items recommended (k) for
on the same dataset. The Relative CTR for i th strategy both the datasets. CCE tends to shift gradually from
can be calculated as exploration to exploitation as our knowledge of the envi-
ronment increases. The positive effect of this shift can
CTRi
Relative CTR = . (8) be seen in the significantly higher CTR, Precision, and
CTR of other strategies Recall values for CCE as compared to other exploration
3) CMR: Cumulative Mean Reward is a useful measure to strategies. Being a context-based strategy, Section IV-D
continuously observe and analyze the performance of deeply narrows the possibility of random items to be
recommender systems over time. At any given point, shown to the users. AAG and AE give comparable per-
if R is the number of rows evaluated so far, CMR is formance to CCE. These exploration strategies change
calculated as follows: the amount of exploration, generally reducing it as the
R
Rewardi model learns over time. However, AAG and AE do not
CMR R = i=0
. (9) use contextual information for this decay in exploration.
R
The -greedy, UCB, and TS-based exploration do not
3) Offline Evaluation Metrics: Following popular recom- do justice as -greedy relies on a consistent degree of
mender system evaluation measures have been compared for randomness during exploration. UCB and TS require
the offline evaluation of this article. multiple instances of the same item to be shown before
1) Precision at k: Precision is defined as the fraction of the exploration stabilizes. In a brief period, these two
the total number of predicted recommendations (i.e., k) techniques can hurt performance.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
590 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 2, APRIL 2022

TABLE II
O NLINE AND O FFLINE E VALUATION FOR TALKING D ATA D ATASET. B EST R ESULTS A RE H IGHLIGHTED IN B OLD

TABLE III
O NLINE AND O FFLINE E VALUATION OF G LOBO D ATASET. B EST R ESULTS A RE H IGHLIGHTED IN B OLD

TABLE IV
VARIATION OF E XPLORATION S TRATEGIES ON TALKING D ATA D ATASET. B EST R ESULTS A RE H IGHLIGHTED IN B OLD

TABLE V
VARIATION OF E XPLORATION S TRATEGIES ON G LOBO D ATASET. B EST R ESULTS A RE H IGHLIGHTED IN B OLD

2) Multi-Agent Versus Single-Agent: When observing CMR as some items may have an acquired user set, e.g., a TV
in Figs. 5(a), 6(a) and (b), and 7(a) it is evident that show recommender has generic shows as well as the
DRN, DRR, and MMCR have a similar curve in the ones which have a limited user base. In a single model-
beginning. Over time, the performance of MMCR super- based technique such as DRR or DRN, the latter types
sedes the other RL-based strategies. This trend can be of shows may get sidelined as they are not frequently
accredited to the provision of item-based independent chosen. With independent arm-based modeling, this
multi-agent pipeline of MMCR. Single-agent techniques problem is solved as the user type is easily recognized
eventually give less weightage to sparsely used items. by the model of that particular item and thus item
This is a disadvantage in the case of recommendations is correctly recommended. Overall, MMCR tends to

Fig. 6. CMR trends on RL-based techniques for k = 3. The y-axis signifies Fig. 7. CMR trends on RL-based techniques for k = 5. The y-axis signifies
CMR and x-axis signify the number of experiences. The red, blue and green CMR and x-axis signify the number of experiences. The red, blue and green
line stands for MMCR, DRN, and DRR, respectively. (a) TalkingData SDK. line stands for MMCR, DRN, and DRR, respectively. (a) TalkingData SDK.
(b) Globo. (b) Globo.

outperform DRN and DRR across both the datasets for initially CMR graphs show that DRN and DRR are
all values of k (see Figs. 5–7). The same trend can also in competence with MMCR technique. In contrast for
be reaffirmed from Tables II and III for online, as well Figs. 5(b), 6(b) and 7(b) the MMCR technique has
as offline metrics. an edge from the beginning. Nevertheless, over time,
3) User-Item Weighted Interaction History:The state MMCR captures the trends far better than DRR and
embedding representation Section IV-A used by MMCR DRN. All the techniques eventually achieve stability in
has used item information fruitfully by developing CMR trends.
embeddings using the user-item interaction history as Although the current RL-based techniques such as DRN and
mentioned in Section IV-A. DRN and DRR use item DRR give improved results over non-RL-based strategies such
information but do not give adequate weightage to the as W&D and FM (see Tables II and III), they are conquered
user-item interaction history. Item history for a particular by MMCR. The non-RL techniques fail to perform in real-
user can give a deep insight into user behavior. MMCR time data as the learning during training may not be sufficient
ensures that new item features have a greater weight in to recognize the trends during testing. MMCR has three
interaction history than older items. Tables II and III as novel properties which are item-based multi-agent framework,
well as Figs. 5–7 representing CMR graphs have veri- non-conventional way of state embedding representation, and
fied our claim by stating significantly improved results novel exploration CCE (see Section IV-D) which significantly
throughout for MMCR. In Figs. 5(a), 6(a), and 7(a) improve its performance.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.
592 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 2, APRIL 2022

VI. C ONCLUSION AND F UTURE W ORK [16] A. Kabra, A. Agarwal, and A. S. Parihar, “Cluster-based deep contextual
reinforcement learning for top-k recommendations,” in Proc. Int. Conf.
In this article, we propose an item-based multi-agent recom- Comput. Commun. Syst. (I CS), vol. 170. Shillong, India: Springer, 2021,
mender system with the capability to tackle real-time dynamic p. 125.
[17] M. Tokic, “Adaptive ε-greedy exploration in reinforcement learning
recommendations with ease. This is especially beneficial to based on value differences,” in Proc. Annu. Conf. Artif. Intell. Ho Chi
sparsely yet periodically occurring item sets that may be Minh City, Vietnam: Springer, 2010, pp. 203–210.
ignored in a single agent strategy. The basis of the tech- [18] P. Auer, “Using confidence bounds for exploitation-exploration trade-
offs,” J. Mach. Learn. Res., vol. 3, pp. 397–422, Nov. 2003.
nique relies on Contextual Multi-Arm Bandit with a novel [19] H. Steck, “Item popularity and recommendation accuracy,” in Proc. 5th
exploration strategy called CCE which conquers SOTA explo- ACM Conf. Recommender Syst. (RecSys), 2011, pp. 125–132.
ration strategies by using effective clustering with minimal [20] Z. Cui et al., “Personalized recommendation system based on collabora-
tive filtering for IoT scenarios,” IEEE Trans. Services Comput., vol. 13,
randomity. Moreover, using user and item information with no. 4, pp. 685–695, Jul. 2020.
an inclination to learn more from recent item trends using [21] F. Ge, “A user-based collaborative filtering recommendation algorithm
user-item interactive history has increased the potential for based on folksonomy smoothing,” in Advances in Computer Science
and Education Applications. Berlin, Germany: Springer-Verlag, 2011,
relevance in recommendations. The continuous updates and pp. 514–518.
experience reply ensure current trends are learned and keep [22] M. Deshpande and G. Karypis, “Item-based top-N recommendation
reinforcing them throughout. Our extensive experimentation algorithms,” ACM Trans. Inf. Syst., vol. 22, no. 1, pp. 143–177, 2004.
[23] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques
has shown that these properties significantly improve results for recommender systems,” IEEE Comput., vol. 42, no. 8, pp. 30–37,
on both offline and online evaluation metrics. For future work, Aug. 2009.
improving scalability by reducing the number of models in [24] S. Rendle, “Factorization machines,” in Proc. IEEE Int. Conf. Data
Mining, Dec. 2010, pp. 995–1000.
multi-agent RL or incorporating the properties into a single [25] S. Rendle, Z. Gantner, C. Freudenthaler, and L. Schmidt-Thieme,
agent model has a scope. Moreover, new techniques of clus- “Fast context-aware recommendations with factorization machines,” in
tering can be tested for this CCE exploration strategy to make Proc. 34th Int. ACM SIGIR Conf. Res. Develop. Inf. (SIGIR), 2011,
pp. 635–644.
it advance. [26] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural
collaborative filtering,” in Proc. 26th Int. Conf. World Wide Web, 2017,
pp. 173–182.
R EFERENCES [27] K. Wang, L. Xu, L. Huang, C.-D. Wang, and J.-H. Lai, “SDDRS:
Stacked discriminative denoising auto-encoder based recommender sys-
[1] J-Clement. (2021). Selected Online Companies Ranked by Total tem,” Cognit. Syst. Res., vol. 55, pp. 164–174, Jun. 2019.
Digital Advertising Revenue From 2012 to 2020. [Online]. Available: [28] R. Bellman, Dynamic Programming, vol. 707. New York, NY, USA:
https://www.statista.com/statistics/205352/digital-advertising-revenue- Courier Corporation, 2013.
of-leading-online-companies/ [29] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, “Deep exploration
[2] A. S. Das, M. Datar, A. Garg, and S. Rajaram, “Google news personal- via bootstrapped DQN,” in Proc. Adv. neural Inf. Process. Syst., 2016,
ization: Scalable online collaborative filtering,” in Proc. 16th Int. Conf. pp. 4026–4034.
World Wide Web (WWW), 2007, pp. 271–280. [30] N. Taghipour and A. Kardan, “A hybrid web recommender system
[3] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl, “Item-based collab- based on Q-learning,” in Proc. ACM Symp. Appl. Comput. (SAC), 2008,
orative filtering recommendation algorithms,” in Proc. 10th Int. Conf. pp. 1164–1168.
World Wide Web (WWW), 2001, pp. 285–295. [31] V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013,
[4] J. B. Schafer, D. Frankowski, J. Herlocker, and S. Sen, “Collaborative arXiv:1312.5602. [Online]. Available: http://arxiv.org/abs/1312.5602
filtering recommender systems,” in The Adaptive Web. Berlin, Germany: [32] X. Zhao, L. Zhang, Z. Ding, L. Xia, J. Tang, and D. Yin, “Recommenda-
Springer, 2007, pp. 291–324. tions with negative feedback via pairwise deep reinforcement learning,”
[5] J. Liu, P. Dolan, and E. R. Pedersen, “Personalized news recommen- in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
dation based on click behavior,” in Proc. 15th Int. Conf. Intell. user 2018, pp. 1040–1048.
Interfaces (IUI), 2010, pp. 31–40. [33] H. Liang, “DRprofiling: Deep reinforcement user profiling for
[6] M. Kompan and M. Bieliková, “Content-based news recommenda- recommendations in heterogenous information networks,” IEEE
tion,” in Proc. Int. Conf. Electron. Commerce Web Technol. Cham, Trans. Knowl. Data Eng., early access, May 29, 2020, doi:
Switzerland: Springer, 2010, pp. 61–72. 10.1109/TKDE.2020.2998695.
[7] M. J. Pazzani and D. Billsus, “Content-based recommendation systems,” [34] W. Zhao et al., “Leveraging long and short-term information in content-
in The Adaptive Web. Berlin, Germany: Springer, 2007, pp. 325–341. aware movie recommendation via adversarial training,” IEEE Trans.
[8] L. Zheng, V. Noroozi, and P. S. Yu, “Joint deep modeling of users and Cybern., vol. 50, no. 11, pp. 4680–4693, Nov. 2020.
items using reviews for recommendation,” in Proc. 10th ACM Int. Conf. [35] I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska, “A survey
Web Search Data Mining, Feb. 2017, pp. 425–434. of actor-critic reinforcement learning: Standard and natural policy gra-
[9] H.-T. Cheng et al., “Wide & deep learning for recommender sys- dients,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 42, no. 6,
tems,” in Proc. 1st Workshop Deep Learn. Recommender Syst., 2016, pp. 1291–1307, Nov. 2012.
pp. 7–10. [36] L. Wang, W. Zhang, X. He, and H. Zha, “Supervised reinforcement
[10] D. Silver et al., “Mastering the game of Go with deep neural networks learning with recurrent neural network for dynamic treatment recom-
and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016. mendation,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery
[11] G. Tesauro, “Temporal difference learning and TD-Gammon,” Commun. Data Mining, Jul. 2018, pp. 2447–2456.
ACM, vol. 38, no. 3, pp. 58–68, Mar. 1995. [37] X. Zhao, L. Xia, L. Zhang, Z. Ding, D. Yin, and J. Tang, “Deep
[12] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in reinforcement learning for page-wise recommendations,” in Proc. 12th
robotics: A survey,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1238–1274, ACM Conf. Recommender Syst., Sep. 2018, pp. 95–103.
2013. [38] X. Zhao, L. Zhang, L. Xia, Z. Ding, D. Yin, and J. Tang,
[13] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training “Deep reinforcement learning for list-wise recommendations,” 2017,
of deep visuomotor policies,” J. Mach. Learn. Res., vol. 17, no. 1, arXiv:1801.00209. [Online]. Available: http://arxiv.org/abs/1801.00209
pp. 1334–1373, 2015. [39] F. Buccafurri, D. Rosaci, G. M. L. Sarnè, and L. Palopoli, “Modeling
[14] F. Liu et al., “Deep reinforcement learning based recommendation cooperation in multi-agent communities,” Cognit. Syst. Res., vol. 5,
with explicit user-item interactions modeling,” 2018, arXiv:1810.12027. no. 3, pp. 171–190, Sep. 2004.
[Online]. Available: http://arxiv.org/abs/1810.12027 [40] P. Tommasino, D. Caligiore, M. Mirolli, and G. Baldassarre, “A rein-
[15] G. Zheng et al., “DRN: A deep reinforcement learning framework for forcement learning architecture that transfers knowledge between skills
news recommendation,” in Proc. World Wide Web Conf. (WWW), 2018, when solving multiple tasks,” IEEE Trans. Cognit. Develop. Syst.,
pp. 167–176. vol. 11, no. 2, pp. 292–317, Jun. 2019.

[41] Y. Rizk, M. Awad, and E. W. Tunstel, “Decision making in multiagent Anu Agarwal received the B.Tech. degree in com-
systems: A survey,” IEEE Trans. Cognit. Develop. Syst., vol. 10, no. 3, puter science and engineering from Delhi Techno-
pp. 514–529, Sep. 2018. logical University, Delhi, India, in 2020.
[42] J. Feng et al., “Learning to collaborate: Multi-scenario ranking via multi- She has worked with the Machine Learning Labo-
agent reinforcement learning,” in Proc. World Wide Web Conf. World ratory, Delhi Technological University. She is cur-
Wide Web (WWW), 2018, pp. 1939–1948. rently working as a Senior Engineer with the
[43] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, Advanced Sensor Algorithm Development Team,
“Counterfactual multi-agent policy gradients,” in Proc. 32nd AAAI Conf. Samsung Semiconductors India Research and Devel-
Artif. Intell., 2018, pp. 1–9. opment Centre, Bengaluru, Karnataka, India. Her
[44] H. Huang, Q. Zhang, and X. Huang, “Mention recommendation for current research interests include reinforcement
Twitter with end-to-end memory network,” in Proc. 26th Int. Joint Conf. learning, 3-D scene reconstruction, neural network
Artif. Intell., Aug. 2017, pp. 1872–1878. optimization, computer vision, and deep learning.
[45] A. Tampuu et al., “Multiagent cooperation and competition with
deep reinforcement learning,” PLoS ONE, vol. 12, no. 4, Apr. 2017,
Art. no. e0172395.
[46] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of
multiagent reinforcement learning,” IEEE Trans. Syst., Man, Cybern. C,
Appl. Rev., vol. 38, no. 2, pp. 156–172, Mar. 2008.
[47] S. Agrawal and N. Goyal, “Thompson sampling for contextual ban-
dits with linear payoffs,” in Proc. Int. Conf. Mach. Learn., 2013,
pp. 127–135.
[48] J. Vermorel and M. Mohri, “Multi-armed bandit algorithms and empir-
ical evaluation,” in Proc. Eur. Conf. Mach. Learn. Cham, Switzerland:
Springer, 2005, pp. 437–448.
[49] D. Cortes, “Adapting multi-armed bandits policies to contextual
bandits scenarios,” 2018, arXiv:1811.04383. [Online]. Available:
http://arxiv.org/abs/1811.04383

Anil Singh Parihar (Member, IEEE) received the

Anubha Kabra received the B.Tech. degree in B.Tech. degree in electronics and communication
computer science and engineering from Delhi engineering from Uttar Pradesh Techinal University,
Technological University, Delhi, India, in 2020. Lucknow, India, in 2005, the M.Eng. degree in elec-
She has worked with the Machine Learning tronics and communication engineering from Delhi
Laboratory, Delhi Technological University, College of Engineering, Delhi, India, in 2008, and
the Multimodal Digital Media Analysis (MIDAS) the Ph.D. degree in computer vision and evolution-
Laboratory, Indraprastha Institute of Information ary computing from Delhi Technological University,
Technology (IIIT) Delhi, Delhi, and the Media and Delhi, in 2016.
Data Science Research Laboratory, Adobe Systems, He joined the Department of Information Technol-
Noida, Uttar Pradesh, India, on various AI and ogy, Delhi Technological University, as an Assistant
data science-based projects. She joined Adobe Professor, in 2010. He is currently an Associate Professor with the Department
Systems as a Software Development Engineer. Her current research interests of Computer Science and Engineering, Delhi Technological University. His
include machine learning, deep learning, natural language processing, and current research interests include computer vision, machine learning, deep
reinforcement learning. learning, pattern recognition, soft computing, and evolutionary algorithms.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 30,2024 at 12:25:47 UTC from IEEE Xplore. Restrictions apply.

BHEL Annual Report 2022-23
No ratings yet
BHEL Annual Report 2022-23
384 pages
Saral Jyotish
50% (2)
Saral Jyotish
296 pages
Honda XR125L 150LEK Owner Manual PDF
100% (1)
Honda XR125L 150LEK Owner Manual PDF
117 pages
125989685
No ratings yet
125989685
8 pages
MOVIE RECOMMENDATION SYSTEM BASED ON MACHINE LEARNING USING PROFILING
No ratings yet
MOVIE RECOMMENDATION SYSTEM BASED ON MACHINE LEARNING USING PROFILING
10 pages
(IJCST-V11I1P5) :jitendra Maan, Harsh Maan
No ratings yet
(IJCST-V11I1P5) :jitendra Maan, Harsh Maan
6 pages
FINAL Document Kalyani
No ratings yet
FINAL Document Kalyani
80 pages
27
No ratings yet
27
7 pages
Efficient Rec Springer
No ratings yet
Efficient Rec Springer
16 pages
Image-Based Service Recommendation System A JPEG-Coefficient RFs Approach
No ratings yet
Image-Based Service Recommendation System A JPEG-Coefficient RFs Approach
11 pages
(FTRL) Ad Click Prediction A View From The Trenches (Google 2013)
No ratings yet
(FTRL) Ad Click Prediction A View From The Trenches (Google 2013)
9 pages
Online Learning in Large-Scale Contextual Recommender Systems
No ratings yet
Online Learning in Large-Scale Contextual Recommender Systems
13 pages
MICRO_PROJECT_REPORT_RL
No ratings yet
MICRO_PROJECT_REPORT_RL
29 pages
Deep Reinforcement Learning For Recommender Systems
No ratings yet
Deep Reinforcement Learning For Recommender Systems
8 pages
Hyper Tuner
No ratings yet
Hyper Tuner
11 pages
DOC-20241116-WA0011.
No ratings yet
DOC-20241116-WA0011.
8 pages
CHURNFORGE Research Paper Kajal
No ratings yet
CHURNFORGE Research Paper Kajal
6 pages
SIGIR 2016ShortPpr
No ratings yet
SIGIR 2016ShortPpr
4 pages
Query Extraction Using Filtering Technique Over The Stored Data in The Database
No ratings yet
Query Extraction Using Filtering Technique Over The Stored Data in The Database
5 pages
A Survey On Recommendation System For Bigdata Using MapReduce Technology
No ratings yet
A Survey On Recommendation System For Bigdata Using MapReduce Technology
5 pages
Churn Prediction Using Machine Learning Models
No ratings yet
Churn Prediction Using Machine Learning Models
6 pages
MICRO_PROJECT_REPORT_RL
No ratings yet
MICRO_PROJECT_REPORT_RL
27 pages
Online News Articles Popularity Prediction System
No ratings yet
Online News Articles Popularity Prediction System
9 pages
Research Paper 3
No ratings yet
Research Paper 3
6 pages
Interactive Machine Learning On Edge Devices With User-in-the-Loop Sample Recommendation
No ratings yet
Interactive Machine Learning On Edge Devices With User-in-the-Loop Sample Recommendation
15 pages
Performance Evaluation of Various Classification Techniques For Customer
No ratings yet
Performance Evaluation of Various Classification Techniques For Customer
19 pages
Personalized E-Commerce Based Recommendation Systems Using Deep-Learning Techniques
No ratings yet
Personalized E-Commerce Based Recommendation Systems Using Deep-Learning Techniques
9 pages
Eecs 2017 99
No ratings yet
Eecs 2017 99
28 pages
Product-Recommendation-System-Based-on-Deep-Learning (1)
No ratings yet
Product-Recommendation-System-Based-on-Deep-Learning (1)
9 pages
Province Sensitive Reference With Subgroup Analysis of The User Component
No ratings yet
Province Sensitive Reference With Subgroup Analysis of The User Component
4 pages
A Big Data-Driven Hybrid Model For Enhancing Streaming Service Customer Retention Through Churn Prediction Integrated With Explainable AI
No ratings yet
A Big Data-Driven Hybrid Model For Enhancing Streaming Service Customer Retention Through Churn Prediction Integrated With Explainable AI
21 pages
Machine Learning Based Recommender Syste
No ratings yet
Machine Learning Based Recommender Syste
9 pages
A Personalized Product Recommendation Model in E-Commerce Based On Retrieval Strategy
No ratings yet
A Personalized Product Recommendation Model in E-Commerce Based On Retrieval Strategy
14 pages
majorpptfin
No ratings yet
majorpptfin
19 pages
Machine Learning Based Recommender System For E-Commerce
No ratings yet
Machine Learning Based Recommender System For E-Commerce
9 pages
CIKM-CFCF
No ratings yet
CIKM-CFCF
10 pages
SIGIR2020-ShiShaoyun
No ratings yet
SIGIR2020-ShiShaoyun
10 pages
1 s2.0 S2665917423001265 Main
No ratings yet
1 s2.0 S2665917423001265 Main
11 pages
98 Jicr September 3208
No ratings yet
98 Jicr September 3208
6 pages
Collaborative Filtering Recommendation Algorithm B
No ratings yet
Collaborative Filtering Recommendation Algorithm B
14 pages
Improving Deep Reinforcement Learning-Based Recommender Systems: Overcoming Issues with Usability, Profitability, and User Preferences
No ratings yet
Improving Deep Reinforcement Learning-Based Recommender Systems: Overcoming Issues with Usability, Profitability, and User Preferences
7 pages
12622-Article Text-22383-1-10-20220510
No ratings yet
12622-Article Text-22383-1-10-20220510
5 pages
Paper Published
No ratings yet
Paper Published
5 pages
ML 5 Reinforcement
No ratings yet
ML 5 Reinforcement
23 pages
Opinion Mining Using Machine Learning
No ratings yet
Opinion Mining Using Machine Learning
3 pages
2019 MMCNet Deep Learning-Based Multimodal Classification Model Using Dynamic Knowledge
No ratings yet
2019 MMCNet Deep Learning-Based Multimodal Classification Model Using Dynamic Knowledge
10 pages
Movie Recommendation System Using Machine Learning
No ratings yet
Movie Recommendation System Using Machine Learning
8 pages
Movie_Recommendation_System research paper
No ratings yet
Movie_Recommendation_System research paper
9 pages
Classification and Selection of Prototyping Activities For Interaction Design
No ratings yet
Classification and Selection of Prototyping Activities For Interaction Design
10 pages
1-s2.0-S2590123024014208-main
No ratings yet
1-s2.0-S2590123024014208-main
12 pages
Clustering Comparison of Customer Attrition Dataset using Machine Learning Algorithms
No ratings yet
Clustering Comparison of Customer Attrition Dataset using Machine Learning Algorithms
5 pages
AI Project FYP Proposal
100% (1)
AI Project FYP Proposal
11 pages
Transformation of Digital Learning towards
No ratings yet
Transformation of Digital Learning towards
8 pages
Azov et al.()Self-Improving Customer Review Response Generation Based on LLMs
No ratings yet
Azov et al.()Self-Improving Customer Review Response Generation Based on LLMs
18 pages
Review On Online Feature Selection
No ratings yet
Review On Online Feature Selection
4 pages
Date Fruit Classification Project
No ratings yet
Date Fruit Classification Project
11 pages
Resource Monitoring in Cloud Computing: Akshay Mehta
No ratings yet
Resource Monitoring in Cloud Computing: Akshay Mehta
3 pages
Arithmetic Optimization With Ensemble Deep Learning SBLSTM-RNN-IGSA Model For Customer Churn Prediction
No ratings yet
Arithmetic Optimization With Ensemble Deep Learning SBLSTM-RNN-IGSA Model For Customer Churn Prediction
18 pages
Research paper_Tushar Agrawal
No ratings yet
Research paper_Tushar Agrawal
3 pages
1 s2.0 S1110016824004952 Main
No ratings yet
1 s2.0 S1110016824004952 Main
8 pages
A Computational Modelfor Predicting Customer Behaviors Using
No ratings yet
A Computational Modelfor Predicting Customer Behaviors Using
8 pages
A Hybrid Approach For Movie Recommendation System Using Feature Engineering
No ratings yet
A Hybrid Approach For Movie Recommendation System Using Feature Engineering
5 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Important Considerations For Measuring Human Body Temperature With Infrared Thermometers
No ratings yet
Important Considerations For Measuring Human Body Temperature With Infrared Thermometers
27 pages
B.A. (Hons.) ECONOMICS SYLLABUS, IPU
No ratings yet
B.A. (Hons.) ECONOMICS SYLLABUS, IPU
31 pages
Ac Corrosion PA-19
No ratings yet
Ac Corrosion PA-19
6 pages
Tugas Belajar Dirumah Reading Skill
No ratings yet
Tugas Belajar Dirumah Reading Skill
1 page
Grade 9 Mathematics Syllabus
No ratings yet
Grade 9 Mathematics Syllabus
64 pages
World Environment Day 2024 - Program Outline
No ratings yet
World Environment Day 2024 - Program Outline
3 pages
Problems in Finite Element Methods Aubin Nitsche’s Duality Process
No ratings yet
Problems in Finite Element Methods Aubin Nitsche’s Duality Process
763 pages
Support Bracket
No ratings yet
Support Bracket
1 page
Water Demand Analysis of Municipal Water Supply Using Epanet PDF
No ratings yet
Water Demand Analysis of Municipal Water Supply Using Epanet PDF
11 pages
Đề 18 Thi vào 10
No ratings yet
Đề 18 Thi vào 10
6 pages
An Introduction To Race Car Engineering
No ratings yet
An Introduction To Race Car Engineering
13 pages
Unit 7 Trignometry Assessment
No ratings yet
Unit 7 Trignometry Assessment
4 pages
EIA PPT CH-1&2
No ratings yet
EIA PPT CH-1&2
40 pages
Paper-1 - 1 - Plato and Aristotle
100% (1)
Paper-1 - 1 - Plato and Aristotle
17 pages
Cylinder-Pressure-Based Engine Control Using Pressure-Ratio-Management and Low-Cost Non-Intrusive Cylinder Pressure Sensors PDF
No ratings yet
Cylinder-Pressure-Based Engine Control Using Pressure-Ratio-Management and Low-Cost Non-Intrusive Cylinder Pressure Sensors PDF
22 pages
PD Fault Location On Cable Lengths: Application Note
No ratings yet
PD Fault Location On Cable Lengths: Application Note
17 pages
How To Interface The 24LC256 EEPROM To Arduino
No ratings yet
How To Interface The 24LC256 EEPROM To Arduino
5 pages
3.1-6 Folder Redirection
No ratings yet
3.1-6 Folder Redirection
41 pages
Turorial 4 - Answer
0% (1)
Turorial 4 - Answer
3 pages
Battery Drain Parasitic Draw Testing
No ratings yet
Battery Drain Parasitic Draw Testing
3 pages
Software Varification
No ratings yet
Software Varification
22 pages
Q/SQR: Enterprise Standard of Chery Automobile Co., LTD
No ratings yet
Q/SQR: Enterprise Standard of Chery Automobile Co., LTD
10 pages
ترجمة الفوكاب بدون جمل بعد التعديل الأخير M442
No ratings yet
ترجمة الفوكاب بدون جمل بعد التعديل الأخير M442
14 pages
Copyreading and Headline Writing: Rowena Del Rosario-Salazar
No ratings yet
Copyreading and Headline Writing: Rowena Del Rosario-Salazar
64 pages
Hipulse UPS Training
No ratings yet
Hipulse UPS Training
30 pages
Scientific Reasoning and Argumentation The Roles of Domain Specific and Domain General Knowledge 1st Edition Frank Fischer (Editor) - Quickly download the ebook to read anytime, anywhere
100% (1)
Scientific Reasoning and Argumentation The Roles of Domain Specific and Domain General Knowledge 1st Edition Frank Fischer (Editor) - Quickly download the ebook to read anytime, anywhere
75 pages
Block 0 - Unit 1 - Mathematics Everywhere
No ratings yet
Block 0 - Unit 1 - Mathematics Everywhere
52 pages