0% found this document useful (0 votes)

72 views

A Short Survey On Memory Based RL

This document provides a survey of memory-based reinforcement learning methods. It begins with background on reinforcement learning and issues with sample efficiency in deep RL. It then discusses how incorporating external memory modules, inspired by human memory, can help RL agents make faster decisions based on past high-reward experiences. The document reviews different memory architectures proposed for RL and how they are applied. It aims to provide insight into recent advances in memory-based RL.

Uploaded by

cnt dvs

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views

A Short Survey On Memory Based RL

Uploaded by

cnt dvs

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

A S HORT S URVEY ON

M EMORY BASED R EINFORCEMENT L EARNING

Dhruv Ramani
National Institute of Technology, Warangal
[email protected]

A BSTRACT
arXiv:1904.06736v1 [cs.AI] 14 Apr 2019

Reinforcement learning(RL) is a branch of machine learning which is employed

to solve various sequential decision making problems without proper supervi-
sion. Due to the recent advancement of deep learning, the newly proposed Deep-
RL algorithms have been able to perform extremely well in sophisticated high-
dimensional environments. However, even after successes in many domains, one
of the major challenge in these approaches is the high magnitude of interactions
with the environment required for efficient decision making. Seeking inspira-
tion from the brain, this problem can be solved by incorporating instance based
learning by biasing the decision making on the memories of high rewarding expe-
riences. This paper reviews various recent reinforcement learning methods which
incorporate external memory to solve decision making and a survey of them is
presented. We provide an overview of the different methods - along with their
advantages and disadvantages, applications and the standard experimentation set-
tings used for memory based models. This review hopes to be a helpful resource to
provide key insight of the recent advances in the field and provide help in further
future development of it.

1 I NTRODUCTION
Reinforcement Learning (RL) (38) involves an agent to learn to take actions based on it’s current
situation to maximize a long term reward objective. The actions it takes aren’t labeled for training,
and the agent has to learn which action to take by trying them all out and settling for the best one.
The action it takes has an affect on both the reward it gets immediately and it’s long term future.
This originates from animal learning in psychology and hence it can mimic human learning ability
to select actions that maximize long-term profit in their interactions with the environment. This is
the reason why RL has been widely used in robotics and autonomous systems.
The recent advancement of deep learning has had a significant impact on many areas in machine
learning, improving the state-of-the-art in tasks such as object detection (28), speech recognition
(6), and language translation (37). The most important property of deep learning is that deep neu-
ral networks can automatically find compact low-dimensional representations of high-dimensional
data. Deep learning has similarly accelerated progress in RL, with the use of deep neural networks
for reinforcement learning tasks, hence defining the field of “deep reinforcement learning”. Deep
reinforcement learning involves usage of deep neural networks as a function approximator, thus
overcoming the curse of dimensionality. This makes it a promising approach to solving complex
real-world problems.
However, even after the rapid advancements in deep reinforcement learning, the standard architec-
tures are still found to be very sample inefficient and slow. In a setting where the agent learns to play
arcade games (22), deep reinforcement learning systems require millions of interactions with game
emulator, amounting to hundreds of hours of game play to achieve human-level performance, which
in turn seems pretty inhuman. Addressing the problems in a greater detail, we find that the slow-
gradient based update of neural networks requires these algorithms to incur large number of steps to
generalize, improve and assimilate the information for policy improvement. For environments with
sparse reward signal, modelling the policy with a neural network becomes even more challenging.
The low frequency of guiding signals or rewards can be seen as a form of class imbalance where

1
low-reward samples outnumber high-reward samples. Standard reward propagation approaches such
as Q-Learning (45) cause reward information to be propagated one step at a time through history.
However, this flow of information can be fairly efficient if updates happen in reverse order in which
the transitions occur. Also, approaches like DQN(22) involve random sampling of experience from
the replay memory, to train on uncorrelated mini-batches - requiring the usage of a target network
which further slows down this propagation.
In order to tackle these shortcomings of the current deep reinforcement learning algorithms, a good
idea would be to make decisions based on the experiences which provided high rewards in the
past. This involves the agent to be less reflexive according to their immediate perception and make
decisions based on the memories it has gathered over an extended time interval. Neither neural
network weights, nor activations support storage and retrieval of experiences as the weights change
too slowly to store samples as individual experiences. Usage of some variant of recurrent neural
networks is common in the partially observable setting (8), however these have trouble learning over
long sequences. Also, these approaches do not involve storage of observations as discrete entities,
hence the comparison of the new observation with a detailed instance of a rare highly-rewarding
past observation becomes very unclear.
Seeking inspiration from the rapid complementary approach of decision making in the brain (11),
there have been various recent attempts that try to incorporate external memory modules which
enrich the quality of decisions made by the agent (as per the returns accumulated), and make the
learning process faster. In the brain, this form of fast learning is supported by the hippocampus
and related medial temporal lobe structures (1; 41). Incorporation of this instance based learning
strategy serves as a faster rough approximation over a slow generalized decision making system.
The recent advancement of memory networks (47; 35) has attracted a growing amount of interest
in the research community to solve the challenging task of designing deep reinforcement learning
agents with external memory. In this paper, we provide a review of novel methods which solve
this problem. We cover different memory architectures which have been proposed to aid decision
making of reinforcement learning agents, different ways in which these proposed architectures are
used to solve different sub-problems within reinforcement learning, the environments proposed to
test these methods and the applications of these methods.

2 BACKGROUND
In this paper we consider a sequential decision making setup, in which an agent interacts with an
environment E over discrete time steps. We model the reinforcement learning problem using a
Markov Decision Process (MDP), unless otherwise specified. A MDP is defined as a tuple, M =
(S, A, R, P, s0 ), where S = {1, ..., sn } is the state space, A = {1, ..., am } is the action space,
and s0 is the initial state distribution. At each time step t = 1, ..., T within an episode, the agent
observes a state st ∈ S, takes action at ∈ A, receives a reward rt ∈ R(st , at ) and transitions to a
new state st+1 ∼ P(st , at ). A reward signal r ∈ R is a scalar defining the desirability of an event.
A policy π is a mapping from a state s ∈ S to an action a ∈ A.
The agent seeks to maximize the expected discounted return, which is defined as
∞
X
Gt = γ T −t rT (1)
T =t

In this formulation, γ ∈ [0, 1] is a discount factor that trades-off the importance of immediate and
future rewards. The state value function Vπ (s) provides an estimate of the expected amount of
the return the agent can accumulate over the future when following a policy π, starting from any
particular state s. The action value function Qπ (s, a) provides the expected return the agent gets
when it starts from any particular state s, takes an action a and continues to follow the policy π.
Vπ (s) = Eπ [Gt |st = s] (2)
Qπ (s, a) = Eπ [Gt |st = s, at = a] (3)

Q∗ (s, a) termed as the optimal action value function is the expected value of taking an action a in
state s and then following the optimal policy. In a value based agent, the agent tries to learn an
approximation of this and carry out planning and control by acting greedily on it. Q-learning (45) is

2
an off-policy control method to find the optimal policy. Q-learning (45) uses temporal differences
to estimate the value of Q∗ (s, a). In Q-learning, the agent maintains a table of Q[S, A] and Q(s, a)
represents the current estimate of Q∗ (s, a). This can be learnt by iteratively updating the value using
the following :

Q(s, a) ← Q(s, a) + α[r + max

0
Q(s0 , a0 ) − Q(s, a)] (4)
a

2.1 D EEP Q-N ETWORK

In all the above mentioned equations, the value functions are stored in a tabular form. Because of the
memory constraints, lack of generalization and lookup costs, learning to approximate these function
is preferred over the tabular setting. Function approximation is a way for generalization when the
state and/or action spaces are large or continuous. Function approximation aims to generalize from
examples of a function to construct an approximate of the entire function. With the coming of deep
learning renaissance, these functions are approximated using neural networks. Deep Q-Network
(22) provides the basis of incorporating neural networks to approximate the action value function
using Q-learning (45) for optimization. In this, the model is parameterized by weights and biases
collectively denoted as θ. Q-values are estimated online by querying the output nodes of the network
after performing a forward pass given a state input. Each output unit denotes a separate action. The
Q-values are denoted as Q(s, a|θ) . Instead of updating individual Q-values, updates are now made
to the parameters of the network to minimize a differentiable loss function :

2
L(s, a|θi ) = r + γ max
0
Q(s0 , a0 |θi ) − Q(s, a|θi ) (5)
a

θi+1 = θi + α∇θ L(θi ) (6)

the neural network model naturally generalizes beyond the states and actions it has been trained
on. However, because the same network is generating the next state target Q-values that are used in
updating its current Q-values, such updates can oscillate or diverge (42). To tackle this problem and
to ensure that the neural network doesn’t get biased, various techniques are incorporated. Experience
Replay is performed in which experiences et = (st , at , rt , st+1 ) are recorded in a replay memory
D and then sampled uniformly at training time. This is done to promote generalization. A separate
target network Q̂ provides update targets to the main network, decoupling the feedback resulting
from the network generating its own targets. Q̂ is identical to the main network except its parameters
θ− are updated to match θ every 10,000 iterations.
At each training iteration i, an experience et = (st , at , rt , st+1 ) is sampled uniformly from the
replay memory D. The loss of the network is determined as follows:
2
Li (θi ) = E(st ,at ,rt ,st+1 )∼D yi − Q(st , at ; θi ) (7)

where yi = rt + γ maxa0 Q̂(st+1 , a0 ; θ− ) is the stale update target given by the target network Q̂.
The actions are taken by acting -greedily on Q.
The Deep Q-Network has formed the basis of growth for deep reinforcement learning and various
modifications have been proposed to make tackle the problems more efficiently. Some of these
methods are Double DQN, Prioritized Experience Replay etc.

2.2 P OLICY G RADIENT A LGORITHMS

The policy π is defined as a stochastic or a deterministic function that maps states to the correspond-
ing action which the agent should take . In a policy based agent, the agent tries to learn the policy
instead of acting greedily on a value function. Again, in the current times - the policy is represented
using a neural network and policy optimization is used to find an optimal mapping. Considering first
the vanilla policy gradient algorithm (39), a stochastic, parameterized policy πθ is to be optimized.
Since we don’t know the ground truth labels for the action to take, we go back to maximizing the

3
expected return J(πθ ) = ET ∼πθ [R(T )]. The policy is optimized using gradient ascent, to maximize
the expected return.

θk+1 = θk + α∇θ J (πθ )|θk (8)

where the gradient estimator is of the form

∇θ J(πθ ) = Et [∇θ log πθ (at |st ) J(πθ )] (9)

Majority of the implementations which use automatic differentiation software, optimize the network
using an objective whose gradient is equal to the policy gradient estimator, which is obtained by
differentiating

LP G (θ) = Et [log πθ (at |st ) J(πθ )] (10)

In order to reduce the variance in the sample estimate for the policy gradient, the return can be
replaced by the advantage function Aπ (st , at ) = Qπ (st , at ) − V π (st ) which describes how much
better or worse it is than other actions on average.
Among the most popular methods to optimize policies is the Actor-Critic algorithm (14), which
learns both a policy and a state-value function, and the value function is used for bootstrapping,
i.e., updating a state from subsequent estimates, to reduce variance and accelerate learning. Some
of the other commonly used algorithms for policy optimization are TRPO (30), PPO (31) and the
deterministic (32) counterparts .

3 E PISODIC M EMORY
Episodic control, introduced by involves learning successful policies based on the episodic memory,
which is a key component of human life. In the brain, this form of learning is supported by the
hippocampus and related medial temporal lobe structures. Hippocampal learning is thought to be
instance-based (20; 36), in contrast to the cortical system which represents generalised statistical
summaries of the input distribution (21). It is found that humans utilize multiple learning, memory
and decision making systems in order to efficiently carry out the task in different situations. For
eg. - when information of the environment is available, the best strategy is model-based planning
associated with prefrontal cortex (4). However, when there aren’t enough resources to plan, a less
intensive decision making system needs to be employed. The common go-to among the RL litera-
ture is model-free decision making systems. However, as pointed out earlier - these methods require
very high amount of repeated interactions with environment, so there’s surely a scope of improve-
ment (when is there not?). This is where episodic control systems come into play and increase the
efficiency of model-free control systems by involvement of an external memory system which can
represent the hippocampus and inculcate instance based learning in agents. An important obser-
vation that supports this approach is that, single experiences with high returns can have prolonged
impact on future decision making in humans. For eg. Vasco Da Gamma’s discovery of the sea
route to India had an almost immediate and long-lasting effect for the Portuguese. Considering a
less extreme eg. - like recalling the plot of a movie as it unfolds, we realize that even in our day
to day lives episodic memory plays a crucial role. In reinforcement learning, the experiences of an
agent are termed as observations - and in episodic control, the agent leverages the information and
advantages of past observations in order to facilitate the decision making progress. For RL agents,
(17) proposed “episodic control”, which is given as

... each time the subject experiences a reward that is considered large enough
(larger than expected a priori) it stores the specific sequence of State- action pairs
leading up to this reward, and tries to follow such a sequence whenever it stumbles
upon a State included in it. If multiple successful sequences are available for the
same State, the one that yielded maximal reward is followed.

Majority of the work which involve episodic memory are based on this. In the next sections, we
will go through various models and methods which involve episodic memory for decision making.

4
Majority of the research which involves external memory in RL tasks involves episodic control. Due
to this and the biological backdrop to it, this approach for decision making seems promising.

4 M EMORY M ODULES
In this section, we go through various papers which have proposed a new memory module/method
which can be used for episodic control. All of the methods presented below propose a novel method
to incorporate memory into the reinforcement learning setting. These modules can be used to solve
various different kind of problems in reinforcement learning.

4.1 M ODEL F REE E PISODIC C ONTROL

This (2) can be considered one of the first few works which involved external memory in the re-
inforcement learning scene. It considers a deterministic environment, given the near deterministic
situations in the real world and the specialised learning mechanisms in the brain which exploit this
structure. The episodic model which represents the hippocampus for instance based learning here
is represented by a non-parametric growing table which is indexed by state-actions pair. It is repre-
sented by QEC (s, a) - and is used to rapidly record and replay the sequence of actions that so far
yielded the highest return from a given start State. Size of the table is limited by removing the least
recently updated entry once the maximum size of the table is reached.
At the end of each episode, the table is updated in the following way :

if (st , at ) 6∈ QEC ,

Rt
QEC (st , at ) ← (11)
max QEC (st , at ), Rt otherwise

Thus, the values stored in QEC (s, a) do not correspond to estimates of the expected return, rather
they are estimates of the highest potential return for a given State and action.
The value is estimated in the following way :

( P
1 k
[EC (s, a) = k i=1 QEC (s(i) , a) if (s, a) 6∈ QEC ,
Q EC (12)
Q (s, a) otherwise

For the states which have never been visited, QEC (s, a) is approximated by taking average of the
K-nearest states.
The main algorithm is given as follows :

Algorithm 1 Model-Free Episodic Control.

for each episode do
for t = 1, 2, 3, . . . , T do
Receive observation ot from environment.
Let st = φ(ot ).
Estimate return for each action a via (equation 12)
Let at = arg maxa Q [ EC (s , a)
t
Take action at , receive reward rt+1
end for
for t = T, T − 1, . . . , 1 do
Update QEC (st , at ) using Rt according to (equation 11).
end for
end for

Here, φ is a feature mapping which maps the observation ot to the state st . In this paper, φ is
represented as a projection to smaller-dimensional space ie. φ : x → Ax, where A ∈ RF ×D and

5
F D where D is the dimensionality of the observation and A is a random matrix drawn from
a standard Gaussian. φ can also be represented as a latent-variable probabilistic models such as a
variational autoencoder (13).
This approach had been the Arcade Learning Environment (Atari) (22), and a first-person 3-
dimensional environment, and it was found that it provided faster results as compared to standard
function approximators.

4.2 N EURAL E PISODIC C ONTROL

This (27) was the first end-to-end architecture which involved using memory in RL and was entirely
differentiable. The agent in this method consists of three components - a DQN (22) inspired convo-
lutional neural network that processes pixel images s and brings it down to an embedding space h.
This is then used to index a set of memory modules (one per action). These readouts from the action
memories are converted to Q(s, a) using a final network. Figure 1 shows the architecture during a
single pass.
The memory architecture in this paper is defined as a Differential Neural Dictionary (DND) - in
which each action a ∈ A has a simple memory module Ma = (Ka , Va ), where Ka and Va are
dynamically sized arrays of vectors, each containing the same number of vectors. The keys here
correspond to the embedding h which represent the state s and the value correspond to the respec-
tive Q(s, a) values. For each key hi , the lookup from the dictionary is performed as a weighted
summation of the P-nearest corresponding values.

p
X
o= w i vi , (13)
i

where vi is the ith element of the array Va and

X
wi = k(h, hi )/ k(h, hj ), (14)
j

Here, k is represented as a Gaussian or an inverse kernel. In the experiments performed, k is taken

as :
1
k(h, hi ) = . (15)
kh − hi k22 + δ
The writes to the DND are append only, and if the key already exits it’s updated. The values are
updated by performing standard Q-learning (45) updates with N -step Q-learning (26).
N
X −1
Q(N ) (st , a) = γ j rt+j + γ N max
0
Q(st+N , a0 ) (16)
a
j=0

Qi ← Qi + α(Q(N ) (s, a) − Qi ) (17)

where α is the learning rate of the Q update.
The algorithm for NEC is given at 2.

4.3 MASKED EXPERIENCE MEMORY

This architecture proposed in (19) provides a new memory module M which uses mask vectors in
the read operation which provides distributed weightage to the past memories based on the current
observations for control. The write operation appends the last observation into the fixed size memory
store, while the oldest memory is dropped from the store. The read operation compares the current
observation with all the previously written observations in the memory store and returns a vector
calculated as a weighted sum of all memories.

N
X
R= bi Mi (18)
i=1

6
Figure 1: Neural Episodic Control - Architecture of episodic memory module for a single action
a. Pixels representing the current state enter through a convolutional neural network on the bottom
left and an estimate of Q(s, a) exits top right. Gradients flow through the entire architecture.

Algorithm 2 Neural Episodic Control

D: replay memory.
Ma : a DND for each action a.
N : horizon for N -step Q estimate.
for each episode do
for t = 1, 2, . . . , T do
Receive observation st from environment with embedding h.
Estimate Q(st , a) for each action a via equation 13 from Ma
at ← -greedy policy based on Q(st , a)
Take action at , receive reward rt+1
Append (h, Q(N ) (st , at )) to Mat .
Append (st , at , Q(N ) (st , at )) to D.
Train on a random minibatch from D.
end for
end for

N is the number of D dimension vectors in the memory module and R is called the read vector. R
is defined as the weighted (b) summation over memories similarity (Qi ) to the current read vector.
exp(Qi )
bi = P N (19)
j=1 exp(Qj )

Qi is defined as the masked summation of the euclidean distance between the current observa-
tion/read key s and the memory elements :
D
X
Qi = exp(z) ad (s − Mid )2 (20)
d=1

exp(wd )
ad = PD (21)
k=1 exp(wk )

the mask weight vector w and attention distribution sharpness parameter z are trained by gradient de-
scent. The architecture used is a LSTM (10) based actor-critic (14) networks. Before every episode,
the memory is cleared. At every time-step, the current observation, the one-hot representation of
the last action taken, the reward received and the memory readout (from the current observation) is
concatenated and passed as an input to both, the actor-LSTM and the critic-LSTM. The standard
policy-gradient training procedure is followed. The key contribution of this architecture is the usage
of attention (43) mechanism which causes the agent to learn which memory segments to focus on.
This paper also proposed novel tasks to test memory-based frameworks on (which are discussed
later).

7
4.4 I NTEGRATING E PISODIC M EMORY WITH R ESERVOIR S AMPLING

This (49) method introduces an end-to-end trainable episodic memory module. Instead of assigning
credit to the recorded state by explicitly backpropogating through time, the set of states from the
memory are drawn from a distribution over all n-subsets of visited states which are parameterized
by weights. To draw from such a distribution without maintaining all visited states in memory, a
reservoir sampling technique is used.
The model is based on advantage actor critic (14) architecture, consisting of separate value and
policy networks. In addition to that, there’s an external memory module M consisting of n past
visited states (St0 , ..., Stn−1 ) with associated important weights (wt0 , ..., wtn−1 . Other than this,
other trainable networks include a query network (q), write network (w). The state St is given
separately to the query, write, value and networks at each time step. The query network outputs a
vector of size equal to the input state size which is used to choose a past state from the memory,
which is taken as an input by the policy. The write network assigns a weight to each new state
determining how likely it is to stay in memory. The policy network assigns probabilities to each
action conditioned on current state and recalled state. The value network estimates expected return
(value) from the current state.

t (q(S )|S )
ti
exp( T )
mt = Q(Sti |Mt ) = P (q(S )|S
(22)
n−1 t tj )
j=0 exp( T )

The model is trained using stochastic gradient descent, using standard RL loss functions for actor-
critic (14) method. The query network is trained on the loss −δt log(Q(mt |St )), by freezing the
other networks. With the write network, the weights w(St ) learned by the network are used in a
reservoir sampling algorithm such that the probability of a particular state St being in the memory at
given future time is proportional to associated weights w(St ) and to obtain estimates of the gradient
of the return with respect to the weight.
For sampling from the memory, a distribution with respect to the weights is learnt. The expected
return is calculated based on this distribution and the policy and other networks are trained using
policy gradient algorithm. However, sampling the samples from this distributions involve usage of a
reservoir sampling (44) algorithm which allows a steady distribution even while the contents of the
memory change. The sampling algorithm is too big to covered in this review, and can be read in the
paper.

4.5 N EURAL M AP

The proposed Neural Map (24) memory module was specifically designed for episodic decision
making in 3D environments, under a partial observable setting. This method provides a lot of weigh-
tage on the area where the agent is currently located for the storage of memory and decision making.
The write operator for the memory is selectively limited to affect the part of the neural map which
represents agent’s current position. If the position of the agent is given by (x, y) with x ∈ R and
y ∈ R and the neural map M is a C × H × W feature block, where C is the feature dimension,
H is the vertical extent of the map and W is the horizontal extent. There exists a coordinate nor-
malization function ψ(x, y) which maps every unique (x, y) to (x0 , y 0 ), where x0 ∈ {0, . . . , W } and
y 0 ∈ {0, . . . , H}.
Let st be the current state embedding, Mt be the current neural map, and (xt , yt ) be the current
position of the agent within the neural map. The Neural Map is defined by the following set of

8
Figure 2: (33) Neural Map - The method involves operation on the external memory Mt based on
the current position of the agent within the environment.

equations:

rt = read(Mt ) (23)
ct = context(Mt , st , rt ) (24)
(x ,yt )
t (xt ,yt )
wt+1 = write(st , rt , ct , Mt ) (25)
(xt ,yt )
Mt+1 = update(Mt , wt+1 ) (26)
(xt ,yt )
ot = [rt , ct , wt+1 ] (27)
πt (a|s) = Softmax(f (ot )), (28)
(x ,y )
where wt t t represents the feature at position (xt , yt ) at time t, [x1 , . . . , xk ] represents a con-
catenation operation, and ot is the output of the neural map at time t which is then processed by
another deep network f to get the policy outputs πt (a|s). The global read operation produces a C-
dimensional feature vector rt by passing the neural map Mt through a deep convolutional network,
hence summarizing the current instance of the memory. The context read operation is used to check
for certain features in the map and is given as :

qt = W [st , rt ] (29)
(x,y) (x,y)
at = qt · Mt (30)
(x,y)
at
(x,y) e
αt =P (w,z)
(31)
(w,z) eat
(x,y) (x,y)
X
ct = αt Mt , (32)
(x,y)

Where st is the current state embedding, rt is the current global read vector and they first produce
(x,y)
a query vector qt . The inner product of the query vector and each feature Mt in the neural map
(x,y)
is then taken to get scores at at all positions (x, y). Soft attention (43) is appplied over these
scores to get the context vector ct . Based on the representations calculated above and the current
coordinates of the agent, the write operation is performed using a deep neural network f which gives
a new C-dimensional write candidate vector at the current position (xt , yt )
(x ,yt )
t (xt ,yt )
wt+1 = f ([st , rt , ct , Mt ]) (33)

9
This write vector is used to update the memory in the following way :
(
(xt ,yt )
(a,b) wt+1 , for (a, b) = (xt , yt )
Mt+1 = (a,b) (34)
Mt , for (a, b) 6= (xt , yt )

For experimentation, the agent had been trained on a 3D Maze environment, where the only obser-
vation given was the current forward view of the agent. The memory represents a 2D map of the
whole maze. The visualized activations for the mapping provided key insights about the rewarding
trajectories which the agent should take.

4.6 M EMORY, RL, AND I NFERENCE N ETWORK (MERLIN)

This (46) seminal paper combined external memory systems, reinforcement learning and variational
inference (13) over states into a unified system based on concepts from psychology and neuro-
science - predictive sensory coding, the hippocampal representation theory of Gluck and Myers(5),
and the temporal context model and successor representation. Information from various sensory
input modalities (image It , egocentric velocity vt , previous reward rt−1 and action at−1 , and a text
instruction Tt ) are taken as observation ot and are encoded to et = enc(ot ). All of these encoders
were taken as ResNet (9) modules.

Figure 3: MERLIN - The architecture consists of two sub-networks, one for policy prediction and
other for observation inference. Both use a common external memory Mt and are modelled by a
recurrent neural network.

Based on variational inference, which treats inference as an optimization problem - the model con-
siders a prior distribution which predicts the next state variable conditioned on a history maintained
in memory of the previous state variables and actions: p(zt |z1 , a1 , . . . , zt−1 , at−1 ). The posterior
distribution corrects this prior based on the new observations ot to form a better estimate of the state
variable: q(zt |z1 , a1 , . . . , zt−1 , at−1 , ot ). The mean and log standard deviation of the prior distribu-
tion p are concatenated with the embedding and passed through a network to form an intermediate
variable nt , which is added to the prior to make a Gaussian posterior distribution q, from which
the state variable zt is sampled. This is inserted into row t of the memory matrix Mt and passed
to the recurrent network ht . The memory is represented using a Differentiable Neural Computer
(DNC) (7) and the recurrent network is represented by a deep LSTM (10). The recurrent network
has several read heads each with a key kt , which is used to find matching items mt in memory. The
state variable is passed as input to the read-only policy and is passed through decoders that pro-
duce reconstructed input data (Iˆt , vˆt , rt−1 ˆ and at−1
ˆ ) and the return prediction R̂t . These decoders
were taken as dual of the respective encoders with transposed convolutions wherever required. The
MBP is to be optimized to produce predictions that are consistent with the probabilities of observed
sensory sequences from the environment: Pr(o1 , o2 , . . . ). This objective can be intractable, hence
based on the standard variational inference procedure, the MBP is trained instead to optimise the

10
variational lower bound (VLB) loss, which acts as a tractable surrogate.
t
X
log p(o0:t , R0:t ) ≥ Eq(z0:τ −1 |o0:τ −1 ) Eq(zτ |z0:τ −1 ,o0:τ ) log p(oτ , Rτ |zτ )
τ =0

− DKL [q(zτ |z0:τ −1 , o0:τ )||p(zτ |z0:τ −1 , a0:τ −1 )] . (35)

This loss consists of a reconstruction loss and a KL divergence between p and q. To implement the
reconstruction term, several decoder networks take zt as input, and each one transforms back into the
space of a sensory modality. The difference between the decoder outputs and the ground truth data
is the loss term. The KL divergence between the prior and posterior probability distributions ensures
that the predictive prior is consistent with the posterior produced after observing new sensory inputs.
The policy, which has read-only access to the memory, is the only part of the system that is trained
conventionally according to standard policy gradient algorithms. To emphasise the independence of
the policy from the MBP, the gradients are blocked from the policy loss into the MBP.
MERLIN also excelled at solving one-shot navigation problems from raw sensory input in randomly
generated, partially observed 3D environments. It had also effectively learned the ability to locate
a goal in a novel environment map and quickly return to it. Even though, it wasn’t explicitly pro-
grammed, MERLIN showed evidence of hierarchical goal-directed behaviour, which was detected
from the MBP’s read operations.

4.7 M EMORY AUGMENTED C ONTROL N ETWORK

This model (12) was specifically created to target partially observable environment with sparse re-
wards, both being a common and tricky problems in reinforcement learning because of lack of imme-
diate feedback leading to hard to model navigation policies. Similar to the options framework (34)
in hierarchical reinforcement learning, this architecture breaks the planning problem into two levels.
At a lower level, a planning module computes optimal policies using a feature rich representation of
the locally observed environment. The higher level policy is used to augment the neural memory to
produce an optimal policy for the global environment. In the set of experiments, the agent operates
in an unknown environment and must remain safe by avoiding collisions. Let m ∈ {−1, 0}n be
a hidden labeling of the states into free (0) and occupied (−1). The agent has access to a sensor
that reveals the labeling of nearby states through an observations zt = H(st )m ∈ {−1, 0}n , where
H(s) ∈ Rn×n captures the local field of view of the agent at state s. The agent’s task is to reach a
goal region S goal ⊂ S, which is assumed obstacle-free, i.e., m[s] = 0 for all s ∈ S goal . The informa-
tion available to the agent at time t to compute its action at is ht := (s0:t , z0:t , a0:t−1 , S goal ) ∈ H,
where H is the set of possible sequences of observations, states, and actions. A policy µ : S → A is
to be learnt such that the agent is able to reach the goal state without any obstacles in between. The
partial observability requires consideration of memory in order to learn µ successfully.
A partially observable markov decision process based on the history space H is defined by
M(H, A, T , r, γ), where γ ∈ (0, 1] is a discount factor, T : H × A → H is a deterministic
transition function, and r : H → R is the reward function, defined as follows:
T (ht , at ) = (ht , st+1 = f (st , at ), zt+1 = H(st+1 )m, at ) (36)
r(ht , at ) = zt [st ] (37)
At any instance, the agent observes a small part of the environment (local observed space). This
approach computes optimal policies for these locally observed spaces and then uses these to compute
a policy optimal in the global space. The read and write operators on the memory Mt are defined as
reit = Mt> wtread,i (38)
Mt = Mt−1 (1 − wtW e>
t ) + wtW vt> (39)

where wtread,i are the read weights, wtW are write weights, et is an erase vector and vt is a write
vector. The write vector and the erase vector are emitted by the controller. At a lower level, plan-
ning is done in a local space given by z 0 within the boundaries of our locally observed environ-
ment space. This setting can be formulated as a fully observable markov decision process given by

11
Figure 4: MACN - The architecture uses convolutional layers to extract features from the environ-
ment. The value maps are generated with these features. The controller network uses the value
maps and low level features to emit read and write heads in addition to doing its own planning
computation.

Mt (S, A, f, r, γ) and planning in this is done by calculating an optimal policy for this local space
given by πl∗ . Let Π = [πl1 , πl2 , πl3 , πl4 , . . . , πln ] be the list of optimal policies calculated from such
consecutive observation spaces [z0 , z1 , . . . zT ]. These two are mapped by training a convolutional
neural network using standard policy gradient approaches.
Hence, in this model a value iteration network (40) is used to learn the value maps of the observation
space z. These value maps are used as keys for the differential memory, and are found to perform
better planning than just standard CNN embeddings. These local value maps (used to calculate local
policies) are concatenated with a low level feature representation of the environment and sent to
a controller network. The controller network interfaces with the memory through an access mod-
ule (another network layer) and emits read heads, write heads and access heads. In addition, the
controller network also performs its own computation for planning. The output from the controller
network and the access module are concatenated and sent through a linear layer to produce an action.
This entire architecture is then trained end to end.

5 U SAGE OF E PISODIC M EMORY

The memory architectures presented above have been used to carry out various different tasks in
reinforcement learning. Majority of the literature surveyed involves usage of external memory sys-
tems to make the existing deep-learning based algorithms more efficient, in terms of the number of
interactions, enhancing reward propagation and having strong priors from the past for decision mak-
ing. In this section, we go through various algorithms which don’t propose a new memory module -
however, are heavily depend upon the usage of episodic memory to accomplish the respective task.

5.1 E PISODIC BACKWARD U PDATE

This method (16) involves episodic memory to train a Deep Q-Network (22) with backward updates
through time. It is based on the observation that whenever we observe an event, we scan through our
memory in a backward manner and recognize the relationships between the current observation and
the past experiences (17). The simple backward update of the Q value function in the tabular setting
- which first generates the entire episode and performs backward update as defined in algorithm 3 is
very unstable if applied to deep reinforcement learning settings.

12
Algorithm 3 Simple Episodic Backward Update (single episode, tabular)
Initialize the Q- table Q ∈ RS×A with zero matrix.
Q(s, a) = 0 for all state action pairs (s, a) ∈ S × A.
Experience an episode E = {(s1 , a1 , r1 , s2 ), . . . , (sT , aT , rT , sT +1 )}
for t = T to 1 do
Q(st , at ) ← rt + γ maxa0 Q(st+1 , a0 )
end for

Hence, this algorithm is modified (4) to perform backward updates in the deep-learning setting by us-
ing episodic memory. All the transitions within the samples epidsode E = {S, A, R, S 0 } are used,
where E is denoted as a set of four length-T vectors: S = {S1 , S2 , . . . ST }; A = {A1 , A2 , . . . AT };
R = {R1 , R2 , . . . RT } and S 0 = {S2 , S3 , . . . ST +1 }. Episodic memory based module is used as
temporary target - Q̃ and it is initialized to store all the target Q-values of S 0 for all valid actions. Q̃
is an |A| × T matrix which stores the target Q-values of all states S 0 for all valid actions. Therefore,
the j-th column of Q̃ is a column vector that contains Q̂ Sj+1 , a; θ − for all valid actions a, where
Q̂ is the target Q-function parameterized by θ− .

Algorithm 4 Episodic Backward Update

Initialize replay memory D to capacity N
Initialize on-line action-value function Q with random weights θ
Initialize target action-value function Q̂ with random weights θ −
for episode = 1 to M do
for t = 1 to Terminal do
With probability select a random action at
Otherwise select at = arg maxa Q (st , a; θ)
Execute action at , observe reward rt and next state st+1
Store transition (st , at , rt , st+1 ) in D
Sample a random episode E = {S, A, R, S 0 } from D, set T = length(E)
Generate temporary target Q table, Q̃ = Q̂ S 0 , ·; θ −
Initialize target vector y = zeros(T )
yT ← RT
for k = T − 1 to 1 do
Q̃ [Ak+1 , k] ← βyk+1 + (1 − β)Q̃ [Ak+1 , k]
yk ← Rk + γ maxa∈A Q̃ [a, k]
end for
2
Perform a gradient descent step on (y − Q (S, A; θ)) with respect to θ
Every C steps reset Q̂ = Q
end for
end for

The target vector y is used to train the network by minimizing the loss between each Q (Sj , Aj ; θ)
and yj for all j from 1 to T . Adopting the backward update idea, one element Q̃ [Ak+1 , k] in the
k-th column of the Q̃ is replaced using the next transition’s target yk+1 . Then yk is estimated as the
maximum value of the newly modified k-th column of Q̃. This procedure is repeated in a recursive
manner and the backward update is finally applied to Deep Q-networks (22). This algorithm has
been tested 2D maze environment and the Arcade Learning Environment (22) and provides a novel
way to perform backward updates on deep architectures using episodic memory.

5.2 E PISODIC M EMORY D EEP Q-N ETWORKS

This method (18) tries to improve the efficiency of DQN (22) by incorporating episodic memory
- mimicking the competitive and cooperative relationship between Striattum and Hippocampus in
the brain. This approach combines the generalization strength of DQN (22) and the fast converging
property of episodic memory, by distilling the information in the memory to the parametric model.

13
The DQN function is paramterized by θ and is represented by Qθ , while the episodic memory targets
are represented by H, given by :

H(st , at ) = max Ri (st , at ), i ∈ (1, 2, · · · , E) (40)

where E represents the number of episodes that the agent has experienced, and Ri (s, a) represents
future return when taking action a under state s in i-th episode. H is a growing table indexed by
state-action pairs (s, a) and is implemented in a way similar to (). The loss function given below is
minimized to train Qθ :

L = α(rt + γ max
0
Qθ (st+1 , a0 ) − Qθ )2 + β(H − Qθ )2 (41)
a

Though straight-forward, this paper has reported various advantages over vanilla-DQN. Because the
memory stores optimal rewards, the reward propagation through the network is faster, compensating
the disadvantage of slow-learning resulted by single step reward update. Introducing the memory
module also makes this DQN highly sample efficient. This architecture had been tested on the Atari
suite (22) and had significantly outperformed the original model.

5.3 E PISODIC C URIOSITY THROUGH R EACHABILITY

This paper (29) specifically address the environments with sparse rewards. This is unsurprisingly
common, most of the environments provide no or negative rewards for non-final states, and a pos-
itive reward for the final state. Due to very infrequent supervision signal, a common trend among
researchers is to introduce rewards which are internal to the agent and is thus called Intrinsic Moti-
vation (3) or Curiosity Driven Learning (25).
This paper provides an internal reward when it reaches specific states which is non-final, and is
considered as novel as per the method. The states which require effort to reach (based on the number
of environment steps taken to reach it are considered as novel. To estimate this, a neural network
C(E(oi ), E(oj )) is trained to predict the number of steps that separate two observations. This is
binarized, hence the network predicts a value close to 0 if the number of steps that separate them is
less than k, which is a hyper-parameter. E(o) is also a neural network, which brings the observation
o to a lower embedding. Among the two observations (oi , oj ) - one is the current observation and
the second is a roll-out from an episodic memory bank which stores the past observations. If the
predicted number of steps between these two observations is greater than a certain threshold, the
agent rewards itself with a bonus, and adds this observation to the episodic memory. The episodic
memory M of size K stores the embeddings of the past observations. At every time step, the
current observation o goes through the embedding network producing the embedding vector e =
E(o). This embedding vector is compared with the stored embeddings in the memory buffer M =
e1 , . . . , e|M | via the comparator network C where |M | is the current number of elements in memory.
This comparator network fills the reachability buffer with values
ci = C(ei , e), i = 1, . . . |M | (42)
Then the similarity score between the memory buffer and the current embedding is computed as
C(M, e) = F (c1 , ..., c|M | ) ∈ [0, 1] (43)
The internal reward, called the curiosity bonus is calculated as :
b = B(M, e) = α(β − C(M, e)) (44)
where α and β are hyperparameters. After the bonus computation, the observation embedding is
added to memory if the bonus b is larger than a novelty threshold bnovelty .

6 S TANDARD T ESTING E NVIRONMENTS

Through various methods reviewed above, almost all of them shared common environments in which
these memory based modules had been tested. The first standard testing environment is the Arcade

14
Learning Environment (Atari) (22). The Arcade Learning Environment is a suite of arcade games
originally developed for the Atari-2600 console. These games are relatively simple visually but
require complex and precise policies to achieve high expected reward, and form an interesting set of
tasks as they contain diverse challenges such as sparse rewards and vastly different magnitudes of
scores across games. This is a good choice as it acts like a baseline, as the most common algorithms
like DQN (22) and A3C (23) have been applied in this domain.
The second common environment is a 2D and 3D maze-based environment, where memory is a
crucial part for optimal planning. These mazes generally involve an agent to successfully navigate
a 2D grid world populated with obstacles at random positions. The task is made harder by having
random start and goal positions. These normally involve a partially observable markov decision
process as the agent has only a single source of observation - the scene in front of it, while has no
idea about the full map of the maze.

Figure 5: 2D Maze Environment - The left side (a) represents the fully observable maze while the
right side (b) represents the agent observations.

Generally, the 2D-mazes are generated using random generator. Hence, test set therefore repre-
sents maze geometries that have never been seen during training, and measure the agent’s ability to
generalize to new environments. For testing in more complicated 3D environments, the 2D maze
environment is implemented in 3D using the ViZDoom (48) environment and a random maze gener-
ator. In this environment, the indicator is a torch of either red or green color that is always at a fixed
location in view of the player’s starting state. The goals are red/green towers that are randomly po-
sitioned throughout the maze. Other than these, some of the more recent approaches involve testing
on the memory game of Concentration. The game is played with a deck of cards in which each card
face appears twice. At the start of each game, the cards are arranged face down on a flat surface.
A player’s turn consists of turning over any two of the cards. If their faces are found to match, the
player wins those two cards and removes them from the table, then plays again. If the two cards do
not match, the player turns them face down again, then play passes to the next player. The game
proceeds until all cards have been matched and removed from the table. The winning strategy is
to remember the locations of the cards as their faces are revealed, then use those memories to find
matching pairs. The Concentration game is converted to a reinforcement learning environment using
the Omniglot (15) dataset.
All of the above mentioned environments provide a benchmark for new upcoming reinforcement
learning algorithms which involve external memory. Each of these environments tackle different
aspect such as speed of learning, partial observability, long scale decision making etc. which shows
the importance of memory in the future advancement of reinforcement learning.

7 C ONCLUSIONS
In this paper we have presented a brief survey on memory based reinforcement learning. We focused
on different memory modules and methods which enable episodic memory to be used for learning

15
how to control and plan for an agent. We cover different methods which use these modules for
different reinforcement learning based problems and give their advantages and disadvantages. We
provide a brief but detailed insights to each of these methods and cover the common testing envi-
ronments which are normally used. This paper had been written to promote the idea of usage of
external memory in reinforcement learning and provide insights on how these methods have been
based on/adapted from the learning procedures which occur in the brain. This paper hopes to be a
useful resource to provide a detailed overview of the field and to help in the future development of
it.

R EFERENCES
[1] Per Andersen, Richard Morris, David Amaral, John O’Keefe, Tim Bliss, et al. The hippocam-
pus book. Oxford university press, 2007.
[2] Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z
Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic control. arXiv
preprint arXiv:1606.04460, 2016.
[3] Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated rein-
forcement learning. In Advances in neural information processing systems, pp. 1281–1288,
2005.
[4] Nathaniel D Daw, Yael Niv, and Peter Dayan. Uncertainty-based competition between pre-
frontal and dorsolateral striatal systems for behavioral control. Nature neuroscience, 8(12):
1704, 2005.
[5] Mark A Gluck and Catherine E Myers. Hippocampal mediation of stimulus representation: A
computational theory. Hippocampus, 3(4):491–516, 1993.
[6] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep
recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and
signal processing, pp. 6645–6649. IEEE, 2013.
[7] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka
Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John
Agapiou, et al. Hybrid computing using a neural network with dynamic external memory.
Nature, 538(7626):471, 2016.
[8] Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps.
In 2015 AAAI Fall Symposium Series, 2015.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recogni-
tion, pp. 770–778, 2016.
[10] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9
(8):1735–1780, 1997.
[11] Steven W Kennerley and Mark E Walton. Decision making and reward in frontal cortex:
complementary evidence from neurophysiological and neuropsychological studies. Behavioral
neuroscience, 125(3):297, 2011.
[12] Arbaaz Khan, Clark Zhang, Nikolay Atanasov, Konstantinos Karydis, Vijay Kumar, and
Daniel D Lee. Memory augmented control networks. arXiv preprint arXiv:1709.05706, 2017.
[13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
[14] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural informa-
tion processing systems, pp. 1008–1014, 2000.
[15] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept
learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.

16
[16] Su Young Lee, Sungik Choi, and Sae-Young Chung. Sample-efficient deep reinforcement
learning via episodic backward update. arXiv preprint arXiv:1805.12375, 2018.
[17] Máté Lengyel and Peter Dayan. Hippocampal contributions to control: the third way. In
Advances in neural information processing systems, pp. 889–896, 2008.
[18] Zichuan Lin, Tianqi Zhao, Guangwen Yang, and Lintao Zhang. Episodic memory deep q-
networks. arXiv preprint arXiv:1805.07603, 2018.
[19] Ricky Loynd, Matthew Hausknecht, Lihong Li, and Li Deng. Now i remember! episodic
memory for reinforcement learning, 2018. URL https://openreview.net/forum?
id=SJxE3jlA-.
[20] David Marr, David Willshaw, and Bruce McNaughton. Simple memory: a theory for archicor-
tex. In From the Retina to the Neocortex, pp. 59–128. Springer, 1991.
[21] James L McClelland, Bruce L McNaughton, and Randall C O’reilly. Why there are comple-
mentary learning systems in the hippocampus and neocortex: insights from the successes and
failures of connectionist models of learning and memory. Psychological review, 102(3):419,
1995.
[22] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602, 2013.
[23] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,
Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce-
ment learning. In International conference on machine learning, pp. 1928–1937, 2016.
[24] Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep rein-
forcement learning. arXiv preprint arXiv:1702.08360, 2017.
[25] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven explo-
ration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops, pp. 16–17, 2017.
[26] Jing Peng and Ronald J Williams. Incremental multi-step q-learning. In Machine Learning
Proceedings 1994, pp. 226–232. Elsevier, 1994.
[27] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech Badia, Oriol
Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2827–
2836. JMLR. org, 2017.
[28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time
object detection with region proposal networks. In Advances in neural information processing
systems, pp. 91–99, 2015.
[29] Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Tim-
othy Lillicrap, and Sylvain Gelly. Episodic curiosity through reachability. arXiv preprint
arXiv:1810.02274, 2018.
[30] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust
region policy optimization. In International Conference on Machine Learning, pp. 1889–1897,
2015.
[31] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[32] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Ried-
miller. Deterministic policy gradient algorithms. In ICML, 2014.
[33] Andreas Stöckel. University of waterloo - cs885 ”reinforcement learning” paper
presentation, 2018. URL https://cs.uwaterloo.ca/˜ppoupart/teaching/
cs885-spring18/slides/cs885-lecture20a.pdf.

17
[34] Martin Stolle and Doina Precup. Learning options in reinforcement learning. In International
Symposium on abstraction, reformulation, and approximation, pp. 212–223. Springer, 2002.
[35] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In
Advances in neural information processing systems, pp. 2440–2448, 2015.
[36] Robert J Sutherland and Jerry W Rudy. Configural association theory: The role of the hip-
pocampal formation in learning, memory, and amnesia. Psychobiology, 17(2):129–144, 1989.
[37] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural
networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
[38] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,
2018.
[39] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradi-
ent methods for reinforcement learning with function approximation. In Advances in neural
information processing systems, pp. 1057–1063, 2000.
[40] Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration net-
works. In Advances in Neural Information Processing Systems, pp. 2154–2162, 2016.
[41] Richard F Thompson. The neurobiology of learning and memory. Science, 233(4767):941–
947, 1986.
[42] JN Tsitsiklis and B Van Roy. An analysis of temporal-difference learning with function approx-
imationtechnical. Report LIDS-P-2322). Laboratory for Information and Decision Systems,
Massachusetts Institute of Technology, 1996.
[43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural informa-
tion processing systems, pp. 5998–6008, 2017.
[44] Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical
Software (TOMS), 11(1):37–57, 1985.
[45] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292,
1992.
[46] Greg Wayne, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-
Barwinska, Jack Rae, Piotr Mirowski, Joel Z Leibo, Adam Santoro, et al. Unsupervised pre-
dictive memory in a goal-directed agent. arXiv preprint arXiv:1803.10760, 2018.
[47] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint
arXiv:1410.3916, 2014.
[48] Marek Wydmuch, Michał Kempka, and Wojciech Jaśkowski. Vizdoom competitions: Playing
doom from pixels. IEEE Transactions on Games, 2018.
[49] Kenny J. Young, Shuo Yang, and Richard S. Sutton. Integrating episodic memory into a re-
inforcement learning agent using reservoir sampling, 2018. URL https://openreview.
net/forum?id=ByJDAIe0b.

1 s2.0 S0925231220303337 Main
No ratings yet
1 s2.0 S0925231220303337 Main
12 pages
Supervised Pretraining Can Learn In-Context Reinforcement Learning
No ratings yet
Supervised Pretraining Can Learn In-Context Reinforcement Learning
27 pages
Yuming Li Pin Ni Victor Chang
No ratings yet
Yuming Li Pin Ni Victor Chang
19 pages
Google REST
No ratings yet
Google REST
19 pages
DeepRL
No ratings yet
DeepRL
10 pages
Deep Super Learner: A Deep Ensemble For Classification Problems
No ratings yet
Deep Super Learner: A Deep Ensemble For Classification Problems
12 pages
Collective Intelligence Deep Learning
No ratings yet
Collective Intelligence Deep Learning
23 pages
Deep Exploration Via Bootstrapped DQN: Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin Van Roy
No ratings yet
Deep Exploration Via Bootstrapped DQN: Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin Van Roy
18 pages
1805.07917v2
No ratings yet
1805.07917v2
15 pages
1 s2.0 S1566253522000288 Main
No ratings yet
1 s2.0 S1566253522000288 Main
22 pages
L A P F: Earning To CT by Redicting The Uture
No ratings yet
L A P F: Earning To CT by Redicting The Uture
14 pages
A Hybrid Multi-Task Learning Approach for Optimizing Deep Reinforcement Learning Agents
No ratings yet
A Hybrid Multi-Task Learning Approach for Optimizing Deep Reinforcement Learning Agents
23 pages
Large-Scale Retrieval For Reinforcement Learning: These Authors Contributed Equally To This Work
No ratings yet
Large-Scale Retrieval For Reinforcement Learning: These Authors Contributed Equally To This Work
16 pages
Zhang2020 Article InterpretablePolicyDerivationF
No ratings yet
Zhang2020 Article InterpretablePolicyDerivationF
13 pages
Mlt Kcs055 2022 23 Aktu Qpaper Sol
No ratings yet
Mlt Kcs055 2022 23 Aktu Qpaper Sol
23 pages
Distral Original Paper
No ratings yet
Distral Original Paper
13 pages
Evaluation of Deep Reinforcement Learning Algorithms for Autonomous Driving
No ratings yet
Evaluation of Deep Reinforcement Learning Algorithms for Autonomous Driving
7 pages
Model-Driven Deep Learning
No ratings yet
Model-Driven Deep Learning
3 pages
Genetic Algorithm and Machine Learning
No ratings yet
Genetic Algorithm and Machine Learning
2 pages
Knowpc: Knowledge-Driven Programmatic Reinforcement Learning For Zero-Shot Coordination
No ratings yet
Knowpc: Knowledge-Driven Programmatic Reinforcement Learning For Zero-Shot Coordination
31 pages
Pplication of Deep Reinforcement Learning For Ndian Stock Trading Automation
No ratings yet
Pplication of Deep Reinforcement Learning For Ndian Stock Trading Automation
9 pages
Sullivan and Longo - 2023 - Explaining Deep Q-Learning Experience Replay With
No ratings yet
Sullivan and Longo - 2023 - Explaining Deep Q-Learning Experience Replay With
23 pages
2206.04282v1
No ratings yet
2206.04282v1
56 pages
Cognitive_RL
No ratings yet
Cognitive_RL
12 pages
基于深度强化学习的组合优化研究进展
No ratings yet
基于深度强化学习的组合优化研究进展
17 pages
Reinforcement and Imitation Learning Via Interactive No-Regret Learning
No ratings yet
Reinforcement and Imitation Learning Via Interactive No-Regret Learning
14 pages
Bayesian Deep Reinforcement Learning Via Deep Kernel Learning
No ratings yet
Bayesian Deep Reinforcement Learning Via Deep Kernel Learning
8 pages
On Transforming Reinforcement Learning With Transformers: The Development Trajectory
No ratings yet
On Transforming Reinforcement Learning With Transformers: The Development Trajectory
26 pages
A Survey of Demonstration Learning
No ratings yet
A Survey of Demonstration Learning
30 pages
Asynchronous Methods For Deep Reinforcement Learning
No ratings yet
Asynchronous Methods For Deep Reinforcement Learning
28 pages
Evolution Strategies As A Scalable Alternative To Reinforcement Learning
No ratings yet
Evolution Strategies As A Scalable Alternative To Reinforcement Learning
13 pages
Retrieval-Augmented Embodied Agents: Yichen Zhu, Zhicai Ou, Xiaofeng Mou, Jian Tang Midea Group, AI Lab
No ratings yet
Retrieval-Augmented Embodied Agents: Yichen Zhu, Zhicai Ou, Xiaofeng Mou, Jian Tang Midea Group, AI Lab
11 pages
A Concise Introduction To Reinforcement Learning: February 2018
No ratings yet
A Concise Introduction To Reinforcement Learning: February 2018
12 pages
Learning by Playing - Solving Sparse Reward Tasks From Scratch
No ratings yet
Learning by Playing - Solving Sparse Reward Tasks From Scratch
18 pages
The History Began From AlexNet - A Comprehensive Survey On Deep Learning Approaches
No ratings yet
The History Began From AlexNet - A Comprehensive Survey On Deep Learning Approaches
39 pages
1611.01606v1
No ratings yet
1611.01606v1
13 pages
Unit 4
No ratings yet
Unit 4
8 pages
Cost Effective Transfer of Reinforcement Learn - 2024 - Expert Systems With Appl
No ratings yet
Cost Effective Transfer of Reinforcement Learn - 2024 - Expert Systems With Appl
15 pages
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
No ratings yet
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
10 pages
Natural_Language-Based_HumanMachine_Collaborative_Learning_Games_Algorithm_Based_on_Deep_Rein-Forcement_Learning
No ratings yet
Natural_Language-Based_HumanMachine_Collaborative_Learning_Games_Algorithm_Based_on_Deep_Rein-Forcement_Learning
13 pages
A Review on Optimization Algorithm for Deep Learning Method in Bioinformatics Field (1)
No ratings yet
A Review on Optimization Algorithm for Deep Learning Method in Bioinformatics Field (1)
5 pages
26286-Article Text-30349-1-2-20230626
No ratings yet
26286-Article Text-30349-1-2-20230626
9 pages
Paper 1
No ratings yet
Paper 1
7 pages
Alexnet Paper
No ratings yet
Alexnet Paper
39 pages
10.1007@s00521 018 3712 X PDF
No ratings yet
10.1007@s00521 018 3712 X PDF
13 pages
Deep Learning Computers Today
No ratings yet
Deep Learning Computers Today
13 pages
Fidelity-Induced Interpretable Policy Extraction For Reinforcement Learning
No ratings yet
Fidelity-Induced Interpretable Policy Extraction For Reinforcement Learning
11 pages
A Survey On Image Data Augmentation For Deep Learn
No ratings yet
A Survey On Image Data Augmentation For Deep Learn
49 pages
Hyperparameter Tuning For Deep Reinforcement Learning Applications
No ratings yet
Hyperparameter Tuning For Deep Reinforcement Learning Applications
12 pages
Parameters_optimization_of_deep_learning_models_using_Particle_swarm_optimization
No ratings yet
Parameters_optimization_of_deep_learning_models_using_Particle_swarm_optimization
6 pages
1604.06057v2
No ratings yet
1604.06057v2
14 pages
MOCODA: Model-Based Counterfactual Data Augmentation
No ratings yet
MOCODA: Model-Based Counterfactual Data Augmentation
23 pages
Rule-based Reinforcement Learning augmented by External Knowledge
No ratings yet
Rule-based Reinforcement Learning augmented by External Knowledge
7 pages
A Guide To Machine Learning For Biologists
No ratings yet
A Guide To Machine Learning For Biologists
16 pages
An Introduction To Intertask Transfer For Reinforcement Learning
No ratings yet
An Introduction To Intertask Transfer For Reinforcement Learning
20 pages
Deep Reinforcement Learning in Large Discrete Action Spaces
No ratings yet
Deep Reinforcement Learning in Large Discrete Action Spaces
11 pages
2305.17473v4
No ratings yet
2305.17473v4
62 pages
Deep Reinforcement Learning: An Essential Guide
From Everand
Deep Reinforcement Learning: An Essential Guide
Robert Johnson
No ratings yet
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
From Everand
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
Nietsnie Trebla
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
A Practical Guide and a Constructivist Rationale for Inquiry Based Learning
No ratings yet
A Practical Guide and a Constructivist Rationale for Inquiry Based Learning
14 pages
Capitalism Can Not Survive The Singularity - by Brett King - Aug, 2022 - Medium
No ratings yet
Capitalism Can Not Survive The Singularity - by Brett King - Aug, 2022 - Medium
8 pages
Diagnostic Models
No ratings yet
Diagnostic Models
3 pages
A. Positive Factors
No ratings yet
A. Positive Factors
5 pages
P7A Master of Professional Engineering
No ratings yet
P7A Master of Professional Engineering
19 pages
VI Semester Exam Date Sheet
No ratings yet
VI Semester Exam Date Sheet
27 pages
2447-Article Text-4594-1-10-20210420
No ratings yet
2447-Article Text-4594-1-10-20210420
3 pages
AF5122
No ratings yet
AF5122
3 pages
Enverga University Schedule of Classes, S.Y. 2015 - 2016 2 Year Engineering, 1 Semester G2A1 (CE, GE)
No ratings yet
Enverga University Schedule of Classes, S.Y. 2015 - 2016 2 Year Engineering, 1 Semester G2A1 (CE, GE)
1 page
Melcs Health Complete
100% (1)
Melcs Health Complete
43 pages
4th Quarter - Lesson 3 Human Person in Society
No ratings yet
4th Quarter - Lesson 3 Human Person in Society
31 pages
Dyslexia
No ratings yet
Dyslexia
21 pages
Beliefs and Values Examination Guide
No ratings yet
Beliefs and Values Examination Guide
2 pages
A Review On Sentiment Analysis Methodologies Practices and Applications
No ratings yet
A Review On Sentiment Analysis Methodologies Practices and Applications
9 pages
Balimbing Street, General Santos City
No ratings yet
Balimbing Street, General Santos City
12 pages
Jass
No ratings yet
Jass
7 pages
Greisman, Ritzer, 1981
No ratings yet
Greisman, Ritzer, 1981
22 pages
Thames 2024 Course Brochure
No ratings yet
Thames 2024 Course Brochure
2 pages
PR1 M2 Types of and Approaches To Research
No ratings yet
PR1 M2 Types of and Approaches To Research
20 pages
Hand Test PPT Report
No ratings yet
Hand Test PPT Report
43 pages
19 Assessing Model Accuracy
No ratings yet
19 Assessing Model Accuracy
16 pages
Presentation On Philosophies
No ratings yet
Presentation On Philosophies
17 pages
Collection of Data
No ratings yet
Collection of Data
10 pages
Math100 Posttest
No ratings yet
Math100 Posttest
101 pages
2.0 Chapter 2 Human Relations
No ratings yet
2.0 Chapter 2 Human Relations
44 pages
Reading and Writing Social Research
No ratings yet
Reading and Writing Social Research
10 pages
L7 India Intern
No ratings yet
L7 India Intern
1 page
Portfolio
No ratings yet
Portfolio
12 pages
Edu 214 NV Computer Standards
No ratings yet
Edu 214 NV Computer Standards
149 pages
Table of Contents021
No ratings yet
Table of Contents021
8 pages

A Short Survey On Memory Based RL

Uploaded by

A Short Survey On Memory Based RL

Uploaded by

A S HORT S URVEY ON

M EMORY BASED R EINFORCEMENT L EARNING

Reinforcement learning(RL) is a branch of machine learning which is employed

Q(s, a) ← Q(s, a) + α[r + max

2.1 D EEP Q-N ETWORK

θi+1 = θi + α∇θ L(θi ) (6)

2.2 P OLICY G RADIENT A LGORITHMS

θk+1 = θk + α∇θ J (πθ )|θk (8)

∇θ J(πθ ) = Et [∇θ log πθ (at |st ) J(πθ )] (9)

LP G (θ) = Et [log πθ (at |st ) J(πθ )] (10)

4.1 M ODEL F REE E PISODIC C ONTROL

Algorithm 1 Model-Free Episodic Control.

4.2 N EURAL E PISODIC C ONTROL

where vi is the ith element of the array Va and

Here, k is represented as a Gaussian or an inverse kernel. In the experiments performed, k is taken

Qi ← Qi + α(Q(N ) (s, a) − Qi ) (17)

4.3 MASKED EXPERIENCE MEMORY

Algorithm 2 Neural Episodic Control

4.6 M EMORY, RL, AND I NFERENCE N ETWORK (MERLIN)

4.7 M EMORY AUGMENTED C ONTROL N ETWORK

5 U SAGE OF E PISODIC M EMORY

5.1 E PISODIC BACKWARD U PDATE

Algorithm 4 Episodic Backward Update

5.2 E PISODIC M EMORY D EEP Q-N ETWORKS

H(st , at ) = max Ri (st , at ), i ∈ (1, 2, · · · , E) (40)

5.3 E PISODIC C URIOSITY THROUGH R EACHABILITY

6 S TANDARD T ESTING E NVIRONMENTS

You might also like