0% found this document useful (0 votes)
191 views

A Concise Introduction To Reinforcement Learning: February 2018

This document provides a concise introduction to reinforcement learning which is a form of machine learning that involves an agent learning through trial-and-error interactions with an environment. It highlights some major successes of reinforcement learning, discusses key concepts like Markov decision processes and Bellman equations, and summarizes different algorithms used to solve reinforcement learning problems including Monte Carlo methods, temporal difference learning, and deep Q-networks. The introduction provides useful context on how reinforcement learning relates to and differs from other types of machine learning such as supervised and unsupervised learning.

Uploaded by

NirajDhotre
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views

A Concise Introduction To Reinforcement Learning: February 2018

This document provides a concise introduction to reinforcement learning which is a form of machine learning that involves an agent learning through trial-and-error interactions with an environment. It highlights some major successes of reinforcement learning, discusses key concepts like Markov decision processes and Bellman equations, and summarizes different algorithms used to solve reinforcement learning problems including Monte Carlo methods, temporal difference learning, and deep Q-networks. The introduction provides useful context on how reinforcement learning relates to and differs from other types of machine learning such as supervised and unsupervised learning.

Uploaded by

NirajDhotre
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/323178749

A Concise Introduction to Reinforcement Learning

Research · February 2018


DOI: 10.13140/RG.2.2.31027.53285

CITATION READS

1 3,283

1 author:

Ahmad Hammoudeh
None
20 PUBLICATIONS   98 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Artificial Intelligence View project

Elevator Engineering View project

All content following this page was uploaded by Ahmad Hammoudeh on 14 February 2018.

The user has requested enhancement of the downloaded file.


A Concise Introduction to Reinforcement Learning

Ahmad Hammoudeh
RLR , Data Science programme2
1

AIC , Princess Suamaya University for Technology2


1

Amman, Jordan
[email protected]

Abstract— This paper aims to introduce, review, and The upraise of Artificial Intelligence is associated with
summarize several works and research papers on Reinforcement Deep Learning achievements in the recent years. Deep
Learning. Learning is basically a set of multiple layers of neural networks
connected to each other. While Deep learning algorithms are
Reinforcement learning is an area of Artificial Intelligence; it the same as what was used in the late 1980’s [4], deep learning
has emerged as an effective tool towards building artificially progress is driven by the development of computational power
intelligent systems and solving sequential decision making and the tremendous increase of both generated and collected
problems. Reinforcement Learning has achieved many
data [5]. Shifting from CPU (Central Processing Unit) to GPU
impressive breakthroughs in the recent years and it was able to
(Graphics Processing Unit) [6] and later to TPU (Tensor
surpass human level in many fields; it is able to play and win
various games. Historically, reinforcement learning was efficient
Processing Unit) [7] accelerated processing speed and opened
in solving some control system problems. Nowadays, it has a the door for more successes. However, computational
growing range of applications. capabilities are bounded by Moore’s law [8] which may slow
down building strong AI systems [9].
This work includes an introduction to reinforcement learning Reinforcement learning is learning through interaction with
which demonstrates the intuition behind Reinforcement Learning
an environment by taking different actions and experiencing
in addition to the main concepts. After that, the remarkable
many failures and successes while trying to maximize the
successes of reinforcement learning are highlighted.
Consequently, methods and the details for solving reinforcement
received rewards. The agent is not told which action to take.
learning problems are summarized. Thenceforth, data from a Reinforcement learning is similar to natural learning processes
wide collection of works in various reinforcement learning where a teacher or a supervisor is not available and learning
applications were reviewed. Finally, the prospects and the process evolves with trial and error, different from supervised
challenges of reinforcement learning are discussed. learning, in which an agent needs to be told what the correct
action is for every position it encounters [1], [9].
Keywords— Reinforcement Learning; Artificial Intelligence;
Machine learning

I. INTRODUCTION

Artificial Intelligence (AI) turned out to be a hot topic in


the recent years. Many articles, books, and movies were
produced with raised questions like “can machines think?”,
“can AI exceed human intelligence?”, “will machines replace
humans?”, “how dangerous is AI?”, “what distinguishes human
from AI?”, in addition to enslavement problem [1].
Researchers and scientific committees are not away from these
questions; Alan M. Turing discussed some of these questions
[2] and hence Turing test was formulated to test machine's
ability of showing intelligent behavior indistinguishable from
that of a human [3]. However, some of these questions remain
debatable between the most powerful CEOs (i.e. Mark
Zuckerberg and Elon Musk) as well as the leading AI
researchers. Discussing such questions entails comprehensive
understanding of reinforcement learning (RL). I refer the reader Fig. 1 Reinforcement Learning faces [10]
to Stuart J. Russell work in [1].
Reinforcement Learning overlaps with multiple fields: The rest of this section will discuss the standard model and
computer science, engineering, neuroscience, mathematics, the basic components of reinforcement learning system. In the
psychology, and economics. Figure 1 demonstrates these next section some remarkable reinforcement learning
intersections [10] . successes are highlighted. Then the idea of Markov Decision
Reinforcement learning is different from other branches of Process and Bellman optimality equations which represent the
machine learning both supervised learning and unsupervised core for formulating reinforcement learning problems are
learning. discussed. After that, some algorithms for solving
reinforcement problem are reviewed; starting by the general
Supervised learning is the most widely studied and approaches such as standard tabular methods and approximate
researched branch of machine learning. In supervised learning solution methods, followed by Monte Carlo method and
the machine learns from a training set of labeled data that are
Temporal Difference method and ending by policy based
provided by an external teacher or supervisor who determines
methods and deep Q-network method. In the end, some
the correct actions that the system should take for each
example. The system’s task is to generalize its responses to act reinforcement learning applications are highlighted.
correctly in cases that are not available in the training
examples. The performance of the supervised learning system A. Standard Model
enhances by increasing the number of the training examples.
The main components of reinforcement learning system
Some examples of supervised learning problems include:
are: policy, reward signal, value function and model [9], [10].
classification, object detection, image captioning, regression,
and labeling. Although this type of learning is important, it is  The policy ( π ) is the way that the agent (something
not adequate for interactive environments since it is unrealistic that perceives and acts in an environment [1]) will
to obtain labeled data that are both representative and correct. behave under certain circumstances. Simply the policy
In interactive environments, learning will be more efficient if maps states into actions. It can be a lookup table, a
the system can learn from its own experience. [9] function, or it may involve a search process. Finding the
Unsupervised learning is oriented about finding structure optimal policy is the core goal of reinforcement learning
hidden in a set of unlabeled data. Some examples of process [9], [10].
unsupervised learning include: Clustering, feature learning,  The reward signal (R) indicates how good and bad is an
dimensionality reduction, and density estimation. Even though event and it defines the goal of the problem where the
reinforcement learning may seem as a kind of unsupervised agent objective is to maximize the total received
learning since it does not learn from labeled data, it is different; reward. Accordingly, the reward is the main factor for
reinforcement learning is trying to maximize the rewards rather updating the policy. Reward may be immediate or
than finding hidden structure [9]. delayed, for delayed signals the agent need to determine
which actions are more relevant to a delayed reward [9],
Reinforcement learning is considered as a third paradigm of [10].
machine learning, along the side of unsupervised learning and  The value function is a prediction of the total future
supervised learning. However, the door is open for other rewards, it is used to evaluate the states and select
paradigms [9]. between actions accordingly [9], [10]
The state-value function V(s) is the expected return
when starting from a state s [9], [10].

(1)

Fig. 3 Reinforcement Learning standard diagram


Fig. 2 Machine Learning branches [10]

A. Hammoudeh, “A Concise Introduction to Reinforcement Learning,” 2018. 2


Fig. 4 Screen shots from five Atari 2600 Games: (Left-to-right) Pong, Breakout, Space Invaders, Seaquest, Beam Rider [11]

Where the return Gt is the total rewards R from time-step t. learned to play 49 different games from self-play and the score
it is the sum of the immediate reward and discounted future of the game using the same algorithm without tuning to a
reward [9], [10] . particular game. It mapped raw screen pixels to a Deep Q
Network. DeepMind published details of a program that can
play Atari at a professional level in 2013 [11]. Later in the
beginning of 2014, Google acquired DeepMind [12].
Gerald Tesauro from IBM developed a program that plays
(2) backgammon using reinforcement learning. Gerald’s program
was able to surpass human players [13]. However, scaling this
The discount γ represents the degradation factor of the success to more complicated games proved to be difficult until
future rewards when they are evaluated at present, γ ranges 2016 when DeepMind trained AlphaGo program using
between 0 and 1. However, the use of discount is sometimes reinforcement learning and succeeded in defeating Go’s world
controversial. champion, Lee Sedol. Go is an ancient Chinese game which
The action-value function q(s,a) is the expected return had been much harder for computers to master [14].
when starting from a state s and taking an action a. equation 3 One of the noticeable achievements in the field of
shows the mathematical formulation [9], [10]. reinforcement learning is AlphaZero that achieved a
superhuman level in more than one game: Chess, Shogi
(Japanese chess), and Go in 24 hours of self-play. The
(3) breakthrough is that it reuses the same hyperparameters for all
games without game-specific tuning [15] .
In the field of control; Andrew Ng and peter Abbeel
succeeded in flying a helicopter automatically in 2006. They
had a pilot flying the helicopter to help find a model of
helicopter dynamics and a reward function. Then they used
 The model of the environment allows predictions to be reinforcement learning to find an optimized controller for the
made about the behavior of the environment. However, resulting model and reward function [16], [17].
Model is an optional element of reinforcement learning;
methods that use models and planning are called model-
based methods. On the other side is model free methods
where the agent does not have a model for the
environment. model-free methods are explicitly trial-
and error learners [9], [10].

II. REINFORCEMENT LEARNING REMARKABLE SUCCESSES

Reinforcement learning is not a new idea. however, the rise


of reinforcement learning and the most achievements are
recent.
Fig. 5 Reinforcement learning was implemented to make strategic decisions in
Deepmind achieved human level on Atari games by a Jeopardy! (IBM’s Watson 2011) [18]
combination of reinforcement learning and deep learning, it

A. Hammoudeh, “A Concise Introduction to Reinforcement Learning,” 2018. 3


III. MARKOV DECISION PROCESS (MDP)
(8)
The problem of Markov Decision Process consists of
sequential decisions in which an action (A) has to be taken in
each state (S) that is visited by the agent. Markov decision is the action value function given the state
process can be represented as a sequence of states, actions, and (next possible state) and the action (next possible action)
rewards as shown in the following line [10]
is the probability of a transition to state given
the current state s and the current action a
…… , , , , , , , , ,………..

(9)
The main property of Markov process is that the future is
independent of the past given the present. Equation 4
formulates Markov state as that the probability of the next state Bellman Optimality Equation is non-linear and in general
depends only on the current state regardless the past no closed form solution is available. However, many iterative
states , , , [10] solution methods have been reported such as value iteration,
policy iteration, Q-learning [21], and SARSA [22] .

(4) Classical methods of DP such as policy iteration and value


iteration may break down when it comes to large-scale and
complex MDPs; there are two barriers that may hinder MDP’s
scalability 1) the curse of modeling which is the difficulty of
A. Bellman optimality equation computing the transition probabilities and 2) the curse of
MDPs are well studied in control theory, and their bases go dimensionality which raises when manipulating the elements of
back to the pioneering work of Richard Bellman. Bellman's MDP becomes challenging. However, adaptive functions
main contribution was to show that solving MDP using approximations [23] and learning-based methods [24] act well
dynamic programming (DP) reduces the computational in finding near-optimal solution for large-scale MDPs.
burdens effectively [19], [20].
The goal of an agent is to take actions that maximizes the
total rewards, basically this an optimality problem where the IV. REINFORCEMENT LEARNING ALGORITHMS
reinforcement learning agent tries taking actions that lead to
maximum rewards which can be represented by the action Optimal policy for reinforcement learning can be found
value function q(s, ɑ) [9], [10] using exact solution methods or approximate solution methods
The optimal action value function is the maximum (function approximation).
action-value function over all policies

A. Tabular methods
(5) Q-learning is a simple reinforcement learning algorithm
that learns long-term optimal behavior without any model of
the environment [25]. The idea behind Q-learning algorithm is
Bellman optimality equation (equation 7) takes advantage that the current value of our estimate of Q can improve our
of the recursive form of the action value function q(s, ɑ) which estimated solution (bootstrapping) [21].
can be rewritten in the recursive form as below (equation 6)
[9], [10]

(10)
(6)

Where α is the learning rate and δ is the temporal difference


Bellman optimality equation: (TD) error
Q-learning is off-policy; it evaluates one policy (target
policy) while following another policy (behavior policy).
(7) However, finding an optimal action value function q* under
arbitrary behavior policy can be achieved using policy
Where:
iteration. Q converges to the optimal action value function
R(s,ɑ) is the immediate reward given the current state s and , and its greedy policy converges to an optimal policy
the current action a under appropriate choice of the learning rate over time [21].

A. Hammoudeh, “A Concise Introduction to Reinforcement Learning,” 2018. 4


Policy iteration requires policy improvement and policy However, Q-learning with function approximation may
evaluation. The later enhances the action value function encounter the problem of instability; the main cause of
estimate by minimizing temporal difference errors (TD) in instability is not learning or sampling, since dynamic
experienced paths by following the policy. As the estimate programming suffers from divergence with function
develops, the policy can normally be improved by selecting approximation; neither greedification, exploration, nor control
greedy actions according to latest value function Q. Instead of is the main cause, since policy evaluation alone can produce
performing previous steps separately (similar to traditional instability, furthermore the complexity of function
policy iteration), the process can be accelerated by allowing for approximation is not the root cause, since linear function
interleaved steps as in generalized policy iteration [26]. approximation can diverge [9].
When combining function approximation, bootstrapping,
and off-policy, the risk of instability and divergence arises [28].
B. Approximate solution methods However, choosing two out of three is challenging; function
Tabular methods represent exact value functions and exact approximation is significant for generalization and scalability,
policies in tables. As the scale and the complexity of the bootstrapping is important for computational and data
environment increases, the required computational power efficiency, and off-policy learning helps in finding a better
dramatically increases. However, Approximations is a policy by freeing behavior policy from target policy [9].
powerful concept that absorbs the hidden states problem and
scalability problem [9]. Reported attempts for achieving stability and surviving the
deadly triad (combining function approximation,
An action value function is represented as a parametrized bootstrapping, and off-policy) suggest that Experience Replay
function approximator with parameter θ. the [31] and more stable targets like Double Q-learning [32] can
approximator can be a linear weighting of features or a deep help. The use of least-squares methods such as off-policy
neural network with weights as parameter θ. [10] LSTD(λ) survives the triad easily [33], [34]. However, their
computational costs scale with the square of the number of
Approximated function parameter θ can be updated using
parameters. The true-gradient methods such as Gradient-TD
the semi-gradient SARSA update shown in equation 11
[35], [36] and proximal-gradient-TD [37] work towards robust
convergence properties of stochastic gradient descent.

(11) C. Monte Carlo Method and Temporal-Difference learning


method
Both Monte Carlo Method and Temporal-Difference
SARSA is an on-policy method where the behavior policy learning method are model free with no knowledge of MDP
is the same as the target policy; it evaluates and follows a transitions or rewards. they learn directly from episodes of
single policy. SARSA algorithm approximates and experience, i.e. the MC return can be estimated by averaging
therefore the policy should not be totally greedy; it should be the return from multiple rollouts. In the contrary from Monte
near greedy ( ) where the policy acts greedy most Carlo which learns from complete episodes by sampling;
Temporal-Difference learns from incomplete episodes by
of the time but there is low probability of choosing random
sampling and bootstrapping. However, it is possible to get the
actions. [22]
best of both Monte Carlo method and TD learning as in the
In general, on policy methods perform better than off- TD(λ) algorithm where λ interpolates between Temporal-
policy but they find worse policies. An off policy method finds Difference ( λ =0 ) and MC ( λ =1 ) [0].
better policy (target policy) but it does not follow it therefore
the performance is worse [9].
Given on-policy methods with linear function D. Policy based RL
approximator, many published works reported guaranteed Policy based reinforcement learning method search for an
convergence for prediction problems [27]–[29] and guaranteed optimal policy directly with no value function; hence it is
non divergence for control problems [30]. effective in continuous and high dimensional spaces and it has
better convergence properties, but typically it converges to a
Function approximation can work well with off-policy local optimum rather than a global optimum. Although it can
methods such as Q-learning; a semi gradient Q-learning update learn stochastic policies, policy evaluation is high variance
for the parameters of approximate function is shown in [38].
equation 12 [25]
Policy based method is an optimization problem where a
parameterized policy is updated to maximize the return using
either gradient-free (i.e. Hill Climbing [39], Nelder Mead [40],
compressed network search [41], Genetic Algorithms [42]) or
(12) gradient-based optimization techniques (Gradient Descent,
Conjugate Gradient, Quasi-Newton) [43], [44].

A. Hammoudeh, “A Concise Introduction to Reinforcement Learning,” 2018. 5


function approximator in addition to massive reduction of the
amount of interactions needed with the environment resulting
in reduced variance of learning updates [64], [65].
While the original DQN algorithm of DeepMind used
uniform sampling [58], In a later work Deepmind reported a
more efficient learning algorithm that prioritize samples based
on TD errors [64]. Even though experience replay is typically
known as model-free technique, it is actually a simple model
[65].
The second stabilizing technique is the target network; it
breaks correlations between Q-network and target by freezing
the policy network for a period of time; Instead of updating the
TD error based on vacillating estimates of the Q-values, the
policy network takes advantage of the fixed target network
[60].
MDP assumes that the agent has full insight into the current
state which is not realistic. Nevertheless, Partially Observable
Fig. 6 A slice through the space of reinforcement learning methods, showing
the most important dimensions. At the extremes of these two dimensions are:
Markov Decision Process (POMDP) assumes that the agent
dynamic programming, exhaustive search, TD learning and Monte Carlo [9] receives partial observations of the state and actions will be
taken based on a belief over the current state [66]. Utilizing
recurrent neural networks (RNNs)[67], [68] is common with
Neural networks are trained to estimate policies POMDP due to the sequential nature of the problem and the
successfully using both gradient based [45]–[48] and gradient- fact that recurrent neural networks integrate information
free [49]–[51] methods. Gradient free methods perform well through time and replicate DQN's performance on partially
with small number of parameters and can optimize non- observed features. Moreover, deep recurrent Q-network
differentiable policies while gradient based methods dominate (DRQN) can adapt to change of observations better than DQN
deep reinforcement learning algorithms [52] . [69], [70].
Actor-critic methods implement generalized policy iteration
alternating between a policy improvement and a policy
evaluation. A policy is called an actor because it selects V. APPLICATIONS
actions, and an estimated value function is called a critic
because it introduces critiques to the actor’s actions. Actor- Reinforcement learning has been applied to various fields
critic methods trade off variance reduction of policy gradients such as electric power systems, healthcare, finance, robotics,
with bias introduction from value function methods [53]–[55]. marketing, natural language processing, transportation systems,
Selected remarkable actor-critic works can be seen in [48], and games.
[56], [57]. Games offer an excellent environment for reinforcement
learning agent since the agent can explore different trials in a
virtual world since the cost of exploration is affordable [71].
E. DQN Some impressive successes of reinforcement learning in games
The Deep Q-Network algorithm combines training deep are discussed in section II. More examples are reported in [72]–
neural networks with Reinforcement Learning. DQN was [75] in addition to a survey of reinforcement learning in video
illustrated to work directly from raw visual inputs and on a games [76]. In the following sub-sections, more applications
wide variety of environments and succeeded in reaching are discussed.
superhuman level of Atari 2600 games [11], [58].
DQN escapes the fundamental instability problem of
combining function approximation and reinforcement learning A. Hyperparameters Selection for Neural Networks
by the use of two techniques: experience replay [59] and target Determining neural network architecture and
networks [60]. Both experience replay and target networks hyperparameters selection passes through an iterative process
have been used in subsequent deep reinforcement learning of trial and evaluation. Hence it can be formulated as a
works [61]–[63]. reinforcement learning problem.
Experience replay memory stores cyclic transitions of the Zoph designed a recurrent neural network RNN that
form (current state, current action, next state, reward) which generates the hyperparameters for neural networks, the RNN
enables the reinforcement learning agent to train on and sample was trained with reinforcement learning by searching in
from previously experienced interactions. This efficient varying hyperparameters spaces to maximize the accuracy; the
utilization of previous experience, by learning with it multiple accuracy of the generated model on a validation set is
times, leads to better convergence behavior when training a considered a reward signal. this approach achieved competitive

A. Hammoudeh, “A Concise Introduction to Reinforcement Learning,” 2018. 6


results compared with state-of-art methods [77]. The concept The typical pipeline of task oriented dialogue systems
was extended later to discovering optimization methods for consists of many modules [92], [93]:
deep neural networks [78], and achieved better results than
standard optimization methods such as Stochastic Gradient a) natural language understanding (NLU):
Descent (SGD) [4], Stochastic Gradient Descent with assigns the user responses to semantic representation.
Momentum [4], Root Mean Square Propagation (RMSProp) b) the dialog state tracker (DST)
[79], and adaptive moment estimation (Adam) [80] .
accumulates the user input of the turn along with the dialogue
history and determines the current state of the dialogue
B. Intelligent Transportation Systems c) the dialogue policy
Intelligent Transportation Systems take advantage of latest selects the next action based on the dialogue state.
information technologies for handling traffic, and facilitating d) natural language generation
transport networks [81] Natural language generation is the process of constructing
Adaptive traffic signal control (ATSC) can lessen traffic computer systems that generate understandable natural
congestion by dynamically adjusting signal timing plans in language texts from underlying representations of information
response to traffic fluctuations. Multi-agent reinforcement [94].
learning approach was proposed in [82] to solve ATSC There are two approaches for designing dialogue systems:
problem where each controller (agent) is responsible for the the modular approach and End to End approach. The difference
control of traffic lights around a single traffic junction. Multi- between them is that the later approach replaces some the
agent reinforcement learning combines game theory with independent modules in figure 7 with a single end-to-end
single agent reinforcement learning. model [93].
Some of the challenges that faces Multi-agent Reinforcement learning is common in learning dialogue
reinforcement learning approach are: the exploration- policy in the modular approach [95]–[97]. In end to end
exploitation tradeoff, curse of dimensionality, stability, and approach, using reinforcement learning contributes to
nonstationarity. The nonstationarity challenge of the multi- enhancing the dialogue system performance by optimizing the
agent learning emerges since multiple agents are learning system response [93], [98]–[101].
simultaneously and every single agent has a varying learning
problem in which the agent’s optimal policy changes as the
policies of other agents change [83].
Van der Pol, and Oliehoek developed the previous work for
the traffic light control problem by eliminating the simplifying
assumptions and introducing new reward function [84].

C. Natural Language Processing


Reinforcement Learning was utilized in various natural
processing tasks such as text generation [85], [86], machine
translation[87]–[89], and conversation systems. in the
following subsection Dialogue systems are discussed.

1) Dialogue Systems Fig. 7 Typical pipeline of task oriented dialogue systems [93]
Dialogue systems are programs that interact with natural
language. Generally, they are classified into two categories: Alexa Prize competition is a 2.5-million-dollar competition
chatbots and task-oriented dialog agents. The task-oriented where university teams challenge to build conservation bots
dialog agents interact through short conversations in a specific that interact with users via text and sound [102]. Although the
domain to help complete specific tasks. On the other hand, the winner in 2017 competition (Sounding Board by University of
chatbots are designed to handle generic conversations and Washington) did not use reinforcement learning [103],
imitate human to human interactions [90]. MILABOT which achieved the highest average user score of
3.15 out of 5 in the competition has utilized reinforcement
The history of dialogue systems development is categorized learning. MILABOT was developed in University of Montreal
in generations: the first generation is a template based system and it consists of a collection of 22 response models including
centered on rules designed by human experts, the second natural language generation, template-based models, bag-of-
generation is a data driven system with some “light” machine words retrieval models, and sequence-to-sequence neural
learning techniques. Currently the third generation is data- network. When a set of possible responses is generated,
driven with deep neural networks. Lately combining reinforcement learning is used to find the optimal policy to
reinforcement learning with the third generation started to have select between the candidate responses. The optimal policy
a contribution to the development of the dialogue systems [91].

A. Hammoudeh, “A Concise Introduction to Reinforcement Learning,” 2018. 7


need to balance between long term and short term user REFERENCES
satisfaction [104].
A Sample of works that applied reinforcement learning to [1] S. Russell and P. Norvig, Artificial Intelligence: A Modern
Approach, 3rd ed. Pearson, 2009.
natural language processing includes DeepMind work of using
reinforcement learning to compute representations of natural [2] A. M. Turing, “Computing machinery and intelligence,” in Parsing
language sentences by learning tree-structured neural networks the Turing Test: Philosophical and Methodological Issues in the
[105]. Quest for the Thinking Computer, 1950, pp. 23–65.

[3] S. M. Shieber, The Turing Test: Verbal Behavior as the Hallmark of


Intelligence. Mit Press, 2004.
VI. PROSPECTIVE AND CHALLENGES
[4] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning
representations by back-propagating errors,” Nature, vol. 323, no.
RL has vast unexplored territories as well as many 6088, pp. 533–536, 1986.
unanswered questions in the explored ones; some of the typical [5] A. Goodfellow, Ian, Bengio, Yoshua, Courville, Deep Learning, 1st
challenges in RL include: Evaluative feedback, non- ed. Cambridge, MA: Mit Press, 2016.
stationarity, and delayed rewards [9].
[6] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J.
Achieving general AI requires multi-task learning [106] C. Phillips, “GPU Computing,” Proc. IEEE, vol. 96, pp. 879–899,
where an agent is able to perform many different type of tasks, 2008.
rather than specializing in few similar tasks. Multi-task
[7] K. Sato, C. Young, and D. Patterson, “An in-depth look at Google’s
learning is one of the challenges that RL aims to solve. first Tensor Processing Unit (TPU).” [Online]. Available:
The computational power plays significant role in driving https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-
googles-first-tensor-processing-unit-tpu. [Accessed: 25-Dec-2017].
progress in reinforcement learning research. In addition to
developing efficient algorithms, parallel processing by modern [8] R. R. Schaller, “Moore’s law: past, present and future,” IEEE
hardware leads to increasing throughput. For example, the final Spectr., vol. 34, no. 6, pp. 52–59, 1997.
version of AlphaGo used 40 search threads, 48 CPUs, and 8
[9] R. S. Sutton and A. G. Barto, Reinforcement learning : an
GPUs. [14]. However, computational power may hinder some introduction, 2nd ed. Cambridge, MA: Mit Press, 2017.
techniques such as tabular methods, exhaustive search, and
Monte Carlo. [10] D. Silver, “Deep reinforcement learning,” in International
Conference on Machine Learning (ICML), 2016.
Yann LeCun [71] pointed that pure model-free RL requires
many trials to learn anything. While trial and error is [11] V. Mnih et al., “Playing Atari with Deep Reinforcement Learning,”
in Conference on Neural Information Processing Systems, 2013, pp.
acceptable in simulated environments, it may not be the case in 1–9.
the real world. Hence, we need model based (see [1])
reinforcement learning with ability to predict by simulating the [12] DeepMind, “About Us.” [Online]. Available:
world which would reach a compact policy with no run-time https://deepmind.com/about/. [Accessed: 14-Jan-2018].
optimization. Nonetheless, Simulator and model based [13] G. Tesauro, “TD-Gammon, a Self-Teaching Backgammon Program,
reinforcement learning may be a clue for safe exploration, Achieves Master-Level Play,” Neural Comput., vol. 6, no. 2, pp.
effective exploration remains a challenge; considering the 215–219, 1994.
chance that an agent succeeds in assembling a car from scratch
with random behavior, the odds seem not in the agents’ favor. [14] D. Silver et al., “Mastering the game of Go with deep neural
networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489,
2016.

[15] D. Silver et al., “Mastering Chess and Shogi by Self-Play with a


General Reinforcement Learning Algorithm,” London, 2017.
VII. CONCLUSION
[16] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng, “An application of
reinforcement learning to aerobatic helicopter flight,” Education,
Reinforcement learning is a huge topic, with a long history, vol. 19, p. 1, 2007.
a wide range of applications, an elegant theoretical core,
distinguished successes, novel algorithms, and many open [17] A. Y. Ng et al., “Autonomous inverted helicopter flight via
reinforcement learning,” Springer Tracts Adv. Robot., vol. 21, pp.
problems.
363–372, 2006.
More research in artificial intelligence is looking for
[18] G. Tesauro, D. C. Gondek, J. Lenchner, J. Fan, and J. M. Prager,
general principles of learning, decision making, and search in “Analysis of Watson’s strategies for playing Jeopardy!,” J. Artif.
conjunction with integrating a wide range of domain Intell. Res., vol. 47, pp. 205–251, 2013.
knowledge. Research on reinforcement learning is a driving
force toward easier and less general principles of artificial [19] R. Bellman, “On the Theory of Dynamic Programming,” Proc. Natl.
Acad. Sci. U. S. A., vol. 38, no. 8, pp. 716–719, 1952.
intelligence [9].
[20] R. E. Bellman and S. E. Dreyfus, “Applied Dynamic
Programming,” Ann. Math. Stat., vol. 33, no. 2, pp. 719–726, 1962.

A. Hammoudeh, “A Concise Introduction to Reinforcement Learning,” 2018. 8


[21] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. Minimization,” Comput. J., vol. 7, no. 4, pp. 308–313, 1965.
8, no. 3–4, pp. 279–292, 1992.
[41] F. Gomez, J. Koutník, and J. Schmidhuber, “Compressed network
[22] G. A. Rummery and M. Niranjan, “On-line Q-learning using complexity search,” in Lecture Notes in Computer Science
Connectionist Systems,” University of Cambridge, 1994. (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), 2012, vol. 7491 LNCS, no. PART
[23] P. J. Werbos, “Building and Understanding Adaptive Systems: A 1, pp. 316–326.
Statistical/Numerical Approach to Factory Automation and Brain
Research,” IEEE Trans. Syst. Man Cybern., vol. 17, no. 1, pp. 7–20, [42] A. Fraser, “Simulation of Genetic Systems by Automatic Digital
1987. Computers I. Introduction,” Aust. J. Biol. Sci., vol. 10, no. 4, p. 484,
1957.
[24] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike
Adaptive Elements That Can Solve Difficult Learning Control [43] M. P. Deisenroth, “A Survey on Policy Search for Robotics,”
Problems,” IEEE Trans. Syst. Man Cybern., vol. SMC-13, no. 5, pp. Found. Trends Robot., vol. 2, no. 1–2, pp. 1–142, 2011.
834–846, 1983.
[44] T. Salimans, J. Ho, X. Chen, and I. Sutskever, “Evolution Strategies
[25] C. J. C. H. Watkins, “Learning from delayed rewards,” University as a Scalable Alternative to Reinforcement Learning,” 2017.
of Cambridge, 1989.
[45] R. J. Willia, “Simple Statistical Gradient-Following Algorithms for
[26] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Connectionist Reinforcement Learning,” Mach. Learn., vol. 8, no. 3,
Bharath, “A Brief Survey of Deep Reinforcement Learning,” IEEE pp. 229–256, 1992.
Signal Process. Mag. Spec. Issue Deep Learn. Image Underst., pp.
1–14, 2017. [46] D. Wierstra, A. Förster, J. Peters, and J. Schmidhuber, “Recurrent
policy gradients,” Log. J. IGPL, vol. 18, no. 5, pp. 620–634, 2009.
[27] P. Dayan, “The Convergence of TD(λ) for General λ,” Mach.
Learn., vol. 8, no. 3, pp. 341–362, 1992. [47] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
Region Policy Optimization,” in International Conference on
[28] J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal-difference Machine Learning, 2015.
learning with function approximation,” IEEE Trans. Automat.
Contr., vol. 42, no. 5, pp. 674–690, 1997. [48] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-
Dimensional Continuous Control using Generalized Advantage
[29] R. S. Sutton, “Learning to Predict by the Methods of Temporal Estimation,” in International Conference on Learning
Differences,” Mach. Learn., vol. 3, no. 1, pp. 9–44, 1988. Representations, 2016.

[30] G. J. Gordon, “Stable Function Approximation in Dynamic [49] G. Cuccu, M. Luciw, J. Schmidhuber, and F. Gomez, “Intrinsically
Programming,” Proc. 12th Int. Conf. Mach. Learn., no. January, pp. motivated neuroevolution for vision-based reinforcement learning,”
261–268, 1995. in IEEE International Conference on Development and Learning,
ICDL, 2011.
[31] L. Lin, “Reinforcement Learning for Robots Using Neural
Networks,” Report, C., pp. 1–155, 1993. [50] F. Gomez and J. Schmidhuber, “Evolving Modular Fast-Weight
Networks for Control,” in ICANN, 2005.
[32] H. Van Hasselt, A. C. Group, and C. Wiskunde, “Double Q-
learning,” Nips, pp. 1–9, 2010. [51] J. Koutník, G. Cuccu, J. Schmidhuber, and F. Gomez, “Evolving
large-scale neural networks for vision-based reinforcement
[33] H. Yu, “Convergence of Least Squares Temporal Difference learning,” in Proceeding of the fifteenth annual conference on
Methods Under General Conditions,” in International Conference Genetic and evolutionary computation conference - GECCO ’13,
on Machine Learning, 2010, pp. 1207–1214. 2013, p. 1061.
[34] A. R. Mahmood and R. S. Sutton, “Off-policy learning based on [52] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
weighted importance sampling with linear computational 521, no. 7553. pp. 436–444, 2015.
complexity,” Proc. Thirty-First Conf. Uncertain. Artif. Intell. {UAI}
2015, July 12-16, 2015, Amsterdam, Netherlands, pp. 552–561, [53] R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour, “Policy
2015. Gradient Methods for Reinforcement Learning with Function
Approximation,” Adv. Neural Inf. Process. Syst. 12, pp. 1057–1063,
[35] H. R. Maei, “Gradient Temporal-Difference Learning Algorithms,” 1999.
Mach. Learn., 2011.
[54] J. Peters and S. Schaal, “Natural Actor-Critic,” Neurocomputing,
[36] S. J. Bradtke and A. G. Barto, “Linear Least-Squares algorithms for vol. 71, no. 7–9, pp. 1180–1190, 2008.
temporal difference learning,” Mach. Learn., vol. 22, no. 1–3, pp.
33–57, 1996. [55] V. R. Konda and J. N. Tsitsiklis, “Actor-Critic Algorithms,” Control
Optim, vol. 42, no. 4, pp. 1143–1166, 2003.
[37] H. R. Maei and R. S. Sutton, “GQ( ): A general gradient algorithm
for temporal-difference prediction learning with eligibility traces,” [56] V. Mnih et al., “Asynchronous methods for deep reinforcement
in Proceedings of the 3d Conference on Artificial General learning,” in International Conference of Machine Learning, 2016.
Intelligence (AGI-10), 2010.
[57] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, “Q-
[38] P. Abbeel and J. Schulman, “Deep Reinforcement Learning through Prop: Sample-Efficient Policy Gradient with an Off-Policy Critic,”
Policy Optimization,” in Neural Information Processing Systems, in International Conference on Learning Representations, 2017.
2016.
[58] V. Mnih et al., “Human-level control through deep reinforcement
[39] M. Minsky, “Steps toward Artificial Intelligence,” Proc. IRE, vol. learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
49, no. 1, pp. 8–30, 1961.
[59] L. J. Lin, “Self-Improving Reactive Agents Based on Reinforcement
[40] J. A. Nelder and R. Mead, “A Simplex Method for Function Learning, Planning and Teaching,” Mach. Learn., vol. 8, no. 3, pp.

A. Hammoudeh, “A Concise Introduction to Reinforcement Learning,” 2018. 9


293–321, 1992. Conference on Machine Learning (ICML), 2017.

[60] T. P. Lillicrap et al., “Continuous Control with Deep Reinforcement [79] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the
Learning,” in International Conference on Learning gradient by a running average of its recent magnitude.,”
Representations, 2016. COURSERA Neural Networks Mach. Learn., 2012.

[61] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous Deep [80] D. P. Kingma and J. L. Ba, “Adam: a Method for Stochastic
Q-Learning with Model-Based Acceleration,” in International Optimization,” Int. Conf. Learn. Represent., pp. 1–15, 2015.
Conference on Learning Representations, 2016.
[81] A. L. C. Bazzan and F. Klügl, Introduction to Intelligent Systems in
[62] T. P. Lillicrap et al., “Continuous control with deep reinforcement Traffic and Transportation, vol. 7, no. 3. 2013.
learning,” in International Conference on Learning Representations
(ICLR), 2016, pp. 1–14. [82] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent
reinforcement learning for integrated network of adaptive traffic
[63] Z. Wang et al., “Sample Efficient Actor-Critic with Experience signal controllers (marlin-atsc): Methodology and large-scale
Replay,” in International Conference on Learning Representations, application on downtown toronto,” IEEE Trans. Intell. Transp.
2017. Syst., vol. 14, no. 3, pp. 1140–1150, 2013.

[64] V. Mnih et al., “Prioritized Experience Replay,” Int. Conf. Mach. [83] L. Busoniu, R. Babuska, B. De Schutter, and B. De Schutter, “A
Learn., vol. 4, no. 7540, p. 14, 2015. comprehensive survey of multiagent reinforcement learning,” Syst.
Man, Cybern. Part C Appl. Rev., vol. 38, no. 2, pp. 156–172, 2008.
[65] H. van Seijen and R. S. Sutton, “A deeper look at planning as
learning from replay,” Proc. 32nd Int. Conf. Mach. Learn., vol. 37, [84] E. Van Der Pol and F. A. Oliehoek, “Coordinated Deep
pp. 2314–2322, 2015. Reinforcement Learners for Traffic Light Control,” NIPS’16 Work.
Learn. Inference Control Multi-Agent Syst., no. Nips, 2016.
[66] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and
acting in partially observable stochastic domains,” Artif. Intell., vol. [85] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level
101, no. 1–2, pp. 99–134, 1998. training with recurrent neural networks,” in the International
Conference on Learning Representations (ICLR), 2016.
[67] Medsker L.C. and L. R. Jains, Recurrent neural networks design
and applications. CRC press, 2001. [86] D. Bahdanau et al., “An actor-critic algorithm for sequence
prediction,” in the International Conference on Learning
[68] M. Boden, “A guide to recurrent neural networks and Representations (ICLR), 2017.
backpropagation,” Electr. Eng., no. 2, pp. 1–10, 2001.
[87] A. Sokolov, J. Kreutzer, C. Lo, and S. Riezler, “Learning Structured
[69] N. Heess, J. Hunt, T. Lillicrap, and D. Silver, “Memory-Based Predictors from Bandit Feedback for Interactive NLP,” in
Control with Recurrent Neural Networks,” in Conference on Neural Association for Computational Linguistic (ACL), 2016, no. 2012,
Information Processing Systems, 2015. pp. 1610–1620.
[70] M. Hausknecht and P. Stone, “Deep Recurrent Q-Learning for [88] A. C. Grissom II, J. Boyd-Graber, H. He, J. Morgan, and H. Daume
Partially Observable MDPs,” in AAAI Fall Symposium Series, 2017. III, “Don’t Until the Final Verb Wait: Reinforcement Learning for
Simultaneous Machine Translation,” Empir. Methods Nat. Lang.
[71] Y. LeCun, “How Could Machines Learn as Efficiently as Animals Process., no. Section 3, pp. 1342–1352, 2014.
and Humans?,” 2017.
[89] Y. Wu et al., “Google’s Neural Machine Translation System:
[72] S. Ontanon, G. Synnaeve, A. Uriarte, F. Richoux, D. Churchill, and Bridging the Gap between Human and Machine Translation,” ArXiv
M. Preuss, “A survey of real-time strategy game AI research and e-prints, pp. 1–23, 2016.
competition in starcraft,” IEEE Transactions on Computational
Intelligence and AI in Games, vol. 5, no. 4. pp. 293–311, 2013. [90] D. Jurafsky and J. H. Martin, Speech and Language Processing, An
Introduction to Natural Language Processing, Computational
[73] Y. Wu and Y. Tian, “Training Agent for First-Person Shooter Game Linguistics, and Speech Recognition (Draft). New Jersy: Practice-
with Actor-Critic Curriculum Learning,” Int. Conf. Learn. Hall, 2008.
Represent., pp. 1–9, 2017.
[91] L. Deng, “Three Generations of Spoken Dialogue Systems (Bots),”
[74] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaskowski, in AI Frontiers, 2017.
“ViZDoom: A Doom-based AI research platform for visual
reinforcement learning,” in IEEE Conference on Computatonal [92] S. Young, “Using pomdps for dialog management,” in Spoken
Intelligence and Games, CIG, 2017. Language Technology Workshop, IEEE, 2006, pp. 8–13.

[75] E. Perot, M. Jaritz, M. Toromanoff, and R. De Charette, “End-to- [93] T. Zhao and M. Eskenazi, “Towards End-to-End Learning for
End Driving in a Realistic Racing Game with Deep Reinforcement Dialog State Tracking and Management using Deep Reinforcement
Learning,” in IEEE Computer Society Conference on Computer Learning,” in the Annual SIGdial Meeting on Discourse and
Vision and Pattern Recognition Workshops, 2017, vol. 2017–July, Dialogue (SIGDIAL), 2016.
pp. 474–475.
[94] Z. Xie, “Neural Text Generation: A Practical Guide.” 2017.
[76] N. Justesen, P. Bontrager, J. Togelius, and S. Risi, “Deep Learning
for Video Game Playing,” arXiv, pp. 1–17, 2017. [95] J. D. Williams and S. Young, “Partially observable Markov decision
processes for spoken dialog systems,” Comput. Speech Lang., vol.
[77] B. Zoph and Q. V. Le, “Neural architecture search with 21, no. 2, pp. 393–422, 2007.
reinforcement learning,” in the International Conference on
Learning Representations (ICLR), 2017. [96] S. Lee and M. Eskenazi, “POMDP-based Let’s Go system for
spoken dialog challenge,” in 2012 IEEE Workshop on Spoken
[78] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le, “Neural optimizer Language Technology, SLT 2012 - Proceedings, 2012, pp. 61–66.
search with reinforcement learning,” in the 34th International

A. Hammoudeh, “A Concise Introduction to Reinforcement Learning,” 2018. 10


[97] K. Georgila and D. Traum, “Reinforcement learning of
argumentation dialogue policies in negotiation,” in Proceedings of
the Annual Conference of the International Speech Communication
Association, INTERSPEECH, 2011, pp. 2073–2076.

[98] J. Li, W. Monroe, A. Ritter, and D. Jurafsky, “Deep Reinforcement


Learning for Dialogue Generation,” in Conference on Empirical
Methods in Natural Language Processing (EMNLP), 2016.

[99] T.-H. Wen et al., “A network-based end-to-end trainable task-


oriented dialogue system,” arXiv:1604.04562, 2016.

[100] I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau,


“Building End-To-End Dialogue Systems Using Generative
Hierarchical Neural Network Models,” Aaai, p. 8, 2016.

[101] I. Sutskever, O. Vinyals, and Q. V Le, “Sequence to sequence


learning with neural networks,” Adv. Neural Inf. Process. Syst., pp.
3104–3112, 2014.

[102] A. Ram et al., “Conversational AI: The Science Behind the Alexa
Prize,” in Alexa Prize Proceedings
https://developer.amazon.com/alexaprize/proceedings, 2018.

[103] H. Fang et al., “Sounding Board – University of Washington’s


Alexa Prize Submission,” in Alexa Prize Competition, 2017.

[104] I. V. Serban et al., “A Deep Reinforcement Learning Chatbot,”


Prepr. arXiv1801.06700v1, 20 Jan 2018.

[105] D. Yogatama, P. Blunsom, C. Dyer, E. Grefenstette, and W. Ling,


“Learning to compose words into sentences with reinforcement
learning,” in International Conference on Learning Representations
(ICLR), 2017.

[106] R. Caruana, “Multitask Learning,” Mach. Learn., vol. 28, no. 1, pp.
41–75, 1997.

A. Hammoudeh, “A Concise Introduction to Reinforcement Learning,” 2018. 11

View publication stats

You might also like