A Concise Introduction To Reinforcement Learning: February 2018
A Concise Introduction To Reinforcement Learning: February 2018
net/publication/323178749
CITATION READS
1 3,283
1 author:
Ahmad Hammoudeh
None
20 PUBLICATIONS 98 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ahmad Hammoudeh on 14 February 2018.
Ahmad Hammoudeh
RLR , Data Science programme2
1
Amman, Jordan
[email protected]
Abstract— This paper aims to introduce, review, and The upraise of Artificial Intelligence is associated with
summarize several works and research papers on Reinforcement Deep Learning achievements in the recent years. Deep
Learning. Learning is basically a set of multiple layers of neural networks
connected to each other. While Deep learning algorithms are
Reinforcement learning is an area of Artificial Intelligence; it the same as what was used in the late 1980’s [4], deep learning
has emerged as an effective tool towards building artificially progress is driven by the development of computational power
intelligent systems and solving sequential decision making and the tremendous increase of both generated and collected
problems. Reinforcement Learning has achieved many
data [5]. Shifting from CPU (Central Processing Unit) to GPU
impressive breakthroughs in the recent years and it was able to
(Graphics Processing Unit) [6] and later to TPU (Tensor
surpass human level in many fields; it is able to play and win
various games. Historically, reinforcement learning was efficient
Processing Unit) [7] accelerated processing speed and opened
in solving some control system problems. Nowadays, it has a the door for more successes. However, computational
growing range of applications. capabilities are bounded by Moore’s law [8] which may slow
down building strong AI systems [9].
This work includes an introduction to reinforcement learning Reinforcement learning is learning through interaction with
which demonstrates the intuition behind Reinforcement Learning
an environment by taking different actions and experiencing
in addition to the main concepts. After that, the remarkable
many failures and successes while trying to maximize the
successes of reinforcement learning are highlighted.
Consequently, methods and the details for solving reinforcement
received rewards. The agent is not told which action to take.
learning problems are summarized. Thenceforth, data from a Reinforcement learning is similar to natural learning processes
wide collection of works in various reinforcement learning where a teacher or a supervisor is not available and learning
applications were reviewed. Finally, the prospects and the process evolves with trial and error, different from supervised
challenges of reinforcement learning are discussed. learning, in which an agent needs to be told what the correct
action is for every position it encounters [1], [9].
Keywords— Reinforcement Learning; Artificial Intelligence;
Machine learning
I. INTRODUCTION
(1)
Where the return Gt is the total rewards R from time-step t. learned to play 49 different games from self-play and the score
it is the sum of the immediate reward and discounted future of the game using the same algorithm without tuning to a
reward [9], [10] . particular game. It mapped raw screen pixels to a Deep Q
Network. DeepMind published details of a program that can
play Atari at a professional level in 2013 [11]. Later in the
beginning of 2014, Google acquired DeepMind [12].
Gerald Tesauro from IBM developed a program that plays
(2) backgammon using reinforcement learning. Gerald’s program
was able to surpass human players [13]. However, scaling this
The discount γ represents the degradation factor of the success to more complicated games proved to be difficult until
future rewards when they are evaluated at present, γ ranges 2016 when DeepMind trained AlphaGo program using
between 0 and 1. However, the use of discount is sometimes reinforcement learning and succeeded in defeating Go’s world
controversial. champion, Lee Sedol. Go is an ancient Chinese game which
The action-value function q(s,a) is the expected return had been much harder for computers to master [14].
when starting from a state s and taking an action a. equation 3 One of the noticeable achievements in the field of
shows the mathematical formulation [9], [10]. reinforcement learning is AlphaZero that achieved a
superhuman level in more than one game: Chess, Shogi
(Japanese chess), and Go in 24 hours of self-play. The
(3) breakthrough is that it reuses the same hyperparameters for all
games without game-specific tuning [15] .
In the field of control; Andrew Ng and peter Abbeel
succeeded in flying a helicopter automatically in 2006. They
had a pilot flying the helicopter to help find a model of
helicopter dynamics and a reward function. Then they used
The model of the environment allows predictions to be reinforcement learning to find an optimized controller for the
made about the behavior of the environment. However, resulting model and reward function [16], [17].
Model is an optional element of reinforcement learning;
methods that use models and planning are called model-
based methods. On the other side is model free methods
where the agent does not have a model for the
environment. model-free methods are explicitly trial-
and error learners [9], [10].
(9)
The main property of Markov process is that the future is
independent of the past given the present. Equation 4
formulates Markov state as that the probability of the next state Bellman Optimality Equation is non-linear and in general
depends only on the current state regardless the past no closed form solution is available. However, many iterative
states , , , [10] solution methods have been reported such as value iteration,
policy iteration, Q-learning [21], and SARSA [22] .
A. Tabular methods
(5) Q-learning is a simple reinforcement learning algorithm
that learns long-term optimal behavior without any model of
the environment [25]. The idea behind Q-learning algorithm is
Bellman optimality equation (equation 7) takes advantage that the current value of our estimate of Q can improve our
of the recursive form of the action value function q(s, ɑ) which estimated solution (bootstrapping) [21].
can be rewritten in the recursive form as below (equation 6)
[9], [10]
(10)
(6)
1) Dialogue Systems Fig. 7 Typical pipeline of task oriented dialogue systems [93]
Dialogue systems are programs that interact with natural
language. Generally, they are classified into two categories: Alexa Prize competition is a 2.5-million-dollar competition
chatbots and task-oriented dialog agents. The task-oriented where university teams challenge to build conservation bots
dialog agents interact through short conversations in a specific that interact with users via text and sound [102]. Although the
domain to help complete specific tasks. On the other hand, the winner in 2017 competition (Sounding Board by University of
chatbots are designed to handle generic conversations and Washington) did not use reinforcement learning [103],
imitate human to human interactions [90]. MILABOT which achieved the highest average user score of
3.15 out of 5 in the competition has utilized reinforcement
The history of dialogue systems development is categorized learning. MILABOT was developed in University of Montreal
in generations: the first generation is a template based system and it consists of a collection of 22 response models including
centered on rules designed by human experts, the second natural language generation, template-based models, bag-of-
generation is a data driven system with some “light” machine words retrieval models, and sequence-to-sequence neural
learning techniques. Currently the third generation is data- network. When a set of possible responses is generated,
driven with deep neural networks. Lately combining reinforcement learning is used to find the optimal policy to
reinforcement learning with the third generation started to have select between the candidate responses. The optimal policy
a contribution to the development of the dialogue systems [91].
[30] G. J. Gordon, “Stable Function Approximation in Dynamic [49] G. Cuccu, M. Luciw, J. Schmidhuber, and F. Gomez, “Intrinsically
Programming,” Proc. 12th Int. Conf. Mach. Learn., no. January, pp. motivated neuroevolution for vision-based reinforcement learning,”
261–268, 1995. in IEEE International Conference on Development and Learning,
ICDL, 2011.
[31] L. Lin, “Reinforcement Learning for Robots Using Neural
Networks,” Report, C., pp. 1–155, 1993. [50] F. Gomez and J. Schmidhuber, “Evolving Modular Fast-Weight
Networks for Control,” in ICANN, 2005.
[32] H. Van Hasselt, A. C. Group, and C. Wiskunde, “Double Q-
learning,” Nips, pp. 1–9, 2010. [51] J. Koutník, G. Cuccu, J. Schmidhuber, and F. Gomez, “Evolving
large-scale neural networks for vision-based reinforcement
[33] H. Yu, “Convergence of Least Squares Temporal Difference learning,” in Proceeding of the fifteenth annual conference on
Methods Under General Conditions,” in International Conference Genetic and evolutionary computation conference - GECCO ’13,
on Machine Learning, 2010, pp. 1207–1214. 2013, p. 1061.
[34] A. R. Mahmood and R. S. Sutton, “Off-policy learning based on [52] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
weighted importance sampling with linear computational 521, no. 7553. pp. 436–444, 2015.
complexity,” Proc. Thirty-First Conf. Uncertain. Artif. Intell. {UAI}
2015, July 12-16, 2015, Amsterdam, Netherlands, pp. 552–561, [53] R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour, “Policy
2015. Gradient Methods for Reinforcement Learning with Function
Approximation,” Adv. Neural Inf. Process. Syst. 12, pp. 1057–1063,
[35] H. R. Maei, “Gradient Temporal-Difference Learning Algorithms,” 1999.
Mach. Learn., 2011.
[54] J. Peters and S. Schaal, “Natural Actor-Critic,” Neurocomputing,
[36] S. J. Bradtke and A. G. Barto, “Linear Least-Squares algorithms for vol. 71, no. 7–9, pp. 1180–1190, 2008.
temporal difference learning,” Mach. Learn., vol. 22, no. 1–3, pp.
33–57, 1996. [55] V. R. Konda and J. N. Tsitsiklis, “Actor-Critic Algorithms,” Control
Optim, vol. 42, no. 4, pp. 1143–1166, 2003.
[37] H. R. Maei and R. S. Sutton, “GQ( ): A general gradient algorithm
for temporal-difference prediction learning with eligibility traces,” [56] V. Mnih et al., “Asynchronous methods for deep reinforcement
in Proceedings of the 3d Conference on Artificial General learning,” in International Conference of Machine Learning, 2016.
Intelligence (AGI-10), 2010.
[57] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, “Q-
[38] P. Abbeel and J. Schulman, “Deep Reinforcement Learning through Prop: Sample-Efficient Policy Gradient with an Off-Policy Critic,”
Policy Optimization,” in Neural Information Processing Systems, in International Conference on Learning Representations, 2017.
2016.
[58] V. Mnih et al., “Human-level control through deep reinforcement
[39] M. Minsky, “Steps toward Artificial Intelligence,” Proc. IRE, vol. learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
49, no. 1, pp. 8–30, 1961.
[59] L. J. Lin, “Self-Improving Reactive Agents Based on Reinforcement
[40] J. A. Nelder and R. Mead, “A Simplex Method for Function Learning, Planning and Teaching,” Mach. Learn., vol. 8, no. 3, pp.
[60] T. P. Lillicrap et al., “Continuous Control with Deep Reinforcement [79] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the
Learning,” in International Conference on Learning gradient by a running average of its recent magnitude.,”
Representations, 2016. COURSERA Neural Networks Mach. Learn., 2012.
[61] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous Deep [80] D. P. Kingma and J. L. Ba, “Adam: a Method for Stochastic
Q-Learning with Model-Based Acceleration,” in International Optimization,” Int. Conf. Learn. Represent., pp. 1–15, 2015.
Conference on Learning Representations, 2016.
[81] A. L. C. Bazzan and F. Klügl, Introduction to Intelligent Systems in
[62] T. P. Lillicrap et al., “Continuous control with deep reinforcement Traffic and Transportation, vol. 7, no. 3. 2013.
learning,” in International Conference on Learning Representations
(ICLR), 2016, pp. 1–14. [82] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent
reinforcement learning for integrated network of adaptive traffic
[63] Z. Wang et al., “Sample Efficient Actor-Critic with Experience signal controllers (marlin-atsc): Methodology and large-scale
Replay,” in International Conference on Learning Representations, application on downtown toronto,” IEEE Trans. Intell. Transp.
2017. Syst., vol. 14, no. 3, pp. 1140–1150, 2013.
[64] V. Mnih et al., “Prioritized Experience Replay,” Int. Conf. Mach. [83] L. Busoniu, R. Babuska, B. De Schutter, and B. De Schutter, “A
Learn., vol. 4, no. 7540, p. 14, 2015. comprehensive survey of multiagent reinforcement learning,” Syst.
Man, Cybern. Part C Appl. Rev., vol. 38, no. 2, pp. 156–172, 2008.
[65] H. van Seijen and R. S. Sutton, “A deeper look at planning as
learning from replay,” Proc. 32nd Int. Conf. Mach. Learn., vol. 37, [84] E. Van Der Pol and F. A. Oliehoek, “Coordinated Deep
pp. 2314–2322, 2015. Reinforcement Learners for Traffic Light Control,” NIPS’16 Work.
Learn. Inference Control Multi-Agent Syst., no. Nips, 2016.
[66] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and
acting in partially observable stochastic domains,” Artif. Intell., vol. [85] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level
101, no. 1–2, pp. 99–134, 1998. training with recurrent neural networks,” in the International
Conference on Learning Representations (ICLR), 2016.
[67] Medsker L.C. and L. R. Jains, Recurrent neural networks design
and applications. CRC press, 2001. [86] D. Bahdanau et al., “An actor-critic algorithm for sequence
prediction,” in the International Conference on Learning
[68] M. Boden, “A guide to recurrent neural networks and Representations (ICLR), 2017.
backpropagation,” Electr. Eng., no. 2, pp. 1–10, 2001.
[87] A. Sokolov, J. Kreutzer, C. Lo, and S. Riezler, “Learning Structured
[69] N. Heess, J. Hunt, T. Lillicrap, and D. Silver, “Memory-Based Predictors from Bandit Feedback for Interactive NLP,” in
Control with Recurrent Neural Networks,” in Conference on Neural Association for Computational Linguistic (ACL), 2016, no. 2012,
Information Processing Systems, 2015. pp. 1610–1620.
[70] M. Hausknecht and P. Stone, “Deep Recurrent Q-Learning for [88] A. C. Grissom II, J. Boyd-Graber, H. He, J. Morgan, and H. Daume
Partially Observable MDPs,” in AAAI Fall Symposium Series, 2017. III, “Don’t Until the Final Verb Wait: Reinforcement Learning for
Simultaneous Machine Translation,” Empir. Methods Nat. Lang.
[71] Y. LeCun, “How Could Machines Learn as Efficiently as Animals Process., no. Section 3, pp. 1342–1352, 2014.
and Humans?,” 2017.
[89] Y. Wu et al., “Google’s Neural Machine Translation System:
[72] S. Ontanon, G. Synnaeve, A. Uriarte, F. Richoux, D. Churchill, and Bridging the Gap between Human and Machine Translation,” ArXiv
M. Preuss, “A survey of real-time strategy game AI research and e-prints, pp. 1–23, 2016.
competition in starcraft,” IEEE Transactions on Computational
Intelligence and AI in Games, vol. 5, no. 4. pp. 293–311, 2013. [90] D. Jurafsky and J. H. Martin, Speech and Language Processing, An
Introduction to Natural Language Processing, Computational
[73] Y. Wu and Y. Tian, “Training Agent for First-Person Shooter Game Linguistics, and Speech Recognition (Draft). New Jersy: Practice-
with Actor-Critic Curriculum Learning,” Int. Conf. Learn. Hall, 2008.
Represent., pp. 1–9, 2017.
[91] L. Deng, “Three Generations of Spoken Dialogue Systems (Bots),”
[74] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaskowski, in AI Frontiers, 2017.
“ViZDoom: A Doom-based AI research platform for visual
reinforcement learning,” in IEEE Conference on Computatonal [92] S. Young, “Using pomdps for dialog management,” in Spoken
Intelligence and Games, CIG, 2017. Language Technology Workshop, IEEE, 2006, pp. 8–13.
[75] E. Perot, M. Jaritz, M. Toromanoff, and R. De Charette, “End-to- [93] T. Zhao and M. Eskenazi, “Towards End-to-End Learning for
End Driving in a Realistic Racing Game with Deep Reinforcement Dialog State Tracking and Management using Deep Reinforcement
Learning,” in IEEE Computer Society Conference on Computer Learning,” in the Annual SIGdial Meeting on Discourse and
Vision and Pattern Recognition Workshops, 2017, vol. 2017–July, Dialogue (SIGDIAL), 2016.
pp. 474–475.
[94] Z. Xie, “Neural Text Generation: A Practical Guide.” 2017.
[76] N. Justesen, P. Bontrager, J. Togelius, and S. Risi, “Deep Learning
for Video Game Playing,” arXiv, pp. 1–17, 2017. [95] J. D. Williams and S. Young, “Partially observable Markov decision
processes for spoken dialog systems,” Comput. Speech Lang., vol.
[77] B. Zoph and Q. V. Le, “Neural architecture search with 21, no. 2, pp. 393–422, 2007.
reinforcement learning,” in the International Conference on
Learning Representations (ICLR), 2017. [96] S. Lee and M. Eskenazi, “POMDP-based Let’s Go system for
spoken dialog challenge,” in 2012 IEEE Workshop on Spoken
[78] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le, “Neural optimizer Language Technology, SLT 2012 - Proceedings, 2012, pp. 61–66.
search with reinforcement learning,” in the 34th International
[102] A. Ram et al., “Conversational AI: The Science Behind the Alexa
Prize,” in Alexa Prize Proceedings
https://developer.amazon.com/alexaprize/proceedings, 2018.
[106] R. Caruana, “Multitask Learning,” Mach. Learn., vol. 28, no. 1, pp.
41–75, 1997.