0% found this document useful (0 votes)
235 views

Human-Level Control Through Deep Reinforcement Learning - Nature

The document summarizes a research paper that developed a novel artificial agent called a deep Q-network that can learn successful policies directly from high-dimensional sensory inputs like pixels using reinforcement learning. The agent was tested on 49 Atari 2600 games and was able to surpass the performance of all previous algorithms and achieve human-level performance on many of the games using the same algorithm and network architecture. This work bridges the gap between high-dimensional sensory inputs and actions, resulting in the first artificial agent that can learn to excel at diverse challenging tasks from raw inputs alone.

Uploaded by

karthikpandit06
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
235 views

Human-Level Control Through Deep Reinforcement Learning - Nature

The document summarizes a research paper that developed a novel artificial agent called a deep Q-network that can learn successful policies directly from high-dimensional sensory inputs like pixels using reinforcement learning. The agent was tested on 49 Atari 2600 games and was able to surpass the performance of all previous algorithms and achieve human-level performance on many of the games using the same algorithm and network architecture. This work bridges the gap between high-dimensional sensory inputs and actions, resulting in the first artificial agent that can learn to excel at diverse challenging tasks from raw inputs alone.

Uploaded by

karthikpandit06
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1/26/24, 7:08 PM Human-level control through deep reinforcement learning | Nature

nature letters article

Letter Published: 25 February 2015

Human-level control through deep reinforcement learning


Volodymyr Mnih, Koray Kavukcuoglu , David Silver, Andrei A. Rusu, Joel Veness, Marc G.
Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen,

Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra,
Shane Legg & Demis Hassabis

Nature 518, 529–533 (2015)

494k Accesses 12k Citations 1536 Altmetric Metrics

Abstract

The theory of reinforcement learning provides a normative account1, deeply rooted


in psychological2 and neuroscientific3 perspectives on animal behaviour, of how
agents may optimize their control of an environment. To use reinforcement learning
successfully in situations approaching real-world complexity, however, agents are
confronted with a difficult task: they must derive efficient representations of the
environment from high-dimensional sensory inputs, and use these to generalize past
experience to new situations. Remarkably, humans and other animals seem to solve
this problem through a harmonious combination of reinforcement learning and
hierarchical sensory processing systems4,5, the former evidenced by a wealth of
neural data revealing notable parallels between the phasic signals emitted by
dopaminergic neurons and temporal difference reinforcement learning algorithms3.
While reinforcement learning agents have achieved some successes in a variety of
domains6,7,8, their applicability has previously been limited to domains in which
useful features can be handcrafted, or to domains with fully observed, low-
dimensional state spaces. Here we use recent advances in training deep neural
networks9,10,11 to develop a novel artificial agent, termed a deep Q-network, that can
learn successful policies directly from high-dimensional sensory inputs using end-to-
https://www.nature.com/articles/nature14236 1/11
1/26/24, 7:08 PM Human-level control through deep reinforcement learning | Nature

end reinforcement learning. We tested this agent on the challenging domain of


classic Atari 2600 games12. We demonstrate that the deep Q-network agent, receiving
only the pixels and the game score as inputs, was able to surpass the performance of
all previous algorithms and achieve a level comparable to that of a professional
human games tester across a set of 49 games, using the same algorithm, network
architecture and hyperparameters. This work bridges the divide between high-
dimensional sensory inputs and actions, resulting in the first artificial agent that is
capable of learning to excel at a diverse array of challenging tasks.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this Subscribe to this journal


article Receive 51 print issues and online access

Prices vary by article type

185,98 € per year


from $1.95
only 3,65 € per issue

to $39.95
Learn more

Learn more

Prices may be subject to local taxes which are calculated during checkout

Additional access options:

Log in
Learn about institutional subscriptions
Read our FAQs
https://www.nature.com/articles/nature14236 2/11
1/26/24, 7:08 PM Human-level control through deep reinforcement learning | Nature

Contact customer support

References

1 Sutton, R. & Barto, A. Reinforcement Learning: An Introduction (MIT Press, 1998)

2 Thorndike, E. L. Animal Intelligence: Experimental studies (Macmillan, 1911)

3 Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and


reward. Science 275, 1593–1599 (1997)

4 Serre, T., Wolf, L. & Poggio, T. Object recognition with features inspired by visual
cortex. Proc. IEEE. Comput. Soc. Conf. Comput. Vis. Pattern. Recognit. 994–1000
(2005)

5 Fukushima, K. Neocognitron: A self-organizing neural network model for a


mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36,
193–202 (1980)

6 Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 58–
68 (1995)

7 Riedmiller, M., Gabel, T., Hafner, R. & Lange, S. Reinforcement learning for robot
soccer. Auton. Robots 27, 55–73 (2009)

8 Diuk, C., Cohen, A. & Littman, M. L. An object-oriented representation for efficient


reinforcement learning. Proc. Int. Conf. Mach. Learn. 240–247 (2008)

https://www.nature.com/articles/nature14236 3/11
1/26/24, 7:08 PM Human-level control through deep reinforcement learning | Nature

9 Bengio, Y. Learning deep architectures for AI. Foundations and Trends in Machine
Learning 2, 1–127 (2009)

10 Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep


convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1106–1114 (2012)

11 Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with


neural networks. Science 313, 504–507 (2006)

12 Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning
environment: An evaluation platform for general agents. J. Artif. Intell. Res. 47,
253–279 (2013)

13 Legg, S. & Hutter, M. Universal Intelligence: a definition of machine intelligence.


Minds Mach. 17, 391–444 (2007)

14 Genesereth, M., Love, N. & Pell, B. General game playing: overview of the AAAI
competition. AI Mag. 26, 62–72 (2005)

15 Bellemare, M. G., Veness, J. & Bowling, M. Investigating contingency awareness


using Atari 2600 games. Proc. Conf. AAAI. Artif. Intell. 864–871 (2012)

16 McClelland, J. L., Rumelhart, D. E. & Group, T. P. R. Parallel Distributed Processing:


Explorations in the Microstructure of Cognition (MIT Press, 1986)

17 LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to
document recognition. Proc. IEEE 86, 2278–2324 (1998)

https://www.nature.com/articles/nature14236 4/11
1/26/24, 7:08 PM Human-level control through deep reinforcement learning | Nature

18 Hubel, D. H. & Wiesel, T. N. Shape and arrangement of columns in cat’s striate


cortex. J. Physiol. 165, 559–568 (1963)

19 Watkins, C. J. & Dayan, P. Q-learning. Mach. Learn. 8, 279–292 (1992)

20 Tsitsiklis, J. & Roy, B. V. An analysis of temporal-difference learning with function


approximation. IEEE Trans. Automat. Contr. 42, 674–690 (1997)

21 McClelland, J. L., McNaughton, B. L. & O’Reilly, R. C. Why there are


complementary learning systems in the hippocampus and neocortex: insights
from the successes and failures of connectionist models of learning and memory.
Psychol. Rev. 102, 419–457 (1995)

22 O’Neill, J., Pleydell-Bouverie, B., Dupret, D. & Csicsvari, J. Play it again: reactivation
of waking experience and memory. Trends Neurosci. 33, 220–229 (2010)

23 Lin, L.-J. Reinforcement learning for robots using neural networks. Technical
Report, DTIC Document. (1993)

24 Riedmiller, M. Neural fitted Q iteration - first experiences with a data efficient


neural reinforcement learning method. Mach. Learn.: ECML 3720, 317–328
(Springer, 2005)

25 Van der Maaten, L. J. P. & Hinton, G. E. Visualizing high-dimensional data using t-


SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

https://www.nature.com/articles/nature14236 5/11
1/26/24, 7:08 PM Human-level control through deep reinforcement learning | Nature

26 Lange, S. & Riedmiller, M. Deep auto-encoder neural networks in reinforcement


learning. Proc. Int. Jt. Conf. Neural. Netw. 1–8 (2010)

27 Law, C.-T. & Gold, J. I. Reinforcement learning can account for associative and
perceptual learning on a visual decision task. Nature Neurosci. 12, 655 (2009)

28 Sigala, N. & Logothetis, N. K. Visual categorization shapes feature selectivity in


the primate temporal cortex. Nature 415, 318–320 (2002)

29 Bendor, D. & Wilson, M. A. Biasing the content of hippocampal replay during


sleep. Nature Neurosci. 15, 1439–1444 (2012)

30 Moore, A. & Atkeson, C. Prioritized sweeping: reinforcement learning with less


data and less real time. Mach. Learn. 13, 103–130 (1993)

31 Jarrett, K., Kavukcuoglu, K., Ranzato, M. A. & LeCun, Y. What is the best multi-
stage architecture for object recognition? Proc. IEEE. Int. Conf. Comput. Vis. 2146–
2153 (2009)

32 Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann


machines. Proc. Int. Conf. Mach. Learn. 807–814 (2010)

33 Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially


observable stochastic domains. Artificial Intelligence 101, 99–134 (1994)

Acknowledgements

We thank G. Hinton, P. Dayan and M. Bowling for discussions, A. Cain and J. Keene for
work on the visuals, K. Keller and P. Rogers for help with the visuals, G. Wayne for

https://www.nature.com/articles/nature14236 6/11
1/26/24, 7:08 PM Human-level control through deep reinforcement learning | Nature

comments on an earlier version of the manuscript, and the rest of the DeepMind team
for their support, ideas and encouragement.
Author information

Volodymyr Mnih, Koray Kavukcuoglu and David Silver: These authors contributed
equally to this work.

Authors and Affiliations


Google DeepMind, 5 New Street Square, London EC4A 3TW, UK,
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K.
Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane
Legg & Demis Hassabis

Contributions
V.M., K.K., D.S., J.V., M.G.B., M.R., A.G., D.W., S.L. and D.H. conceptualized the problem
and the technical framework. V.M., K.K., A.A.R. and D.S. developed and tested the
algorithms. J.V., S.P., C.B., A.A.R., M.G.B., I.A., A.K.F., G.O. and A.S. created the testing
platform. K.K., H.K., S.L. and D.H. managed the project. K.K., D.K., D.H., V.M., D.S.,
A.G., A.A.R., J.V. and M.G.B. wrote the paper.

Corresponding authors

Correspondence to Koray Kavukcuoglu or Demis Hassabis.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Extended data figures and tables

Extended Data Figure 1 Two-dimensional t-SNE embedding of the


representations in the last hidden layer assigned by DQN to game states

https://www.nature.com/articles/nature14236 7/11
1/26/24, 7:08 PM Human-level control through deep reinforcement learning | Nature

experienced during a combination of human and agent play in Space


Invaders.
The plot was generated by running the t-SNE algorithm25 on the last hidden layer
representation assigned by DQN to game states experienced during a combination of
human (30 min) and agent (2 h) play. The fact that there is similar structure in the
two-dimensional embeddings corresponding to the DQN representation of states
experienced during human play (orange points) and DQN play (blue points) suggests
that the representations learned by DQN do indeed generalize to data generated from
policies other than its own. The presence in the t-SNE embedding of overlapping
clusters of points corresponding to the network representation of states experienced
during human and agent play shows that the DQN agent also follows sequences of
states similar to those found in human play. Screenshots corresponding to selected
states are shown (human: orange border; DQN: blue border).

Extended Data Figure 2 Visualization of learned value functions on two


games, Breakout and Pong.
a, A visualization of the learned value function on the game Breakout. At time points 1
and 2, the state value is predicted to be ∼17 and the agent is clearing the bricks at the
lowest level. Each of the peaks in the value function curve corresponds to a reward
obtained by clearing a brick. At time point 3, the agent is about to break through to
the top level of bricks and the value increases to ∼21 in anticipation of breaking out
and clearing a large set of bricks. At point 4, the value is above 23 and the agent has
broken through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value function
on the game Pong. At time point 1, the ball is moving towards the paddle controlled
by the agent on the right side of the screen and the values of all actions are around 0.7,
reflecting the expected value of this state based on previous experience. At time
point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’
action stays high while the value of the ‘down’ action falls to −0.9. This reflects the fact
that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of
−1. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward
keeps increasing until time point 4, when the ball reaches the left edge of the screen
and the value of all actions reflects that the agent is about to receive a reward of 1.

https://www.nature.com/articles/nature14236 8/11
1/26/24, 7:08 PM Human-level control through deep reinforcement learning | Nature

Note, the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.

Extended Data Table 1 List of hyperparameters and their values

Extended Data Table 2 Comparison of games scores obtained by DQN agents


with methods from the literature12,15 and a professional human games tester

Extended Data Table 3 The effects of replay and separating the target Q-
network

Extended Data Table 4 Comparison of DQN performance with linear function


approximator

Supplementary information

Supplementary Information
This file contains a Supplementary Discussion. (PDF 110 kb)

Performance of DQN in the Game Space Invaders


This video shows the performance of the DQN agent while playing the game of Space
Invaders. The DQN agent successfully clears the enemy ships on the screen while the
enemy ships move down and sideways with gradually increasing speed. (MOV 5106
kb)

Demonstration of Learning Progress in the Game Breakout


This video shows the improvement in the performance of DQN over training (i.e. after
100, 200, 400 and 600 episodes). After 600 episodes DQN finds and exploits the
optimal strategy in this game, which is to make a tunnel around the side, and then
allow the ball to hit blocks by bouncing behind the wall. Note: the score is displayed at

https://www.nature.com/articles/nature14236 9/11
1/26/24, 7:08 PM Human-level control through deep reinforcement learning | Nature

the top left of the screen (maximum for clearing one screen is 448 points), number of
lives remaining is shown in the middle (starting with 5 lives), and the “1” on the top
right indicates this is a 1-player game. (MOV 1500 kb)
PowerPoint slides

PowerPoint slide for Fig. 1

PowerPoint slide for Fig. 2

PowerPoint slide for Fig. 3

PowerPoint slide for Fig. 4

Rights and permissions

Reprints and permissions

About this article

Cite this article


Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement
learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236

Received Accepted Published


10 July 2014 16 January 2015 25 February 2015

Issue Date
26 February 2015

DOI
https://doi.org/10.1038/nature14236

Subjects Computer science

This article is cited by

https://www.nature.com/articles/nature14236 10/11
1/26/24, 7:08 PM Human-level control through deep reinforcement learning | Nature

Comparing explanations in RL
Britt Davis Pierson, Dustin Arendt ... Matthew E. Taylor

Neural Computing and Applications (2024)

Multi-objective intelligent clustering routing schema for internet of things


enabled wireless sensor networks using deep reinforcement learning
Walid K. Ghamry & Suzan Shukry

Cluster Computing (2024)

Nature (Nature) ISSN 1476-4687 (online) ISSN 0028-0836 (print)

https://www.nature.com/articles/nature14236 11/11

You might also like