100% found this document useful (1 vote)

7 views

The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python 1st Edition Michael Hu - Download the full ebook now for a seamless reading experience

The document provides information about the ebook 'The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python' by Michael Hu, which covers the fundamentals and practical applications of reinforcement learning (RL). It emphasizes the balance between theoretical concepts and coding implementations, making it accessible for researchers, students, and practitioners. The book includes source code and is structured into four parts, focusing on foundational concepts, value function approximation, policy-based methods, and advanced RL topics.

Uploaded by

ongonokie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

7 views

The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python 1st Edition Michael Hu - Download the full ebook now for a seamless reading experience

Uploaded by

ongonokie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Read Anytime Anywhere Easy Ebook Downloads at ebookmeta.

com

The Art of Reinforcement Learning: Fundamentals,

Mathematics, and Implementations with Python 1st
Edition Michael Hu

https://ebookmeta.com/product/the-art-of-reinforcement-
learning-fundamentals-mathematics-and-implementations-with-
python-1st-edition-michael-hu-2/

OR CLICK HERE

DOWLOAD EBOOK

Visit and Get More Ebook Downloads Instantly at https://ebookmeta.com

Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

The Art of Reinforcement Learning: Fundamentals,

Mathematics, and Implementations with Python 1st Edition
Michael Hu
https://ebookmeta.com/product/the-art-of-reinforcement-learning-
fundamentals-mathematics-and-implementations-with-python-1st-edition-
michael-hu/
ebookmeta.com

Python AI Programming: Navigating fundamentals of ML, deep

learning, NLP, and reinforcement learning in practice
Patrick J
https://ebookmeta.com/product/python-ai-programming-navigating-
fundamentals-of-ml-deep-learning-nlp-and-reinforcement-learning-in-
practice-patrick-j/
ebookmeta.com

Deep Reinforcement Learning with Python: With PyTorch,

TensorFlow and OpenAI Gym 1st Edition Nimish Sanghi

https://ebookmeta.com/product/deep-reinforcement-learning-with-python-
with-pytorch-tensorflow-and-openai-gym-1st-edition-nimish-sanghi-3/

ebookmeta.com

The Sounds of Mandarin 1st Edition Janet Y. Chen

https://ebookmeta.com/product/the-sounds-of-mandarin-1st-edition-
janet-y-chen/

ebookmeta.com
Mechanical Engineering Capsule 1st Edition Youth
Competition Editorial Board

https://ebookmeta.com/product/mechanical-engineering-capsule-1st-
edition-youth-competition-editorial-board/

ebookmeta.com

Here's to Us (What If It's Us 2) 1st Edition Becky

Albertalli

https://ebookmeta.com/product/heres-to-us-what-if-its-us-2-1st-
edition-becky-albertalli-2/

ebookmeta.com

Sustainable Engineering: Process Intensification, Energy

Analysis, and Artificial Intelligence 1st Edition Yasar
Demirel
https://ebookmeta.com/product/sustainable-engineering-process-
intensification-energy-analysis-and-artificial-intelligence-1st-
edition-yasar-demirel/
ebookmeta.com

Probabilistic Statistical Methods for Risk Assessment in

Civil Aviation Valery Dmitryevich Sharov Vadim Vadimovich
Vorobyov Dmitry Alexandrovich Zatuchny
https://ebookmeta.com/product/probabilistic-statistical-methods-for-
risk-assessment-in-civil-aviation-valery-dmitryevich-sharov-vadim-
vadimovich-vorobyov-dmitry-alexandrovich-zatuchny-2/
ebookmeta.com

Professional Well Being Enhancing Wellness Among

Psychiatrists Psychologists and Mental Health Clinicians
1st Edition Grace Gengoux
https://ebookmeta.com/product/professional-well-being-enhancing-
wellness-among-psychiatrists-psychologists-and-mental-health-
clinicians-1st-edition-grace-gengoux/
ebookmeta.com
Comparative Psychology Evolution and Development of Brain
and Behavior 3rd Edition Mauricio R. Papini

https://ebookmeta.com/product/comparative-psychology-evolution-and-
development-of-brain-and-behavior-3rd-edition-mauricio-r-papini/

ebookmeta.com
Michael Hu

The Art of Reinforcement Learning

Fundamentals, Mathematics, and Implementations
with Python
Michael Hu
Shanghai, Shanghai, China

ISBN 978-1-4842-9605-9 e-ISBN 978-1-4842-9606-6

https://doi.org/10.1007/978-1-4842-9606-6

Apress Standard

The use of general descriptive names, registered names, trademarks,

service marks, etc. in this publication does not imply, even in the
absence of a specific statement, that such names are exempt from the
relevant protective laws and regulations and therefore free for general
use.

The publisher, the authors, and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the
material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Apress imprint is published by the registered company APress

Media, LLC, part of Springer Nature.
The registered company address is: 1 New York Plaza, New York, NY
10004, U.S.A.
To my beloved family,
This book is dedicated to each of you, who have been a constant source of
love and support throughout my writing journey.
To my hardworking parents, whose tireless efforts in raising us have been
truly remarkable. Thank you for nurturing my dreams and instilling in
me a love for knowledge. Your unwavering dedication has played a
pivotal role in my accomplishments.
To my sisters and their children, your presence and love have brought
immense joy and inspiration to my life. I am grateful for the laughter and
shared moments that have sparked my creativity.
And to my loving wife, your consistent support and understanding have
been my guiding light. Thank you for standing by me through the highs
and lows, and for being my biggest cheerleader.
—Michael Hu
Preface
Reinforcement learning (RL) is a highly promising yet challenging
subfield of artificial intelligence (AI) that plays a crucial role in shaping
the future of intelligent systems. From robotics and autonomous agents
to recommendation systems and strategic decision-making, RL enables
machines to learn and adapt through interactions with their
environment. Its remarkable success stories include RL agents
achieving human-level performance in video games and even
surpassing world champions in strategic board games like Go. These
achievements highlight the immense potential of RL in solving complex
problems and pushing the boundaries of AI.
What sets RL apart from other AI subfields is its fundamental
approach: agents learn by interacting with the environment, mirroring
how humans acquire knowledge. However, RL poses challenges that
distinguish it from other AI disciplines. Unlike methods that rely on
precollected training data, RL agents generate their own training
samples. These agents are not explicitly instructed on how to achieve a
goal; instead, they receive state representations of the environment and
a reward signal, forcing them to explore and discover optimal strategies
on their own. Moreover, RL involves complex mathematics that
underpin the formulation and solution of RL problems.
While numerous books on RL exist, they typically fall into two
categories. The first category emphasizes the fundamentals and
mathematics of RL, serving as reference material for researchers and
university students. However, these books often lack implementation
details. The second category focuses on practical hands-on coding of RL
algorithms, neglecting the underlying theory and mathematics. This
apparent gap between theory and implementation prompted us to
create this book, aiming to strike a balance by equally emphasizing
fundamentals, mathematics, and the implementation of successful RL
algorithms.
This book is designed to be accessible and informative for a diverse
audience. It is targeted toward researchers, university students, and
practitioners seeking a comprehensive understanding of RL. By
following a structured approach, the book equips readers with the
necessary knowledge and tools to apply RL techniques effectively in
various domains.
The book is divided into four parts, each building upon the previous
one. Part I focuses on the fundamentals and mathematics of RL, which
form the foundation for almost all discussed algorithms. We begin by
solving simple RL problems using tabular methods. Chapter 2, the
cornerstone of this part, explores Markov decision processes (MDPs)
and the associated value functions, which are recurring concepts
throughout the book. Chapters 3 to 5 delve deeper into these
fundamental concepts by discussing how to use dynamic programming
(DP), Monte Carlo methods, and temporal difference (TD) learning
methods to solve small MDPs.
Part II tackles the challenge of solving large-scale RL problems that
render tabular methods infeasible due to their complexity (e.g., large or
infinite state spaces). Here, we shift our focus to value function
approximation, with particular emphasis on leveraging (deep) neural
networks. Chapter 6 provides a brief introduction to linear value
function approximation, while Chap. 7 delves into the renowned Deep
Q-Network (DQN) algorithm. In Chap. 8, we discuss enhancements to
the DQN algorithm.
Part III explores policy-based methods as an alternative approach to
solving RL problems. While Parts I and II primarily focus on value-
based methods (learning the value function), Part III concentrates on
learning the policy directly. We delve into the theory behind policy
gradient methods and the REINFORCE algorithm in Chap. 9.
Additionally, we explore Actor-Critic algorithms, which combine policy-
based and value-based approaches, in Chap. 10. Furthermore, Chap. 11
covers advanced policy-based algorithms, including surrogate objective
functions and the renowned Proximal Policy Optimization (PPO)
algorithm.
The final part of the book addresses advanced RL topics. Chapter 12
discusses how distributed RL can enhance agent performance, while
Chap. 13 explores the challenges of hard-to-explore RL problems and
presents curiosity-driven exploration as a potential solution. In the
concluding chapter, Chap. 14, we delve into model-based RL by
providing a comprehensive examination of the famous AlphaZero
algorithm.
Unlike a typical hands-on coding handbook, this book does not
primarily focus on coding exercises. Instead, we dedicate our resources
and time to explaining the fundamentals and core ideas behind each
algorithm. Nevertheless, we provide complete source code for all
examples and algorithms discussed in the book. Our code
implementations are done from scratch, without relying on third-party
RL libraries, except for essential tools like Python, OpenAI Gym, Numpy,
and the PyTorch deep learning framework. While third-party RL
libraries expedite the implementation process in real-world scenarios,
we believe coding each algorithm independently is the best approach
for learning RL fundamentals and mastering the various RL algorithms.
Throughout the book, we employ mathematical notations and
equations, which some readers may perceive as heavy. However, we
prioritize intuition over rigorous proofs, making the material accessible
to a broader audience. A foundational understanding of calculus at a
basic college level, minimal familiarity with linear algebra, and
elementary knowledge of probability and statistics are sufficient to
embark on this journey. We strive to ensure that interested readers
from diverse backgrounds can benefit from the book’s content.
We assume that readers have programming experience in Python
since all the source code is written in this language. While we briefly
cover the basics of deep learning in Chap. 7, including neural networks
and their workings, we recommend some prior familiarity with
machine learning, specifically deep learning concepts such as training a
deep neural network. However, beyond the introductory coverage,
readers can explore additional resources and materials to expand their
knowledge of deep learning.
This book draws inspiration from Reinforcement Learning: An
Introduction by Richard S. Sutton and Andrew G. Barto, a renowned RL
publication. Additionally, it is influenced by prestigious university RL
courses, particularly the mathematical style and notation derived from
Professor Emma Brunskill’s RL course at Stanford University. Although
our approach may differ slightly from Sutton and Barto’s work, we
strive to provide simpler explanations. Additionally, we have derived
some examples from Professor David Silver’s RL course at University
College London, which offers a comprehensive resource for
understanding the fundamentals presented in Part I. We would like to
express our gratitude to Professor Dimitri P. Bertsekas for his
invaluable guidance and inspiration in the field of optimal control and
reinforcement learning. Furthermore, the content of this book
incorporates valuable insights from research papers published by
various organizations and individual researchers.
In conclusion, this book aims to bridge the gap between the
fundamental concepts, mathematics, and practical implementation of
RL algorithms. By striking a balance between theory and
implementation, we provide readers with a comprehensive
understanding of RL, empowering them to apply these techniques in
various domains. We present the necessary mathematics and offer
complete source code for implementation to help readers gain a deep
understanding of RL principles. We hope this book serves as a valuable
resource for readers seeking to explore the fundamentals, mathematics,
and practical aspects of RL algorithms. We must acknowledge that
despite careful editing from our editors and multiple round of reviews,
we cannot guarantee the book’s content is error free. Your feedback and
corrections are invaluable to us. Please do not hesitate to contact us
with any concerns or suggestions for improvement.

Source Code
You can download the source code used in this book from github.com/
apress/art-of-reinforcement-lear ning.
Michael Hu
Any source code or other supplementary material referenced by the
author in this book is available to readers on GitHub (https://github.
com/Apress). For more detailed information, please visit https://www.
apress.com/gp/services/source-code.
Contents
Part I Foundation
1 Introduction
1.1 AI Breakthrough in Games
1.2 What Is Reinforcement Learning
1.3 Agent-Environment in Reinforcement Learning
1.4 Examples of Reinforcement Learning
1.5 Common Terms in Reinforcement Learning
1.6 Why Study Reinforcement Learning
1.7 The Challenges in Reinforcement Learning
1.8 Summary
References
2 Markov Decision Processes
2.1 Overview of MDP
2.2 Model Reinforcement Learning Problem Using MDP
2.3 Markov Process or Markov Chain
2.4 Markov Reward Process
2.5 Markov Decision Process
2.6 Alternative Bellman Equations for Value Functions
2.7 Optimal Policy and Optimal Value Functions
2.8 Summary
References
3 Dynamic Programming
3.1 Use DP to Solve MRP Problem
3.2 Policy Evaluation
3.3 Policy Improvement
3.4 Policy Iteration
3.5 General Policy Iteration
3.6 Value Iteration
3.7 Summary
References
4 Monte Carlo Methods
4.1 Monte Carlo Policy Evaluation
4.2 Incremental Update
4.3 Exploration vs.Exploitation
4.4 Monte Carlo Control (Policy Improvement)
4.5 Summary
References
5 Temporal Difference Learning
5.1 Temporal Difference Learning
5.2 Temporal Difference Policy Evaluation
5.3 Simplified 𝜖-Greedy Policy for Exploration
5.4 TD Control—SARSA
5.5 On-Policy vs.Off-Policy
5.6 Q-Learning
5.7 Double Q-Learning
5.8 N-Step Bootstrapping
5.9 Summary
References
Part II Value Function Approximation
6 Linear Value Function Approximation
6.1 The Challenge of Large-Scale MDPs
6.2 Value Function Approximation
6.3 Stochastic Gradient Descent
6.4 Linear Value Function Approximation
6.5 Summary
References
7 Nonlinear Value Function Approximation
7.1 Neural Networks
7.2 Training Neural Networks
7.3 Policy Evaluation with Neural Networks
7.4 Naive Deep Q-Learning
7.5 Deep Q-Learning with Experience Replay and Target
Network
7.6 DQN for Atari Games
7.7 Summary
References
8 Improvements to DQN
8.1 DQN with Double Q-Learning
8.2 Prioritized Experience Replay
8.3 Advantage function and Dueling Network Architecture
8.4 Summary
References
Part III Policy Approximation
9 Policy Gradient Methods
9.1 Policy-Based Methods
9.2 Policy Gradient
9.3 REINFORCE
9.4 REINFORCE with Baseline
9.5 Actor-Critic
9.6 Using Entropy to Encourage Exploration
9.7 Summary
References
10 Problems with Continuous Action Space
10.1 The Challenges of Problems with Continuous Action Space
10.2 MuJoCo Environments
10.3 Policy Gradient for Problems with Continuous Action
Space
10.4 Summary
References
11 Advanced Policy Gradient Methods
11.1 Problems with the Standard Policy Gradient Methods
11.2 Policy Performance Bounds
11.3 Proximal Policy Optimization
11.4 Summary
References
Part IV Advanced Topics
12 Distributed Reinforcement Learning
12.1 Why Use Distributed Reinforcement Learning
12.2 General Distributed Reinforcement Learning Architecture
12.3 Data Parallelism for Distributed Reinforcement Learning
12.4 Summary
References
13 Curiosity-Driven Exploration
13.1 Hard-to-Explore Problems vs.Sparse Reward Problems
13.2 Curiosity-Driven Exploration
13.3 Random Network Distillation
13.4 Summary
References
14 Planning with a Model:AlphaZero
14.1 Why We Need to Plan in Reinforcement Learning
14.2 Monte Carlo Tree Search
14.3 AlphaZero
14.4 Training AlphaZero on a 9 × 9 Go Board
14.5 Training AlphaZero on a 13 × 13 Gomoku Board
14.6 Summary
References
Index
About the Author
Michael Hu
is an exceptional software engineer with a wealth of
expertise spanning over a decade, specializing in the
design and implementation of enterprise-level
applications. His current focus revolves around leveraging
the power of machine learning (ML) and artificial
intelligence (AI) to revolutionize operational systems
within enterprises. A true coding enthusiast, Michael finds
solace in the realms of mathematics and continuously
explores cutting-edge technologies, particularly machine learning and
deep learning. His unwavering passion lies in the realm of deep
reinforcement learning, where he constantly seeks to push the
boundaries of knowledge. Demonstrating his commitment to the field,
he has built various numerous open source projects on GitHub that
closely emulate state-of-the-art reinforcement learning algorithms
pioneered by DeepMind, including notable examples like AlphaZero,
MuZero, and Agent57. Through these projects, Michael demonstrates
his commitment to advancing the field and sharing his knowledge with
fellow enthusiasts. He currently resides in the city of Shanghai, China.
About the Technical Reviewer
Shovon Sengupta
has over 14 years of expertise and a deepened
understanding of advanced predictive analytics, machine
learning, deep learning, and reinforcement learning. He
has established a place for himself by creating innovative
financial solutions that have won numerous awards. He is
currently working for one of the leading multinational
financial services corporations in the United States as the
Principal Data Scientist at the AI Center of Excellence. His job entails
leading innovative initiatives that rely on artificial intelligence to
address challenging business problems. He has a US patent (United
States Patent: Sengupta et al.: Automated Predictive Call Routing Using
Reinforcement Learning [US 10,356,244 B1]) to his credit. He is also a
Ph.D. scholar at BITS Pilani. He has reviewed quite a few popular titles
from leading publishers like Packt and Apress and has also authored a
few courses for Packt and CodeRed (EC-Council) in the realm of
machine learning. Apart from that, he has presented at various
international conferences on machine learning, time series forecasting,
and building trustworthy AI. His primary research is concentrated on
deep reinforcement learning, deep learning, natural language
processing (NLP), knowledge graph, causality analysis, and time series
analysis. For more details about Shovon’s work, please check out his
LinkedIn page: www.linkedin.com/in/shovon-sengupta-272aa917.
Part I
Foundation
© The Author(s), under exclusive license to APress Media, LLC, part of Springer
Nature 2023
M. Hu, The Art of Reinforcement Learning
https://doi.org/10.1007/978-1-4842-9606-6_1

1. Introduction
Michael Hu1
(1) Shanghai, Shanghai, China

Artificial intelligence has made impressive progress in recent years,

with breakthroughs achieved in areas such as image recognition,
natural language processing, and playing games. In particular,
reinforcement learning, a type of machine learning that focuses on
learning by interacting with an environment, has led to remarkable
achievements in the field.
In this book, we focus on the combination of reinforcement learning
and deep neural networks, which have become central to the success of
agents that can master complex games such as board game Go and Atari
video games.
This first chapter provides an overview of reinforcement learning,
including key concepts such as states, rewards, policies, and the
common terms used in reinforcement learning, like the difference
between episodic and continuing reinforcement learning problems,
model-free vs. model-based methods.
Despite the impressive progress in the field, reinforcement learning
still faces significant challenges. For example, it can be difficult to learn
from sparse rewards, and the methods can suffer from instability.
Additionally, scaling to large state and action spaces can be a challenge.
Throughout this book, we will explore these concepts in greater
detail and discuss state-of-the-art techniques used to address these
challenges. By the end of this book, you will have a comprehensive
understanding of the principles of reinforcement learning and how they
can be applied to real-world problems.
We hope this introduction has sparked your curiosity about the
potential of reinforcement learning, and we invite you to join us on this
journey of discovery.

Fig. 1.1 A DQN agent learning to play Atari’s Breakout. The goal of the game is to
use a paddle to bounce a ball up and break through a wall of bricks. The agent only
takes in the raw pixels from the screen, and it has to figure out what’s the right action
to take in order to maximize the score. Idea adapted from Mnih et al. [1]. Game
owned by Atari Interactive, Inc.

1.1 AI Breakthrough in Games

Atari
The Atari 2600 is a home video game console developed by Atari
Interactive, Inc. in the 1970s. It features a collection of iconic video
games. These games, such as Pong, Breakout, Space Invaders, and Pac-
Man, have become classic examples of early video gaming culture. In
this platform, players can interact with these classic games using a
joystick controller.
The breakthrough in Atari games came in 2015 when Mnih et al. [1]
from DeepMind developed an AI agent called DQN to play a list of Atari
video games, some even better than humans.
What makes the DQN agent so influential is how it was trained to
play the game. Similar to a human player, the agent was only given the
raw pixel image of the screen as inputs, as illustrated in Fig. 1.1, and it
has to figure out the rules of the game all by itself and decide what to do
during the game to maximize the score. No human expert knowledge,
such as predefined rules or sample games of human play, was given to
the agent.
The DQN agent is a type of reinforcement learning agent that learns
by interacting with an environment and receiving a reward signal. In
the case of Atari games, the DQN agent receives a score for each action
it takes.
Mnih et al. [1] trained and tested their DQN agents on 57 Atari video
games. They trained one DQN agent for one Atari game, with each agent
playing only the game it was trained on; the training was over millions
of frames. The DQN agent can play half of the games (30 of 57 games) at
or better than a human player, as shown by Mnih et al. [1]. This means
that the agent was able to learn and develop strategies that were better
than what a human player could come up with.
Since then, various organizations and researchers have made
improvements to the DQN agent, incorporating several new techniques.
The Atari video games have become one of the most used test beds for
evaluating the performance of reinforcement learning agents and
algorithms. The Arcade Learning Environment (ALE) [2], which
provides an interface to hundreds of Atari 2600 game environments, is
commonly used by researchers for training and testing reinforcement
learning agents.
In summary, the Atari video games have become a classic example
of early video gaming culture, and the Atari 2600 platform provides a
rich environment for training agents in the field of reinforcement
learning. The breakthrough of DeepMind’s DQN agent, trained and
tested on 57 Atari video games, demonstrated the capability of an AI
agent to learn and make decisions through trial-and-error interactions
with classic games. This breakthrough has spurred many
improvements and advancements in the field of reinforcement learning,
and the Atari games have become a popular test bed for evaluating the
performance of reinforcement learning algorithms.

Go
Go is an ancient Chinese strategy board game played by two players,
who take turns laying pieces of stones on a 19x19 board with the goal
of surrounding more territory than the opponent. Each player has a set
of black or white stones, and the game begins with an empty board.
Players alternate placing stones on the board, with the black player
going first.

Fig. 1.2 Yoda Norimoto (black) vs. Kiyonari Tetsuya (white), Go game from the 66th
NHK Cup, 2018. White won by 0.5 points. Game record from CWI [4]
The stones are placed on the intersections of the lines on the board,
rather than in the squares. Once a stone is placed on the board, it
cannot be moved, but it can be captured by the opponent if it is
completely surrounded by their stones. Stones that are surrounded and
captured are removed from the board.
The game continues until both players pass, at which point the
territory on the board is counted. A player’s territory is the set of empty
intersections that are completely surrounded by their stones, plus any
captured stones. The player with the larger territory wins the game. In
the case of the final board position shown in Fig. 1.2, the white won by
0.5 points.
Although the rules of the game are relatively simple, the game is
extremely complex. For instance, the number of legal board positions in
Go is enormously large compared to Chess. According to research by
Tromp and Farnebä ck [3], the number of legal board positions in Go is
approximately , which is vastly greater than the number of
atoms in the universe.
This complexity presents a significant challenge for artificial
intelligence (AI) agents that attempt to play Go. In March 2016, an AI
agent called AlphaGo developed by Silver et al. [5] from DeepMind
made history by beating the legendary Korean player Lee Sedol with a
score of 4-1 in Go. Lee Sedol is a winner of 18 world titles and is
considered one of the greatest Go player of the past decade. AlphaGo’s
victory was remarkable because it used a combination of deep neural
networks and tree search algorithms, as well as the technique of
reinforcement learning.
AlphaGo was trained using a combination of supervised learning
from human expert games and reinforcement learning from games of
self-play. This training enabled the agent to develop creative and
innovative moves that surprised both Lee Sedol and the Go community.
The success of AlphaGo has sparked renewed interest in the field of
reinforcement learning and has demonstrated the potential for AI to
solve complex problems that were once thought to be the exclusive
domain of human intelligence. One year later, Silver et al. [6] from
DeepMind introduced a new and more powerful agent, AlphaGo Zero.
AlphaGo Zero was trained using pure self-play, without any human
expert moves in its training, achieving a higher level of play than the
previous AlphaGo agent. They also made other improvements like
simplifying the training processes.
To evaluate the performance of the new agent, they set it to play
games against the exact same AlphaGo agent that beat the world
champion Lee Sedol in 2016, and this time the new AlphaGo Zero beats
AlphaGo with score 100-0.
In the following year, Schrittwieser et al. [7] from DeepMind
generalized the AlphaGo Zero agent to play not only Go but also other
board games like Chess and Shogi (Japanese chess), and they called this
generalized agent AlphaZero. AlphaZero is a more general
reinforcement learning algorithm that can be applied to a variety of
board games, not just Go, Chess, and Shogi.
Reinforcement learning is a type of machine learning in which an
agent learns to make decisions based on the feedback it receives from
its environment. Both DQN and AlphaGo (and its successor) agents use
this technique, and their achievements are very impressive. Although
these agents are designed to play games, this does not mean that
reinforcement learning is only capable of playing games. In fact, there
are many more challenging problems in the real world, such as
navigating a robot, driving an autonomous car, and automating web
advertising. Games are relatively easy to simulate and implement
compared to these other real-world problems, but reinforcement
learning has the potential to be applied to a wide range of complex
challenges beyond game playing.

1.2 What Is Reinforcement Learning

In computer science, reinforcement learning is a subfield of machine
learning that focuses on learning how to act in a world or an
environment. The goal of reinforcement learning is for an agent to learn
from interacting with the environment in order to make a sequence of
decisions that maximize accumulated reward in the long run. This
process is known as goal-directed learning.
Unlike other machine learning approaches like supervised learning,
reinforcement learning does not rely on labeled data to learn from.
Instead, the agent must learn through trial and error, without being
directly told the rules of the environment or what action to take at any
given moment. This makes reinforcement learning a powerful tool for
modeling and solving real-world problems where the rules and optimal
actions may not be known or easily determined.
Reinforcement learning is not limited to computer science, however.
Similar ideas are studied in other fields under different names, such as
operations research and optimal control in engineering. While the
specific methods and details may vary, the underlying principles of
goal-directed learning and decision-making are the same.
Examples of reinforcement learning in the real world are all around
us. Human beings, for example, are naturally good at learning from
interacting with the world around us. From learning to walk as a baby
to learning to speak our native language to learning to drive a car, we
learn through trial and error and by receiving feedback from the
environment. Similarly, animals can also be trained to perform a variety
of tasks through a process similar to reinforcement learning. For
instance, service dogs can be trained to assist individuals in
wheelchairs, while police dogs can be trained to help search for missing
people.
One vivid example that illustrates the idea of reinforcement learning
is a video of a dog with a big stick in its mouth trying to cross a narrow
bridge.1 The video shows the dog attempting to pass the bridge, but
failing multiple times. However, after some trial and error, the dog
eventually discovers that by tilting its head, it can pass the bridge with
its favorite stick. This simple example demonstrates the power of
reinforcement learning in solving complex problems by learning from
the environment through trial and error.

1.3 Agent-Environment in Reinforcement Learning

Reinforcement learning is a type of machine learning that focuses on
how an agent can learn to make optimal decisions by interacting with
an environment. The agent-environment loop is the core of
reinforcement learning, as shown in Fig. 1.3. In this loop, the agent
observes the state of the environment and a reward signal, takes an
action, and receives a new state and reward signal from the
environment. This process continues iteratively, with the agent learning
from the rewards it receives and adjusting its actions to maximize
future rewards.

Environment
The environment is the world in which the agent operates. It can be a
physical system, such as a robot navigating a maze, or a virtual
environment, such as a game or a simulation. The environment
provides the agent with two pieces of information: the state of the
environment and a reward signal. The state describes the relevant
information about the environment that the agent needs to make a
decision, such as the position of the robot or the cards in a poker game.
The reward signal is a scalar value that indicates how well the agent is
doing in its task. The agent’s objective is to maximize its cumulative
reward over time.

Fig. 1.3 Top: Agent-environment in reinforcement learning in a loop. Bottom: The

loop unrolled by time

The environment has its own set of rules, which determine how the
state and reward signal change based on the agent’s actions. These
rules are often called the dynamics of the environment. In many cases,
the agent does not have access to the underlying dynamics of the
environment and must learn them through trial and error. This is
similar to how we humans interact with the physical world every day,
normally we have a pretty good sense of what’s going on around us, but
it’s difficult to fully understand the dynamics of the universe.
Game environments are a popular choice for reinforcement learning
because they provide a clear objective and well-defined rules. For
example, a reinforcement learning agent could learn to play the game of
Pong by observing the screen and receiving a reward signal based on
whether it wins or loses the game.
In a robotic environment, the agent is a robot that must learn to
navigate a physical space or perform a task. For example, a
reinforcement learning agent could learn to navigate a maze by using
sensors to detect its surroundings and receiving a reward signal based
on how quickly it reaches the end of the maze.

State
In reinforcement learning, an environment state or simply state is the
statistical data provided by the environment to represent the current
state of the environment. The state can be discrete or continuous. For
instance, when driving a stick shift car, the speed of the car is a
continuous variable, while the current gear is a discrete variable.
Ideally, the environment state should contain all relevant
information that’s necessary for the agent to make decisions. For
example, in a single-player video game like Breakout, the pixels of
frames of the game contain all the information necessary for the agent
to make a decision. Similarly, in an autonomous driving scenario, the
sensor data from the car’s cameras, lidar, and other sensors provide
relevant information about the surrounding environment.
However, in practice, the available information may depend on the
task and domain. In a two-player board game like Go, for instance,
although we have perfect information about the board position, we
don’t have perfect knowledge about the opponent player, such as what
they are thinking in their head or what their next move will be. This
makes the state representation more challenging in such scenarios.
Furthermore, the environment state might also include noisy data.
For example, a reinforcement learning agent driving an autonomous car
might use multiple cameras at different angles to capture images of the
surrounding area. Suppose the car is driving near a park on a windy
day. In that case, the onboard cameras could also capture images of
some trees in the park that are swaying in the wind. Since the
movement of these trees should not affect the agent’s ability to drive,
because the trees are inside the park and not on the road or near the
road, we can consider these movements of the trees as noise to the self-
driving agent. However, it can be challenging to ignore them from the
captured images. To tackle this problem, researchers might use various
techniques such as filtering and smoothing to eliminate the noisy data
and obtain a cleaner representation of the environment state.

Reward
In reinforcement learning, the reward signal is a numerical value that
the environment provides to the agent after the agent takes some
action. The reward can be any numerical value, positive, negative, or
zero. However, in practice, the reward function often varies from task to
task, and we need to carefully design a reward function that is specific
to our reinforcement learning problem.
Designing an appropriate reward function is crucial for the success
of the agent. The reward function should be designed to encourage the
agent to take actions that will ultimately lead to achieving our desired
goal. For example, in the game of Go, the reward is 0 at every step
before the game is over, and +1 or if the agent wins or loses the
game, respectively. This design incentivizes the agent to win the game,
without explicitly telling it how to win.
Similarly, in the game of Breakout, the reward can be a positive
number if the agent destroys some bricks negative number if the agent
failed to catch the ball, and zero reward otherwise. This design
incentivizes the agent to destroy as many bricks as possible while
avoiding losing the ball, without explicitly telling it how to achieve a
high score.
The reward function plays a crucial role in the reinforcement
learning process. The goal of the agent is to maximize the accumulated
rewards over time. By optimizing the reward function, we can guide the
agent to learn a policy that will achieve our desired goal. Without the
reward signal, the agent would not know what the goal is and would
not be able to learn effectively.
In summary, the reward signal is a key component of reinforcement
learning that incentivizes the agent to take actions that ultimately lead
to achieving the desired goal. By carefully designing the reward
function, we can guide the agent to learn an optimal policy.

Agent
In reinforcement learning, an agent is an entity that interacts with an
environment by making decisions based on the received state and
reward signal from the environment. The agent’s goal is to maximize its
cumulative reward in the long run. The agent must learn to make the
best decisions by trial and error, which involves exploring different
actions and observing the resulting rewards.
In addition to the external interactions with the environment, the
agent may also has its internal state represents its knowledge about the
world. This internal state can include things like memory of past
experiences and learned strategies.
It’s important to distinguish the agent’s internal state from the
environment state. The environment state represents the current state
of the world that the agent is trying to influence through its actions. The
agent, however, has no direct control over the environment state. It can
only affect the environment state by taking actions and observing the
resulting changes in the environment. For example, if the agent is
playing a game, the environment state might include the current
positions of game pieces, while the agent’s internal state might include
the memory of past moves and the strategies it has learned.
In this book, we will typically use the term “state” to refer to the
environment state. However, it’s important to keep in mind the
distinction between the agent’s internal state and the environment
state. By understanding the role of the agent and its interactions with
the environment, we can better understand the principles behind
reinforcement learning algorithms. It is worth noting that the terms
“agent” and “algorithm” are frequently used interchangeably in this
book, particularly in later chapters.

Action
In reinforcement learning, the agent interacts with an environment by
selecting actions that affect the state of the environment. Actions are
chosen from a predefined set of possibilities, which are specific to each
problem. For example, in the game of Breakout, the agent can choose to
move the paddle to the left or right or take no action. It cannot perform
actions like jumping or rolling over. In contrast, in the game of Pong, the
agent can choose to move the paddle up or down but not left or right.
The chosen action affects the future state of the environment. The
agent’s current action may have long-term consequences, meaning that
it will affect the environment’s states and rewards for many future time
steps, not just the next immediate stage of the process.
Actions can be either discrete or continuous. In problems with
discrete actions, the set of possible actions is finite and well defined.
Examples of such problems include Atari and Go board games. In
contrast, problems with continuous actions have an infinite set of
possible actions, often within a continuous range of values. An example
of a problem with continuous actions is robotic control, where the
degree of angle movement of a robot arm is often a continuous action.
Reinforcement learning problems with discrete actions are
generally easier to solve than those with continuous actions. Therefore,
this book will focus on solving reinforcement learning problems with
discrete actions. However, many of the concepts and techniques
discussed in this book can be applied to problems with continuous
actions as well.

Policy
A policy is a key concept in reinforcement learning that defines the
behavior of an agent. In particular, it maps each possible state in the
environment to the probabilities of chose different actions. By
specifying how the agent should behave, a policy guides the agent to
interact with its environment and maximize its cumulative reward. We
will delve into the details of policies and how they interact with the
MDP framework in Chap. 2.
For example, suppose an agent is navigating a grid-world
environment. A simple policy might dictate that the agent should
always move to the right until it reaches the goal location. Alternatively,
a more sophisticated policy could specify that the agent should choose
its actions based on its current position and the probabilities of moving
to different neighboring states.

Model
In reinforcement learning, a model refers to a mathematical description
of the dynamics function and reward function of the environment. The
dynamics function describes how the environment evolves from one
state to another, while the reward function specifies the reward that the
agent receives for taking certain actions in certain states.
In many cases, the agent does not have access to a perfect model of
the environment. This makes learning a good policy challenging, since
the agent must learn from experience how to interact with the
environment to maximize its reward. However, there are some cases
where a perfect model is available. For example, if the agent is playing a
game with fixed rules and known outcomes, the agent can use this
knowledge to select its actions strategically. We will explore this
scenario in detail in Chap. 2.
In reinforcement learning, the agent-environment boundary can be
ambiguous. Despite a house cleaning robot appearing to be a single
agent, the agent’s direct control typically defines its boundary, while the
remaining components comprise the environment. In this case, the
robot’s wheels and other hardwares are considered to be part of the
environment since they aren’t directly controlled by the agent. We can
think of the robot as a complex system composed of several parts, such
as hardware, software, and the reinforcement learning agent, which can
control the robot’s movement by signaling the software interface, which
then communicates with microchips to manage the wheel movement.

1.4 Examples of Reinforcement Learning

Reinforcement learning is a versatile technique that can be applied to a
variety of real-world problems. While its success in playing games is
well known, there are many other areas where it can be used as an
effective solution. In this section, we explore a few examples of how
reinforcement learning can be applied to real-world problems.

Autonomous Driving
Reinforcement learning can be used to train autonomous vehicles to
navigate complex and unpredictable environments. The goal for the
agent is to safely and efficiently drive the vehicle to a desired location
while adhering to traffic rules and regulations. The reward signal could
be a positive number for successful arrival at the destination within a
specified time frame and a negative number for any accidents or
violations of traffic rules. The environment state could contain
information about the vehicle’s location, velocity, and orientation, as
well as sensory data such as camera feeds and radar readings.
Additionally, the state could include the current traffic conditions and
weather, which would help the agent to make better decisions while
driving.

Navigating Robots in a Factory Floor

One practical application of reinforcement learning is to train robots to
navigate a factory floor. The goal for the agent is to safely and efficiently
transport goods from one point to another without disrupting the work
of human employees or other robots. In this case, the reward signal
could be a positive number for successful delivery within a specified
time frame and a negative number for any accidents or damages
caused. The environment state could contain information about the
robot’s location, the weight and size of the goods being transported, the
location of other robots, and sensory data such as camera feeds and
battery level. Additionally, the state could include information about the
production schedule, which would help the agent to prioritize its tasks.

Automating Web Advertising

Another application of reinforcement learning is to automate web
advertising. The goal for the agent is to select the most effective type of
ad to display to a user, based on their browsing history and profile. The
reward signal could be a positive number for when the user clicks on
the ad, and zero otherwise. The environment state could contain
information such as the user’s search history, demographics, and
current trends on the Internet. Additionally, the state could include
information about the context of the web page, which would help the
agent to choose the most relevant ad.
Video Compression
Reinforcement learning can also be used to improve video
compression. DeepMind’s MuZero agent has been adapted to optimize
video compression for some YouTube videos. In this case, the goal for
the agent is to compress the video as much as possible without
compromising the quality. The reward signal could be a positive
number for high-quality compression and a negative number for low-
quality compression. The environment state could contain information
such as the video’s resolution, bit rate, frame rate, and the complexity of
the scenes. Additionally, the state could include information about the
viewing device, which would help the agent to optimize the
compression for the specific device.
Overall, reinforcement learning has enormous potential for solving
real-world problems in various industries. The key to successful
implementation is to carefully design the reward signal and the
environment state to reflect the specific goals and constraints of the
problem. Additionally, it is important to continually monitor and
evaluate the performance of the agent to ensure that it is making the
best decisions.

1.5 Common Terms in Reinforcement Learning

Episodic vs. Continuing Tasks
In reinforcement learning, the type of problem or task is categorized as
episodic or continuing depending on whether it has a natural ending.
Natural ending refers to a point in a task or problem where it is
reasonable to consider the task or problem is completed.
Episodic problems have a natural termination point or terminal
state, at which point the task is over, and a new episode starts. A new
episode is independent of previous episodes. Examples of episodic
problems include playing an Atari video game, where the game is over
when the agent loses all lives or won the game, and a new episode
always starts when we reset the environment, regardless of whether
the agent won or lost the previous game. Other examples of episodic
problems include Tic-Tac-Toe, chess, or Go games, where each game is
independent of the previous game.
On the other hand, continuing problems do not have a natural
endpoint, and the process could go on indefinitely. Examples of
continuing problems include personalized advertising or
recommendation systems, where the agent’s goal is to maximize a
user’s satisfaction or click-through rate over an indefinite period.
Another example of a continuing problem is automated stock trading,
where the agent wants to maximize their profits in the stock market by
buying and selling stocks. In this scenario, the agent’s actions, such as
the stocks they buy and the timing of their trades, can influence the
future prices and thus affect their future profits. The agent’s goal is to
maximize their long-term profits by making trades continuously, and
the stock prices will continue to fluctuate in the future. Thus, the
agent’s past trades will affect their future decisions, and there is no
natural termination point for the problem.
It is possible to design some continuing reinforcement learning
problems as episodic by using a time-constrained approach. For
example, the episode could be over when the market is closed.
However, in this book, we only consider natural episodic problems that
is, the problems with natural termination.
Understanding the differences between episodic and continuing
problems is crucial for designing effective reinforcement learning
algorithms for various applications. For example, episodic problems
may require a different algorithmic approach than continuing problems
due to the differences in their termination conditions. Furthermore, in
real-world scenarios, distinguishing between episodic and continuing
problems can help identify the most appropriate reinforcement
learning approach to use for a particular task or problem.

Deterministic vs. Stochastic Tasks

In reinforcement learning, it is important to distinguish between
deterministic and stochastic problems. A problem is deterministic if the
outcome is always the same when the agent takes the same action in
the same environment state. For example, Atari video game or a game
of Go is a deterministic problem. In these games, the rules of the game
are fixed; when the agent repeatedly takes the same action under the
same environment condition (state), the outcome (reward signal and
next state) is always the same.
The reason that these games are considered deterministic is that
the environment’s dynamics and reward functions do not change over
time. The rules of the game are fixed, and the environment always
behaves in the same way for a given set of actions and states. This
allows the agent to learn a policy that maximizes the expected reward
by simply observing the outcomes of its actions.
On the other hand, a problem is stochastic if the outcome is not
always the same when the agent takes the same action in the same
environment state. One example of a stochastic environment is playing
poker. The outcome of a particular hand is not entirely determined by
the actions of the player. Other players at the table can also take actions
that influence the outcome of the hand. Additionally, the cards dealt to
each player are determined by a shuffled deck, which introduces an
element of chance into the game.
For example, let’s say a player is dealt a pair of aces in a game of
Texas hold’em. The player might decide to raise the bet, hoping to win a
large pot. However, the other players at the table also have their own
sets of cards and can make their own decisions based on the cards they
hold and the actions of the other players.
If another player has a pair of kings, they might also decide to raise
the bet, hoping to win the pot. If a third player has a pair of twos, they
might decide to fold, as their hand is unlikely to win. The outcome of
the hand depends not only on the actions of the player with the pair of
aces but also on the actions of the other players at the table, as well as
the cards dealt to them.
This uncertainty and complexity make poker a stochastic problem.
While it is possible to use various strategies to improve one’s chances
of winning in poker, the outcome of any given hand is never certain, and
a skilled player must be able to adjust their strategy based on the
actions of the other players and the cards dealt.
Another example of stochastic environment is the stock market. The
stock market is a stochastic environment because the outcome of an
investment is not always the same when the same action is taken in the
same environment state. There are many factors that can influence the
price of a stock, such as company performance, economic conditions,
geopolitical events, and investor sentiment. These factors are
constantly changing and can be difficult to predict, making it impossible
to know with certainty what the outcome of an investment will be.
For example, let’s say you decide to invest in a particular stock
because you believe that the company is undervalued and has strong
growth prospects. You buy 100 shares at a price of $145.0 per share.
However, the next day, the company announces that it has lost a major
customer and its revenue projections for the next quarter are lower
than expected. The stock price drops to $135.0 per share, and you have
lost $1000 on your investment. It’s most likely the stochastic nature of
the environment led to the loss outcome other than the action (buying
100 shares).
While it is possible to use statistical analysis and other tools to try
to predict stock price movements, there is always a level of uncertainty
and risk involved in investing in the stock market. This uncertainty and
risk are what make the stock market a stochastic environment, and why
it is important to use appropriate risk management techniques when
making investment decisions.
In this book, we focus on deterministic reinforcement learning
problems. By understanding the fundamentals of deterministic
reinforcement learning, readers will be well equipped to tackle more
complex and challenging problems in the future.

Model-Free vs. Model-Based Reinforcement

Learning
In reinforcement learning, an environment is a system in which an
agent interacts with in order to achieve a goal. A model is a
mathematical representation of the environment’s dynamics and
reward functions. Model-free reinforcement learning means the agent
does not use the model of the environment to help it make decisions.
This may occur because either the agent lacks access to the accurate
model of the environment or the model is too complex to use during
decision-making.
In model-free reinforcement learning, the agent learns to take
actions based on its experiences of the environment without explicitly
simulating future outcomes. Examples of model-free reinforcement
learning methods include Q-learning, SARSA (State-Action-Reward-
State-Action), and deep reinforcement learning algorithms such as
DQN, which we’ll introduce later in the book.
On the other hand, in model-based reinforcement learning, the
agent uses a model of the environment to simulate future outcomes and
plan its actions accordingly. This may involve constructing a complete
model of the environment or using a simplified model that captures
only the most essential aspects of the environment’s dynamics. Model-
based reinforcement learning can be more sample-efficient than model-
free methods in certain scenarios, especially when the environment is
relatively simple and the model is accurate. Examples of model-based
reinforcement learning methods include dynamic programming
algorithms, such as value iteration and policy iteration, and
probabilistic planning methods, such as Monte Carlo Tree Search in
AlphaZero agent, which we’ll introduce later in the book.
In summary, model-free and model-based reinforcement learning
are two different approaches to solving the same problem of
maximizing rewards in an environment. The choice between these
approaches depends on the properties of the environment, the
available data, and the computational resources.

1.6 Why Study Reinforcement Learning

Machine learning is a vast and rapidly evolving field, with many
different approaches and techniques. As such, it can be challenging for
practitioners to know which type of machine learning to use for a given
problem. By discussing the strengths and limitations of different
branches of machine learning, we can better understand which
approach might be best suited to a particular task. This can help us
make more informed decisions when developing machine learning
solutions and ultimately lead to more effective and efficient systems.
There are three branches of machine learning. One of the most
popular and widely adopted in the real world is supervised learning,
which is used in domains like image recognition, speech recognition,
and text classification. The idea of supervised learning is very simple:
given a set of training data and the corresponding labels, the objective
is for the system to generalize and predict the label for data that’s not
present in the training dataset. These training labels are typically
provided by some supervisors (e.g., humans). Hence, we’ve got the
name supervised learning.
Another branch of machine learning is unsupervised learning. In
unsupervised learning, the objective is to discover the hidden
structures or features of the training data without being provided with
any labels. This can be useful in domains such as image clustering,
where we want the system to group similar images together without
knowing ahead of time which images belong to which group. Another
application of unsupervised learning is in dimensionality reduction,
where we want to represent high-dimensional data in a lower-
dimensional space while preserving as much information as possible.
Reinforcement learning is a type of machine learning in which an
agent learns to take actions in an environment in order to maximize a
reward signal. It’s particularly useful in domains where there is no clear
notion of “correct” output, such as in robotics or game playing.
Reinforcement learning has potential applications in areas like robotics,
healthcare, and finance.
Supervised learning has already been widely used in computer
vision and natural language processing. For example, the ImageNet
classification challenge is an annual computer vision competition
where deep convolutional neural networks (CNNs) dominate. The
challenge provides a training dataset with labels for 1.2 million images
across 1000 categories, and the goal is to predict the labels for a
separate evaluation dataset of about 100,000 images. In 2012,
Krizhevsky et al. [8] developed AlexNet, the first deep CNN system used
in this challenge. AlexNet achieved an 18% improvement in accuracy
compared to previous state-of-the-art methods, which marked a major
breakthrough in computer vision.
Since the advent of AlexNet, almost all leading solutions to the
ImageNet challenge have been based on deep CNNs. Another
breakthrough came in 2015 when researchers He et al. [9] from
Microsoft developed ResNet, a new architecture designed to improve
the training of very deep CNNs with hundreds of layers. Training deep
CNNs is challenging due to vanishing gradients, which makes it difficult
to propagate the gradients backward through the network during
backpropagation. ResNet addressed this challenge by introducing skip
connections, which allowed the network to bypass one or more layers
during forward propagation, thereby reducing the depth of the network
that the gradients have to propagate through.
While supervised learning is capable of discovering hidden patterns
and features from data, it is limited in that it merely mimics what it is
told to do during training and cannot interact with the world and learn
from its own experience. One limitation of supervised learning is the
need to label every possible stage of the process. For example, if we
want to use supervised learning to train an agent to play Go, then we
would need to collect the labels for every possible board position,
which is impossible due to the enormous number of possible
combinations. Similarly, in Atari video games, a single pixel change
would require relabeling, making supervised learning inapplicable in
these cases. However, supervised learning has been successful in many
other applications, such as language translation and image
classification.
Unsupervised learning tries to discover hidden patterns or features
without labels, but its objective is completely different from that of RL,
which is to maximize accumulated reward signals. Humans and animals
learn by interacting with their environment, and this is where
reinforcement learning (RL) comes in. In RL, the agent is not told which
action is good or bad, but rather it must discover that for itself through
trial and error. This trial-and-error search process is unique to RL.
However, there are other challenges that are unique to RL, such as
dealing with delayed consequences and balancing exploration and
exploitation.
While RL is a distinct branch of machine learning, it shares some
commonalities with other branches, such as supervised and
unsupervised learning. For example, improvements in supervised
learning and deep convolutional neural networks (CNNs) have been
adapted to DeepMind’s DQN, AlphaGo, and other RL agents. Similarly,
unsupervised learning can be used to pretrain the weights of RL agents
to improve their performance. Furthermore, many of the mathematical
concepts used in RL, such as optimization and how to train a neural
network, are shared with other branches of machine learning.
Therefore, while RL has unique challenges and applications, it also
benefits from and contributes to the development of other branches of
machine learning.
1.7 The Challenges in Reinforcement Learning
Reinforcement learning (RL) is a type of machine learning in which an
agent learns to interact with an environment to maximize some notion
of cumulative reward. While RL has shown great promise in a variety of
applications, it also comes with several common challenges, as
discussed in the following sections:

Exploration vs. Exploitation Dilemma

The exploration-exploitation dilemma refers to the fundamental
challenge in reinforcement learning of balancing the need to explore
the environment to learn more about it with the need to exploit
previous knowledge to maximize cumulative reward. The agent must
continually search for new actions that may yield greater rewards while
also taking advantage of actions that have already proven successful.
In the initial exploration phase, the agent is uncertain about the
environment and must try out a variety of actions to gather information
about how the environment responds. This is similar to how humans
might learn to play a new video game by trying different button
combinations to see what happens. However, as the agent learns more
about the environment, it becomes increasingly important to focus on
exploiting the knowledge it has gained in order to maximize cumulative
reward. This is similar to how a human might learn to play a game more
effectively by focusing on the actions that have already produced high
scores.
There are many strategies for addressing the exploration-
exploitation trade-off, ranging from simple heuristics to more
sophisticated algorithms. One common approach is to use an -greedy
policy, in which the agent selects the action with the highest estimated
value with probability and selects a random action with
probability in order to encourage further exploration. Another
approach is to use a Thompson sampling algorithm, which balances
exploration and exploitation by selecting actions based on a
probabilistic estimate of their expected value.
It is important to note that the exploration-exploitation trade-off is
not a simple problem to solve, and the optimal balance between
exploration and exploitation will depend on many factors, including the
complexity of the environment and the agent’s prior knowledge. As a
result, there is ongoing research in the field of reinforcement learning
aimed at developing more effective strategies for addressing this
challenge.

Credit Assignment Problem

In reinforcement learning (RL), the credit assignment problem refers to
the challenge of determining which actions an agent took that led to a
particular reward. This is a fundamental problem in RL because the
agent must learn from its own experiences in order to improve its
performance.
To illustrate this challenge, let’s consider the game of Tic-Tac-Toe,
where two players take turns placing Xs and Os on a 3 3 grid until
one player gets three in a row. Suppose the agent is trying to learn to
play Tic-Tac-Toe using RL, and the reward is +1 for a win, for a loss,
and 0 for a draw. The agent’s goal is to learn a policy that maximizes its
cumulative reward.
Now, suppose the agent wins a game of Tic-Tac-Toe. How can the
agent assign credit to the actions that led to the win? This can be a
difficult problem to solve, especially if the agent is playing against
another RL agent that is also learning and adapting its strategies.
To tackle the credit assignment problem in RL, there are various
techniques that can be used, such as Monte Carlo methods or temporal
difference learning. These methods use statistical analysis to estimate
the value of each action taken by the agent, based on the rewards
received and the states visited. By using these methods, the agent can
gradually learn to assign credit to the actions that contribute to its
success and adjust its policy accordingly.
In summary, credit assignment is a key challenge in reinforcement
learning, and it is essential to develop effective techniques for solving
this problem in order to achieve optimal performance.

Reward Engineering Problem

The reward engineering problem refers to the process of designing a
good reward function that encourages the desired behavior in a
reinforcement learning (RL) agent. The reward function determines
Exploring the Variety of Random
Documents with Different Content
of the Lord Jesus, saying, We adjure you, by Jesus whom Paul preacheth.
And there were seven sons, of one Sceva, a Jew, and chief of the priests,
which did so. And the evil spirit answered and said, Jesus I know, and Paul
I know, but who are ye? And the man in whom the evil spirit was, leaped on
them, and overcame them, and prevailed against them, so that they fled out
of the house, naked and wounded."—(Acts xix: 13-16). These men
presumptuously took it upon themselves to act as those who had authority,
and the result was that not even the devils would respect their
administrations, much less the Lord. There is a principle of great moment
associated with this incident. The question is, if these men, when acting
without authority from God could not drive out an evil spirit, would their
administration be of force, or have any virtue in it, had they administered in
some other ordinance of the gospel, say baptism for the remission of sins, or
the laying on of hands for the reception of the Holy Ghost? Manifestly it
would not. Hence we come to the conclusion, so well expressed in one of
our articles of faith: "A man must be called of God by prophecy and by the
laying on of hands by those who are in authority to preach the gospel and
administer in the ordinances thereof."—"The Gospel"—Roberts.

REVIEW.

1. What two great purposes were contemplated in Messiah's mission?

2. Relate the fall of man and its consequences.

3. What is general salvation?

4. How do you prove that there will be a general salvation?

5. Why is redemption from Adam's transgression unconditional? (Notes 1 to

4).

6. How are the claims of justice and mercy balanced in the atonement?

7. Was Messiah's atonement voluntary?

8. What can you say of the love of God as it appears in the atonement?
9. What is meant by individual salvation?

10. In what does it differ from general salvation?

11. By what consideration does mercy mitigate the claims of justice in the
plan of redemption?

12. What are the conditions of salvation? (Note 6).

13. For what several purposes did Messiah institute his church?

14. Why is it that the description of the Church of Christ is so imperfect in

the New Testament?

15. Enumerate the powers granted to the Twelve.

16. What other officers did Jesus call to the ministry upon whom he
bestowed similar powers?

17. What other officers were appointed in the church?

18. Give Paul's description of the church.

19. State the particular objects to be accomplished by the church

organization.

20. What reasons can you give for believing that the church as organized by
Messiah is to be perpetuated?

21. What are the four leading opinions in respect to church government?
(Note 5).

22. What is the truth in respect of church government? (Note 5).

23. Is the Book of Mormon description of church organization more

complete than that of the New Testament? Why?

24. Give an account of the organization of the church on the western

hemisphere.
25. What followed the preaching of the gospel and the organization of the
church on the western hemisphere?

Footnotes

1. It is also called Ephrath [Ef-rath] and Ephratah [Ef-ra-tah.] It was the

scene of Rachel's death and burial, the native place of Samuel's father, the
residence of Boaz and Ruth, and the birthplace of David; it was also the last
rallying point of the remnant of Judah after the invasion of
Nebuchadnezzar.

2. Micah v: 2.

3. Luke 1:28-38.

4. Canon Farrar translates this splendid passage: "Glory to God in the

highest, and on earth peace among men of good will," maintaining that such
is the reading of the best Mss. Dear to us as the reading in King James'
translation of the Bible is, if looked upon as announcing the effect of
Christianity in this world—"On earth peace among men of good will,"
comes more nearly to the truth than "on earth peace, good will toward
men."

5. Matt. ii: 2.

6. III Nephi i: 21.

7. III Nephi i: 13.

8. III Nephi i: 15-19.

9. Matt. ii: 18.

10. Matt. ii: 23.

11. I have condensed much of the matter in the first part of this section from
the learned works of D'Aubigne, Dr. Mosheim, Gibbon and Josephus,
sometimes using even their phraseology without further acknowledgement
than this note.—The Author.

12. Epistle to Romans i: 18-32.

13. See note 7, end of section.

14. Dr. Lardner.

15. See "The First Gospel of the Infancy," Apocryphal New Testament
(Colley & Rich, publishers, Boston, 1891.)

16. Luke i.

17. Matt. iii.

18. Luke iii.

19. Matt. iii.

20. Luke iii.

21. John i:19-23.

22. The location of Bethabara is uncertain.

23. Matt. iii.

24. Matt. iii.

25. John i:33.

26. Matt. iv.

27. That is, vain fellow.

28. Doc. and Cov. lxxxiv:17-27.

29. Biblical Literature.—Kitto.

30. Matt. x.

31. Matt. x.

32. Compare Luke x with Matt. x.

33. Luke x.

34. John v.

35. John iii.

36. John iii.

37. John iv.

38. John v:24-30.

39. John v:39-47, vii:14-18.

40. John v:32-35.

41. John v:36; x:25.

42. John v:37, 39.

43. Mark xi:5.

44. Matt. iv:16-24.

45. Matt. iv:16-24.

46. Mark xii.

47. John xi.

48. Matt. ii.

49. John v:1-18.

50. John v:17, 18.

51. John xii.

52. Luke xxii. Matt. xxvi.

53. John xviii:36.

54. Luke alone calls it Calvary; Matthew, Mark and John call it Golgotha.
They each have reference to the same place, which was known by the two
different names.

55. III Nephi viii.

56. Those predictions are found in the following passages: John ii:18-22;
x:17, 18; xiii:31-33. Matt. xii:38=42; xvi:21-23; xvii:1-9; Mark ix:30-32;
x:32-34.

57. Matt. xxviii.

58. John xx:14-17.

59. Matt. xxviii:9.

60. Luke xxiv:13-31.

61. Luke xxiv:34 and I Cor. xv:5.

62. John xx:19.

63. John xx:26; Mark xvi:14.

64. John xxi:1-24.

65. Matt. xxviii:16.

66. I Cor. xv:6.

67. I Cor. xv:7.

68. I Cor. xv:8.

69. Acts i.

70. Matt. xviii.

71. Mark xvi:16.

72. Luke xxiv:49, 53; Acts i.

73. Acts i; Matt. xvi.

74. John x:16.

75. III Nephi xv:18.

76. III Nephi xv:21.

77. III Nephi xi:12.

78. Section V, paragraph 14.

79. The land Bountiful was in the northern part of South America.

80. III Nephi xi:14.

81. See John xxi:21-25; III Nephi xxviii.

82. Let those who would be more minutely informed upon the ministry of
Messiah on the western hemisphere, study carefully the book of III Nephi,
where the history of that important event is recorded, and which book has
been called—a "Fifth Gospel."

83. It must be remembered, that Jesus told the Nephites that he was going to
visit the lost tribes whom the Father had led away. They, too, were to have
the gospel preached to them (III Nephi xv and xvi.)

84. In his "Comment de Rebus Christ," p. 78-80, the learned Dr. Mosheim
has a note on this passage in which his aim is to prove that the correct
translation from the Greek of the phrase usually rendered "they gave forth
their lots," should be "they gave their votes." While it is but proper to say
that the Doctor's translation is very generally rejected by the learned, still
there will be no question with those who understand the order of the
priesthood and the manner of filling vacancies in its quorums, that Dr.
Mosheim is correct in his interpretation as to the meaning of the passage.

85. IV Nephi i:14.

86. Pentecost came fifty days after the Passover, on which day the Lord
Jesus was crucified. Allowing that he lay three days in the tomb and was
with his disciples forty days after his resurrection (Acts i:3), forty-three
days of the fifty between Passover and Pentecost are accounted for, leaving
but seven days between ascension and the day of Pentecost, when the
promise of the baptism of the spirit was fulfilled.—"The Gospel," note p.
177.

87. Luke iii:16. Matt. iii:2. Acts i:4, 5.

88. The languages spoken are enumerated by the writer of The Acts ii:9-11.

89. Joel ii:28.

90. I think it proper here to call the attention of the student to the fact that
the principles of the gospel in this discourse of Peter's are stated in the same
order that they were unfolded in the ministry of John the Baptist and
Messiah. First, John came bearing witness of one who should come after
him—Christ, the Lord. Hence, he taught faith in God (John i:15, 16, also
verses 19-36). After that, the burden of his message was, "Repent, for the
kingdom of heaven is at hand;" then followed his baptism in water with a
promise that they should receive the Holy Ghost. So Peter first taught the
people faith in the Lord, proving from the scripture that Jesus was both
Lord and Christ; and when they believed that, then he taught them
repentance and baptism for the remission of sins, and promised them the
Holy Ghost.

91. Acts ii:38, 39.

92. Acts iv:9.

93. Acts v:26-32.

94. Acts v:34-42.

95. It is generally supposed by Biblical scholars, Mosheim, Neander, Kitto,

Murdock and many others, that these men were deacons only. There is
nothing, however, in the Acts of the Apostles or other parts of the New
Testament which would lead one to believe that such was the case. We have
evidence on the other hand that one of them at least held a higher
priesthood than the office of deacon. In modern revelation we have it stated
that neither teachers nor deacons have authority to baptize, administer the
sacrament or lay on hands for the Holy Ghost (Doc. and Cov., sec. xx:58);
yet we have Philip, one of the seven, going down into Samaria, teaching the
gospel "and baptizing the people" (Acts viii), hence we may know that he
held a higher priesthood than that of deacon. Yet when it became necessary
to confer the Holy Ghost upon these same converts by the laying on of
hands, Philip, it would seem, had not the authority to do it; but the Apostles
hearing that Samaria had received the word, sent Peter and John down and
they conferred upon the Samaritans the Holy Ghost. And though Philip was
present he appears to have taken no part in it. It is therefore reasonable to
conclude that since Philip had authority to baptize, he therefore must have
held an office higher than that of deacon, or even of teacher; but since he
evidently had not authority to lay on hands for the gift of the Holy Ghost,
his office was something less than that of an Elder. Hence it is most likely
that he was a priest—priests have the right to baptize but not to lay on
hands for the reception of the Holy Ghost (Doc. and Cov. sec. xx)—as
perhaps also were his six associates, appointed to preside over the temporal
affairs of the Church, especially to see after the poor.

96. Acts iv:32-37.

97. Acts viii. The student will observe that the same order of presenting and
accepting the gospel is observed in the account given of its introduction into
Samaria as was observed in the teaching of John the Baptist and Jesus, and
also of Peter, on the day of Pentecost.
98. Acts ix.

99. Matt. xv:24.

100. John xii:32.

101. This case of Cornelius marks an exception—the only one recorded in

the New Testament—to that order in the gospel to which attention has been
drawn several times in this section; that is, these Gentiles received the Holy
Ghost before baptism in water. The object of the deviation from the rule is
obvious. It was that the Jews might have a witness from God that the gospel
was for the Gentiles as well as for the house of Israel. But according to the
Scriptures, and I may say according to the nature and relationship of these
several principles and ordinances of the gospel to each other, the reception
of the Holy Ghost comes after repentance and baptism, the one leading up
logically to the other, which follows in beautiful and harmonious sequence.

102. Col. i:23.

103. Dan. xii:2.

104. John v:26, 28, 29.

105. Doc. and Cov. lxxvi:17.

106. Rev. xx:12, 13.

107. Rom. v:18. See whole chapter.

108. Cor. xv:21-26.

109. Mormon ix:12, 13. Other evidences from the Nephite scriptures will be
found in Alma xi:40-44. III Nephi xxvii:13-15. II Nephi ii. Mosiah xv:18-
27. Alma xxxiv:7-17. Alma xiii:1-26.

110. See "The World's Sixteen Crucified Saviors," by Kersey Graves.

111. John v:26.

112. Mediation and Atonement, xxiv.

113. John x:17, 18.

114. Matt. xxvi:53, 54.

115. I John iv:9.

116. John iii:16, 17.

117. Isaiah liii:5, 6.

118. I Cor. vi:19, 20.

119. Doc. and Cov., sec. xix:16-18. See also Mosiah iii:20, 21. "The
Gospel," Roberts, page 29.

120. Matt. xxviii:18-20.

121. Acts iv:12.

122. Acts ii:22-47. Acts viii:5-25.

123. II Cor. vii:8-10.

124. Rom. vi:3-5.

125. Acts viii:14-18.

126. The injunction placed upon those who accept the faith of the gospel is
that they add to their faith virtue; and to virtue, knowledge; and to
knowledge, temperance; and to temperance, patience; and to patience,
godliness; and to godliness, brotherly kindness; and to brotherly kindness,
charity. For if these things be in you, and abound they make you that ye
shall neither be barren nor unfruitful in the knowledge of our Lord Jesus
Christ. (II Peter i:5-8.)—"The Gospel," page 37.

127. Mark xvi. I Cor. xii.

128. Matt. x. Acts i:4-8.

129. Compare Luke x with Matt. x.

130. Acts i:3.

131. Acts xiv:23. Acts xx:17, 28.

132. Phil. i:1. Titus i:5-7.

133. I Cor. xii:28-30.

134. Eph. iv:10.

135. I Cor. xii.

136. Eph. iv.

137. John xv:16.

138. Acts v:1-6.

139. Acts xiii:1-3.

140. Acts xiv:2, 3.

141. Acts xx:28, 29.

142. Heb. v:1, 5.

143. Ex. xxviii:1.

144. III Nephi xxvi:6, 7.

145. III Nephi xii.

146. III Nephi xi.

147. III Nephi xxvii:37; also Moroni ii.

148. III Nephi xxvii and IV Nephi i:1.

149. Moroni iii.

150. Moroni vi.

151. IV Nephi i:1-7.

PART II.

THE APOSTASY.
SECTION I.
In Part I, our narrative was confined mainly to those propitious
circumstances which made for the successful introduction of the gospel and
the founding of the church of Christ. In Part II, we are to deal with those
adverse events which led finally to the subversion of the Christian religion.
We commence with the

1. Persecution of the Christians by the Jews.—The Messiah forewarned

his disciples that they would be persecuted by the world, pointed out the
reasons for it, and comforted them by reminding them that the world had
hated him before it hated them; that the servant was not greater than his
lord; and for that matter all the prophets which were before them had been
persecuted by the generations in which they lived, and that, for the reason
that they were not of the world, therefore the world hated and destroyed
them.[1]

2. Two special reasons may be assigned for the persecution of the saints by
the Jews. 1. They looked upon Christianity as a rival religion to Judaism, a
thing of itself sufficient to engender bitterness, jealousy, persecution. 2. If
Christianity should live and obtain a respectable standing, the Jews of that
generation must ever be looked upon as not only putting an innocent man to
death, but as rejecting and slaying the Son of God. To crush this rival
religion and escape the odium which the successful establishment of it
would inevitably fix upon them, were the incentives which prompted that
first general persecution which arose against the church in Jerusalem, and
that commenced in the very first year after Messiah's ascension.

3. The extent of the persecution or the time of its continuance may not be
determined; but that it was murderous may be learned from the fact that
Stephen was slain,[2] as was also James, the son of Zebedee,[3] and James,
the Just, brother of the Lord.[4] The Apostle Peter was imprisoned and
would doubtless have shared the fate of the other martyrs, but that he was
delivered by an angel.[5]
4. Nor was this persecution confined alone to Jerusalem; on the contrary the
hate-blinded high priests and elders of the Jews in Palestine conferred with
the Jews throughout the Roman provinces, and everywhere incited them to
hatred of the Christians, exhorting them to have no connection with, and to
do all in their power to destroy the "superstition," as the Christian religion
was then called. Nor were they content with what they themselves could do,
but they exhausted their ingenuity in efforts to incite the Romans against
them. To accomplish this they charged that the Christians had treasonable
designs against the Roman government, as "appeared by their
acknowledging as their king one Jesus, a malefactor whom Pilate had most
justly put to death."[6]

5. The Jews themselves, however, were in no great favor with the Romans
since their impatience of Roman restraint led them to be constantly on the
eve of rebellion and sedition, and frequently to break out into deeds of
violence against the Roman authority. This lack of favor rendered the power
of the Jews unequal to their malice against the church of Christ.

6. The imperious nation, too, whose forefathers had rejected the prophets
and at the last had crucified the Son of God with every circumstance of
cruelty, crying out in the streets of their holy city, "crucify him, and let his
blood be upon us and on our children,"[7] were about to meet the calamities
which their wickedness called down upon them. The Roman emperor
Vespasian [Ves-pa-zhe-an], tired of their repeated seditions, at last sent an
army under Titus to subjugate them. The Jews made a stubborn resistance
and a terrible war followed. Jerusalem, crowded with people who had come
into the city from the surrounding country to attend the Passover, was
besieged for six months, during which time more than a million of her
wretched inhabitants perished of famine. The city was finally taken, the
walls thereof thrown down and the temple so completely destroyed that not
one stone was left upon another. Thousands of Jews were cut to pieces and
nearly a hundred thousand of those taken captive were sent into slavery.[8]
All the calamities predicted by the Messiah[9] befell the city and people.
Jerusalem from that time until now has been trodden down of the Gentiles;
and will be until the times of the Gentiles are fulfilled.
7. According to Eusebius, the Christians escaped these calamities which
befell the Jews; for the whole body of the church at Jerusalem, having been
commanded by divine revelation, given to men of approved piety, removed
from Jerusalem before the war and dwelt at Pella, beyond Jordan, where
they were secure from the calamities of those times.[10]

8. Persecution by the Romans.—It is more difficult to understand why the

Romans should persecute the Christians than it is to see why the Jews did it.
The Romans were polytheists, and affected the fullest religious liberty. The
author of the "Decline and Fall of the Roman Empire" claims that this
period of Roman history was the golden age of religious liberty. And such
was the multitude of deities collected in Rome from various nations, and
such the variety of worship to be seen in the great capital of the empire, that
Gibbon has said:

Rome gradually became the common temple of her subjects; and the
freedom of the city was bestowed on all the gods of mankind.[11]

Furthermore, the same high authority says:

The various modes of worship which prevailed in the Roman world,

were all considered by the people as being equally true; by the
philosophers as all equally false; and by the magistrates as equally
useful. And thus toleration produced not only mutual indulgences, but
even religious concord.

9. The student who would learn why the mild and beautiful Christian
religion was alone selected to bear the wrath and feel the vengeful power of
Rome, must look deeper than the reasons usually assigned for the strange
circumstance. It is superficial to say that the persecution was caused by the
charges of immorality. The Roman authorities had the best of evidence that
the charges were false. (See note 1, end of section). Equally absurd is it to
assign as a cause the supposed atheism of the Christians, for that was the
condition of nearly all Rome; while the charge that they were traitors to the
emperor, and expected to see the empire supplanted by the kingdom of
Christ—which some assign as the chief cause of Roman persecution—was
treated with contempt by the emperors. (See note 2, end of section).
10. The true cause of the persecution was this: Satan knew there was no
power of salvation in the idolatrous worship of the heathen, and hence let
them live on in peace, but when Jesus of Nazareth and his followers came,
in the authority of God, preaching the gospel, he recognized in that the
principles and power against which he had rebelled in heaven, and stirred
up the hearts of men to rebellion against the truth to overthrow it. This was
the real cause of persecution, though it lurked under a variety of pretexts,
the most of which are named in the above supposed causes.

11. The First Roman Persecution.—The first emperor to enact laws for
the extermination of Christians was Nero. (See note 3, end of section). His
decrees against them originated rather in an effort to shield himself from
popular fury than any desire that he had to protect the religion of the State
against the advancement of Christianity. Nero, wishing to witness a great
conflagration, had set fire to the city of Rome. The flames utterly consumed
three of the fourteen wards into which the city was divided, and spread ruin
in seven others. It was in vain that the emperor tried to soothe the indignant
and miserable citizens whose all had been consumed by the flames, and
neither the magnificence of the prince, nor his attempted expiation of the
gods could remove from him the infamy of having ordered the
conflagration.

12. Therefore, [writes Tacitus, one of the most trustworthy of all historians],
to stop the clamor Nero falsely accused and subjugated to the most
exquisite punishments a people hated for their crimes, called Christians.
The founder of the sect, Christ, was executed in the reign of Tiberius, by the
Procurator Pontius Pilate. The pernicious superstition, repressed for a time,
burst forth again; not only through Judea, the birth-place of the evil, but at
Rome also, where everything atrocious and base centers and is in repute.
Those first seized, confessed; then a vast multitude, detected by their
means, were convicted, not so much of the crime of burning the city as of
hatred of mankind. And insult was added to their torments; for being clad in
skins of wild beasts they were torn to pieces by dogs; or affixed to crosses
to be burned, were used as lights to dispel the darkness of night, when the
day was gone. Nero devoted his garden to the show, and held circensian
[sir-sen-shan] games, mixing with the rabble, or mounting a chariot, clad
like a coachman. Hence, though the guilty and those meriting the severest
punishment, suffered, yet compassion was excited, because they were
destroyed, not for the public good, but to satisfy the cruelty of an
individual.[12]

13. Time of the Persecution.—The time of this persecution is fixed by the

date of the great conflagration, which Tacitus set down as commencing on
the 18th of July, A. D. 65. It lasted six days; and soon after that the
persecution broke out.

14. Continuance and Extent of the Persecution.—How long this

persecution lasted, and whether it was confined to the city of Rome or
extended throughout the empire is difficult to determine. From some
remarks made by Tertullian [Ter-tul-li-an], writing in the next century, it
would seem that the decrees of Nero against the Christians of Rome were
general laws, such as those afterwards passed by Domitian. But the
inferences of his language are generally discredited or accounted the result
of Tertullian's fervid rhetoric; and Gibbon's conclusion that the persecution
was confined within the walls of Rome generally accepted.[13] It was in this
persecution, according to the tradition of the early Christian fathers, that
Peter and Paul suffered martyrdom.

15. The Second Persecution.—The second persecution against the

Christian church broke out in the year A. D. 93 or 94, under the reign of
Domitian. It was during this persecution that the Apostle John was banished
to Patmos. Eusebius states that at the same time, for professing Christ, Flavi
Domitilla, the niece of Flavius Clemens, one of the consuls of Rome at the
time, "was transported with many others, by way of punishment, to the
island of Pontia." The pretext for this persecution is ascribed to the fears of
Domitian that he would lose his empire. A rumor reached him that a person
would arise from the relatives of Messiah who would attempt a revolution;
whereupon the jealous nature of the emperor prompted him to begin this
persecution. In it both Jews and Christians suffered, the emperor ordering
that the descendants of David, especially, should be put to death. An
investigation of the prospects of a revolution arising from such a quarter
caused Domitian to dismiss the matter with contempt and order the
persecution to cease.[14] (See note 2, end of section).

NOTES.
1. Pliny's Testimony to the Morality of the Christians.—The character
which this writer gives of the Christians of that age (his celebrated letter
was written to Trajan early in the second century), and which was drawn
from a pretty accurate inquiry, because he considered their moral principles
as the point in which the magistrate was interested, is as follows: He tells
the emperor that some of those who had relinquished the society, or who, to
save themselves pretended that they had relinquished it, affirmed "that they
were wont to meet together on a stated day, before it was light, and sang
among themselves alternately a hymn to Christ as a God; and to bind
themselves by an oath, not to the commission of any wickedness, but that
they would not be guilty of theft, or robbery, or adultery; that they would
never falsify their word, or deny a pledge committed to them when called
upon to return it." This proves that a morality more pure and strict than was
ordinary, prevailed at that time in Christian societies.—Paley's "Evidences."

2. Interview of Domitian and the Relatives of the Lord.—There were yet

living of the family of our Lord the grandchildren of Judas, called the
brother of our Lord according to the flesh. These were reported as being of
the family of David, and were brought to Domitian by the evocaties. For
this emperor was as much alarmed at the appearance of Christ as Herod. He
put the question whether they were of David's race and they confessed that
they were. He then asked them what property they had, or how much
money they owned. And both of them answered, that they had between
them only nine thousand denarii, and this they had not in silver, but in the
value of a piece of land, containing only thirty-nine acres; from which they
raised their taxes and supported themselves by their own labor. Then they
also began to show their hands, exhibiting the hardness of their bodies, and
the callosity formed by incessant labor on their hands, as evidence of their
own labor. When asked also, respecting Christ and his kingdom, what was
its nature, and when and where it was to appear, they replied that it was not
a temporal nor an earthly kingdom, but celestial and angelic; that it would
appear at the end of the world, when coming in glory he would judge the
quick and the dead, and give to every one according to his works. Upon
which Domitian despising them, made no reply; but treating them with
contempt, as simpletons, commanded them to be dismissed, and by a decree
ordered the persecution to cease.—Hegesippus, quoted by Eusebius.