100% found this document useful (3 votes)

31 views

The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python 1st Edition Michael Hu instant download

The document is about 'The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python' by Michael Hu, which provides a comprehensive guide to reinforcement learning concepts and applications using Python. It includes various topics such as Markov Decision Processes, Dynamic Programming, and Monte Carlo Methods, along with practical implementations. The document also contains links to additional resources and related books on artificial intelligence and machine learning.

Uploaded by

mateicmerkys

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (3 votes)

31 views

The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python 1st Edition Michael Hu instant download

Uploaded by

mateicmerkys

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

The Art of Reinforcement Learning: Fundamentals,

Mathematics, and Implementations with Python 1st

Edition Michael Hu install download

https://ebookmeta.com/product/the-art-of-reinforcement-learning-
fundamentals-mathematics-and-implementations-with-python-1st-
edition-michael-hu/

Download more ebook from https://ebookmeta.com

We believe these products will be a great fit for you. Click
the link to download now, or visit ebookmeta.com
to discover even more!

The Art of Reinforcement Learning: Fundamentals,

Mathematics, and Implementations with Python 1st
Edition Michael Hu

https://ebookmeta.com/product/the-art-of-reinforcement-learning-
fundamentals-mathematics-and-implementations-with-python-1st-
edition-michael-hu/

Python AI Programming: Navigating fundamentals of ML,

deep learning, NLP, and reinforcement learning in
practice Patrick J

https://ebookmeta.com/product/python-ai-programming-navigating-
fundamentals-of-ml-deep-learning-nlp-and-reinforcement-learning-
in-practice-patrick-j/

Deep Reinforcement Learning with Python: With PyTorch,

TensorFlow and OpenAI Gym 1st Edition Nimish Sanghi

https://ebookmeta.com/product/deep-reinforcement-learning-with-
python-with-pytorch-tensorflow-and-openai-gym-1st-edition-nimish-
sanghi-3/

Divine Envy Jealousy and Vengefulness in Ancient Israel

and Greece 1st Edition Stuart Lasine

https://ebookmeta.com/product/divine-envy-jealousy-and-
vengefulness-in-ancient-israel-and-greece-1st-edition-stuart-
lasine/
The Apostolic Age and the New Testament George A.
Barton

https://ebookmeta.com/product/the-apostolic-age-and-the-new-
testament-george-a-barton/

Rethinking Rachel Doležal and Transracial Theory 1st

Edition Molly Littlewood Mckibbin

https://ebookmeta.com/product/rethinking-rachel-dolezal-and-
transracial-theory-1st-edition-molly-littlewood-mckibbin/

The Urban Gaze Exploring Urbanity Through Art

Architecture Music Fashion Film and Media 1st Edition
Silvia Mazzucotelli Salice

https://ebookmeta.com/product/the-urban-gaze-exploring-urbanity-
through-art-architecture-music-fashion-film-and-media-1st-
edition-silvia-mazzucotelli-salice/

Mastering Python for Web: A Beginner's Guide (Mastering

Computer Science) 1st Edition Sufyan Bin Uzayr

https://ebookmeta.com/product/mastering-python-for-web-a-
beginners-guide-mastering-computer-science-1st-edition-sufyan-
bin-uzayr/

Applied Numerical Methods with MATLAB for Engineers and

Scientists, 5th Edition Steven Chapra

https://ebookmeta.com/product/applied-numerical-methods-with-
matlab-for-engineers-and-scientists-5th-edition-steven-chapra/
Indiana Steinhardt and the Quest for Quasicrystals A
Conversation with Paul Steinhardt 1st Edition Howard
Burton

https://ebookmeta.com/product/indiana-steinhardt-and-the-quest-
for-quasicrystals-a-conversation-with-paul-steinhardt-1st-
edition-howard-burton/
The Art of
Reinforcement
Learning
Fundamentals, Mathematics,
and Implementations with Python
—
Michael Hu
The Art of Reinforcement Learning
Michael Hu

The Art of Reinforcement

Learning
Fundamentals, Mathematics,
and Implementations with Python
Michael Hu
Shanghai, Shanghai, China

ISBN-13 (pbk): 978-1-4842-9605-9 ISBN-13 (electronic): 978-1-4842-9606-6

https://doi.org/10.1007/978-1-4842-9606-6

Copyright © 2023 by Michael Hu

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned,
specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in
any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by
similar or dissimilar methodology now known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of
a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the
trademark owner, with no intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such,
is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors
nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher
makes no warranty, express or implied, with respect to the material contained herein.

Managing Director, Apress Media LLC: Welmoed Spahr

Acquisitions Editor: Celestin Suresh John
Development Editor: Laura Berendson
Editorial Assistant: Gryffin Winkler

Cover designed by eStudioCalamar

Cover image designed by Freepik (www.freepik.com)

Distributed to the book trade worldwide by Springer Science+Business Media New York, 1 New York Plaza, Suite 4600,
New York, NY 10004-1562, USA. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or
visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science +
Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
For information on translations, please e-mail [email protected]; for reprint, paperback, or audio rights,
please e-mail [email protected].
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also
available for most titles. For more information, reference our Print and eBook Bulk Sales web page at http://www.apress.com/
bulk-sales.
Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub (https://
github.com/Apress). For more detailed information, please visit https://www.apress.com/gp/services/source-code.

Paper in this product is recyclable

To my beloved family,
This book is dedicated to each of you, who have been a constant
source of love and support throughout my writing journey.
To my hardworking parents, whose tireless efforts in raising us
have been truly remarkable. Thank you for nurturing my dreams
and instilling in me a love for knowledge. Your unwavering
dedication has played a pivotal role in my accomplishments.
To my sisters and their children, your presence and love have
brought immense joy and inspiration to my life. I am grateful
for the laughter and shared moments that have sparked my
creativity.
And to my loving wife, your consistent support and
understanding have been my guiding light. Thank you for
standing by me through the highs and lows, and for being my
biggest cheerleader.
—Michael Hu
Contents

Part I Foundation
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 AI Breakthrough in Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 What Is Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Agent-Environment in Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Examples of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Common Terms in Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Why Study Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 The Challenges in Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 Overview of MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Model Reinforcement Learning Problem Using MDP . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Markov Process or Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Markov Reward Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Alternative Bellman Equations for Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7 Optimal Policy and Optimal Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1 Use DP to Solve MRP Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Policy Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 General Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1 Monte Carlo Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Incremental Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Exploration vs. Exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Monte Carlo Control (Policy Improvement) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

vii
viii Contents

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5 Temporal Difference Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1 Temporal Difference Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Temporal Difference Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Simplified -Greedy Policy for Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 TD Control—SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 On-Policy vs. Off-Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.6 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.7 Double Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.8 N-Step Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Part II Value Function Approximation

6 Linear Value Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.1 The Challenge of Large-Scale MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2 Value Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4 Linear Value Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7 Nonlinear Value Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2 Training Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.3 Policy Evaluation with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.4 Naive Deep Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.5 Deep Q-Learning with Experience Replay and Target Network . . . . . . . . . . . . . . . . . 147
7.6 DQN for Atari Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8 Improvements to DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.1 DQN with Double Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.2 Prioritized Experience Replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.3 Advantage function and Dueling Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . 169
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

Part III Policy Approximation

9 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.1 Policy-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.2 Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.3 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
9.4 REINFORCE with Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
9.5 Actor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Contents ix

9.6 Using Entropy to Encourage Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

9.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
10 Problems with Continuous Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
10.1 The Challenges of Problems with Continuous Action Space . . . . . . . . . . . . . . . . . . . . 197
10.2 MuJoCo Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
10.3 Policy Gradient for Problems with Continuous Action Space . . . . . . . . . . . . . . . . . . . 200
10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
11 Advanced Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
11.1 Problems with the Standard Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . 205
11.2 Policy Performance Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
11.3 Proximal Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
11.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

Part IV Advanced Topics

12 Distributed Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
12.1 Why Use Distributed Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
12.2 General Distributed Reinforcement Learning Architecture . . . . . . . . . . . . . . . . . . . . . 224
12.3 Data Parallelism for Distributed Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 229
12.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
13 Curiosity-Driven Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
13.1 Hard-to-Explore Problems vs. Sparse Reward Problems . . . . . . . . . . . . . . . . . . . . . . . 233
13.2 Curiosity-Driven Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
13.3 Random Network Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
13.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
14 Planning with a Model: AlphaZero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
14.1 Why We Need to Plan in Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
14.2 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
14.3 AlphaZero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
14.4 Training AlphaZero on a 9 × 9 Go Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
14.5 Training AlphaZero on a 13 × 13 Gomoku Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
14.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
About the Author

Michael Hu is an exceptional software engineer with a wealth of

expertise spanning over a decade, specializing in the design and
implementation of enterprise-level applications. His current focus
revolves around leveraging the power of machine learning (ML)
and artificial intelligence (AI) to revolutionize operational systems
within enterprises. A true coding enthusiast, Michael finds solace
in the realms of mathematics and continuously explores cutting-
edge technologies, particularly machine learning and deep learning.
His unwavering passion lies in the realm of deep reinforcement
learning, where he constantly seeks to push the boundaries of
knowledge. Demonstrating his commitment to the field, he has built
various numerous open source projects on GitHub that closely emu-
late state-of-the-art reinforcement learning algorithms pioneered by
DeepMind, including notable examples like AlphaZero, MuZero,
and Agent57. Through these projects, Michael demonstrates his
commitment to advancing the field and sharing his knowledge with
fellow enthusiasts. He currently resides in the city of Shanghai,
China.

xi
About the Technical Reviewer

Shovon Sengupta has over 14 years of expertise and a deepened

understanding of advanced predictive analytics, machine learning,
deep learning, and reinforcement learning. He has established a
place for himself by creating innovative financial solutions that
have won numerous awards. He is currently working for one of the
leading multinational financial services corporations in the United
States as the Principal Data Scientist at the AI Center of Excellence.
His job entails leading innovative initiatives that rely on artificial
intelligence to address challenging business problems. He has a US
patent (United States Patent: Sengupta et al.: Automated Predictive
Call Routing Using Reinforcement Learning [US 10,356,244 B1])
to his credit. He is also a Ph.D. scholar at BITS Pilani. He has
reviewed quite a few popular titles from leading publishers like
Packt and Apress and has also authored a few courses for Packt
and CodeRed (EC-Council) in the realm of machine learning. Apart
from that, he has presented at various international conferences on
machine learning, time series forecasting, and building trustworthy
AI. His primary research is concentrated on deep reinforcement
learning, deep learning, natural language processing (NLP), knowl-
edge graph, causality analysis, and time series analysis. For more
details about Shovon’s work, please check out his LinkedIn page:
www.linkedin.com/in/shovon-sengupta-272aa917.

xiii
Preface

Reinforcement learning (RL) is a highly promising yet challenging subfield of artificial intelligence
(AI) that plays a crucial role in shaping the future of intelligent systems. From robotics and
autonomous agents to recommendation systems and strategic decision-making, RL enables machines
to learn and adapt through interactions with their environment. Its remarkable success stories include
RL agents achieving human-level performance in video games and even surpassing world champions
in strategic board games like Go. These achievements highlight the immense potential of RL in solving
complex problems and pushing the boundaries of AI.
What sets RL apart from other AI subfields is its fundamental approach: agents learn by interacting
with the environment, mirroring how humans acquire knowledge. However, RL poses challenges that
distinguish it from other AI disciplines. Unlike methods that rely on precollected training data, RL
agents generate their own training samples. These agents are not explicitly instructed on how to
achieve a goal; instead, they receive state representations of the environment and a reward signal,
forcing them to explore and discover optimal strategies on their own. Moreover, RL involves complex
mathematics that underpin the formulation and solution of RL problems.
While numerous books on RL exist, they typically fall into two categories. The first category
emphasizes the fundamentals and mathematics of RL, serving as reference material for researchers
and university students. However, these books often lack implementation details. The second
category focuses on practical hands-on coding of RL algorithms, neglecting the underlying theory
and mathematics. This apparent gap between theory and implementation prompted us to create
this book, aiming to strike a balance by equally emphasizing fundamentals, mathematics, and the
implementation of successful RL algorithms.
This book is designed to be accessible and informative for a diverse audience. It is targeted toward
researchers, university students, and practitioners seeking a comprehensive understanding of RL. By
following a structured approach, the book equips readers with the necessary knowledge and tools to
apply RL techniques effectively in various domains.
The book is divided into four parts, each building upon the previous one. Part I focuses on the
fundamentals and mathematics of RL, which form the foundation for almost all discussed algorithms.
We begin by solving simple RL problems using tabular methods. Chapter 2, the cornerstone of this
part, explores Markov decision processes (MDPs) and the associated value functions, which are
recurring concepts throughout the book. Chapters 3 to 5 delve deeper into these fundamental concepts
by discussing how to use dynamic programming (DP), Monte Carlo methods, and temporal difference
(TD) learning methods to solve small MDPs.
Part II tackles the challenge of solving large-scale RL problems that render tabular methods
infeasible due to their complexity (e.g., large or infinite state spaces). Here, we shift our focus to value
function approximation, with particular emphasis on leveraging (deep) neural networks. Chapter 6
provides a brief introduction to linear value function approximation, while Chap. 7 delves into the

xv
xvi Preface

renowned Deep Q-Network (DQN) algorithm. In Chap. 8, we discuss enhancements to the DQN
algorithm.
Part III explores policy-based methods as an alternative approach to solving RL problems.
While Parts I and II primarily focus on value-based methods (learning the value function), Part III
concentrates on learning the policy directly. We delve into the theory behind policy gradient methods
and the REINFORCE algorithm in Chap. 9. Additionally, we explore Actor-Critic algorithms,
which combine policy-based and value-based approaches, in Chap. 10. Furthermore, Chap. 11 covers
advanced policy-based algorithms, including surrogate objective functions and the renowned Proximal
Policy Optimization (PPO) algorithm.
The final part of the book addresses advanced RL topics. Chapter 12 discusses how distributed
RL can enhance agent performance, while Chap. 13 explores the challenges of hard-to-explore RL
problems and presents curiosity-driven exploration as a potential solution. In the concluding chapter,
Chap. 14, we delve into model-based RL by providing a comprehensive examination of the famous
AlphaZero algorithm.
Unlike a typical hands-on coding handbook, this book does not primarily focus on coding exercises.
Instead, we dedicate our resources and time to explaining the fundamentals and core ideas behind
each algorithm. Nevertheless, we provide complete source code for all examples and algorithms
discussed in the book. Our code implementations are done from scratch, without relying on third-
party RL libraries, except for essential tools like Python, OpenAI Gym, Numpy, and the PyTorch
deep learning framework. While third-party RL libraries expedite the implementation process in real-
world scenarios, we believe coding each algorithm independently is the best approach for learning RL
fundamentals and mastering the various RL algorithms.
Throughout the book, we employ mathematical notations and equations, which some readers
may perceive as heavy. However, we prioritize intuition over rigorous proofs, making the material
accessible to a broader audience. A foundational understanding of calculus at a basic college level,
minimal familiarity with linear algebra, and elementary knowledge of probability and statistics
are sufficient to embark on this journey. We strive to ensure that interested readers from diverse
backgrounds can benefit from the book’s content.
We assume that readers have programming experience in Python since all the source code is
written in this language. While we briefly cover the basics of deep learning in Chap. 7, including
neural networks and their workings, we recommend some prior familiarity with machine learning,
specifically deep learning concepts such as training a deep neural network. However, beyond
the introductory coverage, readers can explore additional resources and materials to expand their
knowledge of deep learning.
This book draws inspiration from Reinforcement Learning: An Introduction by Richard S. Sutton
and Andrew G. Barto, a renowned RL publication. Additionally, it is influenced by prestigious
university RL courses, particularly the mathematical style and notation derived from Professor Emma
Brunskill’s RL course at Stanford University. Although our approach may differ slightly from Sutton
and Barto’s work, we strive to provide simpler explanations. Additionally, we have derived some
examples from Professor David Silver’s RL course at University College London, which offers a
comprehensive resource for understanding the fundamentals presented in Part I. We would like to
express our gratitude to Professor Dimitri P. Bertsekas for his invaluable guidance and inspiration
in the field of optimal control and reinforcement learning. Furthermore, the content of this book
incorporates valuable insights from research papers published by various organizations and individual
researchers.
In conclusion, this book aims to bridge the gap between the fundamental concepts, mathematics,
and practical implementation of RL algorithms. By striking a balance between theory and implementa-
tion, we provide readers with a comprehensive understanding of RL, empowering them to apply these
Preface xvii

techniques in various domains. We present the necessary mathematics and offer complete source
code for implementation to help readers gain a deep understanding of RL principles. We hope this
book serves as a valuable resource for readers seeking to explore the fundamentals, mathematics, and
practical aspects of RL algorithms. We must acknowledge that despite careful editing from our editors
and multiple round of reviews, we cannot guarantee the book’s content is error free. Your feedback and
corrections are invaluable to us. Please do not hesitate to contact us with any concerns or suggestions
for improvement.
Source Code
You can download the source code used in this book from github.com/apress/art-of-reinforcement-lear
ning.

Michael Hu
Part I
Foundation
Introduction
1

Artificial intelligence has made impressive progress in recent years, with breakthroughs achieved
in areas such as image recognition, natural language processing, and playing games. In particular,
reinforcement learning, a type of machine learning that focuses on learning by interacting with an
environment, has led to remarkable achievements in the field.
In this book, we focus on the combination of reinforcement learning and deep neural networks,
which have become central to the success of agents that can master complex games such as board
game Go and Atari video games.
This first chapter provides an overview of reinforcement learning, including key concepts such as
states, rewards, policies, and the common terms used in reinforcement learning, like the difference
between episodic and continuing reinforcement learning problems, model-free vs. model-based
methods.
Despite the impressive progress in the field, reinforcement learning still faces significant chal-
lenges. For example, it can be difficult to learn from sparse rewards, and the methods can suffer from
instability. Additionally, scaling to large state and action spaces can be a challenge.
Throughout this book, we will explore these concepts in greater detail and discuss state-of-the-art
techniques used to address these challenges. By the end of this book, you will have a comprehensive
understanding of the principles of reinforcement learning and how they can be applied to real-world
problems.
We hope this introduction has sparked your curiosity about the potential of reinforcement learning,
and we invite you to join us on this journey of discovery.

1.1 AI Breakthrough in Games

Atari

The Atari 2600 is a home video game console developed by Atari Interactive, Inc. in the 1970s. It
features a collection of iconic video games. These games, such as Pong, Breakout, Space Invaders,
and Pac-Man, have become classic examples of early video gaming culture. In this platform, players
can interact with these classic games using a joystick controller.

© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023 3
M. Hu, The Art of Reinforcement Learning,
https://doi.org/10.1007/978-1-4842-9606-6_1
4 1 Introduction

Fig. 1.1 A DQN agent learning to play Atari’s Breakout. The goal of the game is to use a paddle to bounce a ball
up and break through a wall of bricks. The agent only takes in the raw pixels from the screen, and it has to figure out
what’s the right action to take in order to maximize the score. Idea adapted from Mnih et al. [1]. Game owned by Atari
Interactive, Inc.

The breakthrough in Atari games came in 2015 when Mnih et al. [1] from DeepMind developed
an AI agent called DQN to play a list of Atari video games, some even better than humans.
What makes the DQN agent so influential is how it was trained to play the game. Similar to a
human player, the agent was only given the raw pixel image of the screen as inputs, as illustrated in
Fig. 1.1, and it has to figure out the rules of the game all by itself and decide what to do during the
game to maximize the score. No human expert knowledge, such as predefined rules or sample games
of human play, was given to the agent.
The DQN agent is a type of reinforcement learning agent that learns by interacting with an
environment and receiving a reward signal. In the case of Atari games, the DQN agent receives a
score for each action it takes.
Mnih et al. [1] trained and tested their DQN agents on 57 Atari video games. They trained one
DQN agent for one Atari game, with each agent playing only the game it was trained on; the training
was over millions of frames. The DQN agent can play half of the games (30 of 57 games) at or better
than a human player, as shown by Mnih et al. [1]. This means that the agent was able to learn and
develop strategies that were better than what a human player could come up with.
Since then, various organizations and researchers have made improvements to the DQN agent,
incorporating several new techniques. The Atari video games have become one of the most used test
beds for evaluating the performance of reinforcement learning agents and algorithms. The Arcade
Learning Environment (ALE) [2], which provides an interface to hundreds of Atari 2600 game
environments, is commonly used by researchers for training and testing reinforcement learning agents.
In summary, the Atari video games have become a classic example of early video gaming
culture, and the Atari 2600 platform provides a rich environment for training agents in the field of
reinforcement learning. The breakthrough of DeepMind’s DQN agent, trained and tested on 57 Atari
video games, demonstrated the capability of an AI agent to learn and make decisions through trial-
and-error interactions with classic games. This breakthrough has spurred many improvements and
advancements in the field of reinforcement learning, and the Atari games have become a popular test
bed for evaluating the performance of reinforcement learning algorithms.

Go is an ancient Chinese strategy board game played by two players, who take turns laying pieces of
stones on a 19x19 board with the goal of surrounding more territory than the opponent. Each player
1.1 AI Breakthrough in Games 5

Fig. 1.2 Yoda Norimoto (black) vs. Kiyonari Tetsuya (white), Go game from the 66th NHK Cup, 2018. White won by
0.5 points. Game record from CWI [4]

has a set of black or white stones, and the game begins with an empty board. Players alternate placing
stones on the board, with the black player going first.
The stones are placed on the intersections of the lines on the board, rather than in the squares.
Once a stone is placed on the board, it cannot be moved, but it can be captured by the opponent if it
is completely surrounded by their stones. Stones that are surrounded and captured are removed from
the board.
The game continues until both players pass, at which point the territory on the board is counted. A
player’s territory is the set of empty intersections that are completely surrounded by their stones, plus
any captured stones. The player with the larger territory wins the game. In the case of the final board
position shown in Fig. 1.2, the white won by 0.5 points.
Although the rules of the game are relatively simple, the game is extremely complex. For instance,
the number of legal board positions in Go is enormously large compared to Chess. According to
research by Tromp and Farnebäck [3], the number of legal board positions in Go is approximately
.2.1 × 10
170 , which is vastly greater than the number of atoms in the universe.

This complexity presents a significant challenge for artificial intelligence (AI) agents that attempt
to play Go. In March 2016, an AI agent called AlphaGo developed by Silver et al. [5] from DeepMind
made history by beating the legendary Korean player Lee Sedol with a score of 4-1 in Go. Lee Sedol
is a winner of 18 world titles and is considered one of the greatest Go player of the past decade.
6 1 Introduction

AlphaGo’s victory was remarkable because it used a combination of deep neural networks and tree
search algorithms, as well as the technique of reinforcement learning.
AlphaGo was trained using a combination of supervised learning from human expert games and
reinforcement learning from games of self-play. This training enabled the agent to develop creative
and innovative moves that surprised both Lee Sedol and the Go community.
The success of AlphaGo has sparked renewed interest in the field of reinforcement learning and
has demonstrated the potential for AI to solve complex problems that were once thought to be the
exclusive domain of human intelligence. One year later, Silver et al. [6] from DeepMind introduced a
new and more powerful agent, AlphaGo Zero. AlphaGo Zero was trained using pure self-play, without
any human expert moves in its training, achieving a higher level of play than the previous AlphaGo
agent. They also made other improvements like simplifying the training processes.
To evaluate the performance of the new agent, they set it to play games against the exact same
AlphaGo agent that beat the world champion Lee Sedol in 2016, and this time the new AlphaGo Zero
beats AlphaGo with score 100-0.
In the following year, Schrittwieser et al. [7] from DeepMind generalized the AlphaGo Zero agent
to play not only Go but also other board games like Chess and Shogi (Japanese chess), and they called
this generalized agent AlphaZero. AlphaZero is a more general reinforcement learning algorithm that
can be applied to a variety of board games, not just Go, Chess, and Shogi.
Reinforcement learning is a type of machine learning in which an agent learns to make decisions
based on the feedback it receives from its environment. Both DQN and AlphaGo (and its successor)
agents use this technique, and their achievements are very impressive. Although these agents are
designed to play games, this does not mean that reinforcement learning is only capable of playing
games. In fact, there are many more challenging problems in the real world, such as navigating a robot,
driving an autonomous car, and automating web advertising. Games are relatively easy to simulate and
implement compared to these other real-world problems, but reinforcement learning has the potential
to be applied to a wide range of complex challenges beyond game playing.

1.2 What Is Reinforcement Learning

In computer science, reinforcement learning is a subfield of machine learning that focuses on learning
how to act in a world or an environment. The goal of reinforcement learning is for an agent to
learn from interacting with the environment in order to make a sequence of decisions that maximize
accumulated reward in the long run. This process is known as goal-directed learning.
Unlike other machine learning approaches like supervised learning, reinforcement learning does
not rely on labeled data to learn from. Instead, the agent must learn through trial and error, without
being directly told the rules of the environment or what action to take at any given moment. This
makes reinforcement learning a powerful tool for modeling and solving real-world problems where
the rules and optimal actions may not be known or easily determined.
Reinforcement learning is not limited to computer science, however. Similar ideas are studied in
other fields under different names, such as operations research and optimal control in engineering.
While the specific methods and details may vary, the underlying principles of goal-directed learning
and decision-making are the same.
Examples of reinforcement learning in the real world are all around us. Human beings, for example,
are naturally good at learning from interacting with the world around us. From learning to walk as a
baby to learning to speak our native language to learning to drive a car, we learn through trial and error
and by receiving feedback from the environment. Similarly, animals can also be trained to perform a
variety of tasks through a process similar to reinforcement learning. For instance, service dogs can be
1.3 Agent-Environment in Reinforcement Learning 7

trained to assist individuals in wheelchairs, while police dogs can be trained to help search for missing
people.
One vivid example that illustrates the idea of reinforcement learning is a video of a dog with a big
stick in its mouth trying to cross a narrow bridge.1 The video shows the dog attempting to pass the
bridge, but failing multiple times. However, after some trial and error, the dog eventually discovers
that by tilting its head, it can pass the bridge with its favorite stick. This simple example demonstrates
the power of reinforcement learning in solving complex problems by learning from the environment
through trial and error.

1.3 Agent-Environment in Reinforcement Learning

Reinforcement learning is a type of machine learning that focuses on how an agent can learn to
make optimal decisions by interacting with an environment. The agent-environment loop is the core
of reinforcement learning, as shown in Fig. 1.3. In this loop, the agent observes the state of the
environment and a reward signal, takes an action, and receives a new state and reward signal from the
environment. This process continues iteratively, with the agent learning from the rewards it receives
and adjusting its actions to maximize future rewards.

Environment

The environment is the world in which the agent operates. It can be a physical system, such as a
robot navigating a maze, or a virtual environment, such as a game or a simulation. The environment

Fig. 1.3 Top: Agent-environment in reinforcement learning in a loop. Bottom: The loop unrolled by time

1 Dog Thinks Through A Problem: www.youtube.com/watch?v=m_CrIu01SnM.

8 1 Introduction

provides the agent with two pieces of information: the state of the environment and a reward signal.
The state describes the relevant information about the environment that the agent needs to make a
decision, such as the position of the robot or the cards in a poker game. The reward signal is a scalar
value that indicates how well the agent is doing in its task. The agent’s objective is to maximize its
cumulative reward over time.
The environment has its own set of rules, which determine how the state and reward signal change
based on the agent’s actions. These rules are often called the dynamics of the environment. In many
cases, the agent does not have access to the underlying dynamics of the environment and must learn
them through trial and error. This is similar to how we humans interact with the physical world every
day, normally we have a pretty good sense of what’s going on around us, but it’s difficult to fully
understand the dynamics of the universe.
Game environments are a popular choice for reinforcement learning because they provide a clear
objective and well-defined rules. For example, a reinforcement learning agent could learn to play the
game of Pong by observing the screen and receiving a reward signal based on whether it wins or loses
the game.
In a robotic environment, the agent is a robot that must learn to navigate a physical space or perform
a task. For example, a reinforcement learning agent could learn to navigate a maze by using sensors
to detect its surroundings and receiving a reward signal based on how quickly it reaches the end of the
maze.

State

In reinforcement learning, an environment state or simply state is the statistical data provided by the
environment to represent the current state of the environment. The state can be discrete or continuous.
For instance, when driving a stick shift car, the speed of the car is a continuous variable, while the
current gear is a discrete variable.
Ideally, the environment state should contain all relevant information that’s necessary for the agent
to make decisions. For example, in a single-player video game like Breakout, the pixels of frames
of the game contain all the information necessary for the agent to make a decision. Similarly, in an
autonomous driving scenario, the sensor data from the car’s cameras, lidar, and other sensors provide
relevant information about the surrounding environment.
However, in practice, the available information may depend on the task and domain. In a two-player
board game like Go, for instance, although we have perfect information about the board position, we
don’t have perfect knowledge about the opponent player, such as what they are thinking in their
head or what their next move will be. This makes the state representation more challenging in such
scenarios.
Furthermore, the environment state might also include noisy data. For example, a reinforcement
learning agent driving an autonomous car might use multiple cameras at different angles to capture
images of the surrounding area. Suppose the car is driving near a park on a windy day. In that case,
the onboard cameras could also capture images of some trees in the park that are swaying in the
wind. Since the movement of these trees should not affect the agent’s ability to drive, because the
trees are inside the park and not on the road or near the road, we can consider these movements of
the trees as noise to the self-driving agent. However, it can be challenging to ignore them from the
captured images. To tackle this problem, researchers might use various techniques such as filtering
and smoothing to eliminate the noisy data and obtain a cleaner representation of the environment
state.
1.3 Agent-Environment in Reinforcement Learning 9

Reward

In reinforcement learning, the reward signal is a numerical value that the environment provides to the
agent after the agent takes some action. The reward can be any numerical value, positive, negative, or
zero. However, in practice, the reward function often varies from task to task, and we need to carefully
design a reward function that is specific to our reinforcement learning problem.
Designing an appropriate reward function is crucial for the success of the agent. The reward
function should be designed to encourage the agent to take actions that will ultimately lead to
achieving our desired goal. For example, in the game of Go, the reward is 0 at every step before the
game is over, and +1 or .−1 if the agent wins or loses the game, respectively. This design incentivizes
the agent to win the game, without explicitly telling it how to win.
Similarly, in the game of Breakout, the reward can be a positive number if the agent destroys some
bricks negative number if the agent failed to catch the ball, and zero reward otherwise. This design
incentivizes the agent to destroy as many bricks as possible while avoiding losing the ball, without
explicitly telling it how to achieve a high score.
The reward function plays a crucial role in the reinforcement learning process. The goal of the
agent is to maximize the accumulated rewards over time. By optimizing the reward function, we can
guide the agent to learn a policy that will achieve our desired goal. Without the reward signal, the
agent would not know what the goal is and would not be able to learn effectively.
In summary, the reward signal is a key component of reinforcement learning that incentivizes the
agent to take actions that ultimately lead to achieving the desired goal. By carefully designing the
reward function, we can guide the agent to learn an optimal policy.

Agent

In reinforcement learning, an agent is an entity that interacts with an environment by making decisions
based on the received state and reward signal from the environment. The agent’s goal is to maximize
its cumulative reward in the long run. The agent must learn to make the best decisions by trial and
error, which involves exploring different actions and observing the resulting rewards.
In addition to the external interactions with the environment, the agent may also has its internal
state represents its knowledge about the world. This internal state can include things like memory of
past experiences and learned strategies.
It’s important to distinguish the agent’s internal state from the environment state. The environment
state represents the current state of the world that the agent is trying to influence through its actions.
The agent, however, has no direct control over the environment state. It can only affect the environment
state by taking actions and observing the resulting changes in the environment. For example, if the
agent is playing a game, the environment state might include the current positions of game pieces,
while the agent’s internal state might include the memory of past moves and the strategies it has
learned.
In this book, we will typically use the term “state" to refer to the environment state. However,
it’s important to keep in mind the distinction between the agent’s internal state and the environment
state. By understanding the role of the agent and its interactions with the environment, we can
better understand the principles behind reinforcement learning algorithms. It is worth noting that the
terms “agent” and “algorithm” are frequently used interchangeably in this book, particularly in later
chapters.
10 1 Introduction

Action

In reinforcement learning, the agent interacts with an environment by selecting actions that affect the
state of the environment. Actions are chosen from a predefined set of possibilities, which are specific
to each problem. For example, in the game of Breakout, the agent can choose to move the paddle to
the left or right or take no action. It cannot perform actions like jumping or rolling over. In contrast,
in the game of Pong, the agent can choose to move the paddle up or down but not left or right.
The chosen action affects the future state of the environment. The agent’s current action may have
long-term consequences, meaning that it will affect the environment’s states and rewards for many
future time steps, not just the next immediate stage of the process.
Actions can be either discrete or continuous. In problems with discrete actions, the set of possible
actions is finite and well defined. Examples of such problems include Atari and Go board games.
In contrast, problems with continuous actions have an infinite set of possible actions, often within
a continuous range of values. An example of a problem with continuous actions is robotic control,
where the degree of angle movement of a robot arm is often a continuous action.
Reinforcement learning problems with discrete actions are generally easier to solve than those with
continuous actions. Therefore, this book will focus on solving reinforcement learning problems with
discrete actions. However, many of the concepts and techniques discussed in this book can be applied
to problems with continuous actions as well.

Policy

A policy is a key concept in reinforcement learning that defines the behavior of an agent. In particular,
it maps each possible state in the environment to the probabilities of chose different actions. By
specifying how the agent should behave, a policy guides the agent to interact with its environment and
maximize its cumulative reward. We will delve into the details of policies and how they interact with
the MDP framework in Chap. 2.
For example, suppose an agent is navigating a grid-world environment. A simple policy might
dictate that the agent should always move to the right until it reaches the goal location. Alternatively,
a more sophisticated policy could specify that the agent should choose its actions based on its current
position and the probabilities of moving to different neighboring states.

Model

In reinforcement learning, a model refers to a mathematical description of the dynamics function and
reward function of the environment. The dynamics function describes how the environment evolves
from one state to another, while the reward function specifies the reward that the agent receives for
taking certain actions in certain states.
In many cases, the agent does not have access to a perfect model of the environment. This makes
learning a good policy challenging, since the agent must learn from experience how to interact with
the environment to maximize its reward. However, there are some cases where a perfect model is
available. For example, if the agent is playing a game with fixed rules and known outcomes, the agent
can use this knowledge to select its actions strategically. We will explore this scenario in detail in
Chap. 2.
In reinforcement learning, the agent-environment boundary can be ambiguous. Despite a house
cleaning robot appearing to be a single agent, the agent’s direct control typically defines its boundary,
1.4 Examples of Reinforcement Learning 11

while the remaining components comprise the environment. In this case, the robot’s wheels and other
hardwares are considered to be part of the environment since they aren’t directly controlled by the
agent. We can think of the robot as a complex system composed of several parts, such as hardware,
software, and the reinforcement learning agent, which can control the robot’s movement by signaling
the software interface, which then communicates with microchips to manage the wheel movement.

1.4 Examples of Reinforcement Learning

Reinforcement learning is a versatile technique that can be applied to a variety of real-world problems.
While its success in playing games is well known, there are many other areas where it can be used as
an effective solution. In this section, we explore a few examples of how reinforcement learning can
be applied to real-world problems.

Autonomous Driving

Reinforcement learning can be used to train autonomous vehicles to navigate complex and unpre-
dictable environments. The goal for the agent is to safely and efficiently drive the vehicle to a desired
location while adhering to traffic rules and regulations. The reward signal could be a positive number
for successful arrival at the destination within a specified time frame and a negative number for any
accidents or violations of traffic rules. The environment state could contain information about the
vehicle’s location, velocity, and orientation, as well as sensory data such as camera feeds and radar
readings. Additionally, the state could include the current traffic conditions and weather, which would
help the agent to make better decisions while driving.

Navigating Robots in a Factory Floor

One practical application of reinforcement learning is to train robots to navigate a factory floor. The
goal for the agent is to safely and efficiently transport goods from one point to another without
disrupting the work of human employees or other robots. In this case, the reward signal could be
a positive number for successful delivery within a specified time frame and a negative number for
any accidents or damages caused. The environment state could contain information about the robot’s
location, the weight and size of the goods being transported, the location of other robots, and sensory
data such as camera feeds and battery level. Additionally, the state could include information about
the production schedule, which would help the agent to prioritize its tasks.

Automating Web Advertising

Another application of reinforcement learning is to automate web advertising. The goal for the agent is
to select the most effective type of ad to display to a user, based on their browsing history and profile.
The reward signal could be a positive number for when the user clicks on the ad, and zero otherwise.
The environment state could contain information such as the user’s search history, demographics, and
current trends on the Internet. Additionally, the state could include information about the context of
the web page, which would help the agent to choose the most relevant ad.
12 1 Introduction

Video Compression

Reinforcement learning can also be used to improve video compression. DeepMind’s MuZero agent
has been adapted to optimize video compression for some YouTube videos. In this case, the goal for
the agent is to compress the video as much as possible without compromising the quality. The reward
signal could be a positive number for high-quality compression and a negative number for low-quality
compression. The environment state could contain information such as the video’s resolution, bit rate,
frame rate, and the complexity of the scenes. Additionally, the state could include information about
the viewing device, which would help the agent to optimize the compression for the specific device.
Overall, reinforcement learning has enormous potential for solving real-world problems in various
industries. The key to successful implementation is to carefully design the reward signal and the
environment state to reflect the specific goals and constraints of the problem. Additionally, it is
important to continually monitor and evaluate the performance of the agent to ensure that it is making
the best decisions.

1.5 Common Terms in Reinforcement Learning

Episodic vs. Continuing Tasks

In reinforcement learning, the type of problem or task is categorized as episodic or continuing

depending on whether it has a natural ending. Natural ending refers to a point in a task or problem
where it is reasonable to consider the task or problem is completed.
Episodic problems have a natural termination point or terminal state, at which point the task is over,
and a new episode starts. A new episode is independent of previous episodes. Examples of episodic
problems include playing an Atari video game, where the game is over when the agent loses all lives or
won the game, and a new episode always starts when we reset the environment, regardless of whether
the agent won or lost the previous game. Other examples of episodic problems include Tic-Tac-Toe,
chess, or Go games, where each game is independent of the previous game.
On the other hand, continuing problems do not have a natural endpoint, and the process could go on
indefinitely. Examples of continuing problems include personalized advertising or recommendation
systems, where the agent’s goal is to maximize a user’s satisfaction or click-through rate over an
indefinite period. Another example of a continuing problem is automated stock trading, where the
agent wants to maximize their profits in the stock market by buying and selling stocks. In this scenario,
the agent’s actions, such as the stocks they buy and the timing of their trades, can influence the future
prices and thus affect their future profits. The agent’s goal is to maximize their long-term profits by
making trades continuously, and the stock prices will continue to fluctuate in the future. Thus, the
agent’s past trades will affect their future decisions, and there is no natural termination point for the
problem.
It is possible to design some continuing reinforcement learning problems as episodic by using
a time-constrained approach. For example, the episode could be over when the market is closed.
However, in this book, we only consider natural episodic problems that is, the problems with natural
termination.
Understanding the differences between episodic and continuing problems is crucial for designing
effective reinforcement learning algorithms for various applications. For example, episodic problems
may require a different algorithmic approach than continuing problems due to the differences in their
termination conditions. Furthermore, in real-world scenarios, distinguishing between episodic and
continuing problems can help identify the most appropriate reinforcement learning approach to use
for a particular task or problem.
1.5 Common Terms in Reinforcement Learning 13

Deterministic vs. Stochastic Tasks

In reinforcement learning, it is important to distinguish between deterministic and stochastic

problems. A problem is deterministic if the outcome is always the same when the agent takes the
same action in the same environment state. For example, Atari video game or a game of Go is a
deterministic problem. In these games, the rules of the game are fixed; when the agent repeatedly
takes the same action under the same environment condition (state), the outcome (reward signal and
next state) is always the same.
The reason that these games are considered deterministic is that the environment’s dynamics and
reward functions do not change over time. The rules of the game are fixed, and the environment always
behaves in the same way for a given set of actions and states. This allows the agent to learn a policy
that maximizes the expected reward by simply observing the outcomes of its actions.
On the other hand, a problem is stochastic if the outcome is not always the same when the agent
takes the same action in the same environment state. One example of a stochastic environment is
playing poker. The outcome of a particular hand is not entirely determined by the actions of the player.
Other players at the table can also take actions that influence the outcome of the hand. Additionally,
the cards dealt to each player are determined by a shuffled deck, which introduces an element of
chance into the game.
For example, let’s say a player is dealt a pair of aces in a game of Texas hold’em. The player might
decide to raise the bet, hoping to win a large pot. However, the other players at the table also have
their own sets of cards and can make their own decisions based on the cards they hold and the actions
of the other players.
If another player has a pair of kings, they might also decide to raise the bet, hoping to win the pot.
If a third player has a pair of twos, they might decide to fold, as their hand is unlikely to win. The
outcome of the hand depends not only on the actions of the player with the pair of aces but also on the
actions of the other players at the table, as well as the cards dealt to them.
This uncertainty and complexity make poker a stochastic problem. While it is possible to use
various strategies to improve one’s chances of winning in poker, the outcome of any given hand is
never certain, and a skilled player must be able to adjust their strategy based on the actions of the
other players and the cards dealt.
Another example of stochastic environment is the stock market. The stock market is a stochastic
environment because the outcome of an investment is not always the same when the same action is
taken in the same environment state. There are many factors that can influence the price of a stock,
such as company performance, economic conditions, geopolitical events, and investor sentiment.
These factors are constantly changing and can be difficult to predict, making it impossible to know
with certainty what the outcome of an investment will be.
For example, let’s say you decide to invest in a particular stock because you believe that the
company is undervalued and has strong growth prospects. You buy 100 shares at a price of $145.0
per share. However, the next day, the company announces that it has lost a major customer and its
revenue projections for the next quarter are lower than expected. The stock price drops to $135.0
per share, and you have lost $1000 on your investment. It’s most likely the stochastic nature of the
environment led to the loss outcome other than the action (buying 100 shares).
While it is possible to use statistical analysis and other tools to try to predict stock price movements,
there is always a level of uncertainty and risk involved in investing in the stock market. This
uncertainty and risk are what make the stock market a stochastic environment, and why it is important
to use appropriate risk management techniques when making investment decisions.
14 1 Introduction

In this book, we focus on deterministic reinforcement learning problems. By understanding the

fundamentals of deterministic reinforcement learning, readers will be well equipped to tackle more
complex and challenging problems in the future.

Model-Free vs. Model-Based Reinforcement Learning

In reinforcement learning, an environment is a system in which an agent interacts with in order

to achieve a goal. A model is a mathematical representation of the environment’s dynamics and
reward functions. Model-free reinforcement learning means the agent does not use the model of the
environment to help it make decisions. This may occur because either the agent lacks access to the
accurate model of the environment or the model is too complex to use during decision-making.
In model-free reinforcement learning, the agent learns to take actions based on its experiences of
the environment without explicitly simulating future outcomes. Examples of model-free reinforce-
ment learning methods include Q-learning, SARSA (State-Action-Reward-State-Action), and deep
reinforcement learning algorithms such as DQN, which we’ll introduce later in the book.
On the other hand, in model-based reinforcement learning, the agent uses a model of the environ-
ment to simulate future outcomes and plan its actions accordingly. This may involve constructing a
complete model of the environment or using a simplified model that captures only the most essential
aspects of the environment’s dynamics. Model-based reinforcement learning can be more sample-
efficient than model-free methods in certain scenarios, especially when the environment is relatively
simple and the model is accurate. Examples of model-based reinforcement learning methods include
dynamic programming algorithms, such as value iteration and policy iteration, and probabilistic
planning methods, such as Monte Carlo Tree Search in AlphaZero agent, which we’ll introduce later
in the book.
In summary, model-free and model-based reinforcement learning are two different approaches
to solving the same problem of maximizing rewards in an environment. The choice between these
approaches depends on the properties of the environment, the available data, and the computational
resources.

1.6 Why Study Reinforcement Learning

Machine learning is a vast and rapidly evolving field, with many different approaches and techniques.
As such, it can be challenging for practitioners to know which type of machine learning to use for a
given problem. By discussing the strengths and limitations of different branches of machine learning,
we can better understand which approach might be best suited to a particular task. This can help us
make more informed decisions when developing machine learning solutions and ultimately lead to
more effective and efficient systems.
There are three branches of machine learning. One of the most popular and widely adopted in the
real world is supervised learning, which is used in domains like image recognition, speech recognition,
and text classification. The idea of supervised learning is very simple: given a set of training data
and the corresponding labels, the objective is for the system to generalize and predict the label for
data that’s not present in the training dataset. These training labels are typically provided by some
supervisors (e.g., humans). Hence, we’ve got the name supervised learning.
Another branch of machine learning is unsupervised learning. In unsupervised learning, the
objective is to discover the hidden structures or features of the training data without being provided
1.6 Why Study Reinforcement Learning 15

with any labels. This can be useful in domains such as image clustering, where we want the system
to group similar images together without knowing ahead of time which images belong to which
group. Another application of unsupervised learning is in dimensionality reduction, where we want to
represent high-dimensional data in a lower-dimensional space while preserving as much information
as possible.
Reinforcement learning is a type of machine learning in which an agent learns to take actions in an
environment in order to maximize a reward signal. It’s particularly useful in domains where there is
no clear notion of “correct” output, such as in robotics or game playing. Reinforcement learning has
potential applications in areas like robotics, healthcare, and finance.
Supervised learning has already been widely used in computer vision and natural language
processing. For example, the ImageNet classification challenge is an annual computer vision
competition where deep convolutional neural networks (CNNs) dominate. The challenge provides
a training dataset with labels for 1.2 million images across 1000 categories, and the goal is to predict
the labels for a separate evaluation dataset of about 100,000 images. In 2012, Krizhevsky et al. [8]
developed AlexNet, the first deep CNN system used in this challenge. AlexNet achieved an 18%
improvement in accuracy compared to previous state-of-the-art methods, which marked a major
breakthrough in computer vision.
Since the advent of AlexNet, almost all leading solutions to the ImageNet challenge have been
based on deep CNNs. Another breakthrough came in 2015 when researchers He et al. [9] from
Microsoft developed ResNet, a new architecture designed to improve the training of very deep CNNs
with hundreds of layers. Training deep CNNs is challenging due to vanishing gradients, which makes
it difficult to propagate the gradients backward through the network during backpropagation. ResNet
addressed this challenge by introducing skip connections, which allowed the network to bypass one or
more layers during forward propagation, thereby reducing the depth of the network that the gradients
have to propagate through.
While supervised learning is capable of discovering hidden patterns and features from data, it
is limited in that it merely mimics what it is told to do during training and cannot interact with
the world and learn from its own experience. One limitation of supervised learning is the need to
label every possible stage of the process. For example, if we want to use supervised learning to train
an agent to play Go, then we would need to collect the labels for every possible board position,
which is impossible due to the enormous number of possible combinations. Similarly, in Atari video
games, a single pixel change would require relabeling, making supervised learning inapplicable in
these cases. However, supervised learning has been successful in many other applications, such as
language translation and image classification.
Unsupervised learning tries to discover hidden patterns or features without labels, but its objective
is completely different from that of RL, which is to maximize accumulated reward signals. Humans
and animals learn by interacting with their environment, and this is where reinforcement learning (RL)
comes in. In RL, the agent is not told which action is good or bad, but rather it must discover that for
itself through trial and error. This trial-and-error search process is unique to RL. However, there are
other challenges that are unique to RL, such as dealing with delayed consequences and balancing
exploration and exploitation.
While RL is a distinct branch of machine learning, it shares some commonalities with other
branches, such as supervised and unsupervised learning. For example, improvements in supervised
learning and deep convolutional neural networks (CNNs) have been adapted to DeepMind’s DQN,
AlphaGo, and other RL agents. Similarly, unsupervised learning can be used to pretrain the weights
of RL agents to improve their performance. Furthermore, many of the mathematical concepts used
in RL, such as optimization and how to train a neural network, are shared with other branches of
16 1 Introduction

machine learning. Therefore, while RL has unique challenges and applications, it also benefits from
and contributes to the development of other branches of machine learning.

1.7 The Challenges in Reinforcement Learning

Reinforcement learning (RL) is a type of machine learning in which an agent learns to interact with an
environment to maximize some notion of cumulative reward. While RL has shown great promise in a
variety of applications, it also comes with several common challenges, as discussed in the following
sections:

Exploration vs. Exploitation Dilemma

The exploration-exploitation dilemma refers to the fundamental challenge in reinforcement learning of

balancing the need to explore the environment to learn more about it with the need to exploit previous
knowledge to maximize cumulative reward. The agent must continually search for new actions that
may yield greater rewards while also taking advantage of actions that have already proven successful.
In the initial exploration phase, the agent is uncertain about the environment and must try out a
variety of actions to gather information about how the environment responds. This is similar to how
humans might learn to play a new video game by trying different button combinations to see what
happens. However, as the agent learns more about the environment, it becomes increasingly important
to focus on exploiting the knowledge it has gained in order to maximize cumulative reward. This is
similar to how a human might learn to play a game more effectively by focusing on the actions that
have already produced high scores.
There are many strategies for addressing the exploration-exploitation trade-off, ranging from
simple heuristics to more sophisticated algorithms. One common approach is to use an .-greedy
policy, in which the agent selects the action with the highest estimated value with probability .1 −
and selects a random action with probability . in order to encourage further exploration. Another
approach is to use a Thompson sampling algorithm, which balances exploration and exploitation by
selecting actions based on a probabilistic estimate of their expected value.
It is important to note that the exploration-exploitation trade-off is not a simple problem to solve,
and the optimal balance between exploration and exploitation will depend on many factors, including
the complexity of the environment and the agent’s prior knowledge. As a result, there is ongoing
research in the field of reinforcement learning aimed at developing more effective strategies for
addressing this challenge.

Credit Assignment Problem

In reinforcement learning (RL), the credit assignment problem refers to the challenge of determining
which actions an agent took that led to a particular reward. This is a fundamental problem in RL
because the agent must learn from its own experiences in order to improve its performance.
To illustrate this challenge, let’s consider the game of Tic-Tac-Toe, where two players take turns
placing Xs and Os on a 3.×3 grid until one player gets three in a row. Suppose the agent is trying to
learn to play Tic-Tac-Toe using RL, and the reward is +1 for a win, .−1 for a loss, and 0 for a draw.
The agent’s goal is to learn a policy that maximizes its cumulative reward.
1.7 The Challenges in Reinforcement Learning 17

Now, suppose the agent wins a game of Tic-Tac-Toe. How can the agent assign credit to the actions
that led to the win? This can be a difficult problem to solve, especially if the agent is playing against
another RL agent that is also learning and adapting its strategies.
To tackle the credit assignment problem in RL, there are various techniques that can be used, such
as Monte Carlo methods or temporal difference learning. These methods use statistical analysis to
estimate the value of each action taken by the agent, based on the rewards received and the states
visited. By using these methods, the agent can gradually learn to assign credit to the actions that
contribute to its success and adjust its policy accordingly.
In summary, credit assignment is a key challenge in reinforcement learning, and it is essential to
develop effective techniques for solving this problem in order to achieve optimal performance.

Reward Engineering Problem

The reward engineering problem refers to the process of designing a good reward function that
encourages the desired behavior in a reinforcement learning (RL) agent. The reward function
determines what the agent is trying to optimize, so it is crucial to make sure it reflects the desired
goal we want the the agent to achieve.
An example of good reward engineering is in the game of Atari Breakout, where the goal of the
agent is to clear all the bricks at the top of the screen by bouncing a ball off a paddle. One way to
design a reward function for this game is to give the agent a positive reward for each brick it clears
and a negative reward for each time the ball passes the paddle and goes out of bounds. However, this
reward function alone may not lead to optimal behavior, as the agent may learn to exploit a loophole
by simply bouncing the ball back and forth on the same side of the screen without actually clearing
any bricks.
To address this challenge, the reward function can be designed to encourage more desirable
behavior. For example, the reward function can be modified to give the agent a larger positive reward
for clearing multiple bricks in a row or for clearing the bricks on the edges of the screen first. This can
encourage the agent to take more strategic shots and aim for areas of the screen that will clear more
bricks at once.
An example of bad reward engineering is the CoastRunners video game in Atari; it’s a very simple
boat racing game. The goal of the game is to finish the boat race as quickly as possible. But there’s
one small issue with the game, the player can earn higher scores by hitting some targets laid out along
the route. There’s a video that shows that a reinforcement learning agent plays the game by repeatedly
hitting the targets instead of finishing the race.2 This example should not be viewed as the failure
of the reinforcement learning agent, but rather humans failed to design and use the correct reward
function.
Overall, reward engineering is a crucial part of designing an effective RL agent. A well-designed
reward function can encourage the desired behavior and lead to optimal performance, while a
poorly designed reward function can lead to suboptimal behavior and may even encourage undesired
behavior.

2 Reinforcement learning agent playing the CoastRunners game: www.youtube.com/watch?v=tlOIHko8ySg.

18 1 Introduction

Generalization Problem

In reinforcement learning (RL), the generalization problem refers to the ability of an agent to apply
what it has learned to new and previously unseen situations. To understand this concept, consider the
example of a self-driving car. Suppose the agent is trying to learn to navigate a particular intersection,
with a traffic light and crosswalk. The agent receives rewards for reaching its destination quickly and
safely, but it must also follow traffic laws and avoid collisions with other vehicles and pedestrians.
During training, the agent is exposed to a variety of situations at the intersection, such as different
traffic patterns and weather conditions. It learns to associate certain actions with higher rewards, such
as slowing down at the yellow light and stopping at the red light. Over time, the agent becomes more
adept at navigating the intersection and earns higher cumulative rewards.
However, when the agent is faced with a new intersection, with different traffic patterns and weather
conditions, it may struggle to apply what it has learned. This is where generalization comes in. If the
agent has successfully generalized its knowledge, it will be able to navigate the new intersection based
on its past experiences, even though it has not seen this exact intersection before. For example, it may
slow down at a yellow light, even if the timing is slightly different than what it has seen before, or it
may recognize a pedestrian crossing and come to a stop, even if the appearance of the crosswalk is
slightly different.
If the agent has not successfully generalized its knowledge, it may struggle to navigate the new
intersection and may make mistakes that lead to lower cumulative rewards. For example, it may miss
a red light or fail to recognize a pedestrian crossing, because it has only learned to recognize these
situations in a particular context.
Therefore, generalization is a crucial aspect of RL, as it allows the agent to apply its past
experiences to new and previously unseen situations, which can improve its overall performance and
make it more robust to changes in the environment.

Sample Efficiency Problem

The sample efficiency problem in reinforcement learning refers to the ability of an RL agent to learn
an optimal policy with a limited number of interactions with the environment. This can be challenging,
especially in complex environments where the agent may need to explore a large state space or take a
large number of actions to learn the optimal policy.
To better understand sample efficiency, let’s consider an example of an RL agent playing a game
of Super Mario Bros. In this game, the agent must navigate Mario through a series of levels while
avoiding enemies and obstacles, collecting coins, and reaching the flag at the end of each level.
To learn how to play Super Mario Bros., the agent must interact with the environment, taking
actions such as moving left or right, jumping, and shooting fireballs. Each action leads to a new state
of the environment, and the agent receives a reward based on its actions and the resulting state.
For example, the agent may receive a reward for collecting a coin or reaching the flag and a penalty
for colliding with an enemy or falling into a pit. By learning from these rewards, the agent can update
its policy to choose actions that lead to higher cumulative rewards over time.
However, learning the optimal policy in Super Mario Bros. can be challenging due to the large
state space and the high dimensionality of the input data, which includes the position of Mario, the
enemies, and the obstacles on the screen.
To address the challenge of sample efficiency, the agent may use a variety of techniques to learn
from a limited number of interactions with the environment. For example, the agent may use function
approximation to estimate the value or policy function based on a small set of training examples. The
References 19

agent may also use off-policy learning, which involves learning from data collected by a different
policy than the one being optimized.
Overall, sample efficiency is an important challenge in reinforcement learning, especially in
complex environments. Techniques such as function approximation, off-policy learning, can help
address this challenge and enable RL agents to learn optimal policies with a limited number of
interactions with the environment.

1.8 Summary

In the first chapter of the book, readers were introduced to the concept of reinforcement learning (RL)
and its applications. The chapter began by discussing the breakthroughs in AI in games, showcasing
the success of RL in complex games such as Go. The chapter then provided an overview of the agent-
environment interaction that forms the basis of RL, including key concepts such as environment,
agent, reward, state, action, and policy. Several examples of RL were presented, including Atari video
game playing, board game Go, and robot control tasks.
Additionally, the chapter introduced common terms used in RL, including episodic vs. continuing
tasks, deterministic vs. stochastic tasks, and model-free vs. model-based reinforcement learning. The
importance of studying RL was then discussed, including its potential to solve complex problems
and its relevance to real-world applications. The challenges faced in RL, such as the exploration-
exploitation dilemma, the credit assignment problem, and the generalization problem, were also
explored.
The next chapter of the book will focus on Markov decision processes (MDPs), which is a formal
framework used to model RL problems.

References
[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex
Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik,
Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-
level control through deep reinforcement learning. Nature, 518(7540):529–533, Feb 2015.
[2] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform
for general agents. Journal of Artificial Intelligence Research, 47:253–279, Jun 2013.
[3] John Tromp and Gunnar Farnebäck. Combinatorics of go. In H. Jaap van den Herik, Paolo Ciancarini, and H. H.
L. M. (Jeroen) Donkers, editors, Computers and Games, pages 84–99, Berlin, Heidelberg, 2007. Springer Berlin
Heidelberg.
[4] CWI. 66th NHK Cup. https://homepages.cwi.nl/~aeb/go/games/games/NHK/66/index.html, 2018.
[5] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian
Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John
Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel,
and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–
489, Jan 2016.
[6] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas
Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George
van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge.
Nature, 550(7676):354–359, Oct 2017.
[7] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot,
Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis.
Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017.
[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural
networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information
Processing Systems, volume 25. Curran Associates, Inc., 2012.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
Markov Decision Processes
2

Markov decision processes (MDPs) offer a powerful framework for tackling sequential decision-
making problems in reinforcement learning. Their applications span various domains, including
robotics, finance, and optimal control.
In this chapter, we provide an overview of the key components of Markov decision processes
(MDPs) and demonstrate the formulation of a basic reinforcement learning problem using the MDP
framework. We delve into the concepts of policy and value functions, examine the Bellman equations,
and illustrate their utilization in updating values for states or state-action pairs. Our focus lies
exclusively on finite MDPs, wherein the state and action spaces are finite. While we primarily assume
a deterministic problem setting, we also address the mathematical aspects applicable to stochastic
problems.
If we were to choose the single most significant chapter in the entire book, this particular chapter
would undeniably be at the forefront of our list. Its concepts and mathematical equations hold such
importance that they are consistently referenced and utilized throughout the book.

2.1 Overview of MDP

At a high level, a Markov decision process (MDP) is a mathematical framework for modeling
sequential decision-making problems under uncertainty. The main idea is to represent the problem
in terms of states, actions, a transition model, and a reward function and then use this representation
to find an optimal policy that maximizes the expected sum of rewards over time.
To be more specific, an MDP consists of the following components:

• States (.S): The set of all possible configurations or observations of the environment that the agent
can be in. For example, in a game of chess, the state might be the current board configuration, while
in a financial portfolio management problem, the state might be the current prices of various stocks.
Other examples of states include the position and velocity of a robot, the location and orientation
of a vehicle, or the amount of inventory in a supply chain.
• Actions (.A): The set of all possible actions that the agent can take. In a game of chess, this might
include moving a piece, while in a financial portfolio management problem, this might include
buying or selling a particular stock. Other examples of actions include accelerating or decelerating
a robot, turning a vehicle, or restocking inventory in a supply chain.

© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023 21
M. Hu, The Art of Reinforcement Learning,
https://doi.org/10.1007/978-1-4842-9606-6_2
22 2 Markov Decision Processes

• Transition model or dynamics function (.P): A function that defines the probability of transition-
ing to a new state .s given the current state s and the action taken a. In other words, it models how
the environment responds to the agent’s actions. For example, in a game of chess, the transition
model might be determined by the rules of the game and the player’s move. In a finance problem,
the transition model could be the result of a stock price fluctuation. Other examples of transition
models include the physics of a robot’s motion, the dynamics of a vehicle’s movement, or the
demand and supply dynamics of a supply chain.
• Reward function (.R): A function that specifies the reward or cost associated with taking an action
in a given state. In other words, it models the goal of the agent’s task. For example, in a game of
chess, the reward function might assign a positive value to winning the game and a negative value
to losing, while in a finance problem, the reward function might be based on maximizing profit
or minimizing risk. Other examples of reward functions include the energy efficiency of a robot’s
motion, the fuel efficiency of a vehicle’s movement, or the profit margins of a supply chain.

Why Are MDPs Useful?

MDPs provide a powerful framework for modeling decision-making problems because they allow us
to use mathematical concepts to model real-world problems. For example, MDPs can be used to

• Model robot navigation problems, where the robot must decide which actions to take in order to
reach a particular goal while avoiding obstacles. For example, the state might include its current
position and the obstacles in its environment, and the actions might include moving in different
directions. The transition model could be determined by the physics of the robot’s motion, and
the reward function could be based on reaching the goal as quickly as possible while avoiding
collisions with obstacles.
• Optimize portfolio management strategies, where the agent must decide which stocks to buy or
sell in order to maximize profits while minimizing risk. For example, the state might include the
current prices of different stocks and the agent’s portfolio holdings, and the actions might include
buying or selling stocks. The transition model could be the result of stock price fluctuations, and
the reward function could be based on the agent’s profits or risk-adjusted returns.
• Design personalized recommendation systems, where the agent must decide which items to
recommend to a particular user based on their past behavior. For example, the state might
include the user’s past purchases and the agent’s current recommendations, and the actions might
include recommending different items. The transition model could be the user’s response to
the recommendations, and the reward function could be based on how much the user likes the
recommended items or makes a purchase.
• Solve many other decision-making problems in various domains, such as traffic control, resource
allocation, and game playing. In each case, the MDP framework provides a way to model the
problem in terms of states, actions, transition probabilities, and rewards and then solve it by finding
a policy that maximizes the expected sum of rewards over time.

Our goal of modeling a problem using MDP is to eventually solve the MDP problem. To solve
an MDP, we must find a policy .π that maps states to actions in a way that maximizes the expected
sum of rewards over time. In other words, the policy tells the agent what action to take in each state to
achieve its goal. One way to find the optimal policy is to use the value iteration algorithm or the policy
iteration algorithm, which iteratively updates the values of the state (or state-action pair) based on the
Bellman equations, until the optimal policy is found. These fundamental concepts will be explored in
this chapter and the subsequent chapters.
2.2 Model Reinforcement Learning Problem Using MDP 23

In summary, MDPs provide a flexible and powerful way to model decision-making problems in
various domains. By formulating a problem as an MDP, we can use mathematical concepts to analyze
the problem and find an optimal policy that maximizes the expected sum of rewards over time. The
key components of an MDP are the states, actions, dynamics function, and reward function, which
can be customized to fit the specific problem at hand.

2.2 Model Reinforcement Learning Problem Using MDP

A Markov decision process (MDP) models a sequence of states, actions, and rewards in a way that is
useful for studying reinforcement learning. For example, a robot navigating a maze can be modeled as
an MDP, where the states represent the robot’s location in the maze, the actions represent the robot’s
movement choices, and the rewards represent the robot’s progress toward the goal.
In this context, we use a subscript t to index the different stages of the process (or so-called time
step of the sequence), where t could be any discrete value like .t = 0, 1, 2, · · ·. Note that the time step
is not a regular time interval like seconds or minutes but refers to the different stages of the process.
For example, in this book .St , .At , and .Rt often mean the state, action, and reward at the current time
step t (or current stage of the process).
It’s worth noting that the agent may take a considerable amount of time before deciding to take
action .At when it observes the environment state .St . As long as the agent does not violate the rules of
the environment, it has the flexibility to take action at its own pace. Therefore, a regular time interval
is not applicable in this case.
In this book, we adapt the mathematical notation employed by Professor Emma Brunskill in her
remarkable course on reinforcement learning [1]. We generally assume that the reward only depends
on the state .St and the action .At taken by the agent. To keep things simple, we use the same index
t for the reward, which is expressed as .Rt = R(St , At ), while some may prefer to use .Rt+1 =
R(St , At , St+1 ) instead of .Rt to emphasize that a reward also depends on the successor state .St+1 , as
Sutton and Barto discussed in their book.1 However, this alternative expression can sometimes lead
to confusion, particularly in simple cases like the ones we present in this book, where the reward does
not depend on the successor state.
It is important to remember that in practical implementation, the reward is typically received one
time step later, along with the successor state, as illustrated in Fig. 2.1.
We use upper case in .St , At , Rt because these are random variables, and the actual outcome of
these random variables could vary. When we are talking about the specific outcome of these random
variables, we often use the lower case .s, a, r.
Taken together, the state space, action space, transition model, and reward function provide a
complete description of the environment and the agent’s interactions with it. In the following sections,
we’ll explore how these elements interact to form the foundation for solving reinforcement learning
problems.
To better understand the interactions between the agent and the environment, we can unroll the
interaction loop as follows, as shown in Fig. 2.2:

• The agent observes the state .S0 from the environment.

• The agent takes action .A0 in the environment.

1 Intheir book, Sutton and Barto briefly discussed why they chose to use .Rt+1 instead of .Rt as the immediate reward
(on page 48). However, they also emphasized that both conventions are widely used in the field.
24 2 Markov Decision Processes

Fig. 2.1 Agent-environment interaction

Fig. 2.2 Example of the agent-environment iteration loop unrolled by time for episodic problem

• The environment transition into a new state .S1 and also generate a reward signal .R0 , where .R0 is
conditioned on .S0 and .A0 .
• The agent receives reward .R0 along with the successor state .S1 from the environment.
• The agent decides to take action .A1 .
• The interaction continues to the next stage until the process reaches the terminal state .ST . Once it
reaches terminal state, no further action is taken and a new episode can be started from the initial
state .S0 .

The reason why there is no reward when the agent observes the first environment state .S0 is due
to the fact that the agent has not yet interacted with the environment. As mentioned earlier, in this
book, we assume that the reward function .Rt = R(St , At ) is conditioned on the current state of the
environment .St and the action .At taken by the agent. Since no action is taken in state .S0 , there is no
reward signal associated with the initial state. However, in practice, it can sometimes be convenient to
include a “fake” initial reward, such as using 0, to simplify the code.
In summary, the MDP provides a framework for modeling reinforcement learning problems, where
the agent interacts with the environment by taking actions based on the current state and receives
rewards that are conditioned on the current state and action. By understanding the interactions
between the agent and environment, we can develop algorithms that learn to make good decisions
and maximize the cumulative reward over time.
2.3 Markov Process or Markov Chain 25

Markov Property

Not every problem can be modeled using the MDP framework. The Markov property is a crucial
assumption that must hold for the MDP framework to be applicable. This property states that the
future state of the system is independent of the past given the present. More specifically, the successor
state .St+1 depends only on the current state .St and action .At , and not on any previous states or actions.
In other words, the Markov property is a restriction on the environment state, which must contain
sufficient information about the history to predict the future state. Specifically, the current state must
contain enough information to make predictions about the next state, without needing to know the
complete history of past states and actions.

P St+1 St , At = P St+1 St , At , St−1 , At−1 , · · ·, S1 , A1 , S0 , A0
. (2.1)

For example, a robot trying to navigate a room can be modeled using the MDP framework only if it
satisfies the Markov property. If the robot’s movement depends on its entire history, including its past
positions and actions, the Markov property will be violated, and the MDP framework will no longer
be applicable. This is because the robot’s current position would not contain sufficient information to
predict its next position, making it difficult to model the environment and make decisions based on
that model.

Service Dog Example

In this example, we imagine training a service dog to retrieve an object for its owner. The training is
conducted in a house with three rooms, one of which contains a personal object that the dog must
retrieve and bring to its owner or trainer. The task falls into the episodic reinforcement learning
problem category, as the task is considered finished once the dog retrieves the object. To simplify the
task, we keep placing the object in the same (or almost the same) location and initialize the starting
state randomly. Additionally, one of the rooms leads to the front yard, where the dog can play freely.
This scenario is illustrated in Fig. 2.3.
We will use this service dog example in this book to demonstrate how to model a reinforcement
learning problem as an MDP (Markov decision process), explain the dynamics function of the
environment, and construct a policy. We will then introduce specific algorithms, such as dynamic
programming, Monte Carlo methods, and temporal difference methods, to solve the service dog
reinforcement learning problem.

2.3 Markov Process or Markov Chain

To start our discussion on Markov decision processes (MDPs), let’s first define what a Markov process
(Markov chain) is. A Markov process is a memoryless random process where the probability of
transitioning to a new state only depends on the current state and not on any past states or actions.
It is the simplest case to study in MDPs, as it involves only a sequence of states without any
rewards or actions. Although it may seem basic, studying Markov chains is important as they provide
fundamental insights into how states in a sequence can influence one another. This understanding can
be beneficial when dealing with more complex MDPs.
26 2 Markov Decision Processes

Fig. 2.3 Simple drawing to illustrate the service dog example

A Markov chain can be defined as a tuple of .(S, P), where .S is a finite set of states called the state
space, and .P is the dynamics function (or transition model) of the environment, which specifies the
probability of transitioning from a current state s to a successor state .s . Since there are no actions in
a Markov chain, we omit actions in the dynamics function .P. The probability of transitioning from
state s to state .s is denoted by .P (s |s).
For example, we could model a Markov chain with a graph, where each node represents a state,
and each edge represents a possible transition from one state to another, with a transition probability
associated with each edge. This can help us understand how states in a sequence can affect one another.
We’ve modeled our service dog example as a Markov chain, as shown in Fig. 2.4. The open circle
represents a non-terminal state, while the square box represents the terminal state. The straight and
curved lines represent the transitions from the current state s to its successor state .s , with a transition
probability .P (s |s) associated with each possible transition. For example, if the agent is currently in
state Room 1, there is a 0.8 probability that the environment will transition to its successor state Room
2 and a 0.2 probability of staying in the same state Room 1. Note that these probabilities are chosen
randomly to illustrate the idea of the dynamics function of the environment.

Transition Matrix for Markov Chain

The transition matrix .P is a convenient way to represent the dynamics function (or transition model) of
a Markov chain. It lists all the possible state transitions in a single matrix, where each row represents
a current state s, and each column represents a successor state .s . The transition probability for
transitioning from state s to state .s is denoted by .P (s |s). Since we are talking about probability,
2.3 Markov Process or Markov Chain 27

Fig. 2.4 Service dog Markov chain

the sum of each row is always equal to 1.0. Here, we list the transition matrix for our service dog
Markov chain:

Room 1 Room 2 Room 3 Outside Found item End

Room 1 0.2 0.8 0 0 0 0
Room 2 0.2 0 0.4 0.4 0 0
Room 3 0 0.2 0 0 0.8 0
P=
Outside 0 0.2 0 0.8 0 0
Found item 0 0 0 0 0 1.0
.
End 0 0 0 0 0 1.0

With access to the dynamics function of the environment, we can sample some state transition
sequences .S0 , S1 , S2 , · · · from the environment. For example:

• Episode 1: (Room 1, Room 2, Room 3, Found item, End)

• Episode 2: (Room 3, Found item, End)
• Episode 3: (Room 2, Outside, Room 2, Room 3, Found item, End)
• Episode 4: (Outside, Outside, Outside, . . . )
28 2 Markov Decision Processes

We now have a basic understanding about the state transition in the environment; let’s move on to
add rewards into the process.

2.4 Markov Reward Process

As we’ve said before, the goal of a reinforcement learning agent is to maximize rewards, so the next
natural step is to add rewards to the Markov chain process. The Markov reward process (MRP) is an
extension of the Markov chain, where rewards are added to the process. In a Markov reward process,
the agent not only observes state transitions but also receives a reward signal along the way. Note,
there are still no actions involved in the MRPs. We can define the Markov reward process as a tuple
.(S, P, R), where

• .S is a finite set of states called the state space.

• .P is the dynamics function (or transition model) of the environment, where .P (s |s) = P St+1 =

s St = s specify the probability of environment transition into successor state .s when in current
state s.

• .R is a reward function of the environment. .R(s) = E Rt St = s is the reward signal provided
by the environment when the agent is in state s.

As shown in Fig. 2.5, we added a reward signal to each state in our service dog example. As we’ve
talked briefly in Chap. 1, we want the reward signals to align with our desired goal, which is to find
the object, so we decided to use the highest reward signal for the state Found item. For the state

Fig. 2.5 Service dog MRP

2.4 Markov Reward Process 29

Outside, the reward signal is +1, because being outside playing might be more enjoyable for the agent
compared to wandering between different rooms.
With reward signals, we can compute the total rewards the agent could get for these different
sample sequences, where the total rewards are calculated for the entire sequence:

• Episode 1: (Room 1, Room 2, Room 3, Found item, End)

Total rewards .= −1 − 1 − 1 + 10 + 0 = 7.0
• Episode 2: (Room 3, Found item, End)
Total rewards .= −1 + 10 = 9.0
• Episode 3: (Room 2, Outside, Room 2, Room 3, Found item, End)
Total rewards .= −1 + 1 − 1 − 1 + 10 + 0 = 8.0
• Episode 4: (Outside, Outside, Outside . . . )
Total rewards .= 1 + 1 + · · · = ∞

Return

To quantify the total rewards the agent can get in a sequence of states, we use a different term called
return. The return .Gt is simply the sum of rewards from time step t to the end of the sequence and
is defined by the mathematical equation in Eq. (2.2), where T is the terminal time step for episodic
reinforcement learning problems.2

.Gt = Rt + Rt+1 + Rt+2 + · · · + RT −1 (2.2)

One issue with Eq. (2.2) is that the return .Gt could become infinite in cases where there
are recursion or loops in the process, such as in our sample sequence episode 4. For continuing
reinforcement learning problems, where there is no natural end to the task, the return .Gt could also
easily become infinite. To resolve this issue, we introduce the discount factor.
The discount factor .γ is a parameter, where .0 ≤ γ ≤ 1, that helps us to solve the infinite return
problem. By discounting rewards received at future time steps, we can avoid the return becoming
infinite. When we add the discount factor to the regular return in Eq. (2.2), the return becomes the
discounted sum of rewards from time step t to a horizon H , which can be the length of the episode or
even infinity for continuing reinforcement learning problems. We’ll be using Eq. (2.3) as the definition
of return for the rest of the book. Notice that we omit the discount .γ for the immediate reward .Rt ,
since .γ 0 = 1.

Gt = Rt + γ Rt+1 + γ 2 Rt+2 + · · · + γ H −1−t RH −1

. (2.3)

We want to emphasize that the discount factor .γ not only helps us to solve the infinite return
problem, but it can also influence the behavior of the agent (which will make more sense when we

2 The notation used in Eq. (2.2), as well as Eqs. (2.3) and (2.4), may seem unfamiliar to readers familiar with the work

of Sutton and Barto. In their book, they utilize a different expression denoted as .Gt = Rt+1 + Rt+2 + Rt+3 + · · · + RT
for the nondiscounted case. In their formulation, the immediate reward is denoted as .Rt+1 . However, in our book, we
adopt a simpler reward function and notation, as explained earlier. We represent the immediate reward as .Rt , assuming
it solely depends on the current state .St and the action .At taken in that state. Therefore, in Eq. (2.2), as well as Eqs. (2.3)
and (2.4), we start with .Rt instead of .Rt+1 , and we use .RT −1 instead of .RT for the final reward. It is important to note
that despite this slight time step shift, these equations essentially compute the same result: the sum of (or discounted)
rewards over an episode.
30 2 Markov Decision Processes

later talk about value functions and policy). For example, when .γ = 0, the agent only cares about
the immediate reward, and as .γ gets closer to 1, future rewards become as important as immediate
rewards. Although there are methods that do not use discount, the mathematical complexity of such
methods is beyond the scope of this book.
There is a useful property about the return .Gt in reinforcement learning: the return .Gt is the
sum of the immediate reward .Rt and the discounted return of the next time step .γ Gt+1 . We can
rewrite it recursively as shown in Eq. (2.4). This recursive property is important in MRPs, MDPs and
reinforcement learning because it forms the foundation for a series of essential mathematical equations
and algorithms, which we will introduce later in this chapter and in the next few chapters.

Gt = Rt + γ Rt+1 + γ 2 Rt+2 + γ 3 Rt+3 + · · ·

= Rt + γ Rt+1 + γ Rt+2 + γ 2 Rt+3 + · · ·
. = Rt + γ Gt+1 (2.4)

We can compute the return .Gt for the sample sequences of a particular MRP or even MDP. The
following shows the returns for some sample episodes in our service dog example, where we use
discount factor .γ = 0.9:

• Episode 1: (Room 1, Room 2, Room 3, Found item, End)

.G0 = −1 − 1 ∗ 0.9 − 1 ∗ 0.9 + 10 ∗ 0.9 = 4.6
2 3

• Episode 2: (Room 3, Found item, End)

.G0 = −1 + 10 ∗ 0.9 = 8.0

• Episode 3: (Room 2, Outside, Room 2, Room 3, Found item, End)

.G0 = −1 + 1 ∗ 0.9 − 1 ∗ 0.9 − 1 ∗ 0.9 + 10 ∗ 0.9 = 4.9
2 3 4

• Episode 4: (Outside, Outside, Outside, . . . )

.G0 = 1 + 1 ∗ 0.9 + 1 ∗ 0.9 + · · · = 10.0
2

As long as the discount factor is not 1, we won’t have the infinite return problem even if there’s a
loop in the MRP process. Comparing the return for these different sample sequences, we can see that
episode 4 has the highest return value. However, if the agent gets stuck in a loop staying in the same
state Outside, it has no way to achieve the goal, which is to find the object in Room 3. This does not
necessarily mean that the reward function is flawed. To prove this, we need to use the value function,
which we will explain in detail in the upcoming sections.

Value Function for MRPs

In Markov reward processes (MRPs), the return .Gt measures the total future reward from time step
t to the end of the episode. However, comparing returns for different sample sequences alone has its
limits, as it only measures returns starting from a particular time step. This is not very helpful for an
agent in a specific environment state who needs to make a decision. The value function can help us
overcome this problem.
Formally, the state value function .V (s) for MRP measures the expected return starting from state
s and to a horizon H . We call it the expected return because the trajectory starting from state s to
a horizon H is often a random variable. In simple words, .V (s) measures the average return starting
2.4 Markov Reward Process 31

from state s and up to a horizon H . For episodic problems, the horizon is just the terminal time step,
that is, .H = T .

.V (s) = E Gt St = s (2.5)

The state value function .V (s) for MRPs also shares the recursive property as shown in Eq. (2.6).
Equation (2.6) is also called the Bellman expectation equation for .V (s) (for MRPs). We call it the
Bellman expectation equation because it’s written in a recursive manner but still has the expectation
sign .E attached to it.

V (s) = E Gt St = s
.

= E Rt + γ Rt+1 + γ 2 Rt+2 + · · · St = s

= E Rt + γ Rt+1 + γ Rt+2 + · · · St = s

= E Rt + γ Gt+1 St = s

= E Rt + γ V (St+1 ) St = s . (2.6)

= R(s) + γ P (s |s)V (s ), for all s ∈ S (2.7)

s ∈S

Equation (2.7) is called the Bellman equation for .V (s) (for MRPs). We can see how the step from
Eqs. (2.6) to (2.7) removed the expectation sign .E from the equation, by considering the values of all
the possible successor states .s and weighting each by the state transition probability .P (s |s) from the
environment for the MRPs.

Worked Example

As an example, we can use Eq. (2.7) to compute the expected return for a particular state. Let’s say
our initial values for state Room 2 are 5.0 and for state Found item are 10.0, as shown in Fig. 2.6
(these initial values were chosen randomly), and we use no discount .γ = 1.0. Now let’s compute the
expected return for state Room 3. Since we already know the model (dynamics function and reward
function) of the Markov reward process (MRP), we know the immediate reward is .−1 no matter what
successor state .s will be. There’s a 0.8 probability the environment will transition to successor state
Found item, and the value for this successor state is V (Found item).= 10.0. And there’s also a 0.2
probability the environment will transition to successor state Room 2, and the value for this successor
state is V (Room 2).= 5.0. So we can use the Bellman equation to compute our estimated value for
state V (Room 3) as follows:

V (Room 3) = −1 + 0.8 ∗ 10.0 + 0.2 ∗ 5.0 = 8.0

Of course, this estimated value for state Room 3 is not accurate, since we started with randomly
guessed values, and we didn’t include other states. But if we include all the states in the state space
and repeat the process over a large number of times, in the end, the estimated values would be very
close to the true values.
32 2 Markov Decision Processes

Fig. 2.6 Example of how to compute the value of a state for the service dog MRP, .γ = 1.0; the numbers are chosen
randomly

Figure 2.7 shows the true values of the states for our service dog MRP. The values are computed
using Eq. (2.7) and dynamic programming, which is an iterative method and can be used to solve
MDP. We will introduce dynamic programming methods in the next chapter. For this experiment, we
use a discount factor of .γ = 0.9. We can see that the state Found item has the highest value among all
states.
Why do we want to estimate the state values? Because it can help the agent make better decisions.
If the agent knows which state is better (in terms of expected returns), it can choose actions that may
lead it to those better states. For example, in Fig. 2.7, if the current state is Room 2, then the best
successor state is Room 3 since it has the highest state value of 7.0 among all the possible successor
states for Room 2. By selecting actions that lead to high-value states, the agent can maximize its
long-term expected return.

2.5 Markov Decision Process

Now we’re ready to discuss the details of the MDP. Similar to how the MRP extends the Markov
chain, the Markov decision process (MDP) extends the MRP by including actions and policy into
the process. The MDP contains all the necessary components, including states, rewards, actions, and
policy. We can define the MDP as a tuple .(S, A, P, R):

• .S is a finite set of states called the state space.

• .A is a finite set of actions called the action space.
2.5 Markov Decision Process 33

Fig. 2.7 State values for the service dog MRP, .γ = 0.9

• .P is the dynamics function (or transition model) of the environment, where .P (s |s, a) = P

St+1 = s St = s, At = a specify the probability of environment transition into successor state
s when in current state s and take action a.
.

• .R is a reward function of the environment; .R(s, a) = E Rt St = s, At = a is the reward signal
provided by the environment when the agent is in state s and taking action a.

Note that the dynamics function .P and reward function .R are now conditioned on the action .At
chosen by the agent at time step t.
In Fig. 2.8, we have modeled our service dog example using an MDP. Before we move on, we want
to explain the small changes we’ve made to Fig. 2.8. First, we have merged states Found item and End
into a single terminal state Found item. This makes sense because the reward is now conditioned on
.(s, a), and there are no additional meaningful states after the agent has reached the state Found item.

Second, the straight and curly lines in Fig. 2.8 represent valid actions that the agent can choose in a
state, rather than state transition probabilities. Finally, the reward now depends on the action chosen
by the agent, and the reward values are slightly different.
In fact, our service dog example is now modeled as a deterministic (stationary) reinforcement
learning environment. This means that if the agent takes the same action a in the same state s, the
successor state .s and reward r will always be the same, regardless of whether we repeat it 100 times
or 1 million times. The transition to the successor state .s is guaranteed to happen with 1.0 probability.
For example, if the agent is in state Room 2 and chooses to Go outside, the successor state will always
be Outside; there is zero chance that the successor state will be Room 3 or Room 1.
34 2 Markov Decision Processes

Fig. 2.8 Service dog MDP

We can list all the elements in the set of .S, A, R for our service dog MDP; notice that not all
actions are available (or legal) in each state:

• .S = {Room 1, Room 2, Room 3, Outside, Found item}

• .A = {Go to room1, Go to room2, Go to room3, Go outside, Go inside, Search}
• .R = {.−1, .−2, +1, 0, +10}

As we have explained before, our service dog MDP is a deterministic (stationary) reinforcement
learning environment. The dynamics function is slightly different compared to our previous example.
For MDPs, the transition from the current state s to its successor state .s depends on the current state
s and the action a chosen by the agent. For a deterministic environment, the transition probability is
always 1.0 for legal actions .a ∈ A(s), and 0 for illegal actions .a ∈
/ A(s), which are actions not allowed
in the environment. Illegal actions should never be chosen by the agent, since most environments have
enforced checks at some level. For example, in the game of Go, if a player makes an illegal move,
they automatically lose the game.
We can still construct a single matrix for the dynamics function for our service dog MDP, but this
time it needs to be a 3D matrix. Since it’s not easy for us to draw a 3D matrix, we chose to use a 2D
plane to explain the concept for a single state. Assume the current state of the environment is Room
2, each row of the plane represents an action, and each column of the plane represents a successor
state .s . For consistency purposes, we set the transition probability to 1.0 for the successor state Room
2, for all illegal actions (Go to room2, Go inside, Search). This just means that these illegal actions
won’t affect the state of the environment.
Random documents with unrelated
content Scribd suggests to you:
due, Leuppold began in a fatherly way to impress upon Gallatin the
utter futility of trying to win the injunction in the Court of Appeals.
The contentions of Sanborn et al. had no basis either in law or in
equity. Mr. Gallatin had doubtless been unduly influenced by doubtful
precedents. He, Leuppold, was familiar with every phase of the case
and had defended the previous suit which had been brought and lost
by a legal firm in Philadelphia. There was absolutely nothing in Mr.
Gallatin’s position as stated in his correspondence and he concluded
by referring “his young friend” to certain marked passages in a
volume which he had brought in under his arm. Gallatin read the
passages through with interest and listened with a show of great
seriousness to Mr. Leuppold’s interpretation of them. Mr. Leuppold
had a mien which commanded attention. Gallatin gave it, but he said
little in reply which could indicate his possible ground of action,
except to express regret that Mr. Leuppold’s clients had taken such
an intolerant view of his own client’s claims and to deplore the
unfortunate tone of Mr. Leuppold’s own letter of some days ago.
When it was quite clear to Mr. Leuppold that the young man was
not to be moved by persuasion, his manner changed.
“I have done my best, Mr. Gallatin,” he said irritably, “to prove to
you the utter futility of your course. My clients have nothing to fear. I
am only trying to save them the expense of further litigation. But if
you insist on bringing this case to trial, we will welcome the
opportunity to show further evidence in our possession. We have
been content for the sake of peace to let matters go on as they have
been going, but if this suit is pressed, I warn you that it will be
unfortunate for your clients.”
“I hope not. I hope we won’t have to bring suit,” replied Gallatin
easily. “I’m only asking for a conference of all the parties interested,
Mr. Leuppold. That certainly is little enough, an amicable conference,
a discussion—if you like——”
“There is nothing to discuss.”
“I beg to differ. Leaving aside for a moment the question of the
new evidence in the Sanborn case, do you think that Mr. Loring, who
controls its stock, would care to have his connection with the Lehigh
and Pottsville Railroad Company brought into court?”
Mr. Leuppold gasped. He couldn’t help it. How and where had this
polite but surprising young man obtained this information, which no
member of his own firm besides himself possessed. It was uncanny.
Was this the fellow they had talked about and smiled over upstairs?
Mr. Leuppold took to cover skillfully, hiding his uneasiness under a
bland smile.
“You’re dreaming, sir,” he said.
Gallatin shook his head.
“No, I’m not dreaming.”
Gallatin rose and took a few paces up and down the room. “See
here, Mr. Leuppold, I’m not prepared to discuss the matter further
now. I’ve asked you for a conference and you call my request
intimidation—which might mean a much uglier thing. You’ve treated
my correspondence in a casual way and you’ve patronized me in my
own office. I’ve kept my temper pretty well, and I’m keeping it still;
but I warn you that you have been and still are making a mistake.
I’ve asked for a conference because I believe this matter can be
settled out of court, and because I didn’t think it fair to your client to
go to court without giving him a chance to save himself. We have no
desire to enter into a long and expensive litigation, but we are
prepared to do so and will take the preliminary steps at once, unless
we have some immediate consideration of our claims. If you stand
suit on this appeal you will lose, and I fancy the evidence presented
will be of such character that you will not care to take the matter
further. Don’t reply now, Mr. Leuppold. Think it over and let me hear
from you in writing.”
Mr. Leuppold had not moved. He was watching Gallatin keenly
from under his beetling brows. Was this mere guess work? What did
the young man really know? What evidence had he? Was it a bluff?
If so, he made it in tones with which Leuppold was unfamiliar. But it
was no time to back water now. He smiled approvingly at Phil
Gallatin’s inkwell.
“Mr. Gallatin, your imagination does you credit. A good lawyer
must have intuition. But he’s got to have discretion, too. You think,
because the interests we represent are wealthy ones, that you can
throw a stick in our direction and be sure of hitting something.
Unfortunately you have been misinformed—on all points. Mr. Loring
has voluntarily submitted his holdings in Pennsylvania to
investigation. You can never prove any connection between the
Pequot Coal Company and the Lehigh and Pottsville Railroad. There
is none.”
He rose pompously and took up his hat and books.
“There isn’t any use in our talking over this case. It will lead us
nowhere. But I’ll promise you if you’ll put your proposition in writing
to submit it to careful consideration.”
“Thanks,” said Gallatin dryly. He picked a large envelope up from
the table and handed it to his visitor. “I have already done so. Will
you take it with you or shall I mail it?”
“I—you may give it to me, Mr. Gallatin.”
Gallatin walked to the outer door and politely bowed him out,
while Tooker, his thin frame writhing with ecstasy, fussed with some
papers on the big table in the junior partner’s office until he was
more composed, and then went on about his daily routine. He
realized now for the first time the full stature of the junior partner. In
a night, it almost seemed to Tooker, he had outgrown his boyhood,
his brilliant wayward boyhood that had promised so much and
achieved so little. He was like his father now, but there was a
difference. Philip Gallatin, the elder, he remembered, had dominated
his office by the mere force of his intellect. He had directed the
preparation of his cases with an unerring legal sense and he had
won them through his mastery of detail and the elimination of the
unessential. But it was when presenting his case to a jury that he
was at his strongest, for such was the personal quality of his
magnetism that jurors were willing to be convinced less by the value
of his cause than by the magic of his sophistry. But to Tooker, who
was little more than a piece of legal machinery, there was something
in the methods of the son which compensated for the more
spectacular talents of the father, the painstaking and diligent way in
which Gallatin had planned and carried out his present investigations
and the confidence with which he was putting his information to use.
It was clear to Tooker that Leuppold had been unprepared for Philip
Gallatin’s revelations. Even now Tooker doubted the wisdom of them,
for Mr. Leuppold would not be slow to take advantage of his
information and to cover the traces left by his clients as well as he
might. But when he spoke of it to Gallatin, the junior partner had
laughed.
“Don’t you bother, old man. Wait a while. We’ll hear from Mr.
Leuppold very soon—before the week is out, I think.”
In the offices upstairs, Mr. Leuppold’s return was the signal for an
immediate consultation of the entire firm, which would have
flattered and encouraged Philip Gallatin had he been aware of it. Mr.
Tyson and Mr. Whitehead discovered in Mr. Leuppold’s account of the
interview undue cause for alarm. They were themselves adepts in
the game Mr. Gallatin was evidently playing and could be depended
upon at the proper moment to out-maneuver him. Mr. Leuppold
disagreed and was forced to admit the weakness of Mr. Loring’s
position, if, as he suspected, Mr. Gallatin had succeeded in fortifying
himself with the proper evidence. The stock was, of course, not in
Mr. Loring’s name, but a man of resource might have been able to
find means to establish a legal connection of the mine with the
railroad. Mr. Leuppold’s opinions usually bore weight, but just now
he seemed to have no definite opinions.
The conference of the partners lasted until late in the afternoon,
during which time messengers came and went between the firm’s
offices and those of the Pequot Coal Company and that of the
President of the L. and P. Henry K. Loring was out of town and
would not return until the end of the week. A wire was sent to him
to return to New York at once, and it was decided that no reply to
Mr. Gallatin’s letter should be sent until Mr. Loring had been advised.
Phil Gallatin, in high good humor, lunched that morning with the
senior partner at a fashionable restaurant uptown. His work on the
Sanborn case was finished. He had been at it very hard for two
months, and the two of them had planned to spend the afternoon
and following day up at John Kenyon’s farm in Westchester, where
they would do some riding, some walking and some resting, of
which both were in need. The lunch was a preliminary luxury and
they found a table in a corner on the Avenue and ordered.
There was no talk of office matters. John Kenyon had been
thoroughly advised of Phil’s work and knew that there was nothing in
the way of suggestion or advice that he could offer. He had noticed
for some days the gaunt look in his young partner’s face. There were
indications of his growing maturity and shadows of the struggle
through which he had passed, but there were marks which John
Kenyon knew belonged to a different kind of trouble. Gallatin had
told him what had happened in the woods and Kenyon had learned
something of Phil’s romance in New York. But Kenyon was not given
to idle or curious questioning, and he knew that when Phil was ready
to speak of private matters he would do so.
Their oysters had been served and their planked fish brought
when a fashionable party entered and was conducted by the head
waiter to an adjoining table which had been decorated for the
occasion. Mrs. Pennington led the way, followed by Miss Ledyard,
Mrs. Perrine and Miss Loring. Behind them followed Ogden Spencer,
Bibby Worthington, Colonel Broadhurst and Coleman Van Duyn, who
was, it appeared, the host.
Phil had hoped that his presence might pass unnoticed; but Nellie
Pennington espied him and nodded gayly, so that he had to rise and
greet her. This drew the eyes of others and when the party was
seated he discovered that Miss Loring, on Van Duyn’s right, was
seated facing him and that her eyes after one blank look in his
direction were assiduously turned elsewhere. John Kenyon caught
the change in Gallatin’s expression, but in a moment Phil had
resumed their conversation upon the comparative merits of the
Delaware River and Potomac River shad, and their luncheon went on
to its conclusion. But the spirits of John Kenyon’s guest had fallen,
and Kenyon’s most persuasive stories failed to find a response. In
spite of himself Phil Gallatin found himself looking at Jane and
thinking of Arcadia. It was three weeks now since that much to be
remembered and regretted interview at the Loring house had taken
place. The glance he stole at Jane assured him that if he had ever
had a hope of reconciliation, the chances for it were now more
remote than ever. She wore a huge hat which screened her
effectually, and the glimpses he had of her face showed it dimpling
in smiles for Coleman Van Duyn or Bibby Worthington, who sat on
either side of her. When their eyes had first met he had thought her
pale, but as the moments passed a warm color mounted her cheeks.
It seemed to Gallatin that never before within his memory had she
ever appeared so care-free. She was youth untrammeled, a sister to
Euphrosyne, the spirit of joy. It seemed as if she realized that the
grim specter which had stolen into her life for a while had been
exorcised away, and that she had already forgotten it in the
beckoning of the jocund hours. Phil Gallatin had come into her life
and gone, leaving no trace in her mind or in her heart.
After this their eyes met but once. He was looking at her, thinking
of these things, oblivious of what John Kenyon was saying, unaware
of the intentness of his gaze, which at last compelled her to look in
his direction. It was a startled glance that she gave him, wide-eyed,
almost fearful, as though he had challenged her to this silent
combat. Then her lids lowered insolently, her chin lifted and she
turned aside.
Their coffee had been served. Phil gulped his down hastily. “Come,
Uncle John,” he said hoarsely. “Let’s get out of this, will you?”
John Kenyon paid the check and they rose. Unfortunately the only
path to the door lay by Mr. Van Duyn’s table, and as Gallatin passed,
nodding to his acquaintances, Mrs. Pennington got up and stood in
front of him.
“I do so want to see you for a moment, Phil. Will you excuse me,
Coley?” she said, and led the way into a room where she found an
unoccupied corner. John Kenyon went elsewhere to smoke his cigar.
“Oh, Phil!” she whispered. “Why wouldn’t you come to see me?
I’ve had so much to talk to you about.”
“I—I’ve been very busy, Nellie. I haven’t been anywhere.”
“My house isn’t ‘anywhere.’ I want to talk to you—you know what
I mean.”
“It won’t do any good, Nellie,” he muttered. “There isn’t anything
more to be said.”
“Perhaps not—but I want to say it just the same. I want you to
promise——”
“I can’t,” he said hoarsely. “Don’t ask me to come and talk to you
—about that.”
“Well, then, come and talk to me about other things.”
“I can’t. If I come I must talk about what you remind me of.”
She hesitated, looking at him critically.
“Phil, you’re an idiot,” she said at last.
“Thanks,” he replied, “I’m aware of it.”
“Are you going to give up?”
“I’ve given up.”
Nellie Pennington shrugged. “For good? You’re going to let—Oh,
I’ve no patience with you.”
“I’m sorry. You did what you could and I’m thankful. Don’t think
I’m ungrateful. I’m not. One of these days I’ll prove it. You did a lot.
I’m awake, Nellie. You woke me and I’m not going to sleep again.”
“I’m proud of you, Phil, but you’re not awake—not really awake or
you couldn’t sit by and see the girl you love forced into an
engagement with a man she doesn’t care for.”
Gallatin flushed.
“Is that—” he asked slowly, “is that what this—this luncheon
means?”
“Judge for yourself. He is with her always. And they’ve even
rebelled against my chaperonage. Their relations are talked of freely
in Jane’s presence and she laughs acquiescence. Imagine it!”
Gallatin turned away.
“I—I have no further interest in—in Miss Loring,” he said quietly.
“Well, I have. And I’m not going to let her make a fool of herself if
I can help it.”
“Miss Loring will probably not agree with you.”
“I hardly expect her to.” She hesitated. “Phil,” she asked at last.
“What, Nellie?”
“Will you answer a question?”
“What?”
“Was this story they’re telling about you and Nina mentioned?”
“Yes, it was.”
“I thought so,” triumphantly. “Phil we must talk this thing out.”
“It can do no good——”
“And no harm. There’s been a mistake somewhere—something
neither you nor I understand.” She stopped and tapped her forehead
with her index finger. “I can’t tell what—but I sense it—here.
Something has gone wrong—what, I don’t know. I’ve got to think
about it.”
“Yes—it’s gone wrong—and it can’t be righted.”
“Perhaps not,” she said rising. “But I do want you to come to see
me. Won’t you?”
“You’re very persistent, aren’t you? Very well, I’ll come.”
“I must go now. Coley will be furious. I hope so, at any rate.”
She smiled at him again and went back to her luncheon party
while Gallatin found John Kenyon and drove to the Grand Central
station.
XXIV
DIAMOND CUT DIAMOND

I t was the middle of March, and fashionable New York, having

been at least twice through its winter wardrobe, had gone
southward for a change of speed. Aiken, Jekyl Island and Palm
Beach had all done their share in the midwinter rejuvenation, but
the particular set of people with which this story concerns itself were
spending the last days of the Lenten season at the Dorsey-Martin’s
place in Virginia.
Dorsey-Martin was rich beyond the dreams of Alnaschar, but unlike
the unfortunate brother of the barber, had not smashed the
glassware in his basket until he had sold it to somebody else, when
he was enabled to buy it in again at a much reduced rate. His
particular specialty was not glassware, but railroads which, while
equally fragile, could be put together again and be made (to all
appearances) as good as new.
The fruits of this fortunate talent were in evidence in his well-
appointed house in New York with its collection of old English
portraits, his palace at Newport just finished, and in his “shooting
place” in Virginia.
The Dorsey-Martins had “arrived.” They had been ten years in
transit, and their ways had been devious, but their present welcome
more than compensated for the pains and money which had been
spent in the pilgrimage. The Virginia place, “Clovelly” adjoined that
of the Ledyards, and consisted of a thousand acres of preserved
woodland and dale, within a night’s journey of New York. Autumn, of
course, was the season when “Clovelly” was most in use, but spring
frequently found it the scene of gay gatherings such as the present
one, for in addition to the squash courts and swimming pool there
was court tennis, with a marker constantly in attendance, a good
stable, and hospitable neighbors.
It was Nellie Pennington who had prevailed upon Phil Gallatin to
accept Mrs. Dorsey-Martin’s invitation, for she knew that Jane Loring
was staying at “Mobjack,” the Ledyards’ place, and she hoped that
she might yet be the means of bringing the two together. Her
interview with Phil had been barren of results, except to confirm her
in the suspicion that Nina Jaffray held the key to the puzzle. Nina,
who had been one of the early arrivals at “Clovelly,” had so far
eluded all her snares; and Nellie Pennington was now convinced that
here was a foeman worthy of her subtlest metal. She enjoyed the
game hugely, as, apparently, did Nina, and their passages at arms
were as skillful (and as ineffectual) as those of two perfectly
matched maîtres d’escrime. Nina knew that Nellie Pennington
suspected her of mischief, but she also knew that it was unlikely that
any one would ever know, unless from Jane, just what that mischief
had been.
The arrival of Phil Gallatin, while it gave Nina happiness, made her
keep a narrower guard against the verbal thrusts of her playful
adversary.
Phil Gallatin had regained his poise and reached “Clovelly” in a
jubilant frame of mind. Two days ago Henry K. Loring had agreed to
a conference.
Mr. Leuppold, more suave, more benign, more patronizing than
ever, had called and told Gallatin of this noteworthy act of
condescension on the part of his client. Nothing, of course, need be
expected from such a meeting in the way of concessions, but men of
the world like Mr. Leuppold and Mr. Gallatin knew that co-operation
was, after all, the soul of business, and that one caught many more
flies with treacle than with vinegar.
He continued for half an hour in this vein, platitudinizing and
begging the question at issue while Gallatin listened and assented
politely, without giving any further intimation of a course of action
for Kenyon, Hood and Gallatin. But when the great lawyer had
departed, Gallatin went to the window and surveyed the steel gray
waters of the Hudson with a gleaming eye, and his face wore a smile
which would not depart. Sanborn’s case would never go to court.
The vestiges of this good humor still remained upon his face and
in his demeanor all the morning, which had been spent in a run with
the Warrenton pack. It was so long since he had ridden to hounds
that he had almost forgotten the joy of it, but he was well mounted
and finished creditably. Nina Jaffray showed the field her heels for
most of the way and Gallatin pounded after her, his muscles aching,
determined not to be outridden by a woman.
In the first check, she drew her horse alongside of his and smiled
at him.
“Ready to let me announce it yet, Phil?” she asked.
Gallatin just then was wondering whether his leg grip would last
out the day.
“Announce what, Nina?” he asked.
“Our engagement,” she returned with a smile. “It’s almost time,
you know.”
“Oh, go as far as you like.”
“Don’t laugh!”
“I’ve got to—you make me so happy.”
“Oh, you can joke if you like now, but you’ll have to marry me
some day.”
“Oh, will I? Why?”
“Because you like me. Friendship subdues even Time, Phil. I’m
willing to wait.”
And when he looked at her, at loss for a reply, the hounds gave
tongue again and they were off at a full gallop. He couldn’t help
admiring her this morning. The easy unconventionality of her
speech, her attitude of good fellowship, were a part of the setting.
This was the scene in which she always appeared to the best
advantage and she took the center of the stage with an assurance
which showed how well she knew her lines.
It was Nina’s brush, of course, for she had brought down her own
best hunter for the occasion and was in at the death with the
Huntsman and Master of the Hounds, while Gallatin trailed in with
the Field. And in the ride homeward Phil found himself jogging along
comfortably at Nina’s side.
“Phil,” she said again, when the others had ridden on ahead. “I
hope you won’t laugh at me any more. It’s indecent. I never laugh at
you.”
“Oh, don’t you? You’re never doing anything else.”
“It seems so, doesn’t it? That’s my pose, Phil. I’m really very much
in earnest about things. I don’t suppose I ever could learn to love
anybody—the faculty is lacking, somehow; but I think you know
that, even if I didn’t love you, I’d never love any one else, whatever
happened, and I’d be true as Death.”
“Yes, I know that. But——”
“But—?” she repeated.
“But—I’m not going to marry,” he laughed.
She shrugged.
“Oh, yes, you will—some day.”
“Why do you think so?”
“Because men of your type always do.”
“My type?”
“Yes, they usually marry late and beneath them. I’m trying to save
you from that mistake.”
He smiled at her saucy profile.
“Marrying one’s equal doesn’t always mean equality.”
“You were always a dreamer, Phil.”
“I think I’ll always dream then, Nina,” he broke in abruptly. “Don’t
make the mistake of thinking that you’ve got to marry somebody—
anybody—just because you’ve reached the marriageable age. That’s
the trap that catches most of us. Marry for love, Nina. You’ve got
that much capital to begin on. Love doesn’t die a sudden death.”
“Not unless it’s killed. That happens, you know.”
“You can’t kill it easily. You may scoff at it, deny it, wound it, but it
doesn’t die, Nina.”
She turned and examined him narrowly, then shifted her bridle to
the other hand and ran her crop along her horse’s neck.
“You know, Jane Loring is going to marry Coley.”
“What has that to do with what we’re talking about?” he said
quickly.
“Oh, nothing. Only I thought you’d like to know it. You’ll have a
chance to congratulate them to-night.”
“To-night? Where?”
“They’re at the Ledyards’, but they’re dining at ‘Clovelly.’”
“Oh!”
“So, if you’re going to put them asunder, you’d better do it to-
night or forever hold your peace.”
He smiled around at her calmly.
“Nothing doing, Nina. You missed it that time. The only things I’m
putting asunder are a railroad and an omnivorous coal company.
That takes about all my energy.”
“Phil,” she put in thoughtfully after a moment.
“What?”
“What’s the use of waiting? You’re going to marry me in the end,
you know.”
“Oh, am I?”
“Yes. You can’t afford to refuse. I’ve got the money, position, and
father has influence. That means power for a man of your ability.
You’re getting ambitious. I can tell that by the way you’re sticking at
things. There’s no telling what you mightn’t accomplish with the help
I can bring you. Oh, you could get along alone, of course. But you’d
waste a lot of time. You’d better think about it seriously.”
“I have thought about it. I’m really beginning to believe you mean
it.”
“Yes, I do mean it. I’ve decided to marry you. And you know I’ve
never yet failed at anything I’ve undertaken.”
She was quite in earnest and he looked at her amusedly.
“Then I suppose I’d better surrender at discretion.”
“Yes, I’m sure you had.”
“Isn’t there a loophole?”
“None, whatever. I’m your super-man, Phil. You might just as well
go at once and order your wedding garments and the ring. It will
save us endless discussions—and you know I hate discussions.
They’re really very wearing. Besides, O Phil!”—She laid the end of
her crop on his arm—“just think what a lot of fun you’ll get out of
letting Jane know how little you care!”
Gallatin didn’t reply and in a moment they had reached the stables
of “Clovelly” where the others were dismounting.
In his room, to which he had gone in search of his pipe, Gallatin
paused at the window, looking out over the winter landscape,
thinking. Why not? Why shouldn’t he marry her? It would be a cold-
blooded business, of course, but he called to mind a dozen
marriages of reason that had turned out satisfactorily, and as many
marriages for love which had ended in the ditch. This life was a
pleasant kind of poison, the luxury and ease, the careless gayety of
these pleasant people who moved along the line of least resistance,
taking from life only what suited their moods, living only for the
moment, sure that the future was amply provided for. He had turned
his back on this world for a while, and had lived in another, a sterner
world, with which this one had little in common. A place like this
might be his, with its broad acres and stables, horses and motor
cars, a life like this for the asking. A marriage of reason! With Nina
Jaffray at the helm of his destiny and hers. God forbid!
He had laid his own course now, but he had weathered the rocks
and shoals and the rough water in sight did not dismay him.
Marriage! He wanted none of it with Nina or any other. This kind of
life was not for him unless he won it for himself, for only then would
he be fit to live it. And while he found it good to be away from his
rooms in the house in —— Street, good to be away from the office
for a while, the atmosphere of “Clovelly” was redolent of his early
days of indolence and undesire and he suddenly found himself less
tolerant of the failings of these people than he had ever been
before. He hadn’t realized what his work had meant until he had this
idleness to compare it with.
Jane! He had been able to think less of Jane Loring in the fever of
work, but here at “Clovelly,” among the people they both knew,
where her name was frequently mentioned, he found it less easy to
forget her, and the imminence of the hour when he must see her
again gave him a qualm.
He lighted his pipe and started downstairs toward the gunroom,
where the guests were recounting the adventures of the morning
over tobacco and high-balls. Nellie Pennington, who had an instinct
for the psychological moment, met him and led him to a lounge at
the end of the hall.
“Well,” she said, “are you prepared to give a full account of
yourself?”
“An empty account, dear Mother Confessor. I’m neither sinful nor
virtuous.”
“I’m not so sure about that.”
“About which?”
“About either. You’re unpleasantly self-righteous and criminally
unamiable.”
“Oh, Nellie, to whom?”
“To me. Also, you’re stupid!”
“Thanks. That’s my misfortune. What else?”
“That’s enough to begin on. I could pull your ears in chagrin.
You’ve treated my advice with the scantest ceremony, made ducks
and drakes of the opportunities I’ve provided, and lastly you’ve gone
and gotten Nina Jaffray talked about——”
“Nellie! Please! I can’t permit——”
“Oh, fudge, Phil. Nina is well able to look after herself. It isn’t of
Nina I’m thinking.”
“Who then?”
“You! You silly goose. There isn’t any spectacle in the world half so
ludicrous as a chivalrous man defending the fame of a woman who
doesn’t care whether she’s defended or not.”
“I don’t see——”
“I know you don’t. That’s why I’m telling you.”
“But Nina, does care.”
“Yes, but not precisely in the way that you suppose. Fortune gave
her some excellent cards—and she played them.”
“Please be more explicit.”
“Very well, then. Girls of Nina’s type would rather have their name
coupled unpleasantly with that of the man they care for than not
coupled with it at all.”
“Nonsense, Nina doesn’t care——”
“Oh, yes, she does. She wants to marry you. She has told you so,
hasn’t she?”
Phil Gallatin looked at her quickly with eyes agog. Such powers of
divination were uncanny.
“She has proposed to you once—twice—how many times, Phil?”
“None—not at all,” he stammered, while she smiled and shrugged
her incredulity.
“If I didn’t know already, I need only a glance at your face to be
convinced of it.”
“How did you know?”
“How does a woman know anything? By virtue, my friend, of
those invisible spiritual fibers which she thrusts in all directions and
upon which she receives impressions. That’s how she knows.”
“You guessed?”
“Call it that, if you like. I guessed. I guessed this, also: that Nina
wanted Jane to believe this story to be true. It didn’t need much to
convince her. That little Nina was willing to provide.”
“What?”
“Nina admitted that the story was true,” she repeated.
Gallatin rose to his feet and stared at his companion like one
possessed.
“Nina admitted it! You’re dreaming.”
“No. I’m very wide awake. I wish you were.”
“It’s preposterous. Whatever put such an idea into your head?”
“My antennæ.”
“Nonsense!”
“Listen. Nina called on Jane a while ago. They had a long talk.
Something happened—something that has interrupted friendly
relations. They don’t speak now. What do you suppose that talk was
about? The weather? Or a plan for the amelioration of the condition
of homeless cats? Oh, you know a lot about women, Phil Gallatin!”
she finished scornfully.
“I know enough,” he muttered.
“You think you do,” she put in quickly. “The Lord give me patience
to talk to you! For unbiased ignorance, next to the callous youth who
thinks he knows it all, commend me to the modern Galahad! The
one only thinks he knows, but the other doesn’t want to know. He’s
content to believe every woman irreproachable by the mere virtue of
being a woman. Nina Jaffray has played her cards with remarkable
cleverness, but she has been quite unscrupulous. It’s time you knew
it, and it’s time that Jane did. I would tell her if I thought she would
believe me, but I fancy I’ve meddled enough.”
Gallatin took two or three paces up and down and then sat down
beside her.
“It isn’t meddling, Nellie,” he said quietly. “You’ve done your best
and I’m grateful to you. Unfortunately, you can’t help me any longer.
It’s too late. I did what I could. No girl who had ever loved a man
could let him go so easily, could doubt him so willingly. It was all a
mistake. It’s better to find it out now than too late.”
Nellie Pennington didn’t reply. She only looked down at her muddy
boots with the cryptic smile that women wear when they wish to
conceal either their ignorance or their wisdom.
“Did you know that Jane was dining here to-night?” she asked.
“Yes,” he replied. “Nina told me. I’m sorry.”
“It doesn’t matter in the least. The world is big enough for
everybody. Jane evidently thinks so, too. Otherwise she wouldn’t be
coming.”
“Does she know I’m here?”
“Oh, yes, she knows that Nina is, too.”
Gallatin looked out of the window.
“You don’t understand women, do you, Phil? Admit that and I’ll tell
you why she’s coming.”
He smiled. “I do admit it. You’re all in league with the devil.”
“She’s coming here because she wants to show you how little she
cares, because she has a morbid curiosity to see you and Nina
together, and lastly,” at this she leaned toward him with her lips very
close to his ear, “and lastly—because she loves you more madly than
ever!”
He had hardly recovered from the shock of surprise at this
announcement when he realized that Nellie Pennington had
suddenly risen and fled.
This preliminary step taken, Nellie Pennington retreated upstairs in
the most amiable of moods, to dress for luncheon. If Nina was going
to play the game with marked cards, it was quite proper that Phil be
permitted the use of the code. She had at least provided him with
food for reflection, which, while not quite pleasant to take, would
serve as nutrition for his failing optimism. And somewhere in the
back of her head a plan was being born, unpalpable as yet and
formless, but which persisted in growing in spite of her.
XXV
DEEP WATER

T he afternoon was passed in leisurely fashion. The modern way of

entertaining guests is to let them entertain themselves. They
loafed, smoked, played bottle-pool and later on there was a court
tennis match between young Dorsey-Martin and the marker, which
drew a gallery and applause. Nina Jaffray tried it next with Bibby
Worthington and though she had played but once, got the knack of
the “railroad” service and succeeded in beating him handily, amid
derisive remarks for Bibby from the nets. A plunge in the pool
followed; after which the ladies went up for a rest before dressing
for dinner. Gallatin saw little of Nellie Pennington during the
afternoon, and though he wanted to question her to satisfy the
alarming curiosity which she had aroused, she avoided speaking to
him alone, and when he insisted on following her about, fled to her
room. She knew the effect of her revelations upon his mind and she
didn’t propose that it should be spoiled by an anti-climax.
The dinner hour arrived and with it the Ledyards and their house-
guests, Angela Wetherill, Millicent Reeves, the Perrines, Jane Loring,
Percy Endicott, Coleman Van Duyn and some of the Warrenton folk.
Dinner tables, each with six chairs, had been laid in the dining-room
and hall, but so perfect was the machinery of the great
establishment that the influx of guests made no apparent difference
in its orderly procedure. There were good-natured comments on
Bibby Worthington’s defeat in the afternoon, congratulations for Nina
Jaffray on her dual achievement, uncomplimentary remarks about
Virginia clay, flattering ones about Virginia hospitality and the usual
discussion about breeds of hounds and horses, back of which was to
be discovered the ancient rivalry between the Cedarcroft and
Apawomeck hunt clubs.
Nellie Pennington directed the destinies of the table at which
Gallatin sat. Nina Jaffray was on his right, Larry Kane beyond her,
Coleman Van Duyn on Mrs. Pennington’s left and Jane Loring
opposite. Nothing could possibly have been arranged which could
conspire more thoroughly to lacerate the feelings of those
assembled. Gallatin saw Jane halt when she was directed to her
seat, he heard Nina’s titter of delight beside him, caught Larry Kane’s
glare and Coley Van Duyn’s flush, but the stab of Jane’s eyes
hardened him into an immediate gayety in which Nina was not slow
to follow. Mrs. Pennington having devised the situation, calmly sat
and proceeded to enjoy it. Good breeding, she knew, made a fair
amalgam of the most heterogeneous elements, but she gave a short
sigh when they were all seated and each began talking rapidly to his
neighbor, Jane to Larry Kane, Nina to Phil and herself to Coley.
Pangs in every heart except her own! It was the perfection of social
cruelty, and she enjoyed it hugely, aware that two, perhaps three, of
the persons at the table might never care to speak to her again, but
stimulated by the reflection, whether for bad or good, something
must come out of her crucible. The first shock of dismay over, it was
apparent that her dinner partners had decided to make the best of
the situation. The table was small, and general conversation
inevitable, but she chose for the present to let matters take their
course, trusting to Nina to provide that element of uncertainty which
was to make the plot of her comedy fruitful.
Indeed, Nina seemed in her element, and, when a sudden silence
fell, broke the ice with a carelessness which showed her quite
oblivious of its existence.
“So nice of you, Nellie, to have us all together! I was just saying to
Phil that dinners at small tables can be such a bore, if the people are
not all congenial.”
“Jolly, isn’t it?” laughed Nellie. “Jane, why weren’t you hunting this
morning?”
“Oh, Coley didn’t want to,” she said quickly, her rapier flashing in
two directions.
Nellie Pennington understood.
“You are getting heavy, aren’t you, Coley?” she asked sweetly.
“Didn’t Honora have anything up to your weight?”
“I didn’t ask,” returned Van Duyn peevishly. “Dreadful bore,
huntin’——”
“Hear the man!” exclaimed Nellie. “You’re spoiling him, Jane.”
“There’s no hope for any creature who doesn’t like hunting,” put in
Nina in disgust.
“Except the fox,” said Gallatin.
“And there’s not much for him when Nina rides,” laughed Larry
Kane. “Lord, Nina, but you did take some chances to-day.”
“I believe in taking chances,” put in Miss Jaffray calmly. “The
element of uncertainty is all that makes life worth while. Nothing in
the world is so deadly as the obvious.”
“You’ll be kept busy avoiding it,” sighed Nellie. “I’ve been.”
“Oh, I simply ignore it,” she returned, with a quick gesture. “Jane
won’t approve, of course; but the unusual, the daring, the
unconventional are the only things that interest me at all.”
“They interest others when you do them, Nina,” Jane replied
smiling calmly.
“Of course, they do. And you ought to be grateful.”
“We are. I’m sure we’d be very dull without you. Personally I’m a
bromide.”
“Heaven forbid! The things that are easiest are not worth trying
for. Whether your game is fish, fowl or beast (and that includes
man), try the most difficult. The thrill of delight when you bag your
game is worth all the pains of the effort. Isn’t it, Nellie?”
“I don’t know,” the other replied, between oysters. “I bagged Dick,
but then I didn’t have to try very hard. I suppose I would have
bagged him just the same. A woman can have any man she wants,
you know.”
“The trouble is,” laughed Larry Kane, “that she doesn’t know what
she wants.”
“And, if she does, Larry,” said Gallatin slowly, “he’s usually the
wrong one.”
Nina laughed.
“His sex must be blamed for that. The right men are all wrong and
the wrong men are all right. That’s my experience. ‘Young saint, old
devil; young devil, old saint.’ You couldn’t provide me with a better
recommendation for a good husband than a bad reputation as a
bachelor. And think of the calm delights of regeneration!”
“You’ll have no difficulty in finding him, Nina,” said Jane.
“I’m afraid there’s no hope for me,” laughed Kane. “I, for one, am
too good for any use.”
“Too good to be true,” sniffed Nina.
“Or too true to be interesting,” he added, below his breath.
Nellie Pennington, having led her companions into deep water,
now turned and guided them into the shoals of the commonplace.
Jane Loring’s eyes and Phil Gallatin’s had met across the table. The
act was unavoidable for they sat directly opposite each other and,
though each looked away at once, the current established, brief as it
was, was burdened with meaning. Gallatin read a hundred things,
but love was not one of them. Jane read a hundred things any one
of which might have been love, but, as far as she knew, was not.
Gallatin caught the end of a gaze she had given him while he was
talking to Nina, and he fancied it to be a kind of indignant curiosity,
not in the slightest degree related to the scorn of her surprise at
being detected in the midst of her inspection. Gallatin found her face
thinner, which made her eyes seem larger and the shadows under
them deeper. He had seen fresh young beauty such as hers break
and fade during one season in New York, but it shocked him a little
to find these marks so evident in so short a time. It was as though a
year, two years even, had been crowded into the few weeks since he
had seen her last, as though she had lived at high tension, letting
nothing escape her that could add to the sum of experience. Her
eyes sparkled, and on her cheeks was a patch of red clearly defined,
like rouge, but not rouge, for it came and went with her humor. She
had grown older, more intense, more fragile, her features more
clearly carved, more refined and—except for the hard little shadows
at the corners of her lips—more spiritual.
He glanced at the heavy, bovine face of Coley Van Duyn beside
her and wondered. Coley had been drinking freely and his face was
flushed, his laugh open-mouthed and louder than Nellie Pennington’s
humor seemed to warrant. How could she? God! How could she do
it?
A blind rage came upon Gallatin, a sudden wave of intolerance
and rebellion, and he clenched his fists beneath the table. This man
drank as much as he liked and when he pleased. He was the club
glutton. He ate immoderately and drank immoderately, because he
liked to do it, and because that was his notion of comfort. Not, as
had been the case with Gallatin, because he had not been able to
live without it. Van Duyn could stop drinking when he liked, when he
had had enough, when he didn’t want any more. He drank for the
mere pleasure of drinking. Gallatin bit his lip and stared at his
untouched wine glasses. Pleasure? With Gallatin it had been no
pleasure. It had been a medicine, a desperate remedy for a
desperate pain, a poisonous medicine which cured and killed at the
same time.
“Phil!” Nina’s voice sounded suddenly at his ear. “Are you ill?”
“Not in the least.”
“You haven’t listened to a word I’ve been saying, and it was so
interesting.”
He laughed.
“What were you thinking of?”
“My sins.”
“Then I don’t wonder that you looked so badly.”
But it was clear that she understood him, for after a short silence
she spoke of other things.
The dinner having progressed to the salad course, visiting was in
order, and the guests sauntered from table to table, exchanging
chairs and partners. Jane Loring was one of the first to take
advantage of this opportunity to escape, and found a seat at Honora
Ledyard’s table between Bibby Worthington and Percy Endicott.
Nellie Pennington watched her departure calmly, for she had
learned what she had set out to learn. All women, no matter how
youthful, are clever at dissimulation, but the art being common to all
women, deceives none. And Jane, skillful though she had been in
hiding her thoughts from Gallatin, deceived neither Nellie Pennington
nor Nina Jaffray.
Dinner over, Nellie Pennington followed the crowd to the gunroom.
The married set were already at their auction and somebody
beckoned to her to make a four, but she refused. On this night she
had a mission. She wandered from group to group, keeping one eye
on Jane and the other on Phil, until the music began, when with one
accord, all but the most devoted of the bridge-players returned to
the hall, from which the furniture had been cleared, and where the
polished wax surface shone invitingly. Mrs. Pennington waited until
the waltz was well under way and saw Jane Loring circling the room
safely with Larry Kane, when she went into the library alone. Her
thought had crystallized into a definite plan.
It was at the end of the third dance when Jane, on the arm of
Percy Endicott was on her way to the terrace for a breath of air, that
Bibby Worthington slipped a note into her fingers. She excused
herself and took it to the nearest electric bulb. She knew the
handwriting at once. It was in Nina Jaffray’s picturesque scrawl.

“Jane, dear,” it ran. “I must see you for a moment about

something which concerns you intimately. Meet me at twelve by the
fountain in the loggia of the tennis court.
“Nina.”

Jane turned the note over and re-read it; then with quick scorn,
tore it into tiny pieces and scattered them into the bushes. The
impudence of her! She had given Nina credit for better taste. What
right had she to intrude again in Jane’s private affairs when she
must know how little her offices were appreciated? And yet, what
was this she had to say? Something that concerned Jane intimately?
What could that be unless——
Coleman Van Duyn appeared and claimed the next dance, which
he begged that she would sit out. Jane agreed because it would give
her a chance to think. There was little real exertion required in
talking to Coley.
What could Nina want to tell her? And where—did she say? In the
loggia of the tennis court—at twelve. It must be almost that now.
At five minutes of twelve Nellie Pennington handed Gallatin a note.
“From Nina,” she whispered. “It’s really outrageous, Phil, the way
you’re flirting with that trusting child. I’m sure you ought to be
ashamed of yourself.”
The tennis court was at the far end of the long house. It was
reached by passing first a succession of rooms which made up the
main building, into the conservatory, by the swimming-pool and
loggia. The loggia was a red-tiled portico, enclosed in glass during
the winter, in the center of which was a fountain surrounded by a
circular marble bench, all filched from an old Etruscan villa. To-night
it was unlighted except by the glow from the bronze Japanese lamps
in the conservatory; an ideal spot for a tryst, so far removed from
the main body of the house and so cool in winter that it was seldom
used except as a promenade or as a haven by those purposely
belated. Gallatin, the scrap of paper in his fingers, strolled through
the deserted halls, smoking thoughtfully. Nina Jaffray was beginning
to grate just a little on his nerves. He had no idea what she wanted
of him and he didn’t much care.
He only knew that it was almost time for him to make his meaning
clear to her in terms which might not be misunderstood. As he
entered the obscurity of the loggia, he saw the head and shoulders
of a figure in white above the back of the stone bench.
“You wanted to see me?” he said.
At the sound of his voice, the figure rose, stood poised breathless,
and he saw that it was not Nina.
“I?” Jane’s voice answered.
He stopped and the cigarette slipped from his fingers.
“I—I beg pardon. I was told that——”
“That I wanted to see you?” she broke in scornfully.
“No. Not you—” he replied, still puzzled.
“There has been a mistake, Mr. Gallatin. I do not want to see you.
If you’ll excuse me——”
She made a movement to go, but Gallatin stood in the aperture,
the only avenue of escape, and did not move. His hands were at his
sides, his head bent forward, his eyes gazing into the pool.
“Wait—” he muttered, as though to himself. “Don’t go yet. I’ve
something to say—just a word—it will not take a moment. Will you
listen?”
“I suppose I—I must,” she stammered.
“I hear—” he began painfully, “that it’s true that you’re going to
marry Mr. Van Duyn.”
“And what if it is?” she flashed at him.
“Nothing—except that I hope you’ll be happy. I wish you——”
“Thanks,” dryly. “When I’m ready for the good wishes—of—of
anybody, I’ll ask for them. At present—will you let me pass, please?”
“Yes—in a moment. I thought perhaps you might be willing to tell
me whether it’s true, the report of your engagement?”
“I can’t see how that can be any interest of yours.”
“Only the interest of one you once cared for and who——”
“Mr. Gallatin, I forbid it,” she said hurriedly. “Would you be so
unmanly as to take advantage of your position here? Isn’t it enough
that I no longer care to know you, that I prefer to choose my own
friends?”
“Will you answer my question?” he repeated doggedly.
“No. You have no right to question me.”
“I’m assuming the right. Your memory of the past——”
“There is no past. It was the dream of a silly child in another
world where men were honest and women clean. I’ve grown older,
Mr. Gallatin.”
“Yes, but not in mercy, not in compassion, not in charity.”
“Speak of virtue before you speak of mercy, of pride before
compassion, of decency before charity—if you can,” she added
contemptuously.
“You’re cruel,” he muttered, “horribly so.”
“I’m wiser than I was. The world has done me that service. And if
cruelty is the price of wisdom, I’ll pay it. Baseness, meanness,
improbity in business or in morals no longer surprise me. They’re
woven into the tissue of life. I can abominate the conditions that
cause them, but they are the world. And, until I choose to live alone,
I must accept them even if I despise the men and women who
practice them, Mr. Gallatin.”
“And you call this wisdom? This disbelief in everything—in
everybody, this threadbare creed of the jaded women of the world?”
“Call it what you like. Neither your opinions nor your principles (or
the lack of them) mean anything to me. If I had known you were
here I should not have come to-night. I pray that we may never
meet again.”
He stood silent a long moment, searching her face with his eyes.
She was so cold, so white and wraithlike, and her voice was so
strange, so impersonal, that he was almost ready to believe that she
was some one else. It was the voice of a woman without a soul—a
calm, ruthless voice which sought to wound, to injure or destroy. It
had been on his lips to speak of the past, to translate into the words
the pain at his heart. He had been ready to take one step forward,
to seize her in his arms and compel her by the might of his
tenderness to return the love that he bore her. If he had done so
then, perhaps fortune would have favored him—have favored them
both; for in the hour of their greatest intolerance women are
sometimes most vulnerable. But he could not. Her words chilled him
to insensibility, scourged his pride and made him dumb and
unyielding.
“If that is your wish,” he said quietly, “I will do my best to respect
it. I’d like you to remember one thing, though, and that is that this
meeting was not of my seeking. If I’ve detained you, it was with the
hope that perhaps you might be willing to listen to the truth, to learn
what a dreadful mistake you have made, of the horrible wrong you
have done——”
“To you?”
“No,” sternly. “To Nina Jaffray. Think what you like of me,” he went
on with sudden passion. “It doesn’t matter. You can’t make a new
pain sharper than the old one. But you’ve got to do justice to her.”
“What is the use, Mr. Gallatin?”
“It’s a lie that they’ve told, a cruel lie, as you’ll learn some day
when it will be too late to repair the wrong you’ve done.”
“I don’t believe that it was a lie, Mr. Gallatin. A lie will not persist
against odds. This does. You’ve done your duty. Now please let me
go.”
“Not yet. You needn’t be afraid of me.”
“Let me pass.”
“In a moment—when you listen. You must. Nina Jaffray is
blameless. She would not deny such a story. It would demean her to
deny it as it demeans me.”
“It does demean you,” she broke in pitilessly, “as other things have
demeaned you. Shame, Mr. Gallatin! Do you think I could believe the
word of a man who seeks revenge for a woman’s indifference? Who
finding her invulnerable goes to the ends of his resources to attack
the members of her family? Trying by methods known only to
himself and those of his kind to hinder the success of those more
diligent than himself, to smirch the good name of an honest man, to
obtain money——”
“Stop,” cried Gallatin hoarsely, and in spite of herself she obeyed.
For he was leaning forward toward her, the long fingers of one hand
trembling before him.
“You’ve gone almost too far, Miss Loring,” he whispered. “You are
talking about things of which you know nothing. I will not speak of
that, nor shall you, for whatever our relations have been or are now,
nothing in them justifies that insult. Time will prove the right or the
wrong of the matter between Henry K. Loring and me as time will
prove the right and the wrong to his daughter. I ask nothing of her
now, nor ever shall, not even a thought. The girl I am thinking of
was gentle, kind, sincere. She looked with the eyes of compassion,
the far-seeing gaze of innocence unclouded by bitterness or doubt. I
gave her all that was best in me, all that was honest, all that was
true, and in return she gave me courage, purpose, resolution. I
loved her for herself, because she was herself, but more for the
things she represented—purity, nobility, strength which I drew from
her like an inspiration. It was to her that I owed the will to conquer
myself, the purpose to win back my self-respect. I thanked God for
her then and I’m thankful now, but I’m more thankful that I’m no
longer dependent on her.”
Jane had sunk on the bench again, her head bent and a sound
came from her lips. But he did not hear it.
“I do not need her now,” he went on quietly. “What she was is
only a memory; what she is, only a regret. I shall live without her. I
shall live without any woman, for no woman could ever be to me
what that memory is. I love it passionately, reverently, madly,
tenderly, and will be true to it, as I have always been. And, if ever
the moment comes when the woman that girl has grown to be looks
into the past, let her remember that love knows not doubt or
bitterness, that it lives upon itself, is sufficient unto itself and that,
whatever happens, is faithful until death.”
He stopped and stepped aside.
“I have finished, Miss Loring. Now go!”
The peremptory note startled her and she straightened and slowly
rose. His head was bowed but his finger pointed toward the door of
the conservatory. As she passed him she hesitated as though about
to speak, and then slowly raising her head walked past him and
disappeared.
XXVI
BIG BUSINESS

T ooker fidgeted uneasily with the papers on the junior partner’s

desk, moving to the safe in the main office and back again,
bringing bundles of documents which he disposed in an orderly row
where Mr. Gallatin could put his hands on them. Eleven o’clock was
the hour set for the conference between Henry K. Loring and Philip
Gallatin. Mr. Leuppold had written last week that Mr. Loring had
agreed to a conference and asked Mr. Gallatin to come to his, Mr.
Leuppold’s, private office at a given time. Gallatin had agreed to the
day and hour named, but politely insisted that Mr. Leuppold and Mr.
Loring come to his office. It would have made no difference in the
result, of course, but Gallatin had reasons of his own.
At ten o’clock Philip Gallatin came in and read his mail. He had
returned yesterday from his southern visit, and in the afternoon had
gone over, with Mr. Kenyon and Mr. Hood, the details of the case.
The matter had been discussed freely, but it was clear to Tooker,
who had been present, that the other partners had been able to add
nothing but their approval to the work which Gallatin had done.
His mail finished, Gallatin took up the other papers on his desk
and scrutinized them carefully, after which he glanced at his watch
and pressed the button for the chief clerk.
“There has been no message from Mr. Leuppold, Tooker?” he
asked.
“Nothing.”
Gallatin smiled. “That’s good. I was figuring on a slight chance
that they might want more time, and ask a postponement.”