Approximate Dynamic Programming Solving the Curses of Dimensionality Second Edition Warren B. Powell(Auth.) download
Approximate Dynamic Programming Solving the Curses of Dimensionality Second Edition Warren B. Powell(Auth.) download
https://ebookfinal.com/download/approximate-dynamic-programming-
solving-the-curses-of-dimensionality-second-edition-warren-b-
powellauth/
https://ebookfinal.com/download/programming-and-problem-solving-
with-c-3rd-edition-edition-nell-b-dale/
https://ebookfinal.com/download/problem-solving-with-c-the-object-of-
programming-fourth-edition-walter-j-savitch/
https://ebookfinal.com/download/vagabond-vol-29-29-inoue/
https://ebookfinal.com/download/approximate-calculation-of-integrals-
v-i-krylov/
Hacker s Delight Second Edition Henry S. Warren
https://ebookfinal.com/download/hacker-s-delight-second-edition-henry-
s-warren/
https://ebookfinal.com/download/the-practical-anarchist-writings-of-
josiah-warren-1st-ed-edition-warren/
https://ebookfinal.com/download/java-an-introduction-to-problem-
solving-and-programming-7th-edition-walter-savitch/
https://ebookfinal.com/download/the-shadow-of-sparta-1st-edition-
anton-powell/
https://ebookfinal.com/download/crusade-for-justice-the-autobiography-
of-ida-b-wells-second-edition-ida-b-wells-editor/
Approximate Dynamic Programming Solving the Curses
of Dimensionality Second Edition Warren B.
Powell(Auth.) Digital Instant Download
Author(s): Warren B. Powell(auth.), Walter A. Shewhart, Samuel S.
Wilks(eds.)
ISBN(s): 9781118029176, 1118029178
File Details: PDF, 7.71 MB
Year: 2011
Language: english
Approximate Dynamic
Programming
Approximate Dynamic
Programming
Solving the Curses of Dimensionality
Second Edition
Warren B. Powell
Princeton University
The Department of Operations Research and Financial Engineering
Princeton, NJ
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee
to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400,
fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission
should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street,
Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/
permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts
in preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be
suitable for your situation. You should consult with a professional where appropriate. Neither the
publisher nor author shall be liable for any loss of profit or any other commercial damages, including
but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our
Customer Care Department within the United States at (800) 762-2974, outside the United States at
(317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print
may not be available in electronic formats. For more information about Wiley products, visit our web
site at www.wiley.com.
10 9 8 7 6 5 4 3 2 1
Contents
Acknowledgments xvii
v
vi contents
6 Policies 221
6.1 Myopic Policies, 224
6.2 Lookahead Policies, 224
6.3 Policy Function Approximations, 232
6.4 Value Function Approximations, 235
6.5 Hybrid Strategies, 239
6.6 Randomized Policies, 242
6.7 How to Choose a Policy?, 244
6.8 Bibliographic Notes, 247
Problems, 247
Bibliography 607
Index 623
Preface to the Second Edition
The writing for the first edition of this book ended around 2005, followed by a
year of editing before it was submitted to the publisher in 2006. As with everyone
who works in this very rich field, my understanding of the models and algorithms
was strongly shaped by the projects I had worked on. While I was very proud of
the large industrial applications that were the basis of my success, at the time I had
a very limited understanding of many other important problem classes that help to
shape the algorithms that have evolved (and continue to evolve) in this field.
In the five years that passed before this second edition went to the publisher, my
understanding of the field and my breadth of applications have grown dramatically.
Reflecting my own personal growth, I realized that the book needed a fundamen-
tal restructuring along several dimensions. I came to appreciate that approximate
dynamic programming is much more than approximating value functions. After
writing an article that included a list of nine types of policies, I realized that every
policy I had encountered could be broken down into four fundamental classes:
myopic policies, lookahead policies, policy function approximations, and policies
based on value function approximations. Many other policies can be created by
combining these four fundamental classes into different types of hybrids.
I also realized that methods for approximating functions (whether they be pol-
icy function approximations or value function approximations) could be usefully
organized into three fundamental strategies: lookup tables, parametric models, and
nonparametric models. Of course, these can also be combined in different forms.
In preparing the second edition, I came to realize that the nature of the decision
variable plays a critical role in the design of an algorithm. In the first edition, one
of my goals was to create a bridge between dynamic programming (which tended
to focus on small action spaces) and math programming, with its appreciation of
vector-valued decisions. As a result I had adopted x as my generic decision vari-
able. In preparing the new edition, I had come to realize that small action spaces
cover a very important class of problems, and these are also the problems that a
beginner is most likely to start with to learn the field. Also action “a” pervades the
reinforcement learning community (along with portions of the operations research
community), to the point that it is truly part of the language. As a result the second
edition now uses action “a” for most of its presentation, but reverts to x specifically
xi
xii preface to the second edition
for problems where the decisions are continuous and/or (more frequently) vectors.
The challenges of vector-valued decisions has been largely overlooked in the rein-
forcement learning community, while the operations research community that works
on these problems has largely ignored the power of dynamic programming.
The second edition now includes a new chapter (Chapter 6) devoted purely to
a discussion of different types of policies, a summary of some hybrid strategies,
and a discussion of problems that are well suited to each of the different strategies.
This is followed by a chapter (Chapter 7) that focuses purely on the issue of policy
search. This chapter brings together fields such as stochastic search and simulation
optimization. The chapter also introduces a new class of optimal learning strate-
gies based on the concept of the knowledge gradient, an idea that was developed
originally to address the exploration–exploitation problem before realizing that it
had many other applications.
I also acquired a much better understanding of the different methods for approx-
imating value functions. I found that the best way to communicate the rich set of
strategies that have evolved was to divide the material into three chapters. The first
of these (Chapter 8) focuses purely on different statistical procedures for approxi-
mating value functions. While this can be viewed partly as a tutorial into statistics
and machine learning, the focus is on strategies that have been used in the approxi-
mate dynamic programming/reinforcement learning literature. ADP imposes special
demands on statistical learning algorithms, including the importance of recursive
estimation, and the need to start with a small number of observations (which works
better with a low-dimensional model) and transition to a larger number of obser-
vations with models that are high-dimensional in certain regions. Next, Chapter 9
summarizes different methods for estimating the value of being in a state using
sample information, with the goal of estimating the value function for a fixed pol-
icy. Since I have found that a number of papers focus on a single policy without
making this apparent, this chapter makes this very explicit by indexing variables
that depend on a policy with a superscript π . Finally, Chapter 10 addresses the very
difficult problem of estimating the value of being in a state while simultaneously
optimizing over policies.
Chapter 11 of this book is a refined version of the old Chapter 6, which
focused on stepsize rules. Chapter 11 is streamlined, with a new discussion of
the implications of algorithms based on policy iteration (including least squares
policy evaluation (LSPE), least squares temporal differences) and algorithms based
on approximate value iteration and Q-learning. Following some recent research,
I use the setting of a single state to develop a much clearer understanding of the
demands on a stepsize that are placed by these different algorithmic strategy. A
new section has been added introducing a stepsize rule that is specifically optimized
for approximate value iteration.
Chapter 12, on the famous exploration–exploitation problem in approximate
dynamic programming, has been heavily revised to reflect a much more thorough
understanding of the general field that is coming to be known as optimal learning.
This chapter includes a recently developed method for doing active learning in the
presence of a physical state, by way of the concept of the knowledge gradient.
preface to the second edition xiii
While this method looks promising, the general area of doing active learning in the
context of dynamic programs (with a physical state) is an active area of research
at the time of this writing.
A major theme of the first edition was to bridge the gap between disciplines,
primarily reinforcement learning (computer science), simulation, and math pro-
gramming (operations research). This edition reinforces this theme first by adopting
more broadly the notation and vocabulary of reinforcement learning (which has
made most of the contributions to this field) while retaining the bridge to math
programming, but now also including stochastic search and simulation optimization
(primarily in the context of policy search).
The mathematical level of the book continues to require only an understanding
of statistics and probability. A goal of the first edition was that the material would
be accessible to an advanced undergraduate audience. With this second edition
a more accurate description would be that the material is accessible to a highly
motivated and well prepared undergraduate, but the breadth of the material is more
suitable to a graduate audience.
The path to completing this book began in the early 1980s when I first started
working on dynamic models arising in the management of fleets of vehicles for
the truckload motor carrier industry. It is often said that necessity is the mother
of invention, and as with many of my colleagues in this field, the methods that
emerged evolved out of a need to solve a problem. The initially ad hoc models and
algorithms I developed to solve these complex industrial problems evolved into
a sophisticated set of tools supported by an elegant theory within a field that is
increasingly being referred to as approximate dynamic programming.
The methods in this book reflect the original motivating applications. I started
with elegant models for which academie is so famous, but my work with industry
revealed the need to handle a number of complicating factors that were beyond the
scope of these models. One of these was a desire from one company to understand
the effect of uncertainty on operations, requiring the ability to solve these large-
scale optimization problems in the presence of various forms of randomness (but
most notably customer demands). This question launched what became a multiple-
decade search for a modeling and algorithmic strategy that would provide practical,
but high-quality, solutions.
This process of discovery took me through multiple fields, including linear and
nonlinear programming, Markov decision processes, optimal control, and stochas-
tic programming. It is somewhat ironic that the framework of Markov decision
processes, which originally appeared to be limited to toy problems (three trucks
moving between five cities), turned out to provide the critical theoretical frame-
work for solving truly industrial-strength problems (thousands of drivers moving
between hundreds of locations, each described by complex vectors of attributes).
The ability to solve these problems required the integration of four major dis-
ciplines: dynamic programming (Markov decision processes), math programming
(linear, nonlinear and integer programming), simulation, and statistics. My desire
to bring together the fields of dynamic programming and math programming moti-
vated some fundamental notational choices (in particular, the use of x as a decision
variable). In this book there is as a result a heavy dependence on the Monte Carlo
methods so widely used in simulation, but a knowledgeable reader will quickly
see how much is missing. The book covers in some depth a number of important
xv
xvi preface to the first edition
techniques from statistics, but even this presentation only scratches the surface
of tools and concepts available from with fields such as nonparametric statistics,
signal processing and approximation theory.
Audience
The book is aimed primarily at an advanced undergraduate/masters audience with
no prior background in dynamic programming. The presentation does expect a first
course in probability and statistics. Some topics require an introductory course in
linear programming. A major goal of the book is the clear and precise presentation
of dynamic problems, which means there is an emphasis on modeling and notation.
The body of every chapter focuses on models and algorithms with a minimum
of the mathematical formalism that so often makes presentations of dynamic pro-
grams inaccessible to a broader audience. Using numerous examples, each chapter
emphasizes the presentation of algorithms that can be directly applied to a variety
of applications. The book contains dozens of algorithms that are intended to serve
as a starting point in the design of practical solutions for real problems. Material for
more advanced graduate students (with measure-theoretic training and an interest
in theory) is contained in sections marked with **.
The book can be used quite effectively in a graduate level course. Several
chapters include “Why does it work” sections at the end that present proofs at an
advanced level (these are all marked with **). This material can be easily integrated
into the teaching of the material within the chapter.
Approximate dynamic programming is also a field that has emerged from several
disciplines. I have tried to expose the reader to the many dialects of ADP, reflect-
ing its origins in artificial intelligence, control theory, and operations research. In
addition to the diversity of words and phrases that mean the same thing—but often
with different connotations—I have had to make difficult notational choices.
I have found that different communities offer unique insights into different
dimensions of the problem. In the main, the control theory community has the most
thorough understanding of the meaning of a state variable. The artificial intelligence
community has the most experience with deeply nested problems (which require
numerous steps before earning a reward). The operations research community has
evolved a set of tools that are well suited for high-dimensional resource allocation,
contributing both math programming and a culture of careful modeling.
W. B. P.
Acknowledgments
The work in this book reflects the contributions of many. Perhaps most important
are the problems that motivated the development of this material. This work would
not have been possible without the corporate sponsors who posed these problems
in the first place. I would like to give special recognition to Schneider National, the
largest truckload carrier in the United States, Yellow Freight System, the largest
less-than-truckload carrier, and Norfolk Southern Railroad, one of the four major
railroads that serves the United States. These three companies not only posed
difficult problems, they provided years of research funding that allowed me to
work on the development of tools that became the foundation of this book. This
work would never have progressed without the thousands of hours of my two senior
professional staff members, Hugo Simão and Belgacem Bouzaiëne-Ayari, who have
written hundreds of thousands of lines of code to solve industrial-strength problems.
It is their efforts working with our corporate sponsors that brought out the richness
of real applications, and therefore the capabilities that our tools needed to possess.
While industrial sponsors provided the problems, without the participation of
my graduate students, I would simply have a set of ad hoc procedures. It is the
work of my graduate students that provided most of the fundamental insights and
algorithms, and virtually all of the convergence proofs. In the order in which
they joined by research program, the students are Linos Frantzeskakis, Raymond
Cheung, Tassio Carvalho, Zhi-Long Chen, Greg Godfrey, Joel Shapiro, Mike
Spivey, Huseyin Topaloglu, Katerina Papadaki, Arun Marar, Tony Wu, Abraham
George, Juliana Nascimento, Peter Frazier, and Ilya Ryzhov, all of whom are my
current and former students and have contributed directly to the material presented
in this book. My undergraduate senior thesis advisees provided many colorful
applications of dynamic programming, and they contributed their experiences with
their computational work.
The presentation has benefited from numerous conversations with profession-
als in this community. I am particularly grateful to Erhan Çinlar, who taught me
the language of stochastic processes that played a fundamental role in guiding my
notation in the modeling of information. I am also grateful for many conversa-
tions with Ben van Roy, Dimitri Bertsekas, Andy Barto, Mike Fu, Dan Adelman,
Lei Zhao, and Diego Klabjan. I would also like to thank Paul Werbos at NSF
xvii
xviii acknowledgments
for introducing me to the wonderful neural net community in IEEE, which con-
tributed what for me was a fresh perspective on dynamic problems. Jennie Si, Don
Wunsch, George Lendaris and Frank Lewis all helped educate me in the language
and concepts of the control theory community.
For the second edition of the book, I would like to add special thanks to Peter
Frazier and Ilya Ryzhov, who contributed the research on the knowledge gradient
for optimal learning in ADP, and improvements in my presentation of Gittins
indices. The research of Jun Ma on convergence theory for approximate policy
iteration for continuous states and actions contributed to my understanding in a
significant way. This edition also benefited from the contributions of Warren Scott,
Lauren Hannah, and Emre Barut who have combined to improve my understanding
of nonparametric statistics.
This research was first funded by the National Science Foundation, but the
bulk of my research in this book was funded by the Air Force Office of Sci-
entific Research, and I am particularly grateful to Dr. Neal Glassman for his
support through the early years. The second edition has enjoyed continued support
from AFOSR by Donald Hearn, and I appreciate Don’s dedication to the AFOSR
program.
Many people have assisted with the editing of this volume through numerous
comments. Mary Fan, Tamas Papp, and Hugo Simão all read various drafts of
the first edition cover to cover. I would like to express my appreciation to Boris
Defourny for an exceptionally thorough proofreading of the second edition. Diego
Klabjan and his dynamic programming classes at the University of Illinois provided
numerous comments and corrections. Special thanks are due to the students in my
own undergraduate and graduate dynamic programming classes who had to survive
the very early versions of the text. The second edition of the book benefited from
the many comments of my graduate students, and my ORF 569 graduate seminar
on approximate dynamic programming. Based on their efforts, many hundreds of
corrections have been made, though I am sure that new errors have been introduced.
I appreciate the patience of the readers who understand that this is the price of
putting in textbook form material that is evolving so quickly.
Of course, the preparation of this book required tremendous patience from my
wife Shari and my children Elyse and Danny, who had to tolerate my ever-present
laptop at home. Without their support, this project could never have been completed.
W.B.P.
CHAPTER 1
The optimization of problems over time arises in many settings, ranging from the
control of heating systems to managing entire economies. In between are examples
including landing aircraft, purchasing new equipment, managing blood inventories,
scheduling fleets of vehicles, selling assets, investing money in portfolios, and just
playing a game of tic-tac-toe or backgammon. These problems involve making
decisions, then observing information, after which we make more decisions, and
then more information, and so on. Known as sequential decision problems, they
can be straightforward (if subtle) to formulate, but solving them is another matter.
Dynamic programming has its roots in several fields. Engineering and economics
tend to focus on problems with continuous states and decisions (these communities
refer to decisions as controls), which might be quantities such as location, speed,
and temperature. By contrast, the fields of operations research and artificial intel-
ligence work primarily with discrete states and decisions (or actions). Problems
that are modeled with continuous states and decisions (and typically in continuous
time) are often addressed under the umbrella of “control theory,” whereas problems
with discrete states and decisions, modeled in discrete time, are studied at length
under the umbrella of “Markov decision processes.” Both of these subfields set up
recursive equations that depend on the use of a state variable to capture history in a
compact way. There are many high-dimensional problems such as those involving
the allocation of resources that are generally studied using the tools of mathemati-
cal programming. Most of this work focuses on deterministic problems using tools
such as linear, nonlinear, or integer programming, but there is a subfield known as
stochastic programming that incorporates uncertainty. Our presentation spans all of
these fields.
1
2 the challenges of dynamic programming
14
1 2
8 10
q 3 5 r
15
17
Equation (1.1) has to be solved iteratively, where at each iteration, we loop over all
the nodes i in the network. We stop when none of the values vi change. It should
be noted that this is not a very efficient way of solving a shortest path problem.
For example, in the early iterations it may well be the case that vj = M for all
j ∈ I+ . However, we use the method to illustrate dynamic programming.
Table 1.1 illustrates the algorithm, assuming that we always traverse the nodes
in the order (q, 1, 2, 3, r). Note that we handle node 2 before node 3, which is the
reason why, even in the first pass, we learn that the path cost from node 3 to node
r is 15 (rather than 17). We are done after iteration 3, but we require iteration 4 to
verify that nothing has changed.
Shortest path problems arise in a variety of settings that have nothing to do with
transportation or networks. Consider, for example, the challenge faced by a college
the three curses of dimensionality 3
Table 1.1 Path cost from each node to node r after each node has been visited
Cost from Node
Iteration q 1 2 3 r
100 100 100 100 0
1 100 100 10 15 0
2 30 18 10 15 0
3 26 18 10 15 0
4 26 18 10 15 0
freshman trying to plan her schedule up to graduation. By graduation, she must take
32 courses overall, including eight departmentals, two math courses, one science
course, and two language courses. We can describe the state of her academic
program in terms of how many courses she has taken under each of these five
categories. Let Stc be the number of courses she has taken by the end of semester
t in category c = {Total courses, Departmentals, Math, Science, Language}, and
let St = (Stc )c be the state vector. Based on this state, she has to decide which
courses to take in the next semester. To graduate, she has to reach the state S8 =
(32, 8, 2, 1, 2). We assume that she has a measurable desirability for each course
she takes, and that she would like to maximize the total desirability of all her
courses.
The problem can be viewed as a shortest path problem from the state S0 =
(0, 0, 0, 0, 0) to S8 = (32, 8, 2, 1, 2). Let St be her current state at the beginning
of semester t, and let at represent the decisions she makes while determining what
courses to take. We then assume we have access to a transition function S M (St , at ),
which tells us that if she is in state St and takes action at , she will land in state
St+1 , which we represent by simply using
St+1 = S M (St , at ).
where St+1 = S M (St , at ) and where St is the set of all possible (discrete) states
that she can be in at the beginning of the year.
All dynamic programs can be written in terms of a recursion that relates the value
of being in a particular state at one point in time to the value of the states that we
Discovering Diverse Content Through
Random Scribd Documents
Annona Grandiflora
[9] Source of rivers. It is said, that St. Ille, St. Mary, and the
beautiful river Little St. Juan, which discharges its waters into the
bay of Apalachi, at St. Mark’s, take their rise from this swamp.
CHAPTER IV
[11] Gigantic Black Oak. Querc. tinctoria; the bark of this species
of oak is found to afford a valuable yellow dye. This tree is known
by the name of Black Oak in Pennsylvania, New-Jersey, New-York,
and New England.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookfinal.com