A Concise Introduction To Decentralized POMDPs
A Concise Introduction To Decentralized POMDPs
Frans A. Oliehoek
Christopher Amato
A Concise
Introduction to
Decentralized
POMDPs
123
SpringerBriefs in Intelligent Systems
Series editors
Gerhard Weiss, Maastricht University, Maastricht, The Netherlands
Karl Tuyls, University of Liverpool, Liverpool, UK
Editorial Board
Felix Brandt, Technische Universität München, Munich, Germany
Wolfram Burgard, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
Marco Dorigo, Université libre de Bruxelles, Brussels, Belgium
Peter Flach, University of Bristol, Bristol, UK
Brian Gerkey, Open Source Robotics Foundation, Bristol, UK
Nicholas R. Jennings, Southampton University, Southampton, UK
Michael Luck, King’s College London, London, UK
Simon Parsons, City University of New York, New York, US
Henri Prade, IRIT, Toulouse, France
Jeffrey S. Rosenschein, Hebrew University of Jerusalem, Jerusalem, Israel
Francesca Rossi, University of Padova, Padua, Italy
Carles Sierra, IIIA-CSIC Cerdanyola, Barcelona, Spain
Milind Tambe, USC, Los Angeles, US
Makoto Yokoo, Kyushu University, Fukuoka, Japan
This series covers the entire research and application spectrum of intelligent
systems, including artificial intelligence, multiagent systems, and cognitive
robotics. Typical texts for publication in the series include, but are not limited to,
state-of-the-art reviews, tutorials, summaries, introductions, surveys, and in-depth
case and application studies of established or emerging fields and topics in the
realm of computational intelligent systems. Essays exploring philosophical and
societal issues raised by intelligent systems are also very welcome.
A Concise Introduction
to Decentralized POMDPs
123
Frans A. Oliehoek Christopher Amato
School of Electrical Engineering, Electronics Computer Science and Artificial Intelligence
and Computer Science Laboratory
University of Liverpool MIT
Liverpool Cambridge, MA
UK USA
This book presents an overview of formal decision making methods for decentral-
ized cooperative systems. It is aimed at graduate students and researchers in the
fields of artificial intelligence and related fields that deal with decision making, such
as operations research and control theory. While we have tried to make the book rel-
atively self-contained, we do assume some amount of background knowledge.
In particular, we assume that the reader is familiar with the concept of an agent as
well as search techniques (like depth-first search, A*, etc.), both of which are stan-
dard in the field of artificial intelligence [Russell and Norvig, 2009]. Additionally,
we assume that the reader has a basic background in probability theory. Although
we give a very concise background in relevant single-agent models (i.e., the ‘MDP’
and ‘POMDP’ frameworks), a more thorough understanding of those frameworks
would benefit the reader. A good first introduction to these concepts can be found
in the textbook by Russell and Norvig, with additional details in texts by Sutton and
Barto [1998], Kaelbling et al. [1998], Spaan [2012] and Kochenderfer et al. [2015].
We also assume that the reader has a basic background in game theory and game-
theoretic notations like Nash equilibrium and Pareto efficiency. Even though these
concepts are not central to our exposition, we do place the Dec-POMDP model in
the more general context they offer. For an explanation of these concepts, the reader
could refer to any introduction on game theory, such as those by Binmore [1992],
Osborne and Rubinstein [1994] and Leyton-Brown and Shoham [2008].
This book heavily builds upon earlier texts by the authors. In particular, many
parts were based on the authors’ previous theses, book chapters and survey articles
[Oliehoek, 2010, 2012, Amato, 2010, 2015, Amato et al., 2013]. This also means
that, even though we have tried to give a relatively complete overview of the work
in the field, the text in some cases is biased towards examples and methods that have
been considered by the authors. For the description of further topics in Chapter 8,
we have selected those that we consider important and promising for future work.
Clearly, there is a necessarily large overlap between these topics and the authors’
recent work in the field.
vii
Acknowledgments
Writing a book is not a standalone activity; it builds upon all the insights devel-
oped in the interactions with peers, reviewers and coathors. As such, we are grateful
for the interaction we have had with the entire research field. We specifically want
to thank the attendees and organizers of the workshops on multiagent sequential
decision making (MSDM) which have provided a unique platform for exchange of
thoughts on decision making under uncertainty.
Furthermore, we would like to thank João Messias, Matthijs Spaan, Shimon
Whiteson, and Stefan Witwicki for their feedback on sections of the manuscript.
Finally, we are grateful to our former supervisors, in particular Nikos Vlassis and
Shlomo Zilberstein, who enabled and stimulated us to go down the path of research
on decentralized decision making.
ix
Contents
3 Finite-Horizon Dec-POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Optimality Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Policy Representations: Histories and Policies . . . . . . . . . . . . . . . . . . 34
3.2.1 Histories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Multiagent Beliefs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Value Functions for Joint Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
xi
xii Contents
3.5 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Infinite-Horizon Dec-POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1 Optimality Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.1 Discounted Cumulative Reward . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.2 Average Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 Policy Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.1 Finite-State Controllers: Moore and Mealy . . . . . . . . . . . . . . . 71
6.2.2 An Example Solution for D EC -T IGER . . . . . . . . . . . . . . . . . . 73
6.2.3 Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.4 Correlation Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Value Functions for Joint Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4 Undecidability, Alternative Goals and Their Complexity . . . . . . . . . 76
8 Further Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.1 Exploiting Structure in Factored Models . . . . . . . . . . . . . . . . . . . . . . . 91
8.1.1 Exploiting Constraint Optimization Methods . . . . . . . . . . . . . 91
8.1.1.1 Coordination (Hyper-)Graphs . . . . . . . . . . . . . . . . . 91
8.1.1.2 ND-POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.1.1.3 Factored Dec-POMDPs . . . . . . . . . . . . . . . . . . . . . . 95
8.1.2 Exploiting Influence-Based Policy Abstraction . . . . . . . . . . . 100
8.2 Hierarchical Approaches and Macro-Actions . . . . . . . . . . . . . . . . . . . 102
8.3 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.3.1 Implicit Communication and Explicit Communication . . . . . 105
8.3.1.1 Explicit Communication Frameworks . . . . . . . . . . . 106
8.3.1.2 Updating of Information States and Semantics . . . . 107
8.3.2 Delayed Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.3.2.1 One-Step Delayed Communication . . . . . . . . . . . . . 108
8.3.2.2 k-Steps Delayed Communication . . . . . . . . . . . . . . . 109
8.3.3 Communication with Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.3.4 Local Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Acronyms
AH action history
AOH action-observation history
BG Bayesian game
CG coordination graph
CBG collaborative Bayesian game
COP constraint optimization problem
DAG directed acyclic graph
DP dynamic programming
Dec-MDP decentralized Markov decision process
Dec-POMDP decentralized partially observable Markov decision process
DICE direct cross-entropy optimization
EM expectation maximization
EXP deterministic exponential time (complexity class)
FSC finite-state controller
FSPC forward-sweep policy computation
GMAA* generalized multiagent A*
I-POMDP interactive partially observable Markov decision process
MAS multiagent system
MARL multiagent reinforcement learning
MBDP memory-bounded dynamic programming
MDP Markov decision process
MILP mixed integer linear program
NEXP non-deterministic exponential time (complexity class)
ND-POMDP networked distributed POMDP
NDP nonserial dynamic programming
NLP nonlinear programming
NP non-deterministic polynomial time (complexity class)
OH observation history
POMDP partially observable Markov decision process
PSPACE polynonomial SPACE (complexity class)
PWLC piecewise linear and convex
xv
xvi Acronyms
RL reinforcement learning
TD-POMDP transition-decoupled POMDP
List of Symbols
Throughout this text, we tried to make consistent use of typesetting to convey the
meaning of used symbols and formulas. In particular, we use blackboard bold fonts
(A,B, etc.) to denote sets, and subscripts to denote agents (typically i or j) or groups
of agents, as well as time (t or τ).
For instance a is the letter used to indicate actions in general, ai denotes an action
of agent i, and the set of its actions is denoted Ai . The action agent i takes at a
particular time step t is denoted ai,t . The profile of actions taken by all agents, a joint
action, is denoted a, and the set of such joint actions is denoted A. When referring
to the action profile of a subset e of agents we write ae , and for the actions of all
agents except agent i, we write a−i . On some occasions we will need to indicate the
index within a set, for instance the k-th action of agent i is written aki . In the list of
symbols below, we have shown all possible uses of notation related to actions (base
symbol ‘a’), but have not exhaustively applied such modifiers to all symbols.
· multiplication,
× Cartesian product,
◦ policy concatenation,
⇓ subtree policy consumption operator,
(·) simplex over (·),
xvii
xviii List of Symbols
ε (small) constant,
θ̄ joint action-observation history,
θ̄i action-observation history,
Θ̄i action-observation history set,
ιi information state, or belief update, function,
μi macro-action policy for agent i,
π joint policy,
πi policy for agent i,
π −i (joint) policy for all agents but i,
π∗ optimal joint policy,
ρ number of reward functions,
Σ alphabet of communication messages,
σt plan-time sufficient statistic,
τ stages-to-go (τ = h − 1),
υ domination gap,
ΦNext set of next policies,
ϕt past joint policy,
ξ parameter vector,
ψ correlation device transition function,
E set of hyper-edges,
e hyper edge, or index of local payoff function (corresponding to a hyper edge),
fξ probability distribution, parameterized by ξ ,
fξ ( j) distribution over joint policies at iteration j,
h horizon,
Ii→ j influence of agent i on agent j,
Ii information states for agent i,
Ii set of information states for agent i,
M Markov multiagent environment,
MDecP Dec-POMDP,
MMPOMDP MPOMDP,
MPT plan-time NOMDP,
mi agent model, also finite-state controller,
m agent component (a joint model),
mi macro-action for agent i,
Nb number of best samples,
Nf number of fire levels,
Next operation constructing next set of partial policies,
NULL null observation,
n number of agents,
O observation function,
Oi local observation function for agent i,
OC optimality criterion,
O set of joint observations,
Oi set of observations for agent i,
Ō joint observation history set,
Ōi observation history set for agent i,
o joint observation,
oi observation for agent i,
oi,0/ NULL observation for agent i,
ōi observation history of agent i,
ōt joint observation history at stage t,
ōt,|k| joint observation history at stage t of length k,
Qπ Q-value function for π,
Qτi set of qτi ,
Q τ set of qτ ,
τ+1
Qe,i set of subtree policies resulting from exhaustive backup,
τ+1
Qm,i set of maintained subtree policies,
k
qt−k joint subtree policy of length k to be executed at stage t − k,
qτi τ-stage-to-go subtree policy for agent i,
qτ τ-stage-to-go joint subtree policy,
R reward function,
Ri local reward function for agent i,
Re local reward function (with index e),
R real numbers,
xx List of Symbols
The impact of the advent of the computer on modern societies can hardly be over-
stated; every single day we are surrounded by more devices equipped with on-board
computation capabilities taking care of the ever-expanding range of functions they
perform for us. Moreover, the decreasing cost and increasing sophistication of hard-
ware and software opens up the possibility of deploying a large number of devices
or systems to solve real-world problems. Each one of these systems (e.g., computer,
router, robot, person) can be thought of as an agent which receives information and
makes decisions about how to act in the world. As the number and sophistication of
these agents increase, controlling them in such a way that they consider and cooper-
ate with each other becomes critical. In many of these multiagent systems (MASs),
cooperation is made more difficult by the fact that the environment is unpredictable
and the information available about the world and other agents (through sensors
and communication channels) is noisy and imperfect. Developing agent controllers
by hand becomes very difficult in these complex domains, so automated methods
for generating solutions from a domain specification are needed. In this book, we
describe a formal framework, called the decentralized partially observable Markov
decision process (Dec-POMDP), that can be used for decision making for a team of
cooperative agents. Solutions to Dec-POMDPs optimize the behavior of the agents
while considering the uncertainty related to the environment and other agents. As
discussed below, the Dec-POMDP model is very general and applies to a wide range
of applications.
From a historical perspective, thinking about interaction has been part of many
different ‘fields’ or, simply, aspects of life, such as philosophy, politics and war.
Mathematical analyses of problems of interaction date back to at least the begin-
ning of the eighteenth century [Bellhouse, 2007], driven by interest in games such
as chess [Zermelo, 1913]. This culminated in the formalization of game theory with
huge implications for the field of economics since the 1940s [von Neumann and
Morgenstern, 1944]. Other single-step cooperative team models were studied by
Marschak [1955] and Radner [1962], followed by systems with dynamics modeled
as team theory problems [Marschak and Radner, 1972, Ho, 1980] and the result-
ing complexity analysis [Papadimitriou and Tsitsiklis, 1987]. In the 1980s, people
in the field of Artificial Intelligence took their concept of agent, an artificial en-
tity that interacts with its environment over a sequence of time steps, and started
thinking about multiple such agents and how they could interact [Davis and Smith,
1983, Grosz and Sidner, 1986, Durfee et al., 1987, Wooldridge and Jennings, 1995,
Tambe, 1997, Jennings, 1999, Lesser, 1999]. Dec-POMDPs represent a probabilis-
tic generalization of this multiagent framework to model uncertainty with respect
to outcomes, environmental information and communication. We first discuss some
motivating examples for the Dec-POMDP model and then provide additional details
about multiagent systems, the uncertainty considered in Dec-POMDPs and applica-
tion domains.
Before diving deeper, this section will present two motivating examples for the mod-
els and techniques described in this book. The examples briefly illustrate the diffi-
culties and uncertainties one has to deal with when automating decisions in real-
world decentralized systems. Several other examples and applications are discussed
in Section 1.4.
Fig. 1.1: Illustration of a simple Recycling Robots example, in which two robots
remove trash in an office environment with three small (blue) trash cans and two
large (yellow) ones. In this situation, the left robot may observe that the large trash
can next to it is full, and the other robot may detect that the adjacent small trash
can is empty. Note that neither robot can be sure of the trash can’s true state due
to limited sensing capabilities, nor do the robots see the state of trash cans further
away. Also, the robots cannot observe each other at this distance and they do not
know the observations of the other robot due to a lack of communication.
fast they will fill up, but it is known that, because more people use them, the larger
trash cans fill up more quickly. Each robot needs to take actions based on its own
knowledge: while we encourage the robots to share some important information, we
would not want them to communicate constantly, as this could overload the office’s
wireless network and seems wasteful from an energy perspective. Each robot must
also ensure that its battery remains charged by moving to a charging station before
it expires. The battery level for a robot degrades due to the distance the robot travels
and the weight of the item being carried. Each robot only knows its own battery
level (but not that of the other robots) and the location of other robots within sensor
range. The goal of this problem is to remove as much trash as possible in a given
time period. To accomplish this goal we want to find a plan, or policy, that specifies
for each robot how to behave as a function of its own observations, such that the
joint behavior is optimal. While this problem may appear simple, it is not. Due to
uncertainty, the robots cannot accurately predict the amount of battery reduction that
results from moving. Furthermore, due to noisy and insufficient sensors, each robot
does not accurately know the position and state of the trash cans and other robots.
As a result of these information deficiencies, deciding which trash cans to navigate
to and when to recharge the battery is difficult. Moreover, even if hand-coding the
solution for a single robot would be feasible, predicting how the combination of
policies (one for each robot) would perform in practice is extremely challenging.
Efficient Sensor Networks
Another application that has received significant interest over the last two decades
is that of sensor networks. These are networks of sensors (the agents) that are dis-
tributed in a particular environment with the task of measuring certain things about
that environment and distilling this into high-level information. For instance, one
could think about sensor networks used for air pollution monitoring [Khedo et al.,
2010], gas leak detection [Pavlin et al., 2010], tracking people in office environ-
ments [Zajdel et al., 2006, Satsangi et al., 2015] or tracking of wildlife [Garcia-
Sanchez et al., 2010]. Successful application of sensor networks involves answering
many questions, such as what hardware to use, how the information from different
sensors can be fused, and how the sensors should measure various parts of their
environment to maximize information while minimizing power use. It is especially
questions of the latter type, which involve local decisions by the different sensors,
for which the Dec-POMDP framework studied in this book is relevant: in order to
decide about when to sense at a specific sensor node we need to reason about the ex-
pected information gain from turning that sensor on, which depends on the actions
taken at other sensors, as well as how the phenomenon to be tracked moves through
the spatial environment. For example, when tracking a person in an office environ-
ment, it may be sufficient to only turn on a sensor at the location where the target
is expected given all the previous observations in the entire system. Only when the
target is not where it is expected to be might other sensors be needed. However,
when communication bandwidth or energy concerns preclude the sharing of such
previous information, deciding when to turn on or not is even further complicated.
Again, finding plans for such problems is highly nontrivial: even if we were able to
4 1 Multiagent Systems Under Uncertainty
specify plans for each node by hand, we typically would not know how good the
joint behavior of the sensor network is, and whether it could be improved.
Concluding, we have seen that in both these examples there are many different
aspects such as decentralization and uncertainties that make it very difficult to spec-
ify good actions to take. We will further elaborate on these issues in the remainder
of this introductory chapter and give an overview of many more domains for which
Decentralized POMDPs are important in Section 1.4.
This book focuses on settings where there are multiple decision makers, or agents,
that jointly influence their environment. Such an environment together with the mul-
tiple agents that operate in it is called a multiagent system (MAS). The field of MAS
research is a broad interdisciplinary field with relations to distributed and concur-
rent systems, artificial intelligence (AI), economics, logic, philosophy, ecology and
the social sciences [Wooldridge, 2002]. The subfield of AI that deals with principles
and design of MASs is also referred to as ‘distributed AI’. Research on MASs is
motivated by the fact that it can potentially provide [Vlassis, 2007, Sycara, 1998]:
• Speedup and efficiency, due to the asynchronous and parallel computation.
• Robustness and reliability, since the whole system can undergo a ‘graceful degra-
dation’ when one or more agents fail.
• Scalability and flexibility, by adding additional agents as required.
• Lower cost, assuming the agents cost much less than a centralized system.
• Lower development cost and reusability, since it is easier to develop and maintain
a modular system.
There are many different aspects of multiagent systems, depending on the type of
agents, their capabilities and their environment. For instance, in a homogeneous
MAS all agents are identical, while in a heterogeneous MAS the design and capa-
bilities of each agent can be different. Agents can be cooperative, self-interested
or adversarial. The environment can be dynamic or static. These are just a few of
many possible parameters, leading to a number of possible settings too large to
describe here. For a more extensive overview, we refer the reader to the texts by
Huhns [1987], Singh [1994], Sycara [1998], Weiss [1999], Stone and Veloso [2000],
Yokoo [2001], Wooldridge [2002], Bordini et al. [2005], Shoham and Leyton-Brown
[2007], Vlassis [2007], Buşoniu et al. [2008] and Weiss [2013]. In this book, we
will focus on decision making for heterogeneous, fully cooperative MASs in dy-
namic, uncertain environments in which agents need to act based on their individual
knowledge about the environment. Due to the complexity of such settings, hand-
1.2 Multiagent Systems 5
coded solutions are typically infeasible for such settings [Kinny and Georgeff, 1997,
Weiss, 2013]. Instead, the approach advocated in this book is to describe such prob-
lems using a formal model—the decentralized partially observable Markov decision
process (Dec-POMDP)—and to develop automatic decision making procedures, or
planning methods, for them.
Related Approaches We point out that due to the multi-disciplinary nature of the
field of MASs, there are many disciplines that are closely related to the topic at
hand, and we point out the most relevant of them here.
For instance, in the field of game theory, much research focuses on extensive
form games and partially observable stochastic games, both of which are closely
related to Dec-POMDPs (more on this connection in Section 2.4.5). The main dif-
ference is that game theorists have typically focused on self-interested settings.
The ‘classical planning problem’ as studied in the AI community also deals with
decision making, but for a single agent. These methods have been extended to the
multiagent setting, resulting in a combination of planning and coordination, e.g.
distributed problem solving (DPS) [Durfee, 2001]. However, like classical planning
itself, these extensions typically fail to address stochastic or partially observable en-
vironments [desJardins et al., 1999, de Weerdt et al., 2005, de Weerdt and Clement,
2009].
The field of teamwork theory also considers cooperative MAS, and the
belief-desire-intension (BDI) model of practical reasoning [Bratman, 1987, Rao and
Georgeff, 1995, Georgeff et al., 1999] has inspired many teamwork theories, such
as joint intentions [Cohen and Levesque, 1990, 1991a,b] and shared plans [Grosz
and Sidner, 1990, Grosz and Kraus, 1996], and implementations [Jennings, 1995,
Tambe, 1997, Stone and Veloso, 1999, Pynadath and Tambe, 2003]. While such
BDI-based approaches do allow for uncertainty, they typically rely on (manually)
pre-specified plans that might be difficult to specify and have as a drawback the fact
that it is difficult to define clear quantitative measures for their performance, making
it hard to judge their quality [Pynadath and Tambe, 2002, Nair and Tambe, 2005].
Finally, there are also close links to the operations research (OR) and control the-
ory communities. The Dec-POMDP model is a generalization of the (single-agent)
MDP [Bellman, 1957] and POMDP [Åström, 1965] models which were developed
in OR, and later became popular in AI as a framework for planning for agents [Kael-
bling et al., 1996, 1998]. Control theory and especially optimal control essentially
deals with the same type of planning problems, but with an emphasis on continuous
state and action spaces. Currently, researchers in the field of decentralized control
are working on problems very similar to Dec-POMDPs [e.g., Nayyar et al., 2011,
2014, Mahajan and Mannan, 2014], and, in fact, some results have been established
in parallel both in this and the AI community.
6 1 Multiagent Systems Under Uncertainty
1.3 Uncertainty
Many real-world applications for which we would want to use MASs are subject
to various forms of uncertainty. This makes it difficult to predict the outcome of a
particular plan (e.g., there may be many possible outcomes) and thus complicates
finding good plans. Here we discuss different types of uncertainty that the Dec-
POMDP framework can cope with.
Outcome Uncertainty. In many situations, the outcome or effects of actions may
be uncertain. In particular we will assume that the possible outcomes of an action
are known, but that each of those outcomes is realized with some probability (i.e.,
the state of the environment changes stochastically). For instance, due to different
surfaces leading to varying amounts of wheel slip, it may be difficult, or even im-
possible, to accurately predict exactly how far our recycling robots move. Similarly,
the amount of trash being put in a bin depends on the activities performed by the
humans in the environment and is inherently stochastic from the perspective of any
reasonable model.1
State Uncertainty. In the real world an agent might not be able to determine what
the state of the environment exactly is. In such cases, we say that the environment is
partially observable. Partial observability results from noisy and/or limited sensors.
Because of sensor noise an agent can receive faulty or inaccurate sensor readings,
or observations. For instance, the air pollution measurement instruments in a sensor
network may give imperfect readings, or gas detection sensors may fail to detect gas
with some probability. When sensors are limited, the agent is unable to observe the
differences between certain states of the environment because they inherently can-
not be distinguished by the sensor. For instance, a recycling robot may simply not
be able to tell whether a trash can is full if it does not first navigate to it. Similarly,
a sensor node typically will only make a local measurement. Due to such sensor
limitations, the same sensor reading might require different action choices, a phe-
nomenon referred to as perceptual aliasing. In order to mitigate these problems, an
agent may use the history of actions it took and the observations it made to get a
better estimate of the state of the environment.
Multiagent Uncertainty: Uncertainty with Respect to Other Agents. Another
complicating factor in MASs is the presence of multiple agents that each make de-
cisions that influence the environment. The difficulty is that each agent can be un-
certain regarding the other agents’ actions. This is apparent in self-interested and
especially adversarial settings, such as games, where agents may not share infor-
mation or try to mislead other agents [Binmore, 1992]. In such settings each agent
should try to accurately predict the behavior of the others in order to maximize its
payoff. But even in cooperative settings, where the agents have the same goal and
therefore are willing to coordinate, it is nontrivial how such coordination should be
1 To be clear, here we exclude models that try to predict human activities in a deterministic fashion
(e.g., this would require perfectly modeling the current activities as well as the ‘internal state’ of
all humans in the office building) from the set of reasonable models.
1.4 Applications 7
1.4 Applications
Decision making techniques for cooperative MASs under uncertainty have a great
number of potential applications, ranging from more abstract tasks located in a digi-
tal or virtual environment to a real-world robotics setting. Here we give an overview
of some of these.
An example of a more abstract task is distributed load balancing among queues.
Here, each agent represents a processing unit with a queue, and can only observe
its own queue size and that of its immediate neighbors. The agents have to decide
whether to accept new jobs or pass them to another queue. Such a restricted prob-
lem can be found in many settings, for instance, industrial plants or a cluster of
webservers. The crucial difficulty is that in many of these settings, the overhead as-
sociated with communication is too high, and the processing units will need to make
decisions on local information [Cogill et al., 2004, Ouyang and Teneketzis, 2014].
Another abstract, but very important domain is that of transmission protocols
and routing in communication networks. In these networks, the agents (e.g.,
routers) operate under severe communication restrictions, since the cost of send-
8 1 Multiagent Systems Under Uncertainty
2 This assumes the absence of the so-called observer effect, as present in quantum mechanics.
1.4 Applications 9
either not possible (e.g., there is too little bandwidth to transmit video streams from
many cameras or transmission is not sufficiently powerful) or consumes resources
(e.g., battery power) and thus has a particular cost. Therefore Dec-POMDPs are
crucial for essentially all teams of embodied agents. Examples of such settings are
considered both in theory/simulation, such as multirobot space exploration [Becker
et al., 2004b, Witwicki and Durfee, 2010b], as well as in real hardware robot imple-
mentation, e.g., multirobot search of a target [Emery-Montemerlo, 2005], robotic
soccer [Messias, 2014] and a physical implementation of a problem similar to R E -
CYCLING ROBOTS [Amato et al., 2014].
A final, closely related, application area is that of decision support systems for
complex real-world settings, such as crisis management. Also in this setting, it is
inherently necessary to deal with the real world, which often is highly uncertain.
For instance, a number of research efforts have been performed within the context
of RoboCup Rescue [Kitano et al., 1999]. In particular, researchers have been able to
model small subproblems using Dec-POMDPs [Nair et al., 2002, 2003a,b, Oliehoek
and Visser, 2006, Paquet et al., 2005]. Another interesting application is presented
by Shieh et al. [2014], who apply Dec-MDPs in the context of security games which
have been used for securing ports, airports and metro-rail systems [Tambe, 2011].
Chapter 2
The Decentralized POMDP Framework
Before diving into the core of multiagent decision making under uncertainty, we
first give a concise treatment of the single-agent problems that we will build upon. In
particular, we will treat Markov decision processes (MDPs) and partially observable
Markov processes (POMDPs). We expect the reader to be (somewhat) familiar with
these models. Hence, these sections serve as a refresher and to introduce notation.
For more details we refer the reader to the texts by Russell and Norvig [2009], Put-
erman [1994], Sutton and Barto [1998], Kaelbling et al. [1998] and Spaan [2012].
Fig. 2.1: Schematic representation of an MDP. At every stage, the agent takes an
action and observes the resulting state s .
2.1.1 MDPs
plans: the inability of the agent to observe the state of the environment as well as
the presence of multiple agents.
2.1.2 POMDPs
As mentioned in Section 1.3, noisy and limited sensors may prevent the agent
from observing the state of the environment, because the observations are inaccurate
and perceptual aliasing may occur. In order to represent such state uncertainty, a
partially observable Markov decision process (POMDP) extends the MDP model
by incorporating observations and their probability of occurrence conditional on the
state of the environment [Kaelbling et al., 1998, Cassandra, 1998, Spaan, 2012].
This is illustrated in Figure 2.2. In a POMDP, an agent no longer knows the state of
the world, but rather has to maintain a belief over states. That is, it can use the history
of observations to estimate the probability of each state and use this information to
decide upon an action.
In this equation, Pr(ot+1 |at ,bt ) is a normalizing constant, and Pr(st+1 ,ot+1 |st ,at ) is
the probability that the POMDP model specifies for receiving the particular new
state st+1 and the resulting observation ot+1 assuming st was the previous state.
In control theory, the (continuous) observations, also referred to as measure-
ments, are typically described as a deterministic function of the state. Sensor noise
is modeled by adding a random disturbance term to this function and is dealt with by
introducing a state estimator component, e.g., by Kalman filtering [Kalman, 1960].
Perceptual aliasing arises when a state component cannot be measured directly. For
instance, it may not be possible to directly measure angular velocity of a robot arm;
in this case it may be possible to use a so-called observer to estimate this velocity
from its positions over time.
Although the treatment of state uncertainty in classical control theory involves
terminology and techniques different from those in used in POMDPs, the basic
idea in both is the same: use information gathered from the history of observations
in order to improve decisions. There also is one fundamental difference, however.
Control theory typically separates the estimation from the control component. For
example, the estimator returns a particular value for the angles and angle veloci-
ties of the robot arm and these values are used to select actions as if there was no
uncertainty. In contrast, POMDPs allow the agent to explicitly reason over the be-
lief and what the best action is given that belief. As a result, agents using POMDP
techniques can reason about information gathering: when beneficial, they will select
actions that will provide information about the state.
Although POMDPs provide principled treatment of state uncertainty, they only con-
sider a single agent. In order to deal with the effects of uncertainty with respect to
other agents, this book will consider an extension of the POMDP framework, called
decentralized POMDP (Dec-POMDP).
The Dec-POMDP framework is illustrated in Figure 2.3. As the figure shows, it
generalizes the POMDP to multiple agents and thus can be used to model a team of
cooperative agents that are situated in a stochastic, partially observable environment.
Formally, a Dec-POMDP can be defined as follows.1
1 Pynadath and Tambe [2002] introduced a model called multiagent team decision problem
(MTDP), which is essentially equivalent to the Dec-POMDP.
2.2 Multiagent Decision Making: Decentralized POMDPs 15
t 0 1 h−1
a finite horizon the undiscounted expected cumulative reward (the expectation of the
sum of the rewards for all stages, introduced in Chapter 3) is commonly used as the
optimality criterion. The planning problem amounts to finding a tuple of policies,
called a joint policy, that maximizes the optimality criterion.
During execution, the agents are assumed to act based on their individual obser-
vations only and no additional communication is assumed. This does not mean that
Dec-POMDPs cannot model settings which concern communication. For instance,
if one agent has an action “mark blackboard” and the other agent has an observa-
tion “mark on blackboard”, the agents have a mechanism of communication through
the state of the environment. However, rather than making this communication ex-
plicit, we say that the Dec-POMDP can model communication implicitly through
the actions, states and observations. This means that in a Dec-POMDP, communi-
cation has no special semantics. Section 8.3 further elaborates on communication in
Dec-POMDPs.
Note that, as in other planning models (and in contrast to what is usual in rein-
forcement learning), in a Dec-POMDP, the agents are assumed not to observe the
immediate rewards. Observing the immediate rewards could convey information
regarding the true state, which is not present in the received observations, which
is undesirable as all information available to the agents should be modeled in the
observations. When planning for Dec-POMDPs, the only thing that matters is the
2.3 Example Domains 17
expectation of the cumulative future reward, which is available in the offline plan-
ning phase, not the actual reward obtained. It is not even assumed that the actual
reward can be observed at the end of the episode. If rewards are to be observed, they
should be made part of the observation.
2.3.1 Dec-Tiger
We will consider the decentralized tiger (D EC -T IGER) problem Nair et al. [2003c]—
a frequently used Dec-POMDP benchmark—as an example. It concerns two agents
that are standing in a hallway with two doors. Behind one door, there is a treasure
and behind the other is a tiger, as illustrated in Figure 2.5.
The state describes which door the tiger is behind—left (sl ) or right (sr ), each
occurring with 0.5 probability (i.e., the initial state distribution b0 is uniform). Each
agent can perform three actions: open the left door (aOL ), open the right door (aOR )
or listen (aLi ). Clearly, opening the door to the treasure will yield a reward (+10),
but opening the door to the tiger will result in a severe penalty (−100). A greater
reward (+20) is given for both agents opening the correct door at the same time.
18 2 The Decentralized POMDP Framework
As such, a good strategy will probably involve listening first. The listen actions,
however, also have a minor cost (a negative reward of −1). The full reward model
is shown in Table 2.1.
a sl sr
aLi ,aLi −2 −2
aLi ,aOL −101 +9
aLi ,aOR +9 −101
aOL ,aLi −101 +9
aOL ,aOL −50 +20
aOL ,aOR −100 −100
aOR ,aLi +9 −101
aOR ,aOL −100 −100
aOR ,aOR +20 −50
At every stage the agents get an observation: they can either hear the tiger behind
the left (oHL ) or right (oHR ) door, but each agent has a 15% chance of hearing it
incorrectly (getting the wrong observation), which means that there is only a prob-
ability of 0.85 · 0.85 = 0.72 that both agents get the correct observation. Moreover,
the observation is informative only if both agents listen; if either agent opens a door,
both agents receive an uninformative (uniformly drawn) observation and the prob-
lem resets to sl or sr with equal probability. At this point the problem just continues,
such that the agents may be able to open the door to the treasure multiple times.
Also note that, since the only two observations the agents can get are oHL and oHR ,
the agents have no way of detecting that the problem has been reset: if one agent
opens the door while the other listens, the other agent will not be able to tell that
the door was opened. The full transition, observation and reward models are listed
in Table 2.2.
sl sr
a sl → sl sl → sr sr → sr sr → sl
a oHL oHR oHL oHR
aLi ,aLi 1.0 0.0 1.0 0.0
aLi ,aLi 0.85 0.15 0.15 0.85
otherwise 0.5 0.5 0.5 0.5
otherwise 0.5 0.5 0.5 0.5
(a) Transition probabilities.
(b) Individual observation probabilities.
Dec-POMDPs can also be used for the high-level formalization of multirobot tasks.
In fact, several benchmark problems are motivated by coordination in robotics. Here
we will briefly describe two of them: R ECYCLING ROBOTS and C OOPERATIVE
B OX P USHING.
Recycling Robots This is the problem described in Section 1.1 and can be rep-
resented as a Dec-POMDP in a natural way. The states, S, consist of the different
locations of each robot, their battery levels and the different amounts of trash in the
cans. The actions, Ai , for each robot consist of movements in different directions
as well as decisions to pick up a trash can or recharge the battery (when in range
of a can or a charging station). Recall that large trash cans can only be picked up
by two agents jointly. The observations, Oi , of each robot consist of its own battery
level, its own location, the locations of other robots in sensor range and the amount
of trash in cans within range. The rewards, R, could consist of a large positive value
for a pair of robots emptying a large (full) trash can, a small positive value for a
single robot emptying a small trash can and negative values for a robot depleting its
battery or a trash can overflowing. An optimal solution is a joint policy that leads
to the expected behavior (given that the rewards are properly specified). That is, it
ensures that the robots cooperate to empty the large trash cans when appropriate and
the small ones individually while considering battery usage.
Box Pushing The C OOPERATIVE B OX P USHING domain was introduced by
Seuken and Zilberstein [2007b] and is a larger two-robot benchmark. Also in this
domain the agents are situated in a grid world, but now they have to collaborate to
move boxes in this world. In particular, there are small boxes that can be moved
by one agent, and big boxes that the agents have to push together. Each agent has
four actions: turn left, turn right, move forward and stay, and five observations that
describe the grid position in front of the agent: empty, wall, other agent, small box,
large box.
Goal area
b1 b2 b3
Fig. 2.7: A sensor network for intrusion detection. Scanning overlapping areas in-
creases the chance of detection, but sensor nodes should also try to preserve power.
Sensor networks have also been modeled as Dec-POMDPs [Nair et al., 2005,
Marecki et al., 2008]. For instance, consider the setting illustrated in Figure 2.7.
Here, a network of sensors needs to coordinate to maximize the chance of detecting
intruders, while minimizing power usage. The intruders navigate through the plane
2.4 Special Cases, Generalizations and Related Models 21
Because solving Dec-POMDPs is complex (as will be discussed in the next chapter),
much research has focused on special cases of Dec-POMDPs. This section briefly
treats a number of special cases that have received considerable attention. For a more
comprehensive overview of all the special cases, the reader is referred to the articles
by Pynadath and Tambe [2002], Goldman and Zilberstein [2004] and Seuken and
Zilberstein [2008]. Additionally, we give a description of the partially observable
stochastic game, which generalizes the Dec-POMDP, and the interactive POMDP,
which is a related framework but takes a subjective perspective.
A different family of special cases focuses on using properties that the transition,
observation and reward function might exhibit in order to both compactly repre-
sent and efficiently solve Dec-POMDP problems. The core idea is to consider the
states and transition, observation and reward functions not as atomic entities, but as
consisting of a number of factors, and explicitly representing how different factors
affect each other.
For instance, in the case of a sensor network, the observations of each sensor
typically depend only on its local environment. Therefore, it can be possible to rep-
resent the observation model more compactly as a product of smaller observation
functions, one for each agent. In addition, since in many cases the sensing costs are
local and sensors do not influence their environment there is likely special structure
in the reward and transition function.
A large number of models that exploit factorization have been proposed, such as
transition- and observation-independent Dec-MDPs [Becker et al., 2003],
ND-POMDPs [Nair et al., 2005], factored Dec-POMDPs [Oliehoek et al., 2008c],
and many others [Becker et al., 2004a, 2005, Shen et al., 2006, Spaan and Melo,
2008, Varakantham et al., 2009, Mostafa and Lesser, 2009, Witwicki and Durfee,
2.4 Special Cases, Generalizations and Related Models 23
2009, 2010b, Mostafa and Lesser, 2011a,b, Witwicki et al., 2012]. Some of these
will be treated in more detail in Chapter 8. In the remainder of this section, we give
an overview of a number of different forms of independence that can arise.
We will discuss factorization in the context of Dec-MDPs, but similar factoriza-
tion can be done in full Dec-POMDPs.
For example, consider an agent navigation task where the agents are located in
positions in a grid and the goal is for all agents to navigate to a particular grid cell.
In such a task, an agent’s local state, si , might consist of its location in a grid. Next,
we identify some properties that an agent-wise factored Dec-MDP might posses.
An agent-wise factored Dec-MDP is said to be locally fully observable if each
agent fully observes its own state component. For instance, if each agent in the
navigation problem can observe its own location the state is locally fully observable.
A factored, n-agent Dec-MDP is said to be transition-independent if the state
transition probabilities factorize as follows:
n
T (s | s,a) = ∏ Ti (si | si ,ai ). (2.4.1)
i
Here, Ti (si | si ,ai ) represents the probability that the local state of agent i transi-
tions from si to si after executing action ai . For instance, a robot navigation task
is transition-independent if the robots never affect each other (i.e., they do not
bump into each other when moving and can share the same grid cell). On the other
hand, R ECYCLING ROBOTS (see Section 2.3.2) is not transition-independent. Even
though the movements are independent, the state cannot be factored into local com-
ponents for each agent: this would require an arbitrary assignment of small trash
cans to agents; moreover, no agent can deal with the large trash cans by itself.
A factored, n-agent Dec-MDP is said to be observation-independent if the obser-
vation probabilities factorize as follows:
In the equation above, Oi (oi | ai ,si ) represents the probability that agent i receives
observation oi in state si after executing action ai . If the robots in the navigation
problem cannot observe each other (due to working in different locations or lack of
sensors), the problem becomes observation-independent.
A factored, n-agent Dec-MDP is said to be reward-independent if there is a
monotonically nondecreasing function f such that
4 Some factored models also consider an s0 component that is a property of the environment and
is not affected by any agent actions.
24 2 The Decentralized POMDP Framework
If this is the case, the global reward is maximized by maximizing local rewards. For
instance, additive local rewards,
In the discussion so far we have focused on models that, in the execution phase, are
truly decentralized: they model agents that select actions based on local observa-
tions. A different approach is to consider models that are centralized, i.e., in which
(joint) actions can be selected based on global information. Such global informa-
tion can arise due to either full observability or communication. In the former case,
each agent simply observes the same observation or state. In the latter case, we have
to assume that agents can share their individual observations over an instantaneous
and noise-free communication channel without costs. In either case, this allows the
construction of a centralized model.
For instance, under such communication, a Dec-MDP effectively reduces to a
multiagent Markov decision process (MMDP) introduced by Boutilier [1996].
In this setting a joint action can be selected based on the state without consider-
ing the history, because the state is Markovian and known by all agents. Moreover,
because each agent knows what the state is, there is an effective way to coordinate.
One can think of the situation as a regular MDP with a ‘puppeteer’ agent that selects
joint actions. For this ‘underlying MDP’ an optimal solution π ∗ can be found effi-
ciently5 with standard dynamic programming techniques [Puterman, 1994]. Such a
solution π ∗ = (δ 0 , . . . ,δ h−1 ) specifies a mapping from states to joint actions for each
stage ∀t δ t : S → A and can be split into individual policies πi = δi,0 , . . . ,δi,h−1
with ∀t δi,t : S → Ai for all agents.
Similarly, adding broadcast communication to regular Dec-POMDP results in
a multiagent POMDP (MPOMDP), which is a special type of POMDP. In this
MPOMDP, each agent can compute the joint belief: i.e., the probability distribu-
tion over states given the histories of joint actions and observations.
5 Solving an MDP is P-complete [Papadimitriou and Tsitsiklis, 1987], but the underlying MDP
of a Dec-POMDP still has size exponential in the number of agents. However, given the MMDP
representation for a particular (typically small) number of agents, the solution is efficient.
2.4 Special Cases, Generalizations and Related Models 25
Definition 6 (Joint Belief). A joint belief is the probability distribution over states
induced by the initial state distribution b0 and the history of joint actions and obser-
vations:
b(s) Pr(s|b0 ,ao ,o1 ,a1 , . . . ,at−1 ot ).
We will also write B = (S) for the set of joint beliefs.
Since the MPOMDP is a POMDP, the computation of this joint belief can be
done incrementally using Bayes’ rule in exactly the same way as described in Sec-
tion 2.1.2.
Even though MPOMDPs are POMDPs and POMDPs are intractable to solve
(PSPACE-complete, Papadimitriou and Tsitsiklis 1987), solving an MPOMDP is
usually easier than solving a Dec-POMDP in practice. The solution of an MPOMDP
specifies a mapping from joint beliefs to joint actions for each stage, ∀t δ t :
(S) → A, and can be split into individual policies πi = δi,0 , . . . ,δi,h−1 with
∀t δi,t : (S) → Ai for all agents.
The attentive reader might wonder why we have not given a definition in terms of a
formal tuple for the MPOMDP framework. The reason is that this definition would
be identical to the definition of the Dec-POMDP given in Definition 2. That is, the
traditional definition of a Dec-POMDP presented in Section 2.2 is underspecified
since it does not include the specification of the communication capabilities of the
agents. We try and rectify this situation here.
In particular, we introduce a formalization of a more general class of multiagent
decision problems (MADPs) that will make more explicit all the constraints spec-
ified by its members. In particular, it will make clearer what the decentralization
constraints are that the Dec-POMDP model imposes, and how the approach can be
generalized (e.g., to deal with different assumptions with respect to communica-
tion). We begin by defining the environment of the agents:
26 2 The Decentralized POMDP Framework
With the exception of the remaining two subsections, in this book we restrict
ourselves to collaborative models: a collaborative MME is an MME where all the
agents get the same reward:
Definition 8 (Agent Model). A model for agent i is a tuple mi = Ii ,Ii ,Ai ,Oi ,Zi ,πi ,ιi ,
where
• Ii is the set of information states (ISs) (also internal states, or beliefs),
• Ii is the current internal state of the agent,
• Ai ,Oi are as before: the actions taken by / observations that the environment
provides to agent i,
• Zi is the set of auxiliary observations zi (e.g., from communication) available to
agent i,
• πi is a (stochastic) action selection policy πi : Ii → (Ai ),
• ιi is the (stochastic) information state function (or belief update function) ιi :
Ii × Ai × Oi × Zi → (Ii ).
This definition makes clear that the MME framework leaves the specification of
the auxiliary observations, information states, the information state function, as well
as the action selection policy unspecified. As such, the MME by itself is not enough
to specify a dynamical process. Instead, it is necessary to specify those missing
components for all agents. This is illustrated in Figure 2.9, which shows how a
dynamic multiagent system (in this case, a Dec-POMDP, which we redefine below)
evolves over time. It makes clear that there is a environment component, the MME,
as well as an agent component that specifies how the agents update their internal
state, which in turn dictates their actions.6 It is only these two components together
that lead to a dynamical process.
I2,t I2,t+1
internal
states
I1,t I1,t+1
o2,t o2,t+1
a2,t a2,t+1
actions /
observation
o1,t o1,t+1 Dec-POMDP
a1,t a1,t+1 agent
component
o2,t o2,t+1
a2,t a2,t+1
actions /
observation o1,t o1,t+1
a1,t a1,t+1
ot at ot+1 at+1
...
st st+1
states
Rt Rt+1
Dec-POMDP
environment
t ts + 1
identity (for
(internal) state observation replicated actions rewards policy
transition probabilities and observation)
Fig. 2.9: Illustration of the new perspective on the Dec-POMDP for the two-agent
case. The process is formed by an environment and an agent component that together
generate the interaction over time.
Clearly, once the MME and a fully specified agent component are brought to-
gether, we have a dynamical system: a somewhat more complicated Markov reward
process. The goal in formalizing these components, however, is that we want to op-
timize the behavior of the overall system. That is, we want to optimize the agent
component in such a way that the reward is maximized.
As such, we provide a perspective of a whole range of multiagent decision prob-
lems that can be formalized in this fashion. On the one hand, the problem designer
1) selects an optimality criterion, 2) specifies the MME, and 3) may specify a subset
of the elements of the agent component (which determines the ‘type’ of problem
that we are dealing with). On the other hand, the problem optimizer (e.g., a planning
method we develop) has as its goal to optimize the nonspecified elements of the
agent component in order to maximize the value as given by the optimality criterion.
In other words, we can think of a multiagent decision problem as the specification
of an MME together with a non-fully specified agent component.
Redefining Dec-POMDPs We can now redefine what a Dec-POMDP is by mak-
ing use of this framework of MADPs.
The goal for the problem optimizer for a Dec-POMDP is to specify the elements
of m that are not specified: {Ii } , {Ii,0 } , {ιi } , {πi }. That is, the action selection poli-
cies need to be optimized and choices need to be made with respect to the represen-
tation and updating of information states. As we will cover in more detail in later
chapters, these choices are typically made differently in the finite and infinite hori-
zon case: internal states are often represented as nodes in a tree (in the former case)
or as a finite-state controller (in the latter case) for each agent.7
Defining MPOMDPs Now, we can also give a more formal definition of an
MPOMDP. As we indicated at the start of this section, an MPOMDP cannot be
discriminated from a Dec-POMDP on the basis of what we now call the MME. In-
stead, it differs from a Dec-POMDP only in the partial specification of the agent
component. This is illustrated in Figure 2.10. In particular, the set of internal states
7 The optimality criterion selected by the problem designer also is typically different depending on
the horizon: maximizing the undiscounted and discounted sum of cumulative rewards are typically
considered as the optimality criteria for the finite and infinite horizon cases, respectively. While this
does not change the task of the problem optimizer, it can change the methods that can be employed
to perform this optimization.
2.4 Special Cases, Generalizations and Related Models 29
I2,t I2,t+1
internal
states
I1,t I1,t+1
o2,t o2,t+1
a2,t a2,t+1
actions /
observation
o1,t o1,t+1 MMDP
a1,t a1,t+1 agent
component
Fig. 2.10: The agent component for an MPOMDP. The agents share their individual
observations via communication and therefore can maintain the same internal state.
In particular, given the history of joint actions and joint observations, each agent can
compute the joint belief I1,t = I2,t = bt .
of the agents is the set of joint beliefs. This allows us to give a formal definition of
the MPOMDP:
We see that in an MPOMDP many more elements of the agent component are
specified. In particular, only the action selection policies {πi } that map from internal
states (i.e., joint beliefs) to individual actions need to be specified.
30 2 The Decentralized POMDP Framework
The Dec-POMDP is a very general model in that it deals with many types of un-
certainty and multiple agents. However, it is only applicable to cooperative teams
of agents, since it only specifies a single (team) reward. The generalization of the
Dec-POMDP is the partially observable stochastic game (POSG). It has the same
components as a Dec-POMDP, except that it specifies not a single reward function,
but a collection of reward functions, one for each agent. This means that a POSG
assumes self-interested agents that want to maximize their individual expected cu-
mulative reward.
The consequence of this is that we arrive in the field of game theory: there is
no longer an optimal joint policy, simply because optimality is no longer defined.
Rather the joint policy should be a (Bayesian) Nash Equilibrium, and preferably a
Pareto optimal one.8 However, there is no clear way to identify the best one. More-
over, such a Pareto optimal NE is only guaranteed to exist in randomized policies
(for a finite POSG), which means that it is no longer possible to perform brute-force
policy evaluation (see Section 3.4). Also search methods based on alternating max-
imization (see Section 5.2.1) are no longer guaranteed to converge for POSGs. The
dynamic programming method proposed by Hansen et al. [2004], covered in Sec-
tion 4.1.2, does apply to POSGs: it finds the set of nondominated policies for each
agent.
Even though the consequences of switching to self-interested agents are severe
from a computational perspective, from a modeling perspective the Dec-POMDP
and POSG framework are very similar. In particular all dynamics with respect to
transitions and observations are identical, and therefore computation of probabilities
of action-observation histories and joint beliefs transfers to the POSG setting. As
such, even though solution methods presented in this book may not transfer directly
to the POSG case, the modeling aspect largely does. For instance, the conversion
of a Dec-POMDP to a type of centralized model (covered in Section 4.3) can be
transferred to the POSG setting [Wiggers et al., 2015].
8 Explanations of these concepts as well as other concepts in this section can be found in for
example the texts by Binmore [1992], Osborne and Rubinstein [1994] and Leyton-Brown and
Shoham [2008].
2.4 Special Cases, Generalizations and Related Models 31
The simplest approach is to try and model the decision making process of the
protagonist agent as a POMDP by simply ignoring other agents, and treating their
influence on the transitions and observations as noise. This approximation has as a
drawback that it decreases the value of the optimal policy. Moreover, it cannot deal
with nonstationarity of the influence of other agents; in many settings the behavior
of other agents can change over time (e.g., as the result of changes to their beliefs).
A more sophisticated approach is to have the protagonist agent maintain explicit
models of the other agents in order to better predict them. This is the approach cho-
sen in the recursive modeling method (RMM) [Gmytrasiewicz and Durfee, 1995,
Gmytrasiewicz et al., 1998], which presents a stateless game framework, and the in-
teractive POMDP (I-POMDP) framework [Gmytrasiewicz and Doshi, 2005], which
extends this approach to sequential decision problems with states and observations.
Fig. 2.11: Schematic representation of an I-POMDP. The agent reasons about the
joint state of the environment and the other agent(s).
Definition
12. Formally,
an interactive POMDP (I-POMDP) of agent i is a tuple
Ši ,A,Ti ,Ri ,Oi ,Oi ,h , where:
and Doshi, 2005]. The intuition is that, in order to predict the actions of the other
agents, it uses probabilities ∀ j Pr(a j |θ j ) given by the model m j .
An interesting case occurs when considering so-called intentional models: i.e.,
when assuming the other agent also uses an I-POMDP. In this case, the formal def-
inition of I-POMDPs as above leads to an infinite hierarchy of beliefs, because an
I-POMDP for agent i defines its belief over models and thus types of other agents,
which in turn define a belief over the type of agent i, etc. In response to this phe-
nomenon, Gmytrasiewicz and Doshi [2005] define finitely nested I-POMDPs. Here,
a 0th-level belief for agent i, bi,0 , is a belief over world states S. An kth-level belief
bi,k is defined over world states and models consisting of types that admit beliefs of
(up to) level k − 1. The actual number of levels that the finitely nested I-POMDP
admits is called the strategy level.
Chapter 3
Finite-Horizon Dec-POMDPs
An optimality criterion defines exactly what we (i.e., the problem optimizer from
Section 2.4.4) want to optimize. In particular, a desirable sequence of joint actions
should correspond to a high ‘long-term’ reward, formalized as the return.
where the expectation refers to the expectation over sequences of states and exe-
cuted joint actions. The planning problem is to find a conditional plan, or policy, for
each agent to maximize the optimality criterion. Because the rewards depend on the
where 0 ≤ γ < 1 is the discount factor. Discounting gives higher priority to rewards
that are obtained sooner, which can be desirable in some applications. This can be
thought of in financial terms, where money now is worth more than money received
in the future. Discounting is also used to keep the optimality criterion bounded in
infinite horizon problems, as we discuss in Chapter 6. Note that the regular (undis-
counted) expected cumulative reward is the special case with γ = 1.
In an MDP, the agent uses a policy that maps states to actions. In selecting its ac-
tion, an agent can ignore the history (of states) because of the Markov property. In
a POMDP, the agent can no longer observe the state, but it can compute a belief b
that summarizes the history; it is also a Markovian signal. In a Dec-POMDP, how-
ever, during execution each agent will only have access to its individual actions and
observations and there is no method known to summarize this individual history. It
is not possible to maintain and update an individual belief in the same way as in a
POMDP, because the transition and observation function are specified in terms of
joint actions and observations.
In a Dec-POMDP, the agents do not have access to a Markovian signal during
execution. As such, there is no known statistic into which the problem optimizer can
compress the histories of actions and observations without sacrificing optimality.1
As a consequence, planning for Dec-POMDPs involves searching the space joint
Dec-POMDP policies that map full-length individual histories to actions. We will
see later that this also means that solving Dec-POMDPs is even harder than solving
POMDPs.
3.2.1 Histories
1 When assuming slightly more information during planning, one approach is known to compress
the space of internal states: Oliehoek et al. [2013a] present an approach to lossless clustering of
individual histories. This, however, does not fundamentally change the representation of all the
internal states (as is done when, for example, computing a belief for a POMDP); instead only some
histories that satisfy a particular criterion are clustered together.
3.2 Policy Representations: Histories and Policies 35
Notation for joint action histories and sets are analogous to those for observation
histories. Finally we note that, clearly, a (joint) AOH consists of a (joint) action and
a (joint) observation history: θ̄ t = ōt ,āt .
3.2.2 Policies
A policy πi for an agent i maps from histories to actions. In the general case, these
histories are AOHs, since they contain all information an agent has. The number of
AOHs grows exponentially with the horizon of the problem: At time step t, there
are (|Ai | · |Oi |)t possible AOHs for agent i. A policy πi assigns an action to each of
these histories. As a result, the number of possible policies πi is doubly exponential
in the horizon.
Under a deterministic policy, only a subset of possible action-observation histo-
ries can be reached. This is illustrated by the left side of Figure 3.1, where the actions
selected by the policy are given as gray arrows and the two possible observations
are given as dashed arrows. Because one action will have probability 1 of being ex-
ecuted while all other actions will have probability 0, policies that only differ with
respect to an AOH that can never be reached result in the same behavior. Therefore,
2 In a particular Dec-POMDP, it may be the case that not all of these histories can actually be
realized, because of the probabilities specified by the transition and observation model.
36 3 Finite-Horizon Dec-POMDPs
act.-obs. history
aLi
aOL aOR
oHL oHR aLi oHL oHR
aLi aLi
aOL aOR
aLi aOR aOL aLi
oHL oHR oHL oHR
Fig. 3.1: A deterministic policy can be represented as a tree. Left: a tree of action-
observation histories θ̄i for one of the agents from the Dec-Tiger problem. A deter-
ministic policy πi is highlighted, showing that πi only reaches a subset of histories
θ̄i . Note that θ̄i that are not reached are not further expanded. Right: The same pol-
icy can be shown in a simplified policy tree. When both agents execute this policy
in the Dec-Tiger problem with h = 3, the joint policy is optimal.
In a deterministic policy, πi (θ̄i ) specifies the action for the observation his-
tory contained in (action-observation history) θ̄i . For instance, if θ̄i = ōi ,āi , then
πi (θ̄i ) πi (ōi ). We use π = π1 ,...,πn to denote a joint policy. We say that a de-
terministic joint policy is an induced mapping from joint observation histories to
joint actions π : Ō → A. That is, the mapping is induced by individual policies πi
that make up the joint policy. We will simply write π(ō) π1 (ō1 ), . . . ,πn (ōn ), but
note that this does not mean that π is an arbitrary function from joint observation
histories: the joint policy is decentralized so only a subset of possible mappings
f : Ō → A are valid (those that specify the same individual action for each ōi of
each agent i). This is in contrast to a centralized joint policy that would allow any
possible mapping from joint histories to action (implying agents have access to the
observation histories of all other agents).
Agents can also execute stochastic policies, but (with exception of some parts
of Chapter 6 and 7) we will restrict our attention to deterministic policies without
sacrificing optimality, since a finite-horizon Dec-POMDP has at least one optimal
pure joint policy [Oliehoek et al., 2008b].
3.4 Value Functions for Joint Policies 37
Joint policies differ in how much reward they can expect to accumulate, which
serves as the basis for determining their quality. Formally, we are interested in the
value of the optimality criterion, the expected cumulative reward (3.1.2), that a joint
policy realizes. This quantity will be simply referred to as the joint policy’s value.
This expectation can be computed using a recursive formulation. For the last
stage t = h − 1, the value is given simply by the immediate reward
V π (st ,ōt ) = R (st ,π(ōt )) + ∑ ∑ Pr(st+1 ,ot+1 |st ,π(ōt ))V π (st+1 ,ōt+1 ).
st+1 ∈S ot+1 ∈O
(3.4.2)
Here, the probability is simply the product of the transition and observation proba-
bilities Pr(s ,o|s,a) = Pr(o|a,s ) · Pr(s |s,a). In essence, fixing the joint policy trans-
forms the Dec-POMDP to a Markov chain with states (st ,ōt ). Evaluating this equa-
tion via dynamic programming will result in the value for all (s0 ,ō0 )-pairs. The value
V (π) is then given by weighting these pairs according to the initial state distribution
b0 :
V (π) = ∑ b0 (s0 )V π (s0 ,ō0 ). (3.4.3)
s0 ∈S
(Remember ō0 = (), . . . ,() is the empty joint observation history, which is fixed.)
Finally, as is apparent from the above equations, the probabilities of states and
histories are important in many computations. The following equation recursively
specifies the probabilities of states and joint AOHs under a (potentially stochastic)
joint policy:
Pr(st ,θ̄ t |b0 ,π) = ∑ ∑ Pr(st ,ot |st−1 ,at−1 ) Pr(at−1 |θ̄ t−1 ,π)
st−1 ∈S at−1 ∈A
3.5 Complexity
Table 3.1: The number of joint policies for different benchmark problems and hori-
zons.
To get an idea of what this means in practice, Table 3.1 lists the number of joint
policies for a number of benchmark problems. Clearly, approaches that exhaustively
search the space of joint policies have little chance of scaling beyond very small
problems. Unfortunately, the complexity result due to Bernstein et al. [2002] sug-
gests that, in the worst case, the complexity associated with such an exhaustive
approach might not be avoidable.
Proof. The proof is by reduction from the TILING problem. See Bernstein et al.
[2002] for details.
NEXP is the class of problems that in the worst case take nondeterministic expo-
nential time. Nondeterministic means that, similarly to NP, solving these problems
requires generating a guess about the solution in a nondeterministic way. Expo-
nential time means that verifying whether the guess is a solution takes exponential
time. In practice this means that (assuming NEXP
= EXP) solving a Dec-POMDP
takes doubly exponential time in the worst case. Moreover, Dec-POMDPs cannot
be approximated efficiently: Rabinovich et al. [2003] showed that even finding
an ‘ε-approximate solution’ is NEXP-complete. That is, given some positive real
number ε, the problem of finding a joint policy that has a value V (π) such that
V (π) ≥ V (π ∗ ) − ε is also intractable. The infinite-horizon problem is undecidable,
40 3 Finite-Horizon Dec-POMDPs
Independence Complexity
Transitions, observations and rewards P-complete
Transitions and observations NP-complete
Any other subset NEXP-complete
Chapter 4
Exact Finite-Horizon Planning Methods
This chapter presents an overview of exact planning methods for finite-horizon Dec-
POMDPs. This means that these methods perform a search through the space of
joint policy trees. There are three main approaches to doing this: dynamic program-
ming, which will be treated in Section 4.1, heuristic search, which will be treated
in Section 4.2, and converting to a special case of single-agent POMDP, treated in
Section 4.3. Finally, a few other methods will be treated in Section 4.4.
Since policies can be represented as trees (remember Figure 3.1), a way to decom-
pose them is by considering subtrees. Define the time-to-go, τ, at stage t as
τ = h − t. (4.1.1)
Now qτi denotes a τ-stage-to-go subtree policy for agent i. That is, qτi is a policy tree
that has the same form as a full policy for the horizon-τ problem. Within the original
horizon-h problem qτi is a candidate for execution starting at stage t = h − τ. The set
of τ-stage-to-go subtree policies for agent i is denoted by Qτi . A joint subtree policy
qτ ∈ Qτ specifies a subtree policy for each agent: qτ = qτ1 , . . . ,qτn .
Figure 4.1 shows different structures in a policy for a fictitious Dec-POMDP with
h = 3. This full policy also corresponds to a 3-stage-to-go subtree policy q3i ; two of
the subtree policies are indicated using dashed ellipses.
t =0 δi,0
ȧ
ϕi,2
o ȯ
q2i
t =1 δi,1
a a
o ȯ o ȯ
t =2 δi,2
ȧ ȧ a ȧ
q1i
Fig. 4.1: Structure of a policy for an agent with actions {a,ȧ} and observations
{o,ȯ}. A policy πi can be divided into decision rules δi (which are introduced in
Section 4.2.1) or subtree policies qi .
Subtree policies exactly correspond to the notion of ‘policies that other agents
will execute in the future’ mentioned in Section 3.3, and allow us to formally define
multiagent beliefs.
This enables an agent to determine its best individual subtree policy at a multiagent
belief, i.e., a multiagent belief is a sufficient statistic for agent i to optimize its
policy. In addition, a multiagent belief is sufficient to predict the next multiagent
belief (given a set of policies for the other agents): an agent i, after performing
ai,t and receiving oi,t+1 , can maintain a multiagent belief via Bayes’ rule. Direct
substitution in (2.1.2) yields:
1
∀s τ−1 bi,t+1 (st+1 ,qτ−1
−i ) =
t+1 ,q−i Pr(oi,t+1 |bi,t ,ai,t )
∑τ bi (st ,qτ−i ) Pr(st+1 ,qτ−1 τ
−i ,oi,t+1 |st ,q−i ,ai,t ),
st ,q−i
where the transition and observation probabilities are the result of marginalizing
over the observations that the other agents could have received:
Pr(st+1 ,qτ−1 τ
−i ,oi,t+1 |st ,q−i ,ai,t ) = ∑ Pr(st+1 ,ot+1 |st ,at )1
{qτ−1 τ
−i =q−i }
.
o−i,t+1 o−i,t+1
The core idea of DP is to incrementally construct sets of longer subtree policies for
the agents: starting with a set of one-stage-to-go (τ = 1) subtree policies (actions)
that can be executed at the last stage, construct a set of two-step policies to be
executed at h − 2, etc. That is, DP constructs Q1i ,Q2i , . . . ,Qhi for all agents i. When
the last backup step is completed, the optimal policy can be found by evaluating
all induced joint policies π ∈ Qh1 × · · · × Qhn for the initial belief b0 as described in
Section 3.4.
a a
o ȯ o ȯ
a ȧ a ȧ
o ȯ o ȯ o ȯ o ȯ
a ȧ ȧ ȧ a ȧ ȧ ȧ
Fig. 4.2: Policy construction in MAA* (discussed in Section 4.2.2 and shown left)
and dynamic programming (shown right). The figure shows how policies are con-
structed for an agent with two actions a, ȧ and two observations o, ȯ. Dashed com-
ponents are newly generated, dotted components result from the previous iteration.
DP formalizes this idea using backup operations that construct Qτ+1 i from Qτi .
3
For instance, the right side of Figure 4.2 shows how qi , a three-stage-to-go subtree
policy, is constructed from two q2i ∈ Q2i . In general, a one step extended policy qτ+1
i
is created by selecting a subtree policy for each observation and an action for the
root. An exhaustive backup generates all possible qτ+1 i that have policies from the
previously generated set qτi ∈ Qτi as their subtrees. We will denote the sets of subtree
policies resulting from exhaustive backup for each agent i by Qτ+1 e,i .
Unfortunately, the exhaustive backup has an exponential complexity: if an agent
has |Qτi | k-step trees, |Ai | actions, and |Oi | observations, there will be
|Qτ+1 τ |Oi |
e,i | = |Ai ||Qi |
(k + 1)-step trees. This means that the sets of subtree policies maintained grow dou-
bly exponentially with k. This makes sense: since the qτi are essentially full policies
for the horizon-k problem their number must be doubly exponentially in k.
To counter this source of intractability, it is possible to prune dominated subtree
policies from Qτe,i , resulting in smaller maintained sets Qτm,i . As indicated by (4.1.4),
the value of a qτi depends on the multiagent belief. Therefore, a qτi is dominated if
it is not maximizing at any point in the multiagent belief space: the simplex over
4.1 Backwards Approach: Dynamic Programming 45
variables: υ and xq−i ,s
maximize: υ
subject to:
∑ xq−i ,sV (s,qi ,q−i ) ≥ ∑ xq−i ,sV (qi ,q−i ,s) + υ ∀ qi
q−i ,s q−i ,s
Fig. 4.3: The linear program (LP) to test for dominance. The LP determines if agent
i’s subtree policy qi is dominated,
by trying to find a multiagent belief point (en-
coded by the variables xq−i ,s ) where the value of qi is higher (by υ) than any
other subtree policy qi (enforced by the constraints on the first line). If at the op-
timal solution υ is nonpositive, qi is not the best subtree policy at any point in the
multiagent belief space and can be pruned. The constraints on the second line simply
guarantee that the variables encode a valid multiagent belief.
structs all the candidate joint policies that consist of nondominated individual poli-
cies, and subsequently the best one is selected out of this set.
Note that in the main (‘policy growing’) loop of the algorithm, the stored values
V (st ,qτ ) are left implicit, even though they form a crucial part of the algorithm; in
reality the exhaustive backup provides both one-step-longer individual policies, as
well as their values V (st ,qτ ), and those values are subsequently used in the pruning
step. Additionally, it is worthwhile to know that this algorithm was proposed in the
context of finding nondominated joint policies for POSGs. For more details, we
refer the reader to the original paper by Hansen et al. [2004].
In practice, the pruning step in DP often is not able to sufficiently reduce the
maintained sets to make the approach tractable for larger problems. However, the
idea of point-based dynamic programming formed the basis for a heuristic method,
which will be discussed in Section 5.2.2, that has achieved some empirical success
in finding good policies for problems with very long horizons.
Policies specify actions for all stages of the Dec-POMDP. A common way to repre-
sent the temporal structure in a policy is to split it into decision rules δi that specify
the policy for each stage. An individual policy is then represented as a sequence of
decision rules πi = (δi,0 , . . . ,δi,h−1 ). Decision rules are indicated by dotted ellipses
in Figure 4.1.
In the case of a deterministic policy, the form of the decision rule for stage t is a
mapping from length-t observation histories to actions δi,t : Ōi,t → Ai . In the more
general case its domain is the set of AOHs δi,t : Θ̄i,t → Ai . A joint decision rule
δ t = δ1,t , . . . ,δn,t specifies a decision rule for each agent.
We will also consider policies that are partially specified with respect to time.
Formally, ϕ t = (δ 0 , . . . ,δ t−1 ) denotes the past joint policy at stage t, which is a
partial joint policy specified for stages 0,...,t − 1. By appending a joint decision rule
for stage t, we can ‘grow’ such a past joint policy.
Definition 21 (Policy Concatenation). We write
Figure 4.1 shows a past policy ϕi,2 and illustrates how policy concatenation ϕi,2 ◦
δi,2 = πi forms the full policy.
4.2.2 Multiagent A*
the list becomes empty, at which point an optimal fully specified joint policy has
been found.
In order to compute a node’s heuristic value V (ϕ t ), MAA∗ takes V 0...t−1 (ϕ t ), the
actual expected reward over the first t stages that are specified, and adds V t...h−1 (ϕ t ),
a heuristic value for the remaining h −t stages. A typical way to specify V t...h−1 (ϕ t )
is to use the value function of the underlying MDP [Szer et al., 2005]. That is, we
can pretend that the Dec-POMDP is an MDP (by ignoring the observations and
the decentralization requirement) and compute the (nonstationary) value function
VMDP (st ). Then we can use those values to specify
Given that we have just defined an MDP, we can write down its optimal value func-
tion:
Vt∗ (ϕ t ) = max Qt∗ (ϕ t ,δ t ) (4.3.1)
δt
where Q∗ is defined as
∗ Ř(ϕ t ,δ t ) for the last stage t = h − 1,
Qt (ϕ t ,δ t ) = ∗
(4.3.2)
Ř(ϕ t ,δ t ) +Vt+1 (ϕ t ◦ δ t ) otherwise.
This means that, via the notion of plan-time MDP, we have been able to write
down an optimal value function for the Dec-POMDP. It is informative to contrast
the formulation of an optimal value function here to that of the value function of a
particular policy as given by (3.4.2). Where the latter only depended on the history
of observations, the optimal value function depends on the entire past joint policy.
This means that, even though this optimal formulation admits a dynamic program-
ming algorithm, it is not helpful, as this (roughly speaking) boils down to brute-force
search through all joint policies [Oliehoek, 2010].
The problem in using the optimal value function defined by (4.3.1) is that it is too
big: the number of past joint policies is too large to be able to compute it for most
problems. However, it turns out that it is possible to replace the dependence on
the past joint policy by a so-called plan-time sufficient statistic: a distribution over
50 4 Exact Finite-Horizon Planning Methods
histories and states [Oliehoek et al., 2013a, Dibangoye et al., 2013]. This is useful,
since many past joint policies can potentially map to the same statistic, as indicated
in Figure 4.5.
Fig. 4.5: A hypothetical MAA* search tree based on plan-time sufficient statistics.
Two joint decision rules from the root node can map to the same σ1 , and two δ 1
(from different σ1 ) can lead to the same σ2 .
Definition 22 (Sufficient Statistic for Deterministic Past Joint Policies). The suf-
ficient statistic for a tuple (b0 ,ϕ t ), with ϕ t deterministic, is the distribution over joint
observation histories and states: σt (st ,ōt ) Pr(st ,ōt |b0 ,ϕ t ).
as well as the next statistic (a function of σt and δ t ). Let ōt+1 = (ōt ,ot+1 ); then the
updated statistic is given by
σt+1 (st+1 ,ōt+1 ) = Uss (σt ,δ t ) = ∑ Pr(st+1 ,ot+1 |st ,δ t (ōt ))σt (st ,ōt ). (4.3.3)
st
This means that we can define the optimal value function for a Dec-POMDP as
where
Ř(σt ,δ t ) for the last stage t = h − 1,
Qt∗ (σt ,δ t ) = ∗ (U (σ ,δ ))
(4.3.5)
Ř(σt ,δ t ) +Vt+1 ss t t otherwise.
Since potentially many ϕ t map to the same statistic σt , the above formulation
can enable a more compact representation of the optimal value function. Moreover,
4.3 Converting to a Non-observable MDP 51
it turns out that this value function satisfies the same property as POMDP value
functions:
The PWLC property of the optimal value function seems to imply that we are actu-
ally dealing with a kind of POMDP. This intuition is correct [Nayyar et al., 2011,
Dibangoye et al., 2013, MacDermed and Isbell, 2013]. In particular, it is possible
make a reduction to a special type of POMDP: a non-observable MDP (a POMDP
with just one ‘NULL’ observation).
Since an NOMDP is a special case of a POMDP, all POMDP theory and solu-
tion methods apply. In particular, it should be clear that the belief in this plan-time
NOMDP corresponds exactly to the plan-time sufficient statistic from Definition 22.
Moreover, it can be easily shown that the optimal value function for this plan-time
NOMDP is identical to the formulation equations (4.3.4) and (4.3.5).
52 4 Exact Finite-Horizon Planning Methods
Here we briefly describe two more methods for finite-horizon Dec-POMDPs. The
first, point-based dynamic programming, is an extension of dynamic programming
that tries to avoid work by only considering reachable multiagent beliefs. The sec-
ond directly tries to transform the Dec-POMDP problem to a mathematical pro-
gramming formulation. While these methods have been less effective on benchmark
problems than the heuristic search and conversion to NOMDP methods discussed
above, they present an insight into the problem and a basis for extensions.
4.4.1 Point-Based DP
The main problem in the scalability of exact dynamic programming, is that the set
of maintained subtree policies grows very quickly. DP only removes qτi that are not
maximizing at any point in the multiagent belief space. Point-based DP (PBDP)
[Szer and Charpillet, 2006] proposes improving pruning of the set Qτe,i by consid-
ering only a subset of reachable belief points Bi ⊂ (S × Qτ−i ). Only those qτi that
maximize the value at some bi ∈ Bi are kept.
In order to define reachable beliefs, we consider mappings Γj from observation
histories to subtree policies: Γj,t : Ō j,t → Qτj . Let Γ−i,t = Γj,t j
=i be a mapping
induced by the individual Γj . Now we can define the multiagent belief point induced
by such a Γ−i,t and distribution Pr(st ,ō−i,t |b0 ,ϕ t ) as follows:
4.4.2 Optimization
mixed integer linear programming formulation for the optimal solution of finite-
horizon Dec-POMDPs, based on representing the set of possible policies for each
agent in sequence form [Koller and Pfeffer, 1997]. In this representation, a policy for
an agent i is represented as a subset of the set of sequences (roughly corresponding
to action-observation histories) for the agent. As such the problem can be interpreted
as a combinatorial optimization problem—find the best subset of sequences—and
solved with a mixed integer linear program (MILP).
For the formulation of the MILP, we refer the reader to the original paper of
Aras and Dutech [2010]. We point out that, while the performance of the MILP-
based approach has not been competitive with the newer MAA* variants, the link
to optimization methods is an important one and may inspire future insights.
Chapter 5
Approximate and Heuristic Finite-Horizon
Planning Methods
The previous chapter discussed methods for exactly solving finite-horizon Dec-
POMDPs: i.e., methods that guarantee finding the optimal solution. While there
have been quite a few insights leading to better scalability, finding an optimal solu-
tion remains very challenging and is not possible for many larger problems. In an
effort to scale to these larger problems, researchers have considered methods that
sacrifice optimality in favor of better scalability. Such methods come in two flavors:
approximation methods and heuristic methods.
Approximation methods are not guaranteed to find the optimal solution, but have
bounds on their solution quality. While such guarantees are very appealing, the com-
plexity result by Rabinovich et al. [2003]—computing an ε-approximate joint pol-
icy is NEXP-complete; see Section 3.5—suggests that they may be either difficult to
obtain, or will suffer from similar scalability problems as exact methods do. Looser
bounds (e.g., probabilistic ones [Amato and Zilberstein, 2009]) are possible in some
cases, but they suffer from the same trade-off between tightness and scalability.
Heuristic methods, in contrast, do not provide quality guarantees, but can pro-
duce high-quality results in practice. A difficulty with such methods is that it is
unclear how close to optimal the solution is. This does not mean that such methods
have no merit: they may produce at least some result where exact or approximation
algorithms would not, and, even though we may not know how far from optimal
they are, they can be objectively compared to one another on benchmark problems.
Many communities in artificial intelligence and computer science have successfully
used such benchmark-driven approaches. Moreover, we argue that the identification
of a successful heuristic should not be seen as the end goal, but rather the start of
an investigation into why that heuristic performs well: apparently there are some
not-yet understood properties of the benchmarks that leads to the heuristic being
successful.
Finally, let us point out that the terminology for non-exact methods is used
loosely in the field. In order to avoid confusion with heuristic search methods
(which, when used with an admissible heuristic—cf. Section 4.2.2—are optimal),
heuristic methods are more often than not referred to as “approximate methods
Online error bounds can also be calculated during BDP’s execution to determine
the actual values of υ that result in removing a policy. That is, if υ < 0 during ε-
pruning, the loss in value is at most |υ|. The magnitude of the negative deltas can
be summed to provide a more accurate estimate of the error that BDP has produced.
Small epsilons are useful when implementing pruning in algorithms (due to numer-
ical precision issues), but in practice large epsilons are often needed to significantly
reduce the memory requirements of dynamic programming.
The forward, heuristic search approach can also be transformed into an approxi-
mation method. In particular, remember that MAA* exhaustively searches the tree
of past joint policies ϕ t while maintaining optimistic heuristic values V (ϕ t ). Since
the method performs an A* search—it always selects the node to expand with the
highest heuristic value—we know that if V (ϕ tselected ) does not exceed the value v of
the best found joint policy so far, it knows that is has identified an optimal solution.
That is, MAA* stops when
V (ϕ tselected ) ≤ v.
This immediately suggests a trivial way to turn MAA* into an algorithm that is
guaranteed to find an ε-absolute error approximation: stop the algorithm when
V (ϕ tselected ) ≤ v + ε.
Alternatively, MAA* can be used as an anytime algorithm: every time the al-
gorithm finds a new best joint policy this can be reported. In addition, the value of
V (ϕ tselected ) can be queried at any time during the search, giving the current tightest
upper bound, which allows us to compute the worst-case absolute error on the last
reported joint policy.
if ō−i,t+1 = (ō−i,t ,o−i,t+1 ) and 0 otherwise. The observation probabilities of the aug-
mented model are given by
Ǒ(oi,t+1 |ai,t ,š = st+1 ,ō−i,t+1 ) = O(oi,t+1 ,o−i,t+1 |π −i (ō−i,t ),ai,t ,st+1 ),
That is, MBDP pretends that the resulting joint beliefs are revealed to the agents and
it retains only the trees that have the highest value at these joint belief. While during
execution the belief state will not truly be revealed to the agents, the hope is that
the individual subtree policies that are specified by these joint subtree policies are
good policies in large parts of the multiagent belief space. Because it can happen
that multiple maximizing joint subtree policies specify the same individual subtree
for an agent, the algorithm continues sampling new joint beliefs bh−2 until it has
found MaxTrees subtrees for each agent. At this point, MBDP will again perform
an exhaustive backup and start with the selection of MaxTrees three-stage-to-go
subtree policies for each agent.
The big advantage that MBDP offers is that, because the size of maintained sub-
trees does not grow, the size of the candidate sets Qτi formed by exhaustive backup is
O(|A† |MaxTrees|O† | ), where |A† | and |O† | denote the size of the largest individual
action and observation set. This does not depend on the horizon and as such MBDP
scales linearly with respect to the horizon, enabling to solution of problems up to
thousands of time steps.
While the complexity of MBDP becomes linear in the horizon, in order to per-
form the maximization in (5.2.1), MBDP loops over the
joint subtree policies for each of the sampled joint belief points. To reduce the bur-
den of this complexity, many papers have proposed new methods for performing
1For a deeper treatment of the relation between MBDP and PBDP, we refer to the description by
Oliehoek [2012].
5.2 Heuristic Methods 61
this so-called point-based backup operation [Seuken and Zilberstein, 2007b, Car-
lin and Zilberstein, 2008, Boularias and Chaib-draa, 2008, Dibangoye et al., 2009,
Amato et al., 2009, Wu et al., 2010a].2 Also, this backup corresponds to solving a
one-shot constraint optimization problem, or collaborative Bayesian game, for each
joint action [Kumar and Zilberstein, 2010a, Oliehoek et al., 2010].
Definition 25. A collaborative Bayesian game B(MDecP ,b0 ,ϕ t ) for stage t of a Dec-
consisting of
POMDP MDecP induced by b0 ,ϕ t is a tuple D,A,Θ̄ t , Pr(·),Q
• the set of agents D,
• the set of their joint actions A,
• the set of their joint AOHs Θ̄ t (referred to as joint ‘types’ in Bayesian-game
terminology),
• a probability distribution over them Pr(θ̄ t |b0 ,ϕ t ), and
• a heuristic payoff function Q( θ̄ t ,a).
Since our discussion will be restricted to CBGs in the context of a single Dec-
POMDP, we will simply write B(b0 ,ϕ t ) for such a CBG.
2 This name indicates the similarity with the point-based backup in single-agent POMDPs.
3 This section title nicely illustrates the difficulty with the terminology for approximate and
heuristic methods: This section covers heuristic methods (i.e., without guarantees) that are based
on heuristic search. In order to avoid the phrase ‘heuristic heuristic-search methods’, we will refer
to these as ‘approximate heuristic-search methods’.
62 5 Approximate and Heuristic Finite-Horizon Planning Methods
In the CBG, agents use policies4 that map from their individual AOHs to actions.
That is, a policy of an agent i for a CBG corresponds to a decision rule δi,t for the
Dec-POMDP. The solution of the CBG is the joint decision rule δ t that maximizes
the expected payoff with respect to Q:
Here δ t (θ̄ t ) is shorthand for the joint action resulting from individual application
of the decision rules: δ t (θ̄ t ) δ1,t (θ̄1,t ), . . . ,δn,t (θ̄n,t ) . The probability is given
as the marginal of (3.4.4). If ϕ t is deterministic, the probability of θ̄ t = āt ,ōt is
nonzero for exactly one āt , which means that attention can be restricted to OHs and
decision rules that map from OHs to actions.
This perspective of a stage of a Dec-POMDP immediately suggests the following
solution method: first construct a CBG for stage t = 0, solve it to find δ̂ 0 , set ϕ 1 =
(δ̂ 0 ) and use it to construct a CBG B(b0 ,ϕ 1 ) for stage t = 1, etc. Once we have
solved a CBG for every stage t = 0,1, . . . ,,h − 1, we have found an approximate
solution π̂ = (δ̂ 0 , . . . ,δ̂ h−1 ). This process is referred to as forward-sweep policy
computation (FSPC) and is illustrated in Figure 5.1a.
ϕ0
ϕ0
δ0
ϕ1 δ0 δ 0
δ 0
δ1 ϕ1 ϕ1 ... ϕ1
..
.
δ1 δ 1
δ h−1
ϕ2 ... ϕ2
π
(a) Forward-sweep policy computation (FSPC). (b) (Generalized) MAA* performs back-
tracking.
A problem in FSPC is that (5.2.2) still maximizes over δ t that map from histo-
ries to actions; the number of such δ t is doubly exponential in t. There are two main
approaches to gaining leverage. First, the maximization in (5.2.2) can be performed
more efficiently: approximately via alternating maximization [Emery-Montemerlo
et al., 2004], or exactly via heuristic search or other methods from constraint op-
4 In game theory, these policies are typically referred to as ‘strategies’. To avoid introducing more
terms than necessary, we stick to policy here.
5.2 Heuristic Methods 63
timization [Kumar and Zilberstein, 2010a, Oliehoek et al., 2010, 2012a]. Second,
it is possible to reduce the number of histories under concern via pruning [Emery-
Montemerlo et al., 2004], approximate clustering [Emery-Montemerlo et al., 2005]
or lossless clustering [Oliehoek et al., 2009].
Heuristic Q-Value Functions The CBG for a stage is fully specified given b0 ,ϕ t
but we have not yet addressed the matter of choosing Q.
and Q, Essentially, this is
quite similar to the choice of the heuristic V in MAA* described in Section 4.2.2.
The difference is that here we constrain the heuristic to be of the form:
If, for the last stage, the heuristic specifies the immediate reward Q( θ̄ t ,a) =
R(θ̄ t ,a), it is easy to show that the decision rule δ̂ h−1 that maximizes (5.2.2) in
fact maximizes the expected last-stage reward and thus is optimal (given b0 ,ϕ t ).
For other stages it is not practical to specify such an optimal heuristic of the form
θ̄ t ,a); this essentially corresponds to specifying an optimal value function, but
Q(
there is no way to compute an optimal value function of such a simple form (cf. the
discussion in Section 4.3).
However, note that FSPC via CBGs is not suboptimal per se: It is possible to
compute a value function of the form Qπ (θ̄ t ,a) for any π. Doing this for a π ∗ yields
∗
Qπ , and when using the latter as the payoff functions for the CBGs, FSPC is ex-
act [Oliehoek et al., 2008b].5 The practical value of this insight is limited since it
requires knowing an optimal policy to start with. In practice, researchers have used
approximate value functions, such as the QMDP , QPOMDP and QBG functions that
were mentioned in Section 4.2.2. It is worth pointing out, however, that since FSPC
does not give any guarantees, it is not restricted to using an ‘admissible’ heuristic:
heuristics that occasionally underestimate the value but are overall more accurate
can produce better results.
Generalized MAA* Even though Figure 5.1 shows a clear relation between FSPC
and MAA∗ , it may not be directly obvious how they relate: the former solves CBGs,
while the latter performs heuristic search. Generalized MAA∗ (GMAA∗ ) unifies
these two approaches by making explicit the ‘Next’ operator [Oliehoek et al.,
2008b].
Algorithm 5.1 shows GMAA∗ . When the Select operator selects the highest
ranked ϕ t and when the expansion (‘Next’) operator expands all the children of a
node, GMAA∗ simply is MAA∗ . Alternatively, the Next operator can construct a
CBG B(b0 ,ϕ t ) for which all joint CBG policies δ t are evaluated. These can then be
used to construct a new set of partial policies ΦNext = {ϕ t ◦ δ t } and their heuristic
values. This corresponds to MAA∗ reformulated to work on CBGs. It can be shown
∗
5 There is a subtle but important difference between Qπ (θ̄ t ,a) and the optimal value function
from Section 4.3: the latter specifies the optimal value given any past joint policy ϕ t while the
former only specifies optimal value given that π ∗ is actually being followed. For a more thorough
discussion of these differences we refer you to Oliehoek [2010].
64 5 Approximate and Heuristic Finite-Horizon Planning Methods
(Q
that when using a particular form of Q needs to faithfully represent the expected
immediate reward; the mentioned heuristics QMDP , QPOMDP and QBG all satisfy this
requirement), the approaches are identical [Oliehoek et al., 2008b]. GMAA∗ can
also use a Next operator that does not construct all new partial policies, but only
the best-ranked one, ΦNext = {ϕ t ◦ δ t∗ }. As a result the open list L will never con-
tain more than one partial policy, and behavior reduces to FSPC. A generalization
called k-GMAA∗ constructs the k best-ranked partial policies, allowing us to trade
off computation time for solution quality. Clustering of histories can also be applied
in GMAA∗ , but only lossless clustering will preserve optimality.
Pr(a1)
ξ() Pr(ȧ1)
o1 ȯ1
ξ(o1) ξ(ȯ1)
o1 ȯ1 o1 ȯ1
tion. Rather D ICE samples complete joint policies and uses those for the parameter
update.
Selecting a random sample X of N joint policies π from the distribution fξ is
straightforward. For all the observation histories ōi,t of an agent i an action can be
sampled from action distribution ξōi,t . The result of this process is a deterministic
policy for agent i. Repeating this procedure for each agent samples a deterministic
joint policy. The evaluation of a joint policy can be done using (3.4.2). For larger
problems—where policy evaluation is expensive—it is also possible to do approx-
imate sample-based evaluation using only polynomially many samples [Oliehoek
et al., 2008b].
Parameter Update The final step is to update the distribution using the best joint
policies sampled. Let π ∈ Xb be the set of best joint policies sampled from the
previous distribution fξ ( j) . These will be used to find new parameters, ξ ( j+1) . Let
1{πi (ōi,t )} (ai ) be an indicator function that indicates whether πi (ōi,t ) = ai . In the OH-
based distribution the probability of agent i taking action ai,t after having observed
ōi,t can be re-estimated as:
( j+1) 1
ξōi,t (ai ) = ∑ 1{πi (ōi,t )} (ai ),
|Xb | π∈X
(5.2.5)
b
where |Xb | normalizes the distribution. Note that the computed new parameter vec-
tor ξ ( j+1) can be smoothed using a learning rate parameter α.
Summary Algorithm 5.2 summarizes the D ICE policy search method. To start, it
needs I, the number of iterations, N, the number of samples taken at each iteration,
5.2 Heuristic Methods 67
Nb , the number of samples used to update ξ , and α, the learning rate. The outer loop
of lines 3–17 covers one iteration. The inner loop of lines 5–13 covers sampling and
evaluating one joint policy. Lines 14–16 perform the parameter update. Because
the CE method can get stuck in local optima, one typically performs a number of
restarts. This algorithm has also been extended to solve the infinite-horizon prob-
lems discussed in the next chapter [Omidshafiei et al., 2016].
Chapter 6
Infinite-Horizon Dec-POMDPs
This chapter presents an overview of the theory and policy representations for
infinite-horizon Dec-POMDPs. The optimality criteria are discussed and policy rep-
resentations using finite-state controllers are considered. These controllers can be
thought of as an extension of the policy trees used in the finite-horizon case to allow
execution over an infinite number of steps. This chapter provides value functions for
such controllers. Furthermore, the decidability and computational complexity of the
problem of (ε-)optimally solving infinite-horizon Dec-POMDPs are discussed.
When an infinite horizon is considered, the expected cumulative reward may be in-
finite. A finite sum (and a well-defined optimization problem) can be maintained by
using discounting or average rewards. Both of these concepts are presented below,
but the vast majority of research considers the discounted cumulative reward case.
In the infinite horizon case, a finite sum can be guaranteed by considering the dis-
counted expected cumulative reward
∞
DECR = E ∑ γ t
R(st ,at ) , (6.1.1)
t=0
where 0 ≤ γ < 1 is the discount factor. That is, the DECR here is exactly the same as
in the finite horizon (3.1.3), even though we explicitly replaced the summand with
∞ here. Where in the finite-horizon setting discounting might be applied if it makes
sense from the application perspective to give more importance to earlier rewards,
in the infinite-horizon discounting is (also) applied to make sure that the objective
is bounded. In this way, discounting makes the values of different joint policies
comparable even if they operate for an infinite amount of time.
A different way to overcome the unbounded sum that would result from regular
expected cumulative rewards is given by the expected average reward criterion:
1 h−1
EAR = E lim ∑ R(st ,at ) , (6.1.2)
h→∞ h t=0
This criterion has the benefit that it does not require setting a discount factor. In
contrast to DECR, earlier states do not weight more heavily and, in fact, any finite
sequence of initially poor rewards will be disregarded by this criterion: only the
limiting performance counts. Therefore, it is most appropriate for problems that are
truly expected to run for what can be considered an infinite amount of time (as
opposed to problems that must reach some goal or complete some set of tasks in an
unknown amount of time). Theoretical analysis of the average reward criterion in
the POMDP case is very involved and the complexity of the problem is the same as
that of the discounted case (undecidable) [Puterman, 1994]. Few researchers have
considered the average reward case, but it has been shown to be NP-complete in the
case of independent transition and observation Dec-MDPs [Petrik and Zilberstein,
2007] and has been used in conjunction with the expectation maximization methods
described by Pajarinen and Peltonen [2013]. Since the amount of work done on the
average-reward case is limited, we will focus on the discounted cumulative reward
criterion in this book.
continues for the infinite steps of the problem. Nodes in the controller of agent i
represent its internal states Ii and prescribe actions based on this finite memory.
For Dec-POMDPs, a set of controllers, one per agent, provides the joint policy.
Finite-state controllers explicitly represent infinite-horizon policies, but can also be
used (as a possibly more concise representation) for finite-horizon policies. They
have been widely used in POMDPs1 (e.g., see Kaelbling et al. 1998, Hansen 1998,
Meuleau et al. 1999b, Poupart and Boutilier 2004, Poupart 2005, Toussaint et al.
2006, 2008, Grześ et al. 2013, Bai et al. 2014) as well as Dec-POMDPs (e.g., Bern-
stein et al. 2005, Szer and Charpillet 2005, Amato et al. 2007a, Bernstein et al.
2009, Kumar and Zilberstein 2010b, Pajarinen and Peltonen 2011a, Kumar et al.
2011, Pajarinen and Peltonen 2011b and Wu et al. 2013).
One thing to note is that compared to the finite-horizon setting treated in the pre-
vious chapters, introducing FSCs somewhat alters the multiagent decision problem
that we are dealing with. The previous chapters assumed that agents’ actions are
based on histories, thereby in fact (implicitly) specifying the agents’ belief update
function. When resorting to FSCs this is no longer the case, and we will need to
reason about both the action selection policies πi as well as the belief update func-
tions ιi .
Perhaps the easiest way to view a finite-state controller (FSC) is as an agent model,
as treated in Section 2.4.4, where the number of internal states (or simply ‘states’)
is finite. The typical notation employed for FSCs is different from the notation
from Section 2.4.4—these differences are summarized in Table 6.1— but we hope
that the parallel is clear and will reuse the agent model’s notation for FSCs (which
are an instantiation of such agent models).2 To differentiate internal states of the
FSC from states of the Dec-POMDP, we will refer to internal controller states as
nodes.
We will now focus on two main types of FSCs, Moore and Mealy: Moore con-
trollers associate actions with nodes and Mealy controllers associate actions with
controller transitions (i.e., nodes and observations). The precise definition of the
components of FSCs can be formulated in different ways (e.g., deterministic vs.
stochastic transition functions, as will be further discussed in the remainder of this
section).
1 In POMDPs, finite-state controllers have the added benefit (over value function representations)
that the policy is explicitly represented, alleviating the need for belief updating during execution.
2 In this section, we will not consider auxiliary observations, and thus omit Z from the definitions.
i
Note that FSCs are not per se incompatible with auxiliary observations: they could be allowed by
defining the Cartesian product Oi × Zi as the input alphabet in Table 6.1.
72 6 Infinite-Horizon Dec-POMDPs
model mi = Ii ,Ii ,Ai ,Oi ,ιi ,πi FSC N,n0 ,Σ ,ϒ ,β ,α mixed
Ii a set of internal states S finite set of states Qi
Ii,0 initial internal state s0 initial state qi
Oi the set of observations Σ the finite input alphabet Oi
Ai the set of actions Γ the finite output alphabet Ai
πi the action selection policy λ the output function λi
ιi the belief update function δ the transition function δi
Table 6.1: Mapping between model notation from Section 2.4.4 and (typical) finite-
state machine notation and terminology, as well as a ‘mixed’ notation common in
Dec-POMDP literature.
a2
o1 o2 o1 ,a1 o2 ,a2
a1
o2 o2,a1
the first stage (the action selection policy for this stage cannot depend on an ob-
servation since none have been seen yet). Examples of two-node Moore and Mealy
controllers are shown in Figure 6.1.
Both Moore and Mealy models are equivalent in the sense that for a given con-
troller of one type, there is a controller of the other type that generates the same
outputs. However, it is known that Mealy controllers are more succinct than Moore
controllers in terms of the number of nodes. Given a Moore controller mi , one can
find an equivalent Mealy controller mi with the same number of nodes by constrain-
ing the outputs produced at each transition from a common node to be the same.
Conversely, given a (general) Mealy controller, the equivalent Moore controller
has |Ii | × |Oi | nodes [Hopcroft and Ullman, 1979]. Of course, more parameters are
needed for a Mealy controller (2|Ii |×|Oi | in the Mealy case, but only |Ii |+|Ii |×|Oi |
in the Moore case), but this added structure can be used by algorithms (e.g., limit-
ing the possible actions considered based on the observation seen at the node). In
general, both formulations can be useful in solution methods (as we will discuss in
the next chapter).
Agent 1 Agent 2
oHL, oHR oHL, oHR
oHR oHL
aL aL aOR aL aL aOL
oHL oHL oHR oHR
oHR oHL
Fig. 6.2: Three node deterministic controllers for two agents in the D EC -T IGER
problem.
An example of a set of Moore controllers for the (two agent) D EC -T IGER prob-
lem is given in Figure 6.2. This is the highest quality deterministic solution which
uses at most three nodes for each agent. Here, agent 1 listens until it hears the tiger
on the left twice in a row and then chooses to open the door on the right, while agent
2 listens until it hears the tiger on the right twice in a row and then opens the door
on the left. After the door is opened, the agents transition back to the first node and
begin this process again. The value of these controllers (using a discount factor of
0.9) is approximately −14.12, while the value of listening forever is −20.
74 6 Infinite-Horizon Dec-POMDPs
6.2.3 Randomization
the commonly known signal. For instance, consider a domain in which there is a
large penalty when agents choose different actions, but a large reward for choosing
the same actions. For sufficiently small controllers (e.g., one node for each agent),
this type of policy is impossible without the correlation device. It has been shown
that policies can be randomized and correlated to allow higher values to be attained
in a range of domains [Bernstein et al., 2005, 2009, Amato et al., 2007a, 2010].
When each agent uses a Moore controller, this results in a fully specified agent
component m (cf. Definition 8), to which we will also refer as joint controller in the
current context of FSCs. Such a joint controller induces a Markov reward process
(which is a Markov chain with rewards, or, alternatively, an MDP without actions)
and thus a value. In particular, the infinite-horizon discounted reward incurred when
the initial state is s and the initial nodes for all of the controllers is given by I can be
denoted by V m (I,s) and satisfies:
V m (I,s) = ∑ π(a|I) R(s,a) + γ ∑ Pr(s ,o|s,a)ι(I |I,o)V m (I ,s ) (6.3.1)
a
s ,o,I
where
• π(a|I) ∏i πi (ai |Ii ), and
• ι(I |I,o) ∏ j ιi (Ii |Ii ,oi ).
The value of m at the initial distribution is V m (b0 ) = ∑s0 b0 (s0 )V m (I 0 ,s0 ), where I 0
is the set of initial nodes for the agents.
For a Mealy controller, the selected actions will depend on the last received ob-
servations. For a particular joint controller, m, when the initial state is s, the last joint
observation was o, and the current node of m is I, the value is denoted by V m (I,o,s)
and satisfies:
V m (I,o,s) = π(a|I,o) R(s,a) + γ ∑ Pr(s ,o|s,a)ι(I |I,o)V m (I ,o ,s ) (6.3.2)
s ,o,I
where, now, π(a|I,o) = ∏i πi (ai |Ii ,oi ). In this case, recall that the first node is as-
sumed to be a Moore node (i.e., the action selection in the first stage is governed
by πi,0 , which only depends on the node, not on observations),4 so the value for the
initial belief b0 can be computed as V m (b0 ) = ∑s0 b0 (s0 )V0m (I 0 ,s0 ), with
4Alternatively, the value can be represented as V m (b0 ) = ∑st b0 (s0 )V0m (I 0 ,o∗0 s0 )), where o0 ∗ is a
dummy observation that is only received on the first step of the problem.
76 6 Infinite-Horizon Dec-POMDPs
V0m (I 0 ,s0 ) = ∑ π 0 (a|I) R(s,a) + γ ∑ Pr(s ,o|s,a)ι(I |I,o)V m (I ,o ,s ) .
a s ,o,I
These recursive equations are similar to those used in the finite-horizon version
of the problem, but evaluation continues for an infinite number of steps. Note that
the policy is now stationary (depending on the controller node, but not time) and the
value for each combination of nodes and states for a fixed policy can be found using
a set of linear equations or iterative methods [Amato et al., 2010, Bernstein et al.,
2009].
For the infinite-horizon problem, both the number of steps of the problem and the
possible size of the policy (i.e., the number of nodes in the controllers) are un-
bounded. That is, controllers of unbounded size may be needed to perfectly rep-
resent an optimal policy. As a result, solving an infinite-horizon Dec-POMDP op-
timally is undecidable. This follows directly from the fact that optimally solving
infinite-horizon POMDPs is undecidable [Madani et al., 1999], since a Dec-POMDP
is a generalization of a POMDP.
Similarly, the definition of multiagent beliefs, which is based on subtree poli-
cies (cf. Definition 19), is not appropriate in the potentially infinite space, but has
been reformulated in the context of bounded policies. Specifically, from the per-
spective of agent i and given a known set of controllers for the other agents, −i, a
probability distribution of the other agents being in nodes I−i while the state of the
system is st , can be represented as Pr(st ,I−i ). Like the multiagent belief in the finite-
horizon case, these probabilities can be used at planning time to evaluate agent i’s
policies across the space of other agent policies and estimate the outcomes of other
agents’controllers.
As an alternative, approximation methods (with guarantees) have been consid-
ered: due to discounting, a solution that is within any fixed ε of the optimal value can
be found in a finite number of steps [Bernstein et al., 2009]. That is, we can choose
t such that the maximum sum of rewards over the remaining stages t + 1,t + 2, . . . is
bounded by ε:
∞
γ t+1 |Rmax |
∑ γ k |Rmax | = 1 − γ ≤ ε,
k=t+1
where |Rmax | is the immediate reward with the largest magnitude. This ensures that
any sum of rewards after time t will be smaller than ε (due to discounting). There-
fore, t becomes an effective horizon in that an optimal solution for the horizon t
problem ensures an ε-optimal solution for the infinite-horizon problem. This pro-
cedure is also NEXP-complete because an optimal policy is found for the effective
horizon. Theoretically, finite-horizon methods from the previous chapter could also
be used to produce ε-optimal solutions, but the effective horizon is often too large
6.4 Undecidability, Alternative Goals and Their Complexity 77
Policy iteration (PI) for Dec-POMDPs [Bernstein et al., 2009] is similar to the
finite-horizon dynamic programming algorithm (Algorithm 4.1), but finite-state
controllers are used as policy representations (like the policy iteration approach for
POMDPs [Hansen, 1998]). In addition, there are two other main differences with
the finite-horizon DP:
1. Instead of using a set of separately represented policies (i.e., policy trees), PI
maintains a single controller for each agent and considers the value of beginning
execution from any node of the controller. That is, starting at each (joint) node
can be interpreted as an infinite-horizon (joint) policy and the set of policies can
be considered to be the set of joint nodes. Pruning can then take place over nodes
of these controllers to remove dominated policies.
2. Rather than initializing the iteration process with the (set of) one-stage-to-go
policies as in DP, PI can start from any initial joint controller for each agent
m0 = m1,0 , . . . ,mn,0 . Exactly what this initial joint controller m0 is does not
matter: the controller is going to be subject to exhaustive backups, which will
‘push’ the initial controller to beyond the effective horizon (where the discount
factor renders the value bounded by ε), as well as controller improvements.
a1
o1 o2
o2 o2
a1 a2 a1 a1 a2 a1
o1 o1,o2 o1 o1,o2
o1 o2 o1 o2
(a) o2 o1
a1
(b)
Fig. 7.1: A full backup for a single agent for action a1 when starting with the con-
troller in (a), resulting in the controller in (b).
The basic procedure of PI is that it continuously tries to improve the joint con-
troller by improving the controller maintained for each agent. This per-agent im-
provement is done using an exhaustive backup operation (this is the same as in
the finite-horizon DP case) by which nodes are added to the controller that intu-
itively correspond to all possible ‘one-step longer’ policies.1 To counter the growth
of the individual controllers, pruning can be conducted, which removes a node in
an agent’s controller if it is dominated (i.e., if it has equal or lower value than when
beginning in a combination of nodes for all (s,I−i )-pairs). That is, the policy for
agent i given by beginning in the node is dominated by some set of policies given
by starting at other nodes in the controller. These exhaustive backups and pruning
steps continue until the solution is provably within ε of an optimal solution. This
algorithm can produce an ε-optimal policy in a finite number of steps [Bernstein
et al., 2009]. The details of policy iteration follow.
The policy iteration algorithm is shown in Algorithm 7.1. The input is an initial
joint controller, m0 , and a parameter ε. At each step, evaluation, backup and pruning
occurs. The controller is evaluated using (6.3.1). Next, an exhaustive backup is per-
formed to add nodes to each of the local controllers. Similarly to the finite-horizon
case, for each agent i, |Ai ||Ii ||Oi | nodes are added to the local controller, where |Ii |
is the number of nodes in the current controller. The exhaustive backup represents
starting from each action and for each combination of observations transitioning to
each of the nodes in the current controller. Repeated application of exhaustive back-
ups amounts to a brute-force search in the space of deterministic policies, which
will converge to ε-optimality, but is obviously quite inefficient.
To increase the efficiency of the algorithm, pruning takes place. Because plan-
ning takes place offline, the controllers for each agent are known at each step, but
agents will not know which node of their controller any of the other agents will be in
during execution. As a result, pruning must be completed over the multiagent belief
1 Essentially, doing (infinitely many) exhaustive backups will generate all (infinitely many) poli-
cies.
7.2 Optimizing Fixed-Size Controllers 81
space (using a linear program that is very similar to that described for finite-horizon
dynamic programming in Section 4.3). That is, a node for an agent’s controller can
only be pruned if there is some combination of nodes that has higher value for all
states of the system and at all nodes of the other agents’ controllers. Unlike in the
finite-horizon case, edges to the removed node are then redirected to the dominating
nodes. Because a node may be dominated by a distribution of other nodes, the result-
ing transitions may be stochastic rather than deterministic. The updated controller
is evaluated, and pruning continues until no agent can remove any further nodes.
Convergence to ε-optimality can be calculated based on the discount rate and the
number of iterations of backups that have been performed. Let |Rmax | be the largest
absolute value of any immediate reward in the Dec-POMDP. Then the algorithm
terminates after iteration t if γ 1−γ
t+1 |R
max |
≤ ε. At this point, due to discounting, the
value of any policy after step t will be less than ε.
o1 q?
q a?
o2
q?
Fig. 7.2: Choosing actions and transitions in each node of a fixed-size controller.
have sought to produce a high-quality solution while keeping the controller sizes
for the agents fixed.
That is, the concept behind these approaches is to choose a controller size |Ii |
for each agent and then determine action selection and node transitions parameters
that produce high values. However, because these methods fix the size of the agents’
controllers, the resulting size may be smaller than is needed for even an ε-optimal
solution. This means that, while some of the approaches in this section can produce a
controller that has the highest value given the fixed size, that value may be arbitrarily
far from the optimal solution.
A way to compute the best deterministic joint controller of given size is via heuristic
search. One such method, referred to as best-first search for Dec-POMDPs [Szer and
Charpillet, 2005], generalizes the MAA∗ technique from Section 4.2.2 to the infinite
horizon. Rather than filling templates of trees (cf. Figure 5.1b) the method here fills
templates of controllers; it searches through the possible actions that can be taken
at each agent’s controller nodes and the possible transitions that result from each
observation in each node.
In more detail, the algorithm searches through the possible deterministic transi-
tion mappings ιi : Ii × Oi → Ii and the deterministic action function, πi : Ii → Ai , for
all agents. The controller nodes are ordered and a forward search is conducted that
specifies the action selection and node transition parameters for all agents, one node
at a time. The heuristic value determines an upper bound value for these partially
specified controllers (again, in a way that is similar to MAA∗ ) by assuming central-
ized control is used for unspecified nodes. In this way, an approximate value for the
controller is calculated given the currently specified deterministic parameters and
the algorithm fills in the remaining nodes one at a time in a best-first fashion.
This process continues until the value of a set of fully specified controllers is
greater than the heuristic value of any partially specified controllers. Since this is an
instance of heuristic search applied with an admissible heuristic, this technique is
guaranteed to find the best deterministic joint finite-state controller of a given size.
7.2 Optimizing Fixed-Size Controllers 83
Because stochastic controllers with the same number of nodes are able to produce
higher-valued policies than deterministic controllers, researchers have also explored
optimizing stochastic controllers (a difficult problem by itself; see the overview and
complexity results by Vlassis et al. 2012). These approaches seek to find proba-
bilistic parameters for action selection and node transition. That is, for agent i, the
algorithms find the probability of taking action ai in node Ii , Pr(ai |Ii ), and the prob-
ability of transitioning to node Ii in node Ii after action ai is taken and observation
oi made, Pr(Ii |Ii ,ai ,oi ).
One method to produce stochastic controllers is by using a set of linear pro-
grams. In particular, bounded policy iteration for decentralized POMDPs (Dec-
BPI) [Bernstein et al., 2005], is such a method that extends the BPI algorithm for
POMDPs [Poupart and Boutilier, 2003] to the multiagent setting. This approach
iterates through the nodes of each agent’s controller, attempting to find an improve-
ment for that node. That is, it tries to improve a single local controller, assuming
that the controllers of the other agents are fixed, and thus is conceptually similar to
JESP, described in Section 5.2.1. In contrast to JESP, however, this improvement for
an agent i cannot be found using plain dynamic programming over a tree or directed
acyclic graph (DAG) of reachable ‘JESP beliefs’ bi (s,ō−i,t ) (even when replacing
histories by internal states leading to a belief of the form bi (s,I−i ) there would
be infinitely many in general) or enumeration of all controllers mi (there are un-
countably many stochastic controllers). Instead, Dec-BPI uses linear programming
to search for a new node to replace the old node.
Specifically, this approach iterates through agents i, along with nodes for agent Ii .
Then, the method assumes that the current controller will be used from the second
step on, and tries to replace the parameters for Ii with ones that will increase value
for just the first step. That is, it attempts to find parameters satisfying the following
inequality:
Here, I−i is the set of controller nodes for all agents besides i. The search for new
parameters can be formulated as a linear program in Figure 7.3. Note that a ‘com-
bined’ action selection and node transition probability,
Pr(Ii ,ai |Ii ,oi ) Pr(Ii |Ii ,ai ,oi ) Pr(ai |Ii ),
is used to ensure the improvement constraints are linear; naive inclusion of the right-
hand side product would lead to quadratic improvement constraints. Instead, we
introduce more variables that lead to a linear formulation. The second probability
constraint in Figure 7.3 ensures that the action selection probabilities can be recov-
84 7 Infinite-Horizon Planning Methods: Discounted Cumulative Reward
ered (i.e., that y(ai ,oi ,Ii ) does not encode an invalid distribution). Note that the first
and second probability constraints together guarantee that
∀oi ∑
y(ai ,oi ,Ii ) = 1.
Ii ,ai
Probability constraints:
∑ x(ai ) = 1
ai
∀ai x(ai ) ≥ 0
∀ai ,oi ,Ii y(ai ,oi ,Ii ) ≥ 0
Fig. 7.3: The linear program (LP) to test for improvement in Dec-BPI. The LP de-
termines if there is a probability distribution over actions and transitions from node
Ii that improves value when assuming the current controller will be used from the
second step on. Note that Pr(Ii ,ai |Ii ,oi ) is the combined action and transition prob-
ability which is made consistent with the action selection probability πi (ai |Ii ) in
the probability constraints. This form is needed to ensure the objective function is
linear.
This linear program is polynomial in the sizes of the Dec-POMDP and the joint
controller, but exponential in the number of agents. Bernstein et al. allow each
agent’s controller to be correlated by using shared information in a correlation de-
vice (as discussed in Section 6.2.4). This may improve solution quality while requir-
ing only a limited increase in problem size [Bernstein et al., 2005, 2009].
7.2 Optimizing Fixed-Size Controllers 85
Because Dec-BPI can often become stuck in local maxima, nonlinear programming
(NLP) has also been used [Amato et al., 2007a, 2010]. The formulation seeks to
optimize the value of a set of fixed-size controllers given an initial state distribution.
The variables for this problem are the action selection and node transition proba-
bilities for each node of each agent’s controller as well as the value of the resulting
policy starting from a set of controller nodes.
More formally, the NLP maximizes the expected value of the initial controller
node for each agent at the initial belief subject to the Bellman constraints. To this
end, let us translate the value of a joint controller (from Equation 6.3.1) in terms of
variables that will be used for optimization:
z(I,s) = ∑ ∏ x(Ii ,ai )
a i
R(s,a) + γ ∑ Pr(s ,o|s,a) ∑
∏ y(I j ,ai ,oi ,I j )
z(I ,s ) . (7.2.1)
s ,o I j
As shown in Figure 7.4, z(I,s) represents the value, V (I,st ), of executing the con-
troller starting from nodes I and state s, while x(Ii ,ai ) is the action selection proba-
bility, Pr(ai | Ii ), and y(Ii ,ai ,oi ,Ii ) is the node transition probability, Pr(Ii | Ii ,ai ,oi ).
Note that to ensure that the values are correct given the action and node transition
probabilities, these nonlinear constraints must be added to the optimization which
represent the Bellman equations given the policy determined by the action and tran-
sition probabilities. We must also ensure that the necessary variables are valid prob-
abilities in a set of probability constraints. This approach differs from DEC-BPI in
that it explicitly represents the node values as variables, thus allowing improvement
and evaluation to take place simultaneously. An optimal solution of this nonlinear
program represents an optimal fixed-size solution to the Dec-POMDP, but as this
is often intractable, approximate solvers have been used in practice [Amato et al.,
2010].
subject to:
Bellman constraints:
∀I,s z(I,s) = ∑ ∏ x(Ii ,ai )
a i
R(s,a) + γ ∑ Pr(s ,o|s,a) ∑
∏ y(I j ,ai ,oi ,I j )
z(I ,s ) . (7.2.2)
s ,o I j
Fig. 7.4: The nonlinear program (NLP) representing the optimal fixed-size solution
for the problem. The action selection, Pr(ai |Ii ), and node transition probabilities,
Pr(Ii |Ii ,ai ,oi ), are optimized for each agent i to maximize the value of the controllers.
This optimization is performed for the given initial belief b0 and a given (arbitrarily
selected) tuple of the initial nodes, I 0 = I1,0 , . . . ,In,0 .
The basic idea is that the problem can be represented as an infinite mixture of dy-
namic Bayesian networks (DBNs), which has one component for each time step t.
The DBN responsible for a particular t covers stages 0, . . . ,t and represents the
‘probability’ that the ‘maximum reward’ is received at its last modeled stage (i.e.,
at t). The intuition is that the probability of achieving the maximum reward can be
considered as a substitute for the value of the controller. We give a brief formaliza-
tion of this intuition next; for details we refer the reader to the papers by Toussaint
et al. [2006], Kumar and Zilberstein [2010b], and Kumar et al. [2011].
First, the formalization is based on binary reward variables, r, for each stage t
t ,at )−Rmin
that provide probability via Pr(r = 1|st ,at ) R(s Rmax −Rmin , where Rmin and Rmax are
the smallest and largest immediate rewards. This probability encodes the ‘chance of
getting the highest possible reward’ at stage t. This can be used to define Pr(r,Z|t; θ )
with Z = s0 ,a0 ,s1 ,o1 ,I 1 ,a1 , . . . ,at−1 ,st ,ot ,I t the entire histories of states, actions,
observations and internal states, and with θ = {Pr(a|I), Pr(I |I,o), Pr(I)} the pa-
7.2 Optimizing Fixed-Size Controllers 87
rameter vector that specifies the action selection, node transition and initial node
probabilities. Now these probabilities constitute a mixture probability via:
with P(t) = γ t (1 − γ) (used to discount the reward that is received for each time
step t). This can be used to define an overall likelihood function Lθ (r = 1; θ ), and
it can be shown that maximizing this likelihood is equivalent to optimizing value
for the Dec-POMDP (using a fixed-size controller). Specifically, the value function
θ
of controller θ can be recovered from the likelihood as V (b0 ) = (Rmax −R1−γ
min )L
+
∑t γ Rmin [Kumar and Zilberstein, 2010b].
t
As such, maximizing the value has been cast as the problem of maximizing likeli-
hood in a DBN, and for this the EM algorithm can be used [Bishop, 2006]. It iterates
by calculating occupancy probabilities—i.e., the probability of being in each con-
troller node and problem state—and values given fixed controller parameters (in an
E-step) and improving the controller parameters (in an M-step). The likelihood and
associated value will increase at each iteration until the algorithm converges to a
(possibly local) optima. The E-step calculates two quantities. The first is Ptθ (I,st ),
the probability of being in state st and node I at each stage t. The second quantity,
computed for each stage-to-go, is Pτθ (r = 1|I,s), which corresponds to the expected
value of starting from I,s and continuing for τ steps. The M-step uses the probabil-
ities calculated in the E-step and the previous controller parameters to update the
action selection, node transition and initial node parameters.
After this EM approach was introduced, additional related methods were devel-
oped. These updated methods include EM for Dec-POMDPs with factored state
spaces [Pajarinen and Peltonen, 2011a] and factored structured representations [Pa-
jarinen and Peltonen, 2011b], and EM using simulation data rather than the full
model [Wu et al., 2013].
where—using ι(I |I,a,o) = ∏i∈D ιi (Ii |Ii ,ai ,oi ) for the joint information state update
probability—the probability of transitioning to (s ,I ) is given by
It is easy to show that, for a given set {ιi } of information state functions, one can
construct a plan-time NOMDP analogous to Definition 23 in Section 4.3.3, where
augmented states are tuples s̄ = s,I. However, as discussed before, in the infinite-
horizon setting, the selection of those information state functions becomes part of
the problem.
One idea to address this, dating back to Meuleau et al. [1999a], is to make
searching the space of deterministic information state functions part of the prob-
lem by defining a cross-product MDP in which “a decision is the choice of an
action and of a next node”. That is, selection of the ιi function (in a POMDP
with protagonist agent i) can be done by introducing |Oi | new action variables
ι,|Oi |
(say, aιi = {aι,1
i , . . . ,ai }) that specify, for each observation oi ∈ Oi , to what next
internal state to transition. This approach is extended to Dec-POMDPs by Mac-
Dermed and Isbell [2013] who introduce the bounded belief Dec-POMDP2 (BB-
Dec-POMDP), which is a Dec-POMDP that encodes the selection of optimal {ιi }
by splitting each stage into two stages: one for selection of the domain-level actions
and one for selection of the aιi . We omit the details of this formulation and refer
the reader to the original paper. The main point that the reader should note is that
by making ι part of the (augmented) joint action, the probability Pr(s ,I |s,I,a) from
(7.2.4) no longer depends on external quantities, which means that it is possible
to construct an NOMDP formulation analogous to Definition 23 that in fact does
optimize over (deterministic) information state functions.
This is in fact what MacDermed and Isbell [2013] propose; they construct the
NOMDP (to which they refer as a ‘belief-POMDP’) for a BB-Dec-POMDP and
solve it with a POMDP solution method. Specifically, they use a modification of
the point-based method Perseus [Spaan and Vlassis, 2005] to solve the NOMDP.
The modification employed is aimed at mitigating the bottleneck of maximizing
over (exponentially many) decision rules in V ∗ (σ ) = maxδ Q∗ (σ ,δ ). Since the value
function is PWLC, the next-stage value function can be represented using a set of
vectors v ∈ V , and we can write
V ∗ (σ ) = max ∑ σ (s,I) R(s,δ (I)) + max ∑ Pr(s ,I |s,I,δ )v(s ,I )
δ (s,I)
v∈V
(s ,I )
= max max ∑ σ (s,I) R(s,δ (I)) + ∑
Pr(s ,I |s,δ (I))v(s ,I ) .
v∈V δ (s,I) (s ,I )
vδ (s,I)
2 The term ‘bounded belief’ refers to the finite number of internal states (or ‘beliefs’) considered.
7.2 Optimizing Fixed-Size Controllers 89
The key insight is the in the last expression, the bracketed part only depends on
δ (I) = δ1 (I1 ), . . . ,δ1 (I1 ), i.e., on that part of δ that specifies the actions for I only.
As such it is possible to rewrite this value as the maximum of solutions of a collec-
tion of collaborative Bayesian games (cf. Section 5.2.3), one for each v ∈ V :
For each v ∈ V , the maximization over δ can be interpreted as the solution of a CBG
(5.2.2), and therefore can be performed more effectively using a variety of meth-
ods [Oliehoek et al., 2010, Kumar and Zilberstein, 2010a, Oliehoek et al., 2012a].
MacDermed and Isbell [2013] propose a method based on the relaxation of an inte-
ger program. We note that the maximizing δ directly induces a vector vδ , which is
the result of the point-based backup. As such, this modification can also be used by
other point-based POMDP methods.
It is good to note that a BB-Dec-POMDP is just a special case of an (infinite-
horizon) Dec-POMDP. The fact that it happens to have a bounded number of infor-
mation states is nothing new compared to previous approaches: those also limited
the number of information states (controller nodes) to a finite number. The concep-
tual difference, however, is that MacDermed and Isbell [2013] pose this restriction
as part of the model definition, rather the solution method. This is very much in
line with, and a source of inspiration for, the more general definition of multiagent
decision problems that we introduced in Section 2.4.4.
Chapter 8
Further Topics
One of the major directions of research in the last decade has been the identifica-
tion and exploitation of structure in Dec-POMDPs. In particular, much research has
considered the special cases mentioned in Section 2.4.2, and models that generalize
these.
One of the main directions of work that people have pursued is the exploitation of
structure between variables by making use of methods from constraint optimization
or inference in graphical models. These methods construct what is called a coordi-
nation graph, which indicates the agents that need to directly coordinate. Methods
to exploit the sparsity of this graph can be employed for more efficient computation.
After introducing coordination graphs, this section will give an impression of how
they are employed within ND-POMDPs and factored Dec-POMDPs.
The main idea behind coordination graphs [Guestrin et al., 2002a, Kok and Vlassis,
2006], also called interaction graphs [Nair et al., 2005] or collaborative graphical
games [Oliehoek et al., 2008c, 2012a], is that even though all agents are relevant
for the total payoff, not all agents directly interact with each other. For instance,
Figure 8.1a shows a coordination graph involving four agents. All agents try to
optimize the sum of the payoff functions, but agent 1 only directly interacts with
agent 2. This leads to a form of conditional independence that can be exploited
using a variety of techniques.
u1 (a1 ,a2 )
2
u1 u4
2
u2 (a2 ,a3 ) u3 (a2 ,a4 ) u2
1 3
3 4
u3
(a) A four-agent coordination graph. (b) A three-agent coordination hyper-
graph.
u(a) = ∑ ue (ae ).
e∈E
1Even though an abuse of notation, the meaning should be clear from context. Moreover, it allows
us to avoid the notational burden of correct alternatives such as defining the subset as A (e) and
writing ue (aA (e) ).
8.1 Exploiting Structure in Factored Models 93
Note that the local reward functions should not be confused with individual re-
ward functions in a system with self-interested agents, such as partially observable
stochastic games [Hansen et al., 2004] and (graphical) BGs [Kearns et al., 2001,
Kearns, 2007]. In such models agents compete to maximize their individual reward
functions, while we consider agents that collaborate to maximize the sum of local
payoff functions.
A CG is a specific instantiation of a constraint optimization problem (COP) [Mari-
nescu and Dechter, 2009] or, more generally, a graphical model [Dechter, 2013]
where the nodes are agents (rather than variables) and they can take actions (rather
than values). This means that there are a multitude of methods that can be brought
to bear on these problems. The most popular exact method is nonserial dynamic
programming (NDP) [Bertele and Brioschi, 1972], also called variable elimination
[Guestrin et al., 2002a, Kok and Vlassis, 2006] and bucket elimination [Dechter,
1997]. Since NDP can be slow for certain network topologies (those with high ‘in-
duced width’; see Kok and Vlassis 2006 for details and empirical investigation), a
popular (but approximate) alternative is the max-sum algorithm [Pearl, 1988, Kok
and Vlassis, 2005, 2006, Farinelli et al., 2008, Rogers et al., 2011]. It is also possi-
ble to use distributed methods as investigated in the field of distributed constraint
optimization problems (DCOPs) [Liu and Sycara, 1995, Yokoo, 2001, Modi et al.,
2005].
The framework of CGs can be applicable to Dec-POMDPs when replacing CG-
actions by Dec-POMDP policies. That is, we can state a requirement that the value
function of a Dec-POMDP can be additively decomposed into local values:
V (π) = ∑ V e (π e ), (8.1.1)
e∈E
where π e = πe1 , . . . ,πe|e| is the profile of individual policies of agents that partici-
pate in edge e. The class of Dec-POMDPs that satisfy this requirement, also called
value-factored Dec-POMDPs [Kumar et al., 2011], can be trivially transformed to
CGs.
8.1.1.2 ND-POMDPs
π1 π2 π3
V 3 (π3 , π4 )
V 1 (π1 , π2 )
π4
π7 π6 π5
Typical motivating domains for ND-POMDPs include sensor network and target
tracking problems. Figure 8.2 shows an example coordination graph corresponding
to an ND-POMDP for the sensor network problem from Section 2.3 (see Figure 2.7).
There are six edges, each connecting two agents, which means that the reward func-
tion decomposes as follows:
R(s,a) = R1 (s0 ,s1 ,s2 ,a1 ,a2 ) + R2 (s0 ,s2 ,s3 ,a2 ,a3 ) + · · · + R6 (s0 ,s6 ,s7 ,a6 ,a7 ).
Let Ni denote the neighbors of agent i, including i itself, in the interaction graph.
In an ND-POMDP, the local neighborhood utility of an agent i is the expected return
for all the edges that contain agent i:
V (π Ni ) = ∑ V e (π e ). (8.1.3)
e∈E s.t. i∈e
8.1 Exploiting Structure in Factored Models 95
It can be shown that when an agent j ∈ / Ni changes its policy, V (π Ni ) is not affected,
a property referred to as locality of interaction [Nair et al., 2005].
This locality of interaction is the crucial property that allows COP methods to
optimize more effectively.2 In particular, Nair et al. [2005] propose a globally opti-
mal algorithm (GOA), which essentially performs nonserial dynamic programming
on the interaction graph, as well as locally interacting distributed joint equilibrium
search for policies (LID-JESP), which combines the distributed breakout algorithm
(DBA) [Yokoo and Hirayama, 1996] with JESP (see Section 5.2.1). Kim et al. [2006]
extend this method to make use of a stochastic version of DBA, allowing for a
speedup in convergence. Both Varakantham et al. [2007b] and Marecki et al. [2008]
address the COP using heuristic search (for the finite and the infinite-horizon case
respectively). Kumar and Zilberstein [2009] extend MBDP (see Section 5.2.2) to
exploit COP methods in both the heuristic used to sample belief points and in com-
puting the best subtree for a sampled belief. Kumar et al. [2011] cast the planning
problem as an inference problem [Toussaint, 2009] and employ the expectation max-
imization (EM) algorithm (e.g., see Koller and Friedman, 2009) to solve the prob-
lem. Effectively, this approach decomposes the full inference problem into smaller
problems by using message passing to compute the local values (E-step) of each
factor and then combining the resulting solutions in the M-step. Dibangoye et al.
[2014] use the reduction to an NOMDP (discussed in Section 4.3) and extend the
solution method of Dibangoye et al. [2013] by exploiting the factored structure in
the resulting NOMDP value function.
We note that even with factored transitions and observations, a policy in an ND-
POMDP is a mapping from observation histories to actions (unlike in the transition-
and observation-independent Dec-MDP case, where policies are mappings from lo-
cal states to actions) and the worst-case complexity remains the same as in a regular
Dec-POMDP (NEXP-complete), thus implying doubly exponential complexity in
the horizon of the problem. While the worst-case complexity remains the same as in
the Dec-POMDP case, algorithms for ND-POMDPs are typically much more scal-
able in the number of agents in practice. Scalability can increase as the hyper-graph
becomes less connected.
2 The formula (8.1.3), that explains the term locality of interaction, and (8.1.1), that states the
requirement that the value function can be decomposed into local components, each emphasize
different aspects. Note, however that one cannot exist without the other and that, therefore, these
terms can be regarded as synonymous.
96 8 Further Topics
3Such a DBN that includes reward nodes is also called an influence diagram (ID) [Howard and
Matheson, 1984].
8.1 Exploiting Structure in Factored Models 97
x1 x1 r1
a1 o1
x2 x2 r2
a2 o2
x3 x3 r3
a3 o3
x4 x4 r4
t t +1
(a) An illustration FFG. (b) The DBN that represents FFG.
Fig. 8.3: The F IRE F IGHTING G RAPH problem. A DBN can be used to represent the
transition, observation and reward function.
Figure 8.3b also includes the local reward functions, one per house. In particular,
for each house 1 ≤ H ≤ 4 the rewards are specified by the fire levels at the next time
step rH (xH ) = −xH . We can reduce these rewards to ones of the form RH (s,a) as
in Definition 2 by taking the expectation over xH . For instance, for house 1
where x{1,2} denotes x1 ,x2 . While FFG is a stylized example, such locally con-
nected systems can be found in applications as traffic control [Wu et al., 2013] or
communication networks [Ooi and Wornell, 1996, Hansen et al., 2004, Mahajan and
Mannan, 2014].
Value Functions for Factored Dec-POMDPs Factored Dec-POMDPs can ex-
hibit very local behavior. For instance, in FFG, the rewards for each house only
depend on a few local factors (both ‘nearby’ agents and state factors), so it seems
reasonable to expect that also in these settings we can exploit locality of interac-
tion. It turns out, however, that due to the absence of transition and observation
independence, the story for fDec-POMDPs is more subtle than for ND-POMDPs.
In particular, in this section we will illustrate this by examining how the Q-function
for a particular joint policy Qπ decomposes.
Similarly to typical use in MDPs and RL, the Q-function is defined as the value
of taking a joint action and following π subsequently (cf. the definition of V π by
(3.4.2)):
98 8 Further Topics
R(st ,at ), for the last stage t = h − 1, and otherwise:
Qπ (st ,ōt ,at )
R(st ,at ) + ∑st+1 ∑ot+1 Pr(st+1 ,ot+1 |st ,at )Qπ (st+1 ,ōt+1 ,π(ōt+1 )).
2
Q1 Q2 Q3 Q4
1 3
Fig. 8.4: The interaction hyper-graph illustrating the factorization of Qπ for F IRE -
F IGHTING G RAPH t = h − 1.
For other stages, t < h − 1, it turns out that—even without assuming transition
and observation independence—it is still possible to decompose the value function
as the sum of functions, one for each payoff component e. However, because actions,
factors and observations influence each other, the scopes of the components Qe grow
over time. This is illustrated in Figure 8.5, which shows the scope of Q1 , the value
component that represents the expected (R1 ) reward for house 1 at different stages
of a horizon h = 3 FFG problem. Even though the scope of the function R1 only
contains {x1 ,x2 ,a1 }, at earlier stages we need to include more variables since they
can affect R1h−1 , the reward for house 1 at the last stage t = h − 1.
We can formalize these growing scopes using ‘scope backup operators’ for state
factors scopes (Γ X ) and agent scopes (Γ A ). Given a set of variables V (which can
be either state factors or observations) from the right-hand side of the 2DBN, they
return the variables (respectively the state factors scope and agent scope) from the
left-hand side of the 2DBN are ancestors of V. This way, it is possible to specify a
local value function component:
Pr(xe ,t+1 ,oe ,t+1 |xΓ X,t ,aΓ A )Qeπ (xe ,t+1 ,ōe ,t+1 ,π e (ōe ,t+1 )). (8.1.5)
8.1 Exploiting Structure in Factored Models 99
x1 x1 x1
a1 o1 a1 o1 a1
x2 x2 x2
a2 o2 a2 o2 a2
x3 x3 x3
a3 o3 a3 o3 a3
x4 x4 x4
Fig. 8.5: Illustration of the interaction between agents and environment over time
in FFG. In contrast to Figure 8.3b, which represents the transition and observation
model using abstract time steps t and t + 1, this figure represents the last 3 stages
of a decision problem. Also the rewards are omitted in this figure. The scope of Q1 ,
given by (8.1.4), is illustrated by shading and increases when going back in time.
where xΓ X,t ,aΓ A denote the state factors and actions of agents that are needed to
predict the state factors (xe ,t+1 ) as well as the observations of the agents (oe ,t+1 ) in
the next stage scope.
Vt (π) = ∑ Vte (π) = ∑ ∑ ∑ Pr(xe,t ,ōe,t |b0 ,π)Qeπ (xe,t ,ōe,t ,π e (ōe,t )). (8.1.6)
e∈E e∈E xe,t ōe,t
When one or more components becomes ‘fully coupled’ (i.e., contains all agents
and state factors), technically, the value function still is additively factored. How-
ever, at this point the components are no longer local, which means that factorization
will no longer provide any benefits. Therefore, at this point the components can be
collapsed to form a non-factored value function.
When assuming transition and observation independence, as in ND-POMDPs,
the scopes do not grow: each variable si,t+1 (representing a local state for agent i) is
dependent only on si,t (its own value at the previous state), and each oi is dependent
only on si,t+1 . As a result, the interaction graph for such settings is stationary.
100 8 Further Topics
In the more general case, such a notion of locality of interaction over full-length
policies is not properly defined, because the interaction graph and hence an agent’s
neighborhood can be different at every stage. It is possible to define such a notion
for each particular stage t [Oliehoek et al., 2008c]. As a result, it is in theory possible
to exploit factorization in an exact manner. However, in practice it is likely that the
gains of such a method will be limited by the dense coupling for earlier stages.
A perhaps more promising idea is to exploit factored structure in approximate
algorithms. For instance, even though Figure 8.5 clearly shows there is a path from
x4 to the immediate reward associated with house 1, it might be the case that this
influence is only minor, in which case it might be possible to approximate Q1 with
a function with a smaller scope. This is the idea behind transfer planning [Oliehoek
et al., 2013b]: the value functions of abstracted source problems involving few state
factors and agents are used as the component value functions for a larger task. In
order to solve the constraint optimization problem more effectively, the approach
makes use of specific constraint formulation for settings with imperfect informa-
tion [Oliehoek et al., 2012a]. Other examples are the methods by Pajarinen and Pel-
tonen [2011a] and Wu et al. [2013] that extend the EM method for Dec-POMDPs
by Kumar and Zilberstein [2010b] and Kumar et al. [2011] such that they can be ap-
plied to fDec-POMDPs. While these methods do not have guarantees, they can be
accompanied by methods to compute upper bounds for fDec-POMDPs, such that it
is still possible to get some information about their quality [Oliehoek et al., 2015a].
(a) IBA: for the purpose of computing best re- (b) IS: given a joint influence point, each agent
sponses, the policies of each agent can be rep- i can independently compute its best-response
resented in a more abstract way as influences. (subject to the constraint of satisfying Ii→ j ).
Fig. 8.6: Influence-based abstraction (IBA) and influence space search (IS).
best response for agent i depends on the probability that π j will lead to agent j
successfully taking a picture from his side of the canyon. That is, the probability
p j ( f oto|π j ) that j successfully takes the picture is I j→i , the influence that π j exerts
on agent i, and similarly pi ( f oto|πi ) corresponds to Ii→ j . To find optimal solutions
for such problems, Becker et al. [2003] introduce the coverage-set algorithm, which
searches over the space of probabilities p j (not unlike the linear support algorithm
for POMDPs by Cheng [1988]) and computes all possible best response policies for
agent i. The resulting set, dubbed the coverage set, is subsequently used to compute
all the candidate joint policies (by computing the best responses of agent j) from
which the best joint policy is selected.
The above example illustrates the core idea of influence search. Since then, re-
search has considered widening the class of problems to which this approach can
be applied [Becker et al., 2004a, Varakantham et al., 2009, Witwicki and Durfee,
2010b, Velagapudi et al., 2011, Oliehoek et al., 2012b], leading to different defini-
tions of ‘influence’ as well as different ways of performing the influence search
[Varakantham et al., 2009, Petrik and Zilberstein, 2009, Witwicki et al., 2012,
Oliehoek et al., 2012b]. Currently, the broadest class of problems for which in-
fluence search has been defined is the so-called transition-decoupled POMDP (TD-
POMDP) [Witwicki and Durfee, 2010b], while IBA has also been defined for fac-
tored POSGs [Oliehoek et al., 2012b]. Finally, the concept of influence-based ab-
straction and influence search is conceptually similar to techniques that exploit be-
havioral equivalence in subjective (e.g., I-POMDP, cf. Section 2.4.6) planning ap-
proaches [Pynadath and Marsella, 2007, Rathnasabapathy et al., 2006, Zeng et al.,
2011]; the difference is that these approaches abstract classes of behavior down to
policies, whereas IBA abstracts policies down to even more abstract influences.
particular fires or regions. Again, the problem here is that it is not possible guarantee
that the tasks ‘fight fire A’ and ‘fight fire B’ taken by two subsets of agents will end
simultaneously, and a synchronized higher level will therefore induce delays: the
amount of time Δt needs to be long enough to guarantee that all tasks are completed,
and if a task finishes sooner, agents will have to wait. In some cases, it may be
possible to overcome the worst effects of such delays by expanding the state space
(e.g., by representing how far the current task has progressed, in combination with
time step lengths Δt that are not too long). In many cases, however, more principled
methods are needed: the differences in task length may make the above approach
cumbersome, and there are no guarantees for the loss in quality it may lead to.
Moreover, in some domains (e.g., when controlling airplanes or underwater vehicles
that cannot stay in place) it is not possible to tolerate any delays.
One potential solution for MPOMDPs, i.e., settings where free communication is
available, is provided by Messias [2014], who describes multirobot decision making
as an event-driven process: in the Event-Driven MPOMDP framework, every time
that an agent finishes its task, an event occurs. This event is immediately broadcast
to other agents, who become aware of the new state of the environment and thus can
immediately react. Another advantage of this approach is that, since events are never
simultaneous (they occur in continuous time), this means that the model does not
suffer from exponentially many joint observations. A practical difficulty for many
settings, however, is that the approach crucially depends on free and instantaneous
broadcast communication.
A recent approach to extending the Dec-POMDP model with macro-actions, or
temporally extended actions [Amato et al., 2014] does not assume such forms of
communication. The formulation models prespecified behaviors as high-level ac-
tions (the macro-actions) in order to deal with significantly longer planning hori-
zons. In particular, the approach extends the options framework Sutton et al. [1999]
to Dec-POMDPs by using macro-actions, mi , that execute a policy in a low-level
Dec-POMDP until some terminal condition is met. We can then define policies over
macro-actions for each agent, μi , for choosing macro-actions that depend on ‘high-
level observations’ which are the termination conditions of the macro-actions (la-
beled with β in Figure 8.7).4 Because macro-action policies are built from primitive
actions, the value for high-level policies can be computed in a similar fashion, as
described in Section 3.4.
As described in that section, the value can be expressed as an expectation
h−1
V (m) = E ∑ R(st ,at ) b0 ,m ,
t=0
with the difference that the expectation now additionally is over macro-actions. In
terms of computing the expectation, it is possible to define a recursive equation
similar to (3.4.2) that explicitly deals with the cases that one or more macro-actions
4 More generally, high-level observations can be defined that depend on these terminal conditions
or underlying states.
104 8 Further Topics
m1 m1 m2 m2
βs 1 βs 2 βs 1 βs 2 βs 1 βs 3 βs 1 βs 3
βs 2
βs 2
m1 m1 m1 m2 m1 m1 m 1 m1 m2 m1
Fig. 8.7: A set of policies for one agent with macro-actions m1 and m1 and terminal
conditions as β s .
terminate. For details, we refer the reader toAmato et al. [2014]. The goal is to
obtain a hierarchically optimal joint macro-policy. This is a joint macro-policy that
produces the highest expected value that can be obtained by sequencing the agents’
given macro-actions.
Two Dec-POMDP algorithms have been extended to consider macro-actions [Am-
ato et al., 2014], but other extensions are possible. The key difference (as shown in
Figure 8.7) is that nodes in a policy tree now select macro-actions (rather than prim-
itive actions) and edges correspond to terminal conditions. Macro-action methods
perform well in large domains when high-quality macro-actions are available. For
example, consider the C OOPERATIVE B OX P USHING-inspired, multirobot domain
shown in Figure 8.8. Here, robots were tasked with finding and retrieving boxes of
two different sizes: large and small. The larger boxes can only be moved effectively
by two robots and while the possible locations of boxes are known (in depots), the
depot contents (the number and type of boxes) are unknown. Macro-actions were
given that navigate to the depot locations, pick up the boxes, push them to the drop-
off area and communicate any available messages.
The resulting behavior is illustrated in Figure (8.8), which shows screen cap-
tures from a video of the macro-policies in action. It is worth noting that the prob-
lem here represented as a flat Dec-POMDP is much larger than the C OOPERATIVE
B OX P USHING benchmark problem (with over 109 states) and still the approach
can generate high-quality solutions for a very long horizon [Amato et al., 2015b].
Additional approaches have extended these methods to controller-based solutions,
to automatically generate macro-actions, and to remove the need for a full model of
the underlying Dec-POMDP [Omidshafiei et al., 2015, Amato et al., 2015a].
8.3 Communication
(a) One robot starts first and goes to depot 1 (b) The robot in depot 1 sees a large box, so it
while the other robots begin moving towards turns on the red light (the light is not shown).
the top middle.
(c) The green robot sees light first, turns it off, (d) The robots in depot 1 push the large box
and goes to depot 1. The white robot goes to and the robot in depot 2 pushes a small box to
depot 2. the goal.
The main focus of this book is the regular Dec-POMDP, i.e., the setting without ex-
plicitly modeled communication. Nevertheless, in a Dec-POMDP the agents might
very well communicate by means of their regular actions and regular observations.
For instance, if one agent can use a chalk to write a mark on a blackboard and other
agents have sensors to observe such a mark, the agents have a clear mechanism
to send information from one to another, i.e., to communicate. We refer to such
communication via regular (‘domain’) actions as implicit communication.5 Further-
more, communication actions can be added to the action sets of each agent and
communication observations can be added to the observation sets of each agent,
allowing communication to be modeled in the same way as other actions and obser-
vations (e.g., with noise, delay or signal loss).
Definition 31 (Implicit and Explicit Communication). When a multiagent deci-
sion framework has a separate set of communication actions, we say that it supports
explicit communication. Frameworks without explicit communication can and typ-
ically do still allow for implicit communication: the act of influencing the observa-
tions of one agent through the actions of another.
5 Goldman and Zilberstein [2004] refer to this as ‘indirect’ communication.
106 8 Further Topics
We point out that the notion of implicit communication is very general. In fact,
the only models that do not allow for implicit communication are precisely those
that impose both transition and observation independence, such as the transition-
and observation-independent Dec-MDP (Section 2.4.2) and the ND-POMDP (Sec-
tion 8.1.1.2).
That is, in this perspective the semantics of the communication actions become part
of the optimization problem. This problem is considered by [Xuan et al., 2001,
Goldman and Zilberstein, 2003, Spaan et al., 2006, Goldman et al., 2007, Amato
et al., 2015b].
One can also consider the case where messages have specified semantics. In such
a case the agents need a mechanism to process these semantics (i.e., to allow the
messages to affect their internal state or beliefs). For instance, as we already dis-
cussed in Section 2.4.3, in an MPOMDP the agents share their local observations.
108 8 Further Topics
Each agent maintains a joint belief and performs an update of this joint belief, rather
than maintaining the list of observations. In terms of the agent component of the
multiagent decision process, introduced in Section 2.4.4, this means that the belief
update function for each agent must be (at least partly) specified.7
It was shown by Pynadath and Tambe [2002] that for a Dec-POMDP-Com un-
der instantaneous, noise-free and cost-free communication, a joint communication
policy that shares the local observations at each stage (i.e., as in an MPOMDP) is
optimal. Since this also makes intuitive sense, much research has investigated shar-
ing local observations in models similar to the Dec-POMDP-Com [Pynadath and
Tambe, 2002, Nair et al., 2004, Becker et al., 2005, Roth et al., 2005a,b, Spaan et al.,
2006, Oliehoek et al., 2007, Roth et al., 2007, Goldman and Zilberstein, 2008]. The
next two subsections cover observation-sharing approaches that try to lift some of
the limiting assumptions: Section 8.3.2 allows for communication that is delayed
one or more time steps and Section 8.3.3 deals with the case where broadcasting the
local observation has nonzero costs.
Here we describe models, in which the agents can share their local observations
via noise-free and cost-free communication, but where this communication can be
delayed. That is, the assumption is that the synchronization of the agents such that
each agent knows what the local observations of the other agents were takes one or
more time steps.
bt−1 and joint action at−1 in which the type of each agent is its individual observa-
tion oi,t . Note that these CBGs are different from the ones in Section (5.2.3); where
the latter had types corresponding to the entire observation histories of agents, the
CBGs considered here have types that correspond only to the last individual obser-
vation. Effectively, communication makes the earlier observations common infor-
mation and therefore they need not be modeled in the type anymore.
As a result, the optimal value for the 1-SD setting can be expressed as:
We now consider the setting of k-step delayed communication8 [Ooi and Wornell,
1996, Oliehoek, 2010, Nayyar et al., 2011, Oliehoek, 2013]. In this setting, at stage t
each agent agent knows θ̄ t−k , the joint action-observation history of k stages earlier,
and therefore can compute bt−k , the joint belief induced by θ̄ t−k . Again, bt−k is a
Markov signal, so no further history needs to be retained and bt−k takes the role of
b0 in the no-communication setting and bt−1 in the one-step delay setting. Indeed,
one-step delay is just a special case of the k-step delay setting.
In contrast to the one-step delayed communication case, the agents do not know
the last taken joint action. However, if we assume the agents know each other’s
policies, they do know qt−k k , the joint policy that has been executed during stages
q1,t−k a1 q2,t−k a2
t −k
o1 ȯ1 o2 ȯ2
a1 ȧ1 a2 ȧ2 t −1
β tk
ot−k+1 = ȯ1,o2
Fig. 8.9: Subtree policies in a system with k = 2 steps delayed communication. Top:
policies at t − k. The policies are extended by a joint BG-policy β tk , shown dashed.
Bottom: The resulting policies after joint observation ȯ1 ,o2 .
action, but they do not know each other’s individual observation history since stage
t − k. That is, they have uncertainty with respect to the length-k observation history
ōt,|k| = (ot−k+1 , . . . ,ot ).Effectively, this means that the agents have to use a joint
BG-policy β tk = β1,t k , . . . ,β k that implicitly maps length-k observation histories to
n,t
joint actions β tk (ōt,|k| ) = at .
For example, let us assume that in the planning phase we computed a joint BG-
policy β tk as indicated in the figure. As is shown, β tk can be used to extend the
subtree policy qt−k k to form a longer subtree policy with τ = k + 1 stage-to-go. Each
agent has knowledge of this extended joint subtree policy qt−k k+1 k ◦β k . Conse-
= qt−k t
quently each agent i executes the action corresponding to its individual observation
history βi,tk (ōi,t,|k| ) = ai,t and a transition occurs to stage t + 1. At that point each
agent receives a new observation oi,t+1 through perception and the joint observation
ot−k+1 through communication, it transmits its individual observation, and it com-
putes bt−k+1 . Now, all the agents know what action was taken at t − k and what the
k+1
following observation ot−k+1 was. Therefore the agents know which part of qt−k
has been executed during the last k stages t − k + 1, . . . ,t and they discard the part
8.3 Communication 111
not needed further; i.e., the joint observation ‘consumes’ part of the joint subtree
k+1
k
policy: qt−k+1 = qt−k ot−k+1
(see Definition 20).
Now the basic idea is that of Section 4.3.1: it is possible to define a plan-time
MDP where theaugmented states at stage t correspond to the common knowledge
(i.e., bt−k ,qt−k
k -pairs) and where actions correspond to joint BG-policies (β tk ).
Comparing to Section 4.3.1, the former correspond to ϕ t and the latter to δ t . Con-
sequently, it is possible to define the value function of this plan-time MDP in a very
similar way; see Oliehoek [2010] for details. Similarly to the development in
Sec-
tion 4.3.2, it turns out to be possible to replace the dependence on bt−k ,qt−k k -pairs
by a plan-time sufficient statistic σt (st ,ōt,|k| ) over states and joint length-k observa-
tion histories [Oliehoek, 2013], which in turn allows for a centralized formulation (a
reduction to a POMDP), similar to the reformulation as an NOMDP of Section 4.5.
The approach can be further generalized to exploit any common information that
the agents might have [Nayyar et al., 2013, 2014].
Another way in which the strong assumptions of the MPOMDP can be relaxed is to
assume that communication, while instantaneous and noise-free, no longer is cost-
free. This can be a reasonable assumption in settings where the agents need to con-
serve energy (e.g., in robotic settings).
For instance, Becker et al. [2009] consider the question of when to communicate
in a transition- and observation-independent Dec-MDP augmented with the ‘sync’
communication model. In more detail, each time step is separated in a domain ac-
tion selected phase and a communication action selection phase. There are only two
communication actions: communicate and do not communicate. When at least one
agent chooses to communicate, synchronization is initiated and all agents partic-
ipate in synchronizing their knowledge; in this case the agents suffer a particular
communication cost.
Becker et al. [2009] investigate a myopic procedure for this setting: as long as
no agent chooses to synchronize, all agents follow a decentralized (i.e, Dec-MDP)
policy. At each stage, however, each agent estimates the value of communication—
the difference between the expected value when communicating and when staying
silent—by assuming that 1) in the future there will be no further possibility to com-
municate, and 2) other agents will not initiate communication. Since these assump-
tions introduce errors, Becker et al. also propose modified variants that mitigate
these errors. In particular, the first assumption is modified by proposing a method
to defer communicating if the value of communicating after one time step is higher
than that of communicating now. The second assumption is overcome by modeling
the myopic communication decision as a joint decision problem; essentially it is
modeled as a collaborative Bayesian game in which the actions are communicate or
do not communicate, while the types are the agents’ local states.
112 8 Further Topics
The previous two subsections focused on softening the strong assumptions that the
MPOMDP model makes with respect to communication. However, even in settings
where instantaneous, noise-free and cost-free communication is available, broad-
casting the individual observations to all other agents might not be feasible since
this scales poorly: with large numbers of agents the communication bandwidth may
become a problem. For instance, consider the setting of a Dec-MDP where each
agent i can observe a subset of state factors si (potentially overlapping with that of
other agents) that are the most relevant for its task. If all the agents can broadcast
their individual observation, each agent knows the complete state and the problem
reduces to that of an MMDP. When such broadcast communication is not possible,
however, the agents can still coordinate their actions using local coordination.
The main idea, introduced by Guestrin et al. [2002a] and Kok and Vlassis
[2006], is that we can approximate (without guarantees) the value function of an
MMDP using a factored value function (similar to the ND-POMDP discussed in
Section 8.1.1.2), which can be computed, for instance, via linear programming
[Guestrin et al., 2002a]. When we condition the resulting factored Q-function
Q(s,a) ≈ ∑i∈D Qi (si ,aN (i) ) on the state we implicitly define a coordination graph
u(·) Q(s,·), which allows the agents to coordinate their action selection online via
message passing (e.g., using max-sum or NDP) that only uses local communication.
The crucial insight that allows this to be applicable to Dec-MDPs is that in order to
condition the local factor Qi on the current state s each agent only needs access to
its local state si : ui (·) = Qi (si ,·).
A somewhat related idea, introduced by Roth et al. [2007], is to minimize the
communication in a Dec-MDP by using the (exact or approximate) decision-tree
based solution of a factored MMDP. That is, certain solution methods for factored
MDPs (such as SPI [Boutilier et al., 2000], or SPUDD [Hoey et al., 1999]) produce
policies in the form of decision trees: the internal nodes specify state variables,
edges specify their values and the leaves specify the joint action to be taken for the
set of states corresponding to the path from root to leaf. Now the idea is that in some
cases (certain parts of) such a policy can be executed without communication even
if the agents observe different subsets of state variables. For instance, Figure 8.10a
shows the illustrative relay world in which two agents have a local state factor that
encodes their position. Each agent can perform ‘shuffle’ to randomly reset its loca-
tion, ‘exchange’ a packet (only useful when both agents are at the top of the square)
or do nothing (‘noop’). The optimal policy for agent 1 is shown in Figure 8.10b and
clearly demonstrates that requesting the location of the other agent via communica-
tion is only necessary when ŝ1 = L1 . This idea has also been extended to partially
observable environments [Messias et al., 2011].
8.4 Reinforcement Learning 113
ŝ1
L1 L2
sˆ2 shuffle
R1 R2
exchange noop
(a) Relay World [Messias et al., 2011]. (b) The optimal policy of agent 1 only depends on ŝ2
(Reproduced with permission.) when ŝ1 = L1 .
Fig. 8.10: The structure of the policy for a factored multiagent problem can be ex-
ploited to reduce communication requirements.
This book focuses on planning for Dec-POMDPs, i.e., settings where the model of
the environment is known at the start of the task. When this is not the case, we
step into the realm of multiagent reinforcement learning (MARL). In such settings,
the model will have to be learned online (model-based MARL) or the agents will
have to learn a solution directly without the use of a model (model-free methods).
While there is a great deal of work on MARL in general [Panait and Luke, 2005,
Buşoniu et al., 2008, Fudenberg and Levine, 2009, Tuyls and Weiss, 2012], MARL
in partially observable settings has received little attention.
One of the main reasons for this gap in the literature seems to be that it is hard
to properly define the setup of the reinforcement learning (RL) problem in these
partially observable environments with multiple agents. For instance, it is not clear
when or how the agents will the observe rewards.9 Moreover, even when the agents
can observe the state, general convergence of MARL under different assumptions is
not fully understood: from the perspective of one agent, the environment has become
nonstationary (since the other agent is also learning), which means that convergence
guarantees from single-agent RL no longer hold. Claus and Boutilier [1998] argue
that, in a cooperative setting, independent Q-learners are guaranteed to converge to
a local optimum (but not necessarily to the global optimal solution). Nevertheless,
this method has been reported to be successful in practice [e.g., Crites and Barto,
1998] and theoretical understanding of convergence of individual learners is pro-
gressing [e.g., Tuyls et al., 2006, Kaisers and Tuyls, 2010, Wunder et al., 2010].
There are coupled learning methods (e.g., Q-learning using the joint action space)
that will converge to an optimal solution [Vlassis, 2007]. However, the guarantees
9 Even in a single-agent POMDP, the agent is not assumed to have access to the immediate rewards,
since they can convey hidden information about the states.
114 8 Further Topics
of these methods rely on the fact that the global states can be observed by all agents.
In partially observable settings such guarantees have not yet been established. Nev-
ertheless, a recent approach to Bayesian RL (i.e., the setting where there is a prior
over models) in MPOMDPs demonstrates that learning in such settings is possible
and scales to a moderate number of agents [Amato and Oliehoek, 2015].
Relatively few MARL approaches are applicable in partially observable settings
where agents have only local observations. Peshkin et al. [2000] introduced decen-
tralized gradient ascent policy search (DGAPS), a method for MARL in partially
observable settings based on gradient descent. DGAPS represents individual poli-
cies using finite-state controllers and assumes that agents observe the global re-
wards. Based on this, it is possible for each agent to independently update its policy
in the direction of the gradient with respect to the return, resulting in a locally opti-
mal joint policy. This approach was extended to learn policies for self-configurable
modular robots [Varshavskaya et al., 2008]. Chang et al. [2004] also consider de-
centralized RL assuming that the global rewards are available to the agents. In their
approach, these global rewards are interpreted as individual rewards, corrupted by
noise due to the influence of other agents. Each agent explicitly tries to estimate
the individual reward using Kalman filtering and performs independent Q-learning
using the filtered individual rewards.
The methods by Wu et al. [2010b, 2013] are closely related to RL since they do
not need entire models as input. They do, however, need access to a simulator which
can be initialized to specific states. Similarly, Banerjee et al. [2012] iteratively learn
policies for each agent using a sample-based version of the JESP algorithm where
communication is used to alert the other agents that learning has been completed.
Finally, there are MARL methods for partially observed decentralized settings
that require only limited amounts of communication. For instance, Boyan and
Littman [1993] considered decentralized RL for a packet routing problem. Their
approach, Q-routing, performs a type of Q-learning where there is only limited lo-
cal communication: neighboring nodes communicate the expected future waiting
time for a packet. Q-routing was extended to mobile wireless networks by Chang
and Ho [2004]. A similar problem, distributed task allocation, is considered by Ab-
dallah and Lesser [2007]. In this problem there also is a network, but now agents
do not send communication packets, but rather tasks to neighbors. Again, commu-
nication is only local. This approach was extended to a hierarchical approach that
includes so-called supervisors [Zhang et al., 2010]. The supervisors can communi-
cate locally with other supervisors and with the agents they supervise (‘workers’).
Finally, in some RL methods for MMDPs (i.e., coupled methods) it is possible to
have agents observe a subset of state factors if they have the ability to communicate
locally [Guestrin et al., 2002b, Kok and Vlassis, 2006]. Such methods have been
used in RoboCup soccer [Kok and Vlassis, 2005] and traffic control [Kuyer et al.,
2008].
Chapter 9
Conclusion
This book gives an overview of the research performed since the early 2000s on de-
cision making for multiagent systems under uncertainty. In particular, it focuses on
the decentralized POMDP (Dec-POMDP) model, which is a general framework for
modeling multiagent systems in settings that are both stochastic (i.e., the outcome
of actions is uncertain) and partially observable (i.e., the state is uncertain). The core
distinction between a Dec-POMDP and a (centralized) POMDP is that the execu-
tion phase is decentralized: each agent can only use its own observations to select its
actions. This characteristic significantly changes the problem: there is no longer a
compact sufficient statistic (or ‘belief’) that the agents can use to select actions, and
the worst-case complexity of solving a Dec-POMDP is higher (NEXP-complete for
the finite-horizon case). Such decentralized settings are important because they oc-
cur in many real-world applications, ranging from sensor networks to robotic teams.
Moreover, in many of these settings dealing with uncertainty in a principled man-
ner is important (e.g., avoiding critical failures while dealing with noisy sensors in
robots problems or minimizing delays and thus economic cost due to traffic conges-
tion while anticipating low-probability events that could lead to large disruptions).
As such, Dec-POMDPs are a crucial framework for decision making in cooperative
multiagent settings.
This book provides an overview of planning methods for both finite-horizon and
infinite-horizon settings (which proceed for a finite or infinite number of time steps,
respectively). Solution methods are provided that 1) are exact, 2) have some guaran-
tees, or 3) are heuristic (have no guarantees but work well on larger benchmark do-
mains). We also sketched some of the main lines of research that are currently being
pursued: exploiting structure to increase scalability, employing hierarchical models,
making more realistic assumptions with respect to communication, and dealing with
settings where the model is not perfectly known in advance.
There are many big questions left to be answered in planning for Dec-POMDPs
and we expect that much future research will continue to investigate these topics.
In particular, the topics treated in Chapter 8 (exploiting structured models, hierar-
chical approaches, more versatile communication models, and reinforcement learn-
ing) all have seen quite significant advances in just the last few years. In parallel,
due to some of the improvements in scalability, we see that the field is starting to
shift from toy problems to benchmarks that, albeit still simplified, are motivated by
real-world settings. Examples of such problems are settings for traffic control [Wu
et al., 2013], communication network control [Winstein and Balakrishnan, 2013]
and demonstrations on real multirobot systems [Emery-Montemerlo et al., 2005,
Amato et al., 2015b]. We are hopeful that progress in these domains will inspire new
ideas, and will attract the attention of both researchers and practitioners, thus lead-
ing to an application-driven influx of ideas to complement the traditionally theory-
driven community studying these problems.
References
K. Hsu and S. Marcus. Decentralized control of finite state Markov processes. IEEE
Transactions on Automatic Control, 27(2):426–431, 1982.
M. N. Huhns, editor. Distributed Artificial Intelligence. Pitman Publishing, 1987.
N. R. Jennings. Controlling cooperative problem solving in industrial multi-agent
systems using joint intentions. Artificial Intelligence, 75(2):195–240, 1995.
N. R. Jennings. Agent-based computing: Promise and perils. In Proceedings of the
Sixteenth International Joint Conference on Artificial Intelligence, pages 1429–
1436, 1999.
L. P. Kaelbling and T. Lozano-Pérez. Integrated task and motion planning in belief
space. The International Journal of Robotics Research, 32(9-10):1194–1227,
2013.
L. P. Kaelbling, M. Littman, and A. Moore. Reinforcement learning: A survey.
Journal of Artificial Intelligence Research, 4:237–285, 1996.
L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially
observable stochastic domains. Artificial Intelligence, 101(1-2):99–134, 1998.
M. Kaisers and K. Tuyls. Frequency adjusted multi-agent Q-learning. In Proceed-
ings of the International Conference on Autonomous Agents and Multiagent Sys-
tems, pages 309–316, 2010.
R. E. Kalman. A new approach to linear filtering and prediction problems. Trans-
actions of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960.
M. J. Kearns. Graphical games. In N. Nisan, T. Roughgarden, E. Tardos, and
V. Vazirani, editors, Algorithmic Game Theory. Cambridge University Press,
2007.
M. J. Kearns, M. L. Littman, and S. P. Singh. Graphical models for game theory. In
Proceedings of Uncertainty in Artificial Intelligence, pages 253–260, 2001.
K. K. Khedo, R. Perseedoss, and A. Mungur. A wireless sensor network air pollution
monitoring system. International Journal of Wireless & Mobile Networks, 2(2):
31–45, 2010.
Y. Kim, R. Nair, P. Varakantham, M. Tambe, and M. Yokoo. Exploiting locality of
interaction in networked distributed POMDPs. In Proceedings of the AAAI Spring
Symposium on Distributed Plan and Schedule Management, pages 41–48, 2006.
D. Kinny and M. Georgeff. Modelling and design of multi-agent systems. In In-
telligent Agents III Agent Theories, Architectures, and Languages, pages 1–20.
Springer, 1997.
H. Kitano, S. Tadokoro, I. Noda, H. Matsubara, T. Takahashi, A. Shinjoh, and S. Shi-
mada. RoboCup Rescue: Search and rescue in large-scale disasters as a domain
for autonomous agents research. In Proceedings of the International Conference
on Systems, Man and Cybernetics, pages 739–743, 1999.
M. J. Kochenderfer, C. Amato, G. Chowdhary, J. P. How, H. J. D. Reynolds, J. R.
Thornton, P. A. Torres-Carrasquillo, N. K. Üre, and J. Vian. Decision making
under uncertainty: theory and application. MIT Press, 2015.
J. R. Kok and N. Vlassis. Using the max-plus algorithm for multiagent decision
making in coordination graphs. In RoboCup-2005: Robot Soccer World Cup IX,
pages 1–12, 2005.
124 References
S. Mannor, R. Rubinstein, and Y. Gat. The cross entropy method for fast policy
search. In Proceedings of the International Conference on Machine Learning,
pages 512–519, 2003.
J. Marecki, T. Gupta, P. Varakantham, M. Tambe, and M. Yokoo. Not all agents are
equal: Scaling up distributed POMDPs for agent networks. In Proceedings of the
International Conference on Autonomous Agents and Multiagent Systems, pages
485–492, 2008.
R. Marinescu and R. Dechter. And/or branch-and-bound search for combinatorial
optimization in graphical models. Artificial Intelligence, 173(16-17):1457–1491,
2009.
J. Marschak. Elements for a theory of teams. Management Science, 1:127–137,
1955.
J. Marschak and R. Radner. Economic Theory of Teams. Yale University Press,
1972.
J. V. Messias. Decision-Making under Uncertainty for Real Robot Teams. PhD
thesis, Institute for Systems and Robotics, Instituto Superior Técnico, 2014.
J. V. Messias, M. Spaan, and P. U. Lima. Efficient offline communication policies
for factored multiagent POMDPs. In Advances in Neural Information Processing
Systems 24, pages 1917–1925, 2011.
N. Meuleau, K. Kim, L. P. Kaelbling, and A. R. Cassandra. Solving POMDPs by
searching the space of finite policies. In Proceedings of the Fifteenth Conference
on Uncertainty in Artificial Intelligence, pages 417–426, 1999a.
N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling. Learning finite-state con-
trollers for partially observable environments. In Proceedings of the Fifteenth
Conference on Uncertainty in Artificial Intelligence, pages 427–436, 1999b.
P. J. Modi, W.-M. Shen, M. Tambe, and M. Yokoo. Adopt: Asynchronous distributed
constraint optimization with quality guarantees. Artificial Intelligence, 161:149–
180, 2005.
H. Mostafa and V. Lesser. Offline planning for communication by exploit-
ing structured interactions in decentralized MDPs. In Proceedings of 2009
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent
Agent Technology, pages 193–200, 2009.
H. Mostafa and V. Lesser. A compact mathematical formulation for problems with
structured agent interactions. In Proceedings of the AAMAS Workshop on Multi-
Agent Sequential Decision Making in Uncertain Domains (MSDM), pages 55–62,
2011a.
H. Mostafa and V. R. Lesser. Compact mathematical programs for DEC-MDPs with
structured agent interactions. In Proceedings of the Twenty-Seventh Conference
on Uncertainty in Artificial Intelligence, pages 523–530, 2011b.
R. Nair and M. Tambe. Hybrid BDI-POMDP framework for multiagent teaming.
Journal of Artificial Intelligence Research, 23:367–420, 2005.
R. Nair, M. Tambe, and S. Marsella. Team formation for reformation. In Proceed-
ings of the AAAI Spring Symposium on Intelligent Distributed and Embedded
Systems, pages 52–56, 2002.
126 References
N. Roy, G. Gordon, and S. Thrun. Planning under uncertainty for reliable health care
robotics. In Proceedings of the International Conference on Field and Service
Robotics, pages 417–426, 2003.
S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Pearson
Education, 3rd edition, 2009.
Y. Satsangi, S. Whiteson, and F. A. Oliehoek. Exploiting submodular value func-
tions for faster dynamic sensor selection. In Proceedings of the Twenty-Ninth
AAAI Conference on Artificial Intelligence, pages 3356–3363, 2015.
S. Seuken and S. Zilberstein. Memory-bounded dynamic programming for DEC-
POMDPs. In Proceedings of the International Joint Conference on Artificial
Intelligence, pages 2009–2015, 2007a.
S. Seuken and S. Zilberstein. Improved memory-bounded dynamic programming
for decentralized POMDPs. In Proceedings of Uncertainty in Artificial Intelli-
gence, pages 344–351, 2007b.
S. Seuken and S. Zilberstein. Formal models and algorithms for decentralized deci-
sion making under uncertainty. Journal of Autonomous Agents and Multi-Agent
Systems, 17(2):190–250, 2008.
J. Shen, R. Becker, and V. Lesser. Agent interaction in distributed MDPs and its
implications on complexity. In Proceedings of the International Conference on
Autonomous Agents and Multiagent Systems, pages 529–536, 2006.
E. A. Shieh, A. X. Jiang, A. Yadav, P. Varakantham, and M. Tambe. Unleashing Dec-
MDPs in security games: Enabling effective defender teamwork. In Proceedings
of the Twenty-First European Conference on Artificial Intelligence, pages 819–
824, 2014.
Y. Shoham and K. Leyton-Brown. Multi-Agent Systems. Cambridge University
Press, 2007.
M. P. Singh. Multiagent Systems: A Theoretical Framework for Intentions, Know-
How, and Communications. Springer, 1994.
S. P. Singh, T. Jaakkola, and M. I. Jordan. Learning without state-estimation in
partially observable Markovian decision processes. In Proceedings of the Inter-
national Conference on Machine Learning, pages 284–292, 1994.
T. Smith. Probabilistic Planning for Robotic Exploration. PhD thesis, The Robotics
Institute, Carnegie Mellon University, 2007.
T. Smith and R. G. Simmons. Heuristic search value iteration for POMDPs. In
Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence,
pages 520–527, 2004.
M. T. J. Spaan. Partially observable Markov decision processes. In M. Wiering and
M. van Otterlo, editors, Reinforcement Learning: State of the Art, pages 387–414.
Springer, 2012.
M. T. J. Spaan and F. S. Melo. Interaction-driven Markov games for decentralized
multiagent planning under uncertainty. In Proceedings of the International Con-
ference on Autonomous Agents and Multiagent Systems, pages 525–532, 2008.
M. T. J. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for
POMDPs. Journal of Artificial Intelligence Research, 24:195–220, 2005.
References 131
J. von Neumann and O. Morgenstern. The Theory of Games and Economic Behav-
ior. Princeton University Press, 1944.
M. de Weerdt and B. Clement. Introduction to planning in multiagent systems.
Multiagent and Grid Systems, 5(4):345–355, 2009.
M. de Weerdt, A. ter Mors, and C. Witteveen. Multi-agent planning: An introduc-
tion to planning and coordination. In Handouts of the European Agent Summer
School, pages 1–32, 2005.
G. Weiss, editor. Multiagent Systems: A Modern Approach to Distributed Artificial
Intelligence. MIT Press, 1999.
G. Weiss, editor. Multiagent Systems. MIT Press, 2nd edition, 2013.
M. Wiering. Multi-agent reinforcement learning for traffic light control. In Proceed-
ings of the International Conference on Machine Learning, pages 1151–1158,
2000.
M. Wiering and M. van Otterlo, editors. Reinforcement Learning: State of the Art.
Adaptation, Learning, and Optimization. Springer, 2012.
M. Wiering, J. Vreeken, J. van Veenen, and A. Koopman. Simulation and optimiza-
tion of traffic in a city. In IEEE Intelligent Vehicles Symposium, pages 453–458,
2004.
A. J. Wiggers, F. A. Oliehoek, and D. M. Roijers. Structure in the value function
of zero-sum games of incomplete information. In Proceedings of the AAMAS
Workshop on Multi-Agent Sequential Decision Making in Uncertain Domains
(MSDM), 2015.
K. Winstein and H. Balakrishnan. TCP ex machina: Computer-generated congestion
control. In SIGCOMM, pages 123–134, 2013.
H. S. Witsenhausen. Separation of estimation and control for discrete time systems.
Proceedings of the IEEE, 59(11):1557–1566, 1971.
S. Witwicki and E. Durfee. From policies to influences: A framework for nonlocal
abstraction in transition-dependent Dec-POMDP agents. In Proceedings of the
International Conference on Autonomous Agents and Multiagent Systems, pages
1397–1398, 2010a.
S. Witwicki, F. A. Oliehoek, and L. P. Kaelbling. Heuristic search of multiagent
influence space. In Proceedings of the Eleventh International Conference on
Autonomous Agents and Multiagent Systems, pages 973–981, 2012.
S. J. Witwicki. Abstracting Influences for Efficient Multiagent Coordination Under
Uncertainty. PhD thesis, University of Michigan, 2011.
S. J. Witwicki and E. H. Durfee. Flexible approximation of structured interactions
in decentralized Markov decision processes. In Proceedings of the International
Conference on Autonomous Agents and Multiagent Systems, pages 1251–1252,
2009.
S. J. Witwicki and E. H. Durfee. Influence-based policy abstraction for weakly-
coupled Dec-POMDPs. In Proceedings of the International Conference on Auto-
mated Planning and Scheduling, pages 185–192, 2010b.
S. J. Witwicki and E. H. Durfee. Towards a unifying characterization for quantify-
ing weak coupling in Dec-POMDPs. In Proceedings of the Tenth International
Conference on Autonomous Agents and Multiagent Systems, pages 29–36, 2011.
134 References