A Concise Introduction To Multi Agent Systems and Distributed Artificial Intelligence
A Concise Introduction To Multi Agent Systems and Distributed Artificial Intelligence
A Concise Introduction
to Multiagent Systems
and Distributed Artificial
Intelligence
i
MOBK077-FM MOBKXXX-Sample.cls August 3, 2007 7:54
ii
MOBK077-FM MOBKXXX-Sample.cls August 3, 2007 7:54
iii
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI: 10.2200/S00091ED1V01Y200705AIM002
Lecture #2
Series Editors: Ronald Brachman, Yahoo! Research and Thomas G. Dietterich, Oregon State University
First Edition
10 9 8 7 6 5 4 3 2 1
iv
MOBK077-FM MOBKXXX-Sample.cls August 3, 2007 7:54
A Concise Introduction
to Multiagent Systems
and Distributed Artificial
Intelligence
Nikos Vlassis
Department of Production Engineering and Management
Technical University of Crete
Greece
M
&C Morgan &Claypool Publishers
v
MOBK077-FM MOBKXXX-Sample.cls August 3, 2007 7:54
vi
ABSTRACT
Multiagent systems is an expanding field that blends classical fields like game theory and
decentralized control with modern fields like computer science and machine learning. This
monograph provides a concise introduction to the subject, covering the theoretical foundations
as well as more recent developments in a coherent and readable manner.
The text is centered on the concept of an agent as decision maker. Chapter 1 is a short
introduction to the field of multiagent systems. Chapter 2 covers the basic theory of single-
agent decision making under uncertainty. Chapter 3 is a brief introduction to game theory,
explaining classical concepts like Nash equilibrium. Chapter 4 deals with the fundamental
problem of coordinating a team of collaborative agents. Chapter 5 studies the problem of
multiagent reasoning and decision making under partial observability. Chapter 6 focuses on
the design of protocols that are stable against manipulations by self-interested agents. Chapter
7 provides a short introduction to the rapidly expanding field of multiagent reinforcement
learning.
The material can be used for teaching a half-semester course on multiagent systems
covering, roughly, one chapter per lecture.
Nikos Vlassis is Assistant Professor at the Department of Production Engineering and
Management at the Technical University of Crete, Greece. His email is [email protected]
KEYWORDS
Multiagent Systems, Distributed Artificial Intelligence, Game Theory, Decision Making under
Uncertainty, Coordination, Knowledge and Information, Mechanism Design, Reinforcement
Learning.
MOBK077-FM MOBKXXX-Sample.cls August 3, 2007 7:54
vii
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Multiagent Systems and Distributed AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Characteristics of Multiagent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Agent Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.3 Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.4 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.5 Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.6 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Challenging Issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
1.5 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2. Rational Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 What is an Agent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Agents as Rational Decision Makers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Observable Worlds and the Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 The Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Stochastic Transitions and Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 From Goals to Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Decision Making in a Stochastic World . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.3 Example: A Toy World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3. Strategic Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Game Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Strategic Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Iterated Elimination of Dominated Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Nash Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
MOBK077-FM MOBKXXX-Sample.cls August 3, 2007 7:54
6. Mechanism Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1 Self-Interested Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 The Mechanism Design Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2.1 Example: An Auction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3 The Revelation Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3.1 Example: Second-price Sealed-bid (Vickrey) Auction . . . . . . . . . . . . . . . 50
6.4 The Vickrey–Clarke–Groves Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4.1 Example: Shortest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.5 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7. Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2.1 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2.2 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.3 Markov Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.3.1 Independent Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.3.2 Coupled Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
MOBK077-FM MOBKXXX-Sample.cls August 3, 2007 7:54
CONTENTS ix
7.3.3 Sparse Cooperative Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.4 The Problem of Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.5 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Author Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
MOBK077-FM MOBKXXX-Sample.cls August 3, 2007 7:54
x
MOBK077-FM MOBKXXX-Sample.cls August 3, 2007 7:54
xi
Preface
This monograph is based on a graduate course on multiagent systems that I have taught at
the University of Amsterdam, The Netherlands, from 2003 until 2006. This is the revised
version of an originally unpublished manuscript that I wrote in 2003 and used as lecture notes.
Since then the field has grown tremendously, and a large body of new literature has become
available. Encouraged by the positive feedback I have received all these years from students and
colleagues, I decided to compile this new, revised and up-to-date version.
Multiagent systems is a subject that has received much attention lately in science and
engineering. It is a subject that blends classical fields like game theory and decentralized con-
trol with modern fields like computer science and machine learning. In the monograph I
have tried to translate several of the concepts that appear in the above fields into a coherent
and comprehensive framework for multiagent systems, aiming at keeping the text at a rela-
tively introductory level without compromising its consistency or technical rigor. There is no
mathematical prerequisite for the text; the covered material should be self-contained.
The text is centered on the concept of an agent as decision maker. The 1st chapter is an
introductory chapter on multiagent systems. Chapter 2 addresses the problem of single-agent
decision making, introducing the concepts of a Markov state and utility function. Chapter 3
is a brief introduction to game theory, in particular strategic games, describing classical solu-
tion concepts like iterated elimination of dominated actions and Nash equilibrium. Chapter 4
focuses on collaborative multiagent systems, and deals with the problem of multiagent co-
ordination; it includes some standard coordination techniques like social conventions, roles,
and coordination graphs. Chapter 5 examines the case where the perception of the agents
is imperfect, and what consequences this may have in the reasoning and decision making
of the agents; it deals with the concepts of information, knowledge, and common knowl-
edge, and presents the model of a Bayesian game for multiagent decision making under
partial observability. Chapter 6 deals with the problem of how to develop protocols that
are nonmanipulable by a group of self-interested agents, discussing the revelation principle
and the Vickrey-Clarke-Groves (VCG) mechanism. Finally, chapter 7 is a short introduc-
tion to reinforcement learning, that allows the agents to learn how to take good decisions;
it covers the models of Markov decision processes and Markov games, and the problem of
exploration.
MOBK077-FM MOBKXXX-Sample.cls August 3, 2007 7:54
Nikos Vlassis
Chania, March 2007
book MOBK077-Vlassis August 3, 2007 7:59
CHAPTER 1
Introduction
In this chapter we give a brief introduction to multiagent systems, discuss their differences with
single-agent systems, and outline possible applications and challenging issues for research.
behaviors are often called heterogeneous, in contrast to homogeneous agents that are designed
in an identical way and have a priori the same capabilities. Agent heterogeneity can affect all
functional aspects of an agent from perception to decision making.
1.2.2 Environment
Agents have to deal with environments that can be either static or dynamic (change with time).
Most existing AI techniques for single agents have been developed for static environments
because these are easier to handle and allow for a more rigorous mathematical treatment. In
a MAS, the mere presence of multiple agents makes the environment appear dynamic from
the point of view of each agent. This can often be problematic, for instance in the case of
concurrently learning agents where non-stable behavior can be observed. There is also the
issue of which parts of a dynamic environment an agent should treat as other agents and which
not. We will discuss some of these issues in Chapter 7.
1.2.3 Perception
The collective information that reaches the sensors of the agents in a MAS is typically dis-
tributed: the agents may observe data that differ spatially (appear at different locations), tem-
porally (arrive at different times), or semantically (require different interpretations). The fact
that agents may observe different things makes the world partially observable to each agent,
which has various consequences in the decision making of the agents. For instance, optimal
multiagent planning under partial observability can be an intractable problem. An additional
issue is sensor fusion, that is, how the agents can optimally combine their perceptions in order
to increase their collective knowledge about the current state. In Chapter 5 we will discuss some
of the above in more detail.
book MOBK077-Vlassis August 3, 2007 7:59
INTRODUCTION 3
1.2.4 Control
Contrary to single-agent systems, the control in a MAS is typically decentralized. This means
that the decision making of each agent lies to a large extent within the agent itself. Decentralized
control is preferred over centralized control (that involves a center) for reasons of robustness
and fault-tolerance. However, not all MAS protocols can be easily distributed, as we will see
in Chapter 6. The general problem of multiagent decision making is the subject of game theory
which we will briefly cover in Chapter 3. In a collaborative or team MAS where the agents
share the same interests, distributed decision making offers asynchronous computation and
speedups, but it also has the downside that appropriate coordination mechanisms need to be
additionally developed. Chapter 4 is devoted to the topic of multiagent coordination.
1.2.5 Knowledge
In single-agent systems we typically assume that the agent knows its own actions but not
necessarily how the world is affected by its actions. In a MAS, the levels of knowledge of
each agent about the current world state can differ substantially. For example, in a team MAS
involving two homogeneous agents, each agent may know the available action set of the other
agent, both agents may know (by communication) their current perceptions, or they can infer
the intentions of each other based on some shared prior knowledge. On the other hand, an
agent that observes an adversarial team of agents will typically be unaware of their action sets
and their current perceptions, and might also be unable to infer their plans. In general, in a
MAS each agent must also consider the knowledge of each other agent in its decision making.
In Chapter 5 we will discuss the concept of common knowledge, according to which every
agent knows a fact, every agent knows that every other agent knows this fact, and so on.
1.2.6 Communication
Interaction is often associated with some form of communication. Typically we view communi-
cation in a MAS as a two-way process, where all agents can potentially be senders and receivers
of messages. Communication can be used in several cases, for instance, for coordination among
cooperative agents or for negotiation among self-interested agents. This additionally raises
the issue of what network protocols to use in order for the exchanged information to arrive
safely and timely, and what language the agents must speak in order to understand each other
(especially, if they are heterogeneous). We will see throughout the book several examples of
multiagent protocols involving communication.
1.3 APPLICATIONS
Just as with single-agent systems in traditional AI, it is difficult to anticipate the full range of
applications where MASs can be used. Some applications have already appeared, for instance
book MOBK077-Vlassis August 3, 2007 7:59
A very challenging application domain for MAS technology is the Internet. Today the
Internet has developed into a highly distributed open system where heterogeneous software
agents come and go, there are no well established protocols or languages on the ‘agent level’
(higher than TCP/IP), and the structure of the network itself keeps on changing. In such an
environment, MAS technology can be used to develop agents that act on behalf of a user and
are able to negotiate with other agents in order to achieve their goals. Electronic commerce
and auctions are such examples (Cramton et al., 2006, Noriega and Sierra, 1999). One can also
think of applications where agents can be used for distributed data mining and information
retrieval (Kowalczyk and Vlassis, 2005, Symeonidis and Mitkas, 2006).
Other applications include sensor networks, where the challenge is to efficiently al-
locate resources and compute global quantities in a distributed fashion (Lesser et al., 2003,
Paskin et al., 2005); social sciences, where MAS technology can be used for studying in-
teractivity and other social phenomena (Conte and Dellarocas, 2001, Gilbert and Doran,
1994); robotics, where typical applications include distributed localization and decision mak-
ing (Kok et al., 2005, Roumeliotis and Bekey, 2002); artificial life and computer games, where
the challenge is to build agents that exhibit intelligent behavior (Adamatzky and Komosinski,
2005, Terzopoulos, 1999).
A recent popular application of MASs is robot soccer, where teams of real or simulated
autonomous robots play soccer against each other (Kitano et al., 1997). Robot soccer provides
a testbed where MAS algorithms can be tested, and where many real-world characteristics
are present: the domain is continuous and dynamic, the behavior of the opponents may be
difficult to predict, there is uncertainty in the sensor signals, etc. A related application is robot
rescue, where teams of simulated or real robots must explore an unknown environment in
book MOBK077-Vlassis August 3, 2007 7:59
INTRODUCTION 5
order to discover victims, extinguish fires, etc. Both applications are organized by the RoboCup
Federation (www.robocup.org).
Clearly the above problems are interdependent and their solutions may affect each other.
For example, a distributed planning algorithm may require a particular coordination mechanism,
learning can be guided by the organizational structure of the agents, and so on. In the later
following chapters we will try to provide answers to some of the above questions.
6
book MOBK077-Vlassis August 3, 2007 7:59
CHAPTER 2
Rational Agents
In this chapter we describe what a rational agent is, we investigate some characteristics of
an agent’s environment like observability and the Markov property, and we examine what is
needed for an agent to behave optimally in an uncertain world where actions do not always
have the desired effects.
1
In this chapter we will use ‘it’ to refer to an agent, to emphasize that we are talking about computational entities.
book MOBK077-Vlassis August 3, 2007 7:59
π (θ0 , a 0 , θ1 , a 1 , . . . , θt ) = a t (2.1)
that in principle would require mapping the complete history of observation–action pairs up to
time t to an optimal action a t , is called the policy of the agent.
As long as we can find a function π that implements the above mapping, the part of
optimal decision making that refers to the past is solved. However, defining and implementing
such a function is problematic; the complete history can consist of a very large (even infinite)
number of observation–action pairs, which can vary from one task to another. Merely storing all
observations would require very large memory, aside from the computational cost for actually
computing π .
This fact calls for simpler policies. One possibility is for the agent to ignore all its percept
history except for the last observation θt . In this case its policy takes the form
π (θt ) = a t (2.2)
which is a mapping from the current observation of the agent to an action. An agent that simply
maps its current observation θt to a new action a t , thus effectively ignoring the past, is called a
reflex agent, and its policy (2.2) is called reactive or memoryless. A natural question to ask is
how successful a reflex agent can be. As we will see next, for a particular class of environments
a reflex agent can do pretty well.
RATIONAL AGENTS 9
each other, and other parameters that are relevant to the decision making of the agents like the
elapsed time since the game started, etc.
Depending on the nature of the problem, a world can be either discrete or continuous.
A discrete world can be characterized by a finite number of states, like the possible board
configurations in a chess game. A continuous world can have infinitely many states, like the
possible configurations of a point robot that translates freely on the plane in which case S = IR2 .
Most of the existing AI techniques have been developed for discrete worlds, and this will be
our main focus as well.
2.3.1 Observability
A fundamental property that characterizes a world from the point of view of an agent is related
to the perception of the agent. We will say that the world is (fully) observable to an agent if
the current observation θt of the agent completely reveals the current state of the world, that is,
s t = θt . On the other hand, in a partially observable world the current observation θt of the
agent provides only partial information about the current state s t in the form of a deterministic
or stochastic observation model, for instance a conditional probability distribution p(s t |θt ).
The latter would imply that the current observation θt does not fully reveal the true world
state, but to each state s t the agent assigns probability p(s t |θt ) that s t is the true state (with
0 ≤ p(s t |θt ) ≤ 1 and s t ∈S p(s t |θt ) = 1). Here we treat s t as a random variable that can take
all possible values in S. The stochastic coupling between s t and θt may alternatively be defined
by an observation model in the form p(θt |s t ), and a posterior state distribution p(s t |θt ) can be
computed from a prior distribution p(s t ) using the Bayes rule:
p(θt |s t ) p(s t )
p(s t |θt ) = . (2.3)
p(θt )
Partial observability can in principle be attributed to two factors. First, it can be the result
of noise in the agent’s sensors. For example, due to sensor malfunction, the same state may
‘generate’ different observations to the agent at different points in time. That is, every time the
agent visits a particular state it may perceive something different. Second, partial observability
can be related to an inherent property of the environment referred to as perceptual aliasing:
different states may produce identical observations to the agent at different time steps. In other
words, two states may ‘look’ the same to an agent, although the states are different from each
other. For example, two identical doors along a corridor will look exactly the same to the eyes
of a human or the camera of a mobile robot, no matter how accurate each sensor system is.
Partial observability is much harder to handle than full observability, and algorithms for
optimal decision making in a partially observable world can often become intractable. As we
book MOBK077-Vlassis August 3, 2007 7:59
π (s t ) = a t . (2.4)
In other words, in an observable world the policy of a reflex agent is a mapping from world
states to actions. The gain comes from the fact that in many problems the state of the world at
time t provides a complete characterization of the history before time t. Such a world state that
summarizes all relevant information about the past is said to be Markov or to have the Markov
property. As we conclude from the above, in a Markov world an agent can safely use the
memoryless policy (2.4) for its decision making, in place of the memory-expensive policy (2.1).
So far we have discussed how the policy of an agent may depend on its past experience
and the particular characteristics of the environment. However, as we argued at the beginning,
optimal decision making should also take the future into account. This is what we are going to
examine next.
RATIONAL AGENTS 11
We saw in the previous section that sometimes partial observability can be attributed to
uncertainty in the perception of the agent. Here we see another example where uncertainty plays
a role; namely, in the way the world changes when the agent executes an action. In a stochastic
world, the effects of the actions of the agent are not known a priori. Instead, there is a random
element that decides how the world changes as a result of an action. Clearly, stochasticity in
the state transitions introduces an additional difficulty in the optimal decision making task of
the agent.
4
3 +1
2 −1 −1
1 start
a b c d
FIGURE 2.1: A world with one desired (+1) and two undesired (−1) states
book MOBK077-Vlassis August 3, 2007 7:59
where we sum over all possible states s t+1 ∈ S the world may transition to, given that the current
state is s t and the agent takes action a t . In words, to see how good an action is, the agent has
to multiply the utility of each possible resulting state with the probability of actually reaching
this state, and sum up over all states. Then the agent must choose the action a t∗ that gives the
highest sum.
If each world state possesses a utility value, the agent can do the above calculations and
compute an optimal action for each possible state. This provides the agent with a policy that
maps states to actions in an optimal sense (optimal with respect to the given utilities). In
particular, given a set of optimal (that is, highest attainable) utilities U ∗ (s ) in a given task, the
greedy policy
π ∗ (s ) = arg max p(s |s , a)U ∗ (s ) (2.6)
a s
which is a simpler formula than (2.6) that does not make use of a transition model. In Chapter 7
we will see how we can compute optimal Q-values Q ∗ (s , a), and hence an optimal policy, in a
given task.
RATIONAL AGENTS 13
the actions {Up, Down, Left, Right }. We assume that the world is fully observable (the agent
always knows where it is), and stochastic in the following sense: every action of the agent to
an intended direction succeeds with probability 0.8, but with probability 0.2 the agent ends up
perpendicularly to the intended direction. Bumping on the border leaves the position of the
agent unchanged. There are three terminal states, a desired one (the ‘goal’ state) with utility
+1, and two undesired ones with utility −1. The initial position of the agent is a1.
We stress again that although the agent can perceive its own position and thus the state
of the world, it cannot predict the effects of its actions on the world. For example, if the agent
is in state c2, it knows that it is in state c2. However, if it tries to move Up to state c3, it may
reach the intended state c3 (this will happen in 80% of the cases) but it may also reach state b2
(in 10% of the cases) or state d2 (in the rest 10% of the cases).
Assume now that optimal utilities have been computed for all states, as shown in Fig. 2.2.
Applying the principle of maximum expected utility, the agent computes that, for instance, in
state b3 the optimal action is Up. Note that this is the only action that avoids an accidental
transition to state b2. Similarly, by using (2.6) the agent can now compute an optimal action
for every state, which gives the optimal policy shown in parentheses.
Note that, unlike path planning in a deterministic world that can be described as graph
search, decision making in stochastic domains requires computing a complete policy that maps
states to actions. Again, this is a consequence of the fact that the results of the actions of an
agent are unpredictable. Only after the agent has executed its action it can observe the new
state of the world, from which it can select another action based on its precomputed policy.
14
book MOBK077-Vlassis August 3, 2007 7:59
15
CHAPTER 3
Strategic Games
In this chapter we study the problem of multiagent decision making where a group of agents
coexist in an environment and take simultaneous decisions. We use game theory to analyze
the problem. In particular, we describe the model of a strategic game and we examine two
fundamental solution concepts, iterated elimination of strictly dominated actions and Nash
equilibrium.
1
In this chapter we will use ‘he’ or ‘she’ to refer to an agent, following the convention in the literature
(Osborne and Rubinstein, 1994, p. xiii).
book MOBK077-Vlassis August 3, 2007 7:59
In summary, in a strategic game each agent chooses a single action, and then he receives
a payoff that depends on the selected joint action. This joint action is called the outcome of
the game. Although the payoff functions of the agents are common knowledge, an agent does
not know in advance the action choices of the other agents. The best he can do is to try to
predict the actions of the other agents. A solution to a game is a prediction of the outcome of
the game using the assumption that all agents are rational and strategic.
book MOBK077-Vlassis August 3, 2007 7:59
STRATEGIC GAMES 17
In the special case of two agents, a strategic game can be graphically represented by
a payoff matrix, where the rows correspond to the actions of agent 1, the columns to the
actions of agent 2, and each entry of the matrix contains the payoffs of the two agents for
the corresponding joint action. In Fig. 3.1 we show the payoff matrix of a classical game, the
prisoner’s dilemma, whose story goes as follows:
Two suspects in a crime are independently interrogated. If they both confess, each will
spend three years in prison. If only one confesses, he will run free while the other will spend
four years in prison. If neither confesses, each will spend one year in prison.
In this example each agent has two available actions, Not confess or Confess. Translating
the above story into appropriate payoffs for the agents, we get in each entry of the matrix the
pairs of numbers that are shown in Fig. 3.1 (note that a payoff is by definition a ‘reward’,
whereas spending three years in prison is a ‘penalty’). For example, the entry (4, 0) indicates
that if the first agent confesses and the second agent does not, then the first agent will get
payoff 4 and the second agent will get payoff 0.
In Fig. 3.2 we see two more examples of strategic games. The game in Fig. 3.2(a) is
known as ‘matching pennies’; each of two agents chooses either Head or Tail. If the choices
differ, agent 1 pays agent 2 a cent; if they are the same, agent 2 pays agent 1 a cent. Such a
game is called strictly competitive or zero-sum because u 1 (a) + u 2 (a) = 0 for all a. The game
in Fig. 3.2(b) is played between two car drivers at a crossroad; each agent wants to cross first
(and he will get payoff 1), but if they both cross they will crash (and get payoff −1). Such a
game is called a coordination game (we will study coordination games in Chapter 4).
What does game theory predict that a rational agent will do in the above examples? In
the next sections we will describe two fundamental solution concepts for strategic games.
(a) (b)
FIGURE 3.2: A strictly competitive game (a), and a coordination game (b)
book MOBK077-Vlassis August 3, 2007 7:59
Definition 3.1. We will say that an action a i of agent i is strictly dominated by another action a i
of agent i if
u i (a i , a −i ) > u i (a i , a −i ) (3.1)
In the above definition, u i (a i , a −i ) is the payoff the agent i receives if he takes action a i
while the other agents take a −i . In the prisoner’s dilemma, for example, Not confess is a strictly
dominated action for agent 1; no matter what agent 2 does, the action Confess always gives
agent 1 higher payoff than the action Not confess (4 as opposed to 3 if agent 2 does not confess,
and 1 as opposed to 0 if agent 2 confesses). Similarly, Not confess is a strictly dominated action
for agent 2.
Iterated elimination of strictly dominated actions (IESDA) is a solution technique
that iteratively eliminates strictly dominated actions from all agents, until no more actions are
strictly dominated. It is solely based on the following two assumptions:
STRATEGIC GAMES 19
L M R L M R
U 1, 0 1, 2 0, 1 U 1, 0 1, 2 0, 1
D 0, 3 0, 1 2, 0 D 0, 3 0, 1 2, 2
(a) (b)
FIGURE 3.3: Examples where IESDA predicts a single outcome (a), or predicts that any outcome is
possible (b).
A characteristic of IESDA is that the agents do not need to maintain beliefs about
the other agents’ strategies in order to compute their optimal actions. The only thing that is
required is the common knowledge assumption that each agent is rational. Moreover, it can be
shown that the algorithm is insensitive to the speed and the elimination order; it will always
produce the same result no matter how many actions are eliminated in each step and in which
order. However, as we saw in the examples above, IESDA can sometimes fail to make useful
predictions for the outcome of a game.
Definition 3.2. A Nash equilibrium is a joint action a ∗ with the property that for every agent i
holds
u i (a i∗ , a −i
∗ ∗
) ≥ u i (a i , a −i ) (3.2)
In other words, a NE is a joint action from where no agent can unilaterally improve his
payoff, and therefore no agent has any incentive to deviate. Note that, contrary to IESDA that
describes a solution of a game by means of an algorithm, a NE describes a solution in terms of
the conditions that hold at that solution.
There is an alternative definition of a NE that makes use of the so-called best-response
function. For agent i, this function is defined as
and Bi (a −i ) can be a set containing many actions. In the prisoner’s dilemma, for example, when
agent 2 takes the action Not confess, the best-response of agent 1 is the action Confess (because
4 > 3). Similarly, we can compute the best-response function of each agent:
book MOBK077-Vlassis August 3, 2007 7:59
In this case, the best-response functions are singleton-valued. Using the definition of a best-
response function we can now formulate the following:
Definition 3.3. A Nash equilibrium is a joint action a ∗ with the property that for every agent i
holds
a i∗ ∈ Bi (a −i
∗
). (3.4)
That is, at a NE, each agent’s action is an optimal response to the other agents’ ac-
tions. In the prisoner’s dilemma, for instance, given that B1 (Confess ) = Confess, and B2
(Confess ) = Confess, we conclude that (Confess, Confess ) is a NE. Moreover, we can easily
show the following:
Proposition 3.1. The two definitions 3.2 and 3.3 of a NE are equivalent.
Proof. Suppose that (3.4) holds. Then, using (3.3) we see that for each agent i, the action a i∗
must satisfy u i (a i∗ , a −i
∗
) ≥ u i (a i , a −i
∗
) for all a i ∈ Ai . The latter is precisely the definition of a
NE according to (3.2). Similarly for the converse.
The definitions 3.2 and 3.3 suggest a brute-force method for finding the Nash equilibria
of a game: enumerate all possible joint actions and then verify which ones satisfy (3.2) or (3.4).
Note that the cost of such an algorithm is exponential in the number of agents.
It turns out that a strategic game can have zero, one, or more than one Nash equilibria.
For example, (Confess, Confess ) is the only NE in the prisoner’s dilemma. We also find that the
zero-sum game in Fig. 3.2(a) does not have a NE, while the coordination game in Fig. 3.2(b)
has two Nash equilibria (Cross, Stop ) and (Stop, Cross ). Similarly, (U , M) is the only NE in
both games of Fig. 3.3.
We argued above that a NE is a stronger solution concept than IESDA in the sense
that it produces more accurate predictions of a game. For instance, the game of Fig. 3.3(b) has
only one NE, but IESDA predicts that any outcome is possible. In general, we can show the
following two propositions (the proof of the second is left as an exercise):
STRATEGIC GAMES 21
∗ ∗
Proof. Let a be a NE, and let us assume that a does not survive IESDA. This means that
for some agent i the component a i∗ of the action profile a ∗ is strictly dominated by another
∗
action a i of agent i. But then (3.1) implies that u i (a i , a −i ) > u i (a i∗ , a −i
∗
) which contradicts the
Definition 3.2 of a NE.
Proposition 3.3. If IESDA eliminates all but a single joint action a, then a is the unique NE
of the game.
Note also that in the prisoner’s dilemma, the joint action (Not confess, Not confess ) gives
both agents payoff 3, and thus it should have been the preferable choice. However, from this
joint action each agent has an incentive to deviate, to be a ‘free rider’. Only if the agents had
made an agreement in advance, and only if trust between them was common knowledge, would
they have opted for this non-equilibrium joint action which is optimal in the following sense:
Definition 3.4. A joint action a is Pareto optimal if there is no other joint action a for which
u i (a ) ≥ u i (a) for each i and u j (a ) > u j (a) for some j .
So far we have implicitly assumed that when the game is actually played, each agent i
will choose his action deterministically from his action set Ai . This is however not always true.
In many cases there are good reasons for an agent to introduce randomness in his behavior; for
instance, to avoid being predictable when he repeatedly plays a zero-sum game. In these cases
an agent i can choose actions a i according to some probability distribution:
Definition 3.5. A mixed strategy for an agent i is a probability distribution over his actions
a i ∈ Ai .
In his celebrated theorem, Nash (1950) showed that a strategic game with a finite num-
ber of agents and a finite number of actions always has an equilibrium in mixed strategies.
Osborne and Rubinstein (1994, sec. 3.2) give several interpretations of such a mixed strat-
egy Nash equilibrium. Porter et al. (2004) and von Stengel (2007) describe several algorithms
for computing Nash equilibria, a problem whose complexity has been a long-standing is-
sue (Papadimitriou, 2001).
22
book MOBK077-Vlassis August 3, 2007 7:59
23
CHAPTER 4
Coordination
In this chapter we address the problem of multiagent coordination. We analyze the problem
using the framework of strategic games that we studied in Chapter 3, and we describe several
practical techniques like social conventions, roles, and coordination graphs.
Thriller Comedy
Thriller 1, 1 0, 0
Comedy 0, 0 1, 1
COORDINATION 25
world. Besides the game primitives, the state now also contains the relative orientation of the
cars in the physical environment. If the state is fully observable by both agents (and this fact is
common knowledge), then a simple convention is that the driver coming from the right will
always have priority over the other driver in the lexicographic ordering. If we also order the
actions by Cross Stop, then coordination by social conventions implies that the driver from the
right will cross the road first. Similarly, if traffic lights are available, the established convention
is that the driver who sees the red light must stop.
When communication is available, we only need to impose an ordering i = 1, . . . , n of
the agents that is common knowledge. Coordination can now be achieved by the following
algorithm: Each agent i (except agent 1) waits until all previous agents 1, . . . , i − 1 in the
ordering have broadcast their chosen actions, and then agent i computes its component a i∗ of
an equilibrium that is consistent with the choices of the previous agents and broadcasts a i∗ to
all agents that have not chosen an action yet. Note that here the fixed ordering of the agents
together with the wait/send primitives result in a synchronized sequential execution order of
the coordination algorithm.
4.3 ROLES
Coordination by social conventions relies on the assumption that an agent can compute all
equilibria in a game before choosing a single one. However, computing equilibria can be
expensive when the action sets of the agents are large, so it makes sense to try to reduce the size
of the action sets first. Such a reduction can have computational advantages in terms of speed,
but it can also simplify the equilibrium selection problem; in some cases the resulting subgame
contains only one equilibrium which is trivial to find.
A natural way to reduce the action sets of the agents is by assigning roles to the agents.
Formally, a role can be regarded as a masking operator on the action set of an agent given
a particular state. In practical terms, if an agent is assigned a role at a particular state, then
some of the agent’s actions are deactivated at this state. In soccer for example, an agent that is
currently in the role of defender cannot attempt to Score.
A role can facilitate the solution of a coordination game by reducing it to a subgame
where the equilibria are easier to find. For example, in Fig. 4.1, if agent 2 is assigned a role that
forbids him to select the action Thriller (say, he is under 12), then agent 1, assuming he knows
the role of agent 2, can safely choose Comedy resulting in coordination. Note that there is only
one equilibrium left in the subgame formed after removing the action Thriller from the action
set of agent 2.
In general, suppose that there are n available roles (not necessarily distinct), that the state
is fully observable to the agents, and that the following facts are common knowledge among
agents:
book MOBK077-Vlassis August 3, 2007 7:59
r There is a fixed ordering {1, 2, . . . , n} of the roles. Role 1 must be assigned first,
followed by role 2, etc.
r For each role there is a function that assigns to each agent a ‘potential’ that reflects how
appropriate that agent is for the specific role, given the current state. For example, the
potential of a soccer robot for the role attacker can be given by its negative Euclidean
distance to the ball.
r Each agent can be assigned only one role.
Then role assignment can be carried out, for instance, by a greedy algorithm in which
each role (starting from role 1) is assigned to the agent that has the highest potential for
that role, and so on until all agents have been assigned a role. When communication is not
available, each agent can run this algorithm identically and in parallel, assuming that each agent
can compute the potential of each other agent. When communication is available, an agent
only needs to compute its own potentials for the set of roles, and then broadcast them to the
rest of the agents. Next it can wait for all other potentials to arrive in order to compute the
assignment of roles to agents as above. In the communication-based case, each agent needs to
compute O(n) (its own) potentials instead of O(n2 ) in the communication-free case, but this is
compensated by the total number O(n2 ) of potentials that need to be broadcast and processed
by the agents. Figure 4.2 shows the greedy role assignment algorithm when communication is
available.
COORDINATION 27
joint action space is exponentially large in the number of agents. As roles reduce the size of the
action sets, we also need a method that reduces the number of agents involved in a coordination
game.
Guestrin et al. (2002a) introduced the coordination graph as a framework for solving
large-scale coordination problems. A coordination graph allows for the decomposition of a
coordination game into several smaller subgames that are easier to solve. Unlike roles where a
single subgame is formed by the reduced action sets of the agents, in this framework various
subgames are formed, each typically involving a small number of agents.
In order for such a decomposition to apply, the main assumption is that the global
payoff function u(a) can be written as a linear combination of k local payoff functions f j , for
j = 1, . . . , k, each involving fewer agents. For example, suppose that there are n = 4 agents,
and k = 3 local payoff functions, each involving two agents:
u(a) = f 1 (a 1 , a 2 ) + f 2 (a 1 , a 3 ) + f 3 (a 3 , a 4 ). (4.1)
Here, for instance f 2 (a 1 , a 3 ) involves only agents 1 and 3, with their actions a 1 and a 3 . Such a
decomposition can be graphically represented by a graph (hence the name), where each node
represents an agent and each edge corresponds to a local payoff function. For example, the
decomposition (4.1) can be represented by the graph of Fig. 4.3.
Many practical problems can be modeled by such additively decomposable payoff func-
tions. For example, in a computer network nearby servers may need to coordinate their actions
in order to optimize overall network traffic; in a firm with offices in different cities, geograph-
ically nearby offices may need to coordinate their actions in order to maximize global sales; in
a soccer team, nearby players may need to coordinate their actions in order to improve team
performance; and so on.
Let us now see how this framework can be used for coordination. A solution to the
coordination problem is by definition a Pareto optimal Nash equilibrium in the corresponding
strategic game, that is, a joint action a ∗ that maximizes u(a). We will describe two solution
f1 1
f2
2 3
f3
4
Next we perform the inner maximization over the actions of agent 1. For each combination of
actions of agents 2 and 3, agent 1 must choose an action that maximizes f 1 + f 2 . This essentially
involves computing the best-response function B1 (a 2 , a 3 ) of agent 1 (see Section 3.4) in the
subgame formed by agents 1, 2, and 3, and the sum of payoffs f 1 + f 2 . The function B1 (a 2 , a 3 )
can be thought of as a conditional strategy for agent 1, given the actions of agents 2 and 3.
The above maximization and the computation of the best-response function of agent 1
define a new payoff function f 4 (a 2 , a 3 ) = maxa 1 [ f 1 (a 1 , a 2 ) + f 2 (a 1 , a 3 )] that is independent of
a 1 . Agent 1 has now been eliminated. The maximum (4.2) becomes
max u(a) = max f 3 (a 3 , a 4 ) + f 4 (a 2 , a 3 ) . (4.3)
a a 2 ,a 3 ,a 4
We can now eliminate agent 2 as we did with agent 1. In (4.3), only f 4 involves a 2 , and
maximization of f 4 over a 2 gives the best-response function B2 (a 3 ) of agent 2 which is a
function of a 3 only. This in turn defines a new payoff function f 5 (a 3 ), and agent 2 is eliminated.
Now we can write
max u(a) = max f 3 (a 3 , a 4 ) + f 5 (a 3 ) . (4.4)
a a 3 ,a 4
COORDINATION 29
For each agent in parallel
F = {f1 , . . . , fk }.
For each agent i = 1, 2, . . . , n
Find all fj (ai , a − i ) ∈ F that involve
ai .
Compute Bi (a − i ) = arg maxai j fj (ai , a − i ).
Compute fk+i (a − i ) = maxai j fj (ai , a − i ).
Remove all fj (ai , a − i ) from F and add fk+i (a − i ) in F .
End
For each agent i = n, n − 1,...,1
Choose a∗i ∈ Bi (a∗− i ) based on a fixed ordering of actions.
End
End
one agent may have more than one best-response actions, in which case the first action can be
chosen according to an a priori ordering of the actions of each agent that must be common
knowledge.
The complete algorithm, which we will refer to as coordination by variable elimina-
tion, is shown in Fig. 4.4. Note that the notation −i that appears in f j (a i , a −i ) refers to
all agents other than agent i that are involved in f j , and it does not necessarily include all
n − 1 agents. Similarly, in the best-response functions Bi (a −i ) the action set a −i may involve
less than n − 1 agents. The algorithm runs identically for each agent in parallel. For that
we require that all local payoff functions are common knowledge among agents, and that
there is an a priori ordering of the action sets of the agents that is also common knowledge.
The latter assumption is needed so that each agent will finally compute the same joint ac-
tion. The main advantage of this algorithm compared to coordination by social conventions
is that here we need to compute best-response functions in subgames involving only few
agents, as opposed to computing best-response functions in the complete game involving all n
agents.
For simplicity, in the above algorithm we have fixed the elimination order of the agents
as 1, 2, . . . , n. However, this is not necessary; each agent running the algorithm can choose a
different elimination order, and the resulting joint action a ∗ will always be the same. The total
runtime of the algorithm, however, will not be the same; different elimination orders produce
different intermediate payoff functions, and thus subgames of different size. It turns out that
computing the elimination order that minimizes the execution time of the algorithm is a hard
(NP-complete) problem (Arnborg et al., 1987). A good heuristic is to eliminate agents that
have the fewest neighbors.
When communication is available, we do not need to assume that all local payoff functions
f j are common knowledge and that the actions are ordered. In the forward pass, each agent
book MOBK077-Vlassis August 3, 2007 7:59
COORDINATION 31
4.4.2 Coordination by Message Passing
Variable elimination is an exact method (it always computes an optimal joint action), but it
suffers from two limitations. First, for densely connected graphs the runtime of the method can
be exponential in the number of agents (for example, a particular elimination order may cause
the graph to become fully connected). Second, variable elimination can only produce a solution
after the end of its backward pass, which can be unacceptable when real-time behavior is in
order: often, decision making is done under time constraints, and there is a deadline after which
the payoff of the team becomes zero (think of a research team that try to a submit a proposal
before a deadline). In such cases we would like to have an anytime algorithm that improves the
quality of the solution over time and (if possible) eventually computes the optimal solution.
Such anytime behavior can be achieved by distributed algorithms that are based on
message passing. Here we will describe one such algorithm, called max-plus, that was originally
developed for computing maximum a posteriori (MAP) solutions in Bayesian networks (Pearl,
1988). In this algorithm, neighboring agents in the graph repeatedly send messages to each
other, where a message is a local payoff function for the receiving agent. Suppose that we have a
coordination graph that defines a payoff function as a sum of two-agent local payoff functions:
u(a) = f i j (a i , a j ) (4.5)
(i, j )
where the summation is over all (i, j ) pairs of neighboring agents in the graph. In each time
step, each agent i sends a message µi j to a (randomly picked) neighbor j , where µi j is a local
payoff function for the receiving agent j defined as
µi j (a j ) = max f i j (a i , a j ) + µki (a i ) (4.6)
ai
k∈(i)\ j
where (i) \ j denotes all neighbors of agent i except agent j . Messages are exchanged until
they converge to a fixed point, or until some external signal stops the process. The two operators
involved in (4.6), a maximization and a summation, give the name max-plus to the algorithm.
When the graph is cycle-free (tree), max-plus always converges after a finite num-
ber of steps to a fixed point in which the messages do not change anymore (Pearl, 1988,
Wainwright et al., 2004). If we define local functions g i , one for each agent i, as
g i (a i ) = µ ji (a i ) (4.7)
j ∈(i)
g i (a i ) = max
u(a ). (4.8)
{a |a i =a i }
book MOBK077-Vlassis August 3, 2007 7:59
and each optimal action a i∗ is unique (for all i), then at convergence the globally optimal
action a ∗ = arg maxa u(a) is also unique and has elements a ∗ = (a i∗ ) computed by only local
optimizations (each agent maximizes g i (a i ) separately). If the local a i∗ are not unique, an
optimal joint action can still be computed by dynamic programming (Wainwright et al., 2004,
sec. 3.1). Hence, max-plus allows the decomposition of a difficult global optimization problem
(a ∗ = arg maxa u(a)) into a set of local optimization problems (4.9) that are much easier to
solve.
When the graph contains cycles, there are no guarantees that max-plus will converge,
nor that the local maximizers a i∗ from (4.9) will comprise a global maximum at any time step.
However, max-plus can still be used as an approximate coordination algorithm, that produces
very good results in practice, much faster than variable elimination (Kok and Vlassis, 2006).
Max-plus is effective and simple to implement, but it comes with few performance
guarantees in general graphs. Other algorithms exist, based on branch and bound or hill
climbing, that can provably converge to the optimal solution (Modi et al., 2005), or to a
k-optimal solution in which no subset of k or fewer agents can jointly improve the global
payoff (Pearce and Tambe, 2007, Zhang et al., 2005).
COORDINATION 33
also Chapter 5). Coordination graphs are due to Guestrin et al. (2002a) who suggested the use
of variable elimination for coordination. The max-plus algorithm on coordination graphs was
suggested by Vlassis et al. (2004). Coordination on a coordination graph is essentially identical
to a distributed constraint optimization problem (DCOP) (Modi et al., 2005, Yokoo, 2000), a
particular version of constraint processing (Dechter, 2003).
book MOBK077-Vlassis August 3, 2007 7:59
34
book MOBK077-Vlassis August 3, 2007 7:59
35
CHAPTER 5
Partial Observability
In the previous chapters we assumed that the world state is fully observable to the agents. Here
we relax this assumption and examine the case where parts of the state are hidden to the agents.
In such a partially observable world an agent must always reason about his knowledge, and the
knowledge of the others, prior to making decisions. We formalize the notions of knowledge
and common knowledge in such domains, and describe the model of a Bayesian game for
multiagent decision making under partial observability.
Three agents (say, girls) are sitting around a table, each wearing a hat. A hat can be either
red or white, but suppose that all agents are wearing red hats. Each agent can see the hat
of the other two agents, but she does not know the color of her own hat. A person who
observes all three agents asks them in turn whether they know the color of their hats. Each
agent replies negatively. Then the person announces ‘At least one of you is wearing a red
hat’, and then asks them again in turn. Agent 1 says No. Agent 2 also says No. But when he
asks agent 3, she says Yes.
How is it possible that agent 3 can finally figure out the color of her hat? Before the
announcement that at least one of them is wearing a red hat, no agent is able to tell her hat
color. What changes then after the announcement? Seemingly the announcement does not
reveal anything new; each agent already knows that there is at least one red hat because she can
see the red hats of the other two agents.
Given that everyone has heard that there is at least one red hat, agent 3 can tell her hat
color by reasoning as follows: ‘Agent’s 1 No implies that either me or agent 2 is wearing a red
hat. Agent 2 knows this, so if my hat had been white, agent 2 would have said Yes. But agent 2
said No, so my hat must be red.’
Although each agent already knows (by perception) the fact that at least one agent is
wearing a red hat, the key point is that the public announcement of the person makes this
fact common knowledge among the agents. (Implicitly we have also assumed that it is common
knowledge that each agent can see and hear well, and that she can reason rationally.) The puzzle
is instructive as it demonstrates the implications of interactive reasoning and the strength of
the common knowledge assumption.
Let us now try to formalize some of the concepts that appear in the puzzle. The starting
point is that the world state is partially observable to the agents. Recall that in a partially
observable world the perception of an agent provides only partial information about the true
state by means of a deterministic or stochastic observation model (see Section 2.3). In the puzzle
of the hats this model is a set-partition deterministic model, as we will see next.
Let S be the set of all states and s ∈ S be the current (true) state of the world. We assume
that the perception of an agent i provides information about the state s through an information
function Pi : S → 2 S that maps s to Pi (s ), a nonempty subset of S called the information
set of agent i in state s . The interpretation of the information set is that when the true state
is s , agent i thinks that any state in Pi (s ) can be the true state. The set Pi (s ) will always
book MOBK077-Vlassis August 3, 2007 7:59
PARTIAL OBSERVABILITY 37
World states
a b c d e f g h
1 R R R R W W W W
Agents 2 R R W W R R W W
3 R W R W R W R W
FIGURE 5.1: The eight world states in the puzzle of the hats
contain s , but essentially this is the only thing that agent i knows about the true state. In the
case of multiple agents, each agent can have a different information function.
In the puzzle of the hats, a state is a three-component vector containing the col-
ors of the hats. Let R and W denote red and white. There are in total eight states S =
{a, b, c , d , e , f, g , h}, as shown in Fig. 5.1. By assumption, the true state is s = a. From the
setup of the puzzle we know that the state is partially observable to each agent; only two of
the three hat colors are directly perceivable by each agent. In other words, in any state s the
information set of each agent contains two equiprobable states, those in which the only dif-
ference is in her own hat color. For instance, in state s = a the information set of agent 2 is
P2 (s ) = {a, c }, a two-state subset of S.
As we mentioned above, the information set Pi (s ) of an agent i contains those states in S
that agent i considers possible if the true state is s . In general, we assume that the information
function of an agent divides the state space into a collection of mutually disjoint subsets, called
cells, that together form a partition Pi of S. The information set Pi (s ) for agent i in true
state s is exactly that cell of Pi that contains s , while the union of all cells in Pi is S.
Based on the information functions, we can compute the partitions of the agents in the
puzzle of the hats:
where t refers to the time step before any announcement took place. Clearly, in the true state
s = a = RRR no agent knows her hat color, since the corresponding cell of each partition
contains two equiprobable states. Thus, agent 1 considers a and e possible, agent 2 considers a
and c possible, and agent 3 considers a and b possible. (Note again that we know that the true
state is a but the agents in our puzzle do not.)
Now we make the additional assumption that all partitions are common knowledge
among the agents. In the case of homogeneous agents, for instance, this is not an unrealistic
assumption; typically each agent will be aware of the perception capabilities of each other. In
the puzzle of the hats, for example, it is reasonable to assume that all agents can see and hear
book MOBK077-Vlassis August 3, 2007 7:59
Note that h has been disambiguated from d , f , and g , in the three partitions. The person then
asks each agent in turn whether she knows the color of her hat. Agent 1 says No. In which case
would agent 1 have said Yes? As we see from the above partitions, only in state d would agent 1
have known her hat color. But the true state is a, and in this state agent 1 still considers e
possible.
The reply of agent 1 eliminates state d from the set of candidate states. This results in a
refinement of the partitions of agents 2 and 3:
Next agent 2 is asked. From her partition P2t+2 we see that she would have known her
hat color only in state b or f (d and h are already ruled out by the previous announcements).
However, in the true state a agent 2 still considers c possible, therefore she replies negatively.
Her reply excludes b and f from the set of candidate states, resulting in a further refinement
of the partitions of agent 1 and 3:
The partition of agent 3 now contains only singleton cells, thus agent 3 can now tell her hat
color. Note that agents 1 and 2 still cannot tell their hat colors. In fact, they will be unable to tell
book MOBK077-Vlassis August 3, 2007 7:59
PARTIAL OBSERVABILITY 39
their hat colors no matter how many more announcements will take place; the partitions (5.6)
cannot be further refined. Interestingly, the above analysis would have been exactly the same if
the true state had been any one in the set {a, c , e , g }. (Try to verify this with logical reasoning.)
That is, for any event E, the set K i (E) contains all states in which agent i knows E. It is
not difficult to see that K i (E) can be written as the union of all cells of Pi that are fully
contained in E. In the puzzle of the hats, for example, in the final partitions (5.6) holds
K 1 ({a, e , c }) = {a, e }, while for the event E = {a, c , e , g } holds K i (E) = E for all i = 1, 2, 3.
An event E ⊆ S is called self-evident to agent i if E can be written as a union of cells
of Pi . For example, in (5.6) the event E = {a, c , e , g } is self-evident to all three agents. As
another example, suppose that the state space consists of the integer numbers from 1 to 8, the
true state is s = 1, and two agents have the following partitions:
In s = 1 agent 1 thinks that {1, 2} are possible. Agent 1 also thinks that agent 2 may think that
{1, 2, 3} are possible. Furthermore, agent 1 thinks that agent 2 may think that agent 1 might
think that {1, 2} or {3, 4, 5} are possible. But nobody needs to think beyond 5. In this example,
the event {1, 2, 3, 4} is self-evident to agent 2, while the event {1, 2, 3, 4, 5} is self-evident to
both agents.
We can now formalize the notion of common knowledge. For simplicity, the first defi-
nition is formulated for only two agents.
Definition 5.1. An event E ⊆ S is common knowledge between agents 1 and 2 in true state s ∈ S,
if s is a member of every set in the infinite sequence K 1 (E), K 2 (E), K 1 (K 2 (E)), K 2 (K 1 (E)), . . ..
Definition 5.2. An event E ⊆ S is common knowledge among a group of agents in true state s ∈ S,
if s is a member of some set F ⊆ E that is self-evident to all agents.
1
This definition of knowledge is related to the one used in epistemic logic. There an agent is said to know a fact φ
if φ is true in all states the agent considers possible. In the event-based framework, an agent knows an event E
if all the states the agent considers possible are contained in E. Fagin et al. (1995, sec. 2.5) show that the two
approaches, logic-based and event-based, are equivalent.
book MOBK077-Vlassis August 3, 2007 7:59
PARTIAL OBSERVABILITY 41
states and joint observations, from which various other quantities can be computed, like p(θ )
or p(θ |s ), by using the laws of probability theory.2
5.4.4 Payoffs
In the puzzle of the hats the agents reply truthfully to the questions of the person. Although we
have not explicitly defined a payoff function in this problem, we can think of an implicit payoff
function that the agents maximize, in which, say, truthfulness is highly valued. In general,
multiagent decision making requires defining an explicit payoff function Q i for each agent.
This function can take several forms; for instance, it can be a function Q i (s , a) over states
and joint actions; or a function Q i (θ, a) over joint observations and joint actions; or a function
Q i (θi , a) over individual observations and joint actions (we will see an example of such a function
in Chapter 6). Note that often one form can be derived from the other; for instance, when an
inverse observation model p(s |θ ) is available, we can write Q i (θ, a) = s ∈S p(s |θ )Q i (s , a).
When the above primitives are defined, multiagent decision making under partial ob-
servability can be modeled by a Bayesian game, also known as strategic game with imperfect
information. This is a combination of the strategic game model of Section 3.2 with the concepts
of knowledge and partial observability defined in this chapter. In particular, a Bayesian game
assumes that there is a set of states S, from which one state (the true state) is realized at the
start of the game. The true state is only partially observable by the agents; each agent i receives
an observation θi , also called the type of agent i, that is hidden to the other agents, and that is
related to the state via a deterministic or stochastic observation model. Each agent additionally
possesses a payoff function Q i as described above. The solution of the game is a profile of
individual policies πi (θi ) that are optimal according to some solution concept, for instance,
Nash equilibrium (defined below). Note that each individual policy πi (θi ) specifies an action
to take by agent i for each of his observations, and not only for the observation that the agent
2
p(A) = B p(A, B), and p(A|B) = p(A, B)/ p(B).
book MOBK077-Vlassis August 3, 2007 7:59
Definition 5.3. A Nash equilibrium of a Bayesian game is a Nash equilibrium of a new strategic
game in which each player is a pair (agent i, observation θi ) and has payoff function
u i (πi (θi )) = p(s |θi )Q i (s , [πi (θi ), a −i (s )]) (5.9)
s
where a −i (s ) is the profile of actions taken by all other players except player (i, θi ) at state s .
Clearly, in order for this definition to be applicable, each agent must be able to infer the
action of each other agent at each state. This requires that the observation model is common
knowledge, and that it is a deterministic model where, for each i, the observation θi is a
deterministic function of s (for instance a partitional model as in the puzzle of the hats). In this
case, the policy π j (θ j ) of an agent j uniquely identifies his action at s through a j (s ) = π j (θ j (s )).
The second model of a Bayesian game is not making use of states. Instead it assumes
that payoffs are defined over joint observations and actions, in the form Q i (θ, a), and that a
marginal observation model p(θ ) is available. In this case, a Nash equilibrium is defined as in
Definition 5.3 with (5.9) replaced by
u i (πi (θi )) = p(θ−i |θi )Q i (θ, [πi (θi ), π−i (θ−i )]) (5.10)
θ−i
where now the quantities π−i (θ−i ) are directly available, and p(θ−i |θi ) can be computed from
p(θ ). This second model of a Bayesian game is easier to work with, and it is often preferred
over the first one in practical problems.
In the special case of n collaborative agents with common payoff functions Q 1 = . . . =
Q n ≡ Q, coordination requires computing a Pareto optimal Nash equilibrium (see Chapter 4).
In the second model of a Bayesian game described above, such an equilibrium can be computed
by the following:
book MOBK077-Vlassis August 3, 2007 7:59
PARTIAL OBSERVABILITY 43
−
θ2 θ2
a2 a−2 a2 a−2
a1 +0.1 +2.2 +0.4 − 0.2
θ1
a−1 − 0.5 +2.0 +1.0 +2.0
− a1 +0.4 − 0.2 +0.7 − 2.6
θ1
a−1 +1.0 +2.0 +2.5 +2.0
FIGURE 5.2: A Bayesian game with common payoffs involving two agents and binary actions and
observations. The shaded entries indicate the Pareto optimal Nash equilibrium of this game.
Proposition 5.1. A Pareto optimal Nash equilibrium for a Bayesian game with a common
payoff function Q(θ, a) is a joint policy π ∗ = (πi∗ ) that satisfies
π ∗ = arg max p(θ )Q(θ, π (θ )). (5.11)
π θ
Proof. From the perspective of some agent i, the above formula reads
πi∗ = arg max p(θi ) ∗
p(θ−i |θi )Q i (θ, [πi (θi ), π−i (θ−i )]). (5.12)
πi θi θ−i
A sum of terms is maximized when each of the terms is maximized, so there must hold
πi∗ (θi ) = arg max ∗
p(θ−i |θi )Q i (θ, [πi (θi ), π−i (θ−i )]) (5.13)
πi (θi ) θ−i
which is the definition of a Nash equilibrium from (5.10). This shows that π ∗ is a Nash
equilibrium. The proof that π ∗ is also Pareto optimal is left as an exercise.
Figure 5.2 shows an example of a two-agent Bayesian game with common payoffs,
where each agent i has two available actions, Ai = {a i , ā i }, and two available observations,
i = {θi , θ̄i }. Assuming uniform p(θ ), we can compute from (5.11) the Pareto optimal Nash
equilibrium π ∗ = (π1∗ , π2∗ ) of the game, which is
45
CHAPTER 6
Mechanism Design
In this chapter we study the problem of mechanism design, which is the development of
agent interaction protocols that explicitly take into account the fact that the agents may be
self-interested. We discuss the revelation principle and the Vickrey–Clarke–Groves (VCG)
mechanism that allows us to build successful protocols in a variety of cases.
1
In this chapter we will use ‘he’ to refer to an agent, and ‘we’ to refer to the mechanism designer.
book MOBK077-Vlassis August 3, 2007 7:59
MECHANISM DESIGN 47
is to ask each agent to tell us his type, but there is no guarantee that an agent will report his true
type! Recall that each agent i forms his own preferences over outcomes, given by his valuation
function νi (θi , o ) that is parametrized by his true type θi . If by reporting a false type θ̃i = θi an
agent i expects to receive higher payoff than by reporting his true type θi , then this agent may
certainly consider lying. For instance, if a social choice function chooses the outcome that is
last in the preferences of agent 1, that is, f (θ ) = arg mino ν1 (θ1 , o ), then agent 1 will report a
false type θ̃i for which arg mino ν1 (θ̃1 , o ) = arg maxo ν1 (θ1 , o ).
The challenge therefore is to design non-manipulable mechanisms in which no agent
can benefit from not abiding by the rules of the mechanism. For instance, if a mechanism
requires from each agent to report his true type, then we would like truth-telling to be indeed in
the best interests of each agent. Viewed from a computational perspective, we can characterize
mechanism design as the development of efficient and robust algorithms for optimization
problems with distributed parameters, where these parameters are controlled by agents that
have different preferences for different solutions.
We focus here on simple mechanisms in the form of a Bayesian game with the following
primitives:
r Ai is the set of available actions of agent i.
r i is the set of types of agent i.
r g : A → O is an outcome function that maps a joint action a = (a i ) to an outcome
o = g (a).
r Q i (θi , a) is the payoff function of agent i that is defined as
That is, each agent i chooses for each of his types θi the action πi∗ (θi ) with the highest
payoff. In particular, note that an agent i does not consider in the above equilibrium the types
∗
θ−i and the policies π−i (θ−i ) of the other agents. This is in contrast to the solution concept of
a Nash equilibrium (5.10) where each agent i is assumed to possess a conditional distribution
p(θ−i |θi ) over the types of the other agents, and must know the policies of the other agents at
the equilibrium.
Our choice of such a solution concept is motivated by the fact that we would like to design
mechanisms in which each agent can compute his optimal action without having to worry about
the actions of the other agents. In terms of predictive power for the solutions of a game, an
equilibrium in dominant actions is weaker than both a Nash equilibrium and an equilibrium
computed by iterated elimination of strictly dominated actions (see Chapter 3). However, in
the context of mechanism design, the existence of such an equilibrium guarantees that every
(rational) agent will adhere to it, even if he has no information about the preferences of the
other agents. Such an equilibrium solution is also very attractive computationally, because an
agent does not need to consider the types or the policies of the other agents.
Summarizing, the mechanism design problem can be defined as follows:
Definition 6.2 (The mechanism design problem). Given a set of outcomes o ∈ O, a profile of
valuation functions νi (θi , o ) parametrized by θi , and a social choice function f (θ ), find appropriate
action sets Ai , an outcome function g (a), and payment functions ξi (o ), such that for any profile of true
types θ = (θi ) and for payoff functions Q i (θi , a) defined via (6.2) holds g (π ∗ (θ )) = f (θ ), where π ∗
is an equilibrium in dominant strategies of the Bayesian game M = (Ai , g , ξi ). In this case we say
that the mechanism M implements the social choice function f in dominant strategies.
MECHANISM DESIGN 49
values it most, but we do not know the true valuations (types) of the agents. In this example,
an outcome o ∈ {1, . . . , n} is the index of the agent to whom the item is assigned, while the
valuation function of an agent i with type θi ∈ IR+ is νi (θi , o ) = θi if o = i and zero otherwise.
The social choice function is f (θ1 , . . . , θn ) = arg maxi {θi } which is a special case of (6.1). If we
do not include a payment function, that is ξi = 0 for all i, then a mechanism M1 = (Ai , g , ξi )
that implements f is always individually rational because for an agent i holds Q i (θi , ·) = νi (θi , ·)
which is either θi > 0 or zero.
Since the ξi are identical in M and M , using (6.2) we can rewrite (6.4) as
Q iM (θi , [θi , θ̃−i ]) ≥ Q iM (θi , [θ̃i , θ̃−i ]) (6.5)
A mechanism in the form M = (i , f, ξi ) in which each agent is asked to report his type
is called a direct-revelation mechanism. A direct-revelation mechanism in which truth-telling
is a dominant strategy for every agent is called strategy-proof. The revelation principle is re-
markable because it allows us to restrict our attention to strategy-proof mechanisms only. One
of its consequences, for example, is that if we cannot implement a social choice function by a
strategy-proof mechanism, then there is no way to implement this function in dominant strate-
gies by any other general mechanism. The revelation principle has been a powerful theoretical
book MOBK077-Vlassis August 3, 2007 7:59
n
f (θ̃) = arg max νi (θ̃i , o ). (6.6)
o ∈O i=1
In a Groves mechanism, the payment function that is associated with a profile of reported
types θ̃ is defined for each agent as
ξi ( f (θ̃)) = ν j (θ̃ j , f (θ̃)) − h i (θ̃−i ) (6.7)
j =i
for arbitrary function h i (θ̃−i ) that does not depend on the report of agent i. In this case, and
for payoffs given by (6.2), we can show the following (the proof is left as an exercise):
MECHANISM DESIGN 51
Having the freedom to choose any function h i (θ̃−i ), the Clarke mechanism, also known
as Vickrey–Clarke–Groves (VCG) mechanism, uses
h i (θ̃−i ) = ν j (θ̃ j , f (θ̃−i )) (6.8)
j =i
where f (θ̃−i ) is an allocatively efficient social choice function with agent i excluded:
f (θ̃−i ) = arg max ν j (θ̃ j , o ). (6.9)
o ∈O j =i
Under quite general conditions, the VCG mechanism can be shown to be individually rational.
Moreover, in some applications the payments ξi to the agents are negative, so the mechanism
does not need to be externally subsidized (however, the collected tax must be burnt).
where C is the additive cost (length) of the shortest path solution, and C is the length of
the shortest path solution after edge i is removed from the graph. From (6.2) and (6.10), the
payoff of agent i under truth-telling is Q i (θi , [θi , θ̃−i ]) = −θi + θi − C + C , which is always
nonnegative since removing an edge from a graph can never generate a shorter path. It is
therefore individually rational for an agent to participate in this mechanism, and because VCG
mechanisms are strategy-proof, each agent will report his true cost.
book MOBK077-Vlassis August 3, 2007 7:59
53
CHAPTER 7
Learning
In this chapter we briefly address the issue of learning, in particular reinforcement learning
which allows agents to learn from delayed rewards. We outline existing techniques for single-
agent systems, and show how they can be extended in the multiagent case.
r Discrete time t = 0, 1, 2, . . ..
r A discrete set of states s ∈ S.
r A discrete set of actions a ∈ A.
r A stochastic transition model p(s |s , a), so that the world transitions stochastically to
state s when the agent takes action a at state s .
r A reward function R : S × A → IR, so that the agent receives reward R(s , a) when it
takes action a at state s .
r A planning horizon, which can be infinite.
The task of the agent is to maximize a function of accumulated reward over its planning
horizon. A standard such function is the discounted future reward R(s t , a t ) + γ R(s t+1 , a t+1 ) +
γ 2 R(s t+2 , a t+1 ) + · · · , where γ ∈ [0, 1) is a discount rate that ensures that the sum remains
finite for infinite horizon.
A stationary policy of the agent in an MDP is a mapping π (s ) from states to actions, as
in Section 2.4. Clearly, different policies will produce different discounted future rewards, since
each policy will take the agent through different trajectories in the state space. The optimal
value of a state s for the particular agent is defined as the maximum discounted future reward
the agent can receive in state s by following some policy:
∞
V ∗ (s ) = max E γ t R(s t , a t )|s 0 = s , a t = π (s t ) (7.1)
π
t=0
where the expectation operator E[·] averages over the stochastic transitions. Similarly, the
optimal Q-value of a state s and action a of the agent is the maximum discounted future
reward the agent can receive after taking action a in state s :
∞
Q ∗ (s , a) = max E γ t R(s t , a t )|s 0 = s , a 0 = a, a t>0 = π (s t ) . (7.2)
π
t=0
A policy π ∗ (s ) that achieves the maximum in (7.1) or (7.2) is an optimal policy for the agent. For
an MDP there is always an optimal policy that is deterministic and stationary. Deterministic
means that π ∗ (s ) specifies a single action per state. Stationary means that every time the agent
visits a state s , the optimal action to take at s is always π ∗ (s ). An optimal policy is greedy with
book MOBK077-Vlassis August 3, 2007 7:59
LEARNING 55
∗ ∗
respect to V or Q , as we have seen in Section 2.4:
Note that there can be many optimal policies in a given task, but they all share a unique V ∗
and Q ∗ .
The definition of V ∗ in (7.1) can be rewritten recursively by making use of the transition
model, to get the so-called Bellman equation:
V ∗ (s ) = max R(s , a) + γ p(s |s , a)V ∗ (s ) . (7.4)
a
s
This is a set of nonlinear equations, one for each state, the solution of which defines the optimal
V ∗ . A similar recursive definition holds for Q-values:
Q ∗ (s , a) = R(s , a) + γ p(s |s , a) max
Q ∗ (s , a ). (7.5)
a
s
We repeat the above two equations until V does not change significantly between two con-
secutive steps. Value iteration converges to the optimal Q ∗ (and thus to V ∗ and π ∗ ) for any
initialization (Bertsekas, 2001). After we have computed Q ∗ we can extract an optimal policy
π ∗ using (7.3). As an example, using value iteration in the world of Fig. 2.1 of Chapter 2,
with fixed reward R(s , a) = −1/30 for each nonterminal state s and action a, and with no
discounting, we get the optimal values (utilities) and the optimal policy shown in Fig. 2.2. The
reader is encouraged to verify this by implementing the method.
7.2.2 Q-learning
In order to apply the value iteration updates (7.6) and (7.7) we need to know the transition model
p(s |s , a), but in many applications the transition model is unavailable. Q-learning is a method
for estimating the optimal Q ∗ (and from that an optimal policy) that does not require knowledge
of the transition model. In Q-learning the agent repeatedly interacts with the environment and
book MOBK077-Vlassis August 3, 2007 7:59
V (s ) := max
Q(s , a ) (7.8)
a
Q(s , a) := (1 − λ)Q(s , a) + λ r + γ V (s ) (7.9)
where τ controls the smoothness of the distribution (and thus the randomness of the choice),
and is decreasing with time.
LEARNING 57
r For each agent i, a reward function Ri : S × A → IR, that gives agent i reward Ri (s , a)
when joint action a is taken at state s .
r A planning horizon, which can be infinite.
Note that the individual reward functions Ri (s , a) define a set of strategic games, one for
each state s . A Markov game differs to an MDP in that a transition depends on the joint action
of the agents, and that each agent may receive different reward as a result of a joint action. As
in MDPs, a policy of an agent i is a mapping πi (s ) from states to individual actions. As in
strategic games, a joint policy π ∗ = (πi∗ ) is a Nash equilibrium if no agent has an incentive to
unilaterally change its policy; that is, no agent i would like to take at state s an action a i = πi∗ (s )
∗
assuming that all other agents stick with their equilibrium policies π−i (s ). Contrary to MDPs,
in a Markov game an optimal policy of an agent need not be deterministic; we can see this
by noticing that a single-state Markov game is just a strategic game, for which we know that
deterministic equilibria may not always exist (see Fig. 3.2(a)).
where C is a function that applies some solution concept to the strategic game formed by the
Q 1 , . . . , Q n . When the transition model is not available, a corresponding coupled Q-learning
update scheme can be derived in which (7.12) is replaced by (7.9), one per agent. Note that
such a multiagent Q-learning scheme requires that each agent observes the selected joint action
in each step.
Depending on the type of game (zero-sum, general-sum, or coordination game),
the function C may compute a Nash equilibrium (Hu and Wellman, 2004, Littman, 1994,
2001), a correlated equilibrium1 (Greenwald and Hall, 2003), or a coordinated joint ac-
tion (Kok and Vlassis, 2006). Although successful in practice, the method may not always
be able to compute an optimal equilibrium policy (Zinkevich et al., 2006).
Such a model allows for a decentralized learning algorithm where all update steps are local, and
intermediate results are communicated by solving a global coordination game. In particular,
following the general multiagent learning approach of Section 7.3.2, we can derive a multiagent
1
A correlated equilibrium is a generalization of a mixed-strategy Nash equilibrium where the mixed strategies of the
agents can be correlated.
book MOBK077-Vlassis August 3, 2007 7:59
LEARNING 59
Q-learning algorithm for a CM-MDP. Assuming that each agent i observes tuple (s , a, r i , s )
with r i = Ri (s , a), we obtain:
n
Vi (s ) := Q i (s , a ∗ ), where a ∗ ∈ arg max Q i (s , a ), (7.14)
a
i=1
Q i (s , a) := (1 − λ)Q i (s , a) + λ r i + γ Vi (s ) . (7.15)
Note that the Q-learning update rule (7.15) is fully decentralized (each agent applies a
local update step separately), while (7.14) involves computing a coordinated joint action
(a Pareto optimal Nash equilibrium) in a global coordination game with common payoffs
n
Q(s , a) = i=1 Q i (s , a). The latter can be carried out with a coordination algorithm as in
Chapter 4.
When the optimal global Q-function of the task is decomposable as Q ∗ (s , a) =
n ∗
i=1 Q i (s , a), we can easily show (by summing (7.15) over n) that the above Q-learning
algorithm converges to an optimal joint policy. Such a decomposition of Q ∗ does not hold in
general, but there are cases where Q ∗ may indeed by decomposable (Wiegerinck et al., 2006). In
these cases, and assuming a properly decreasing learning rate λ, the above Q-learning algorithm
will be optimal.
If each local Q-function Q i (s , a) is stored as a table (one entry for each state and joint
action), the above approach will not scale to many agents. Alternatively we can use a coordina-
tion graph approach to represent the global Q-function (see Section 4.4). For instance, we can
n
assume a decomposition Q(s , a) = i=1 Q i (s , a), where now each local term Q i may depend
on few actions (say, of the neighbors of i in the graph). When such a sparse decomposition of
the Q-function is assumed, the coordination step in (7.14) can be carried out (exactly or approx-
imately) by the techniques presented in Section 4.4. In this case, the multiagent Q-learning al-
gorithm (7.14)–(7.15) has been dubbed Sparse Cooperative Q-learning (Kok and Vlassis, 2006).
Various representations of the local Q-functions can be used, for instance a representation using
a functions approximator (Guestrin et al., 2002b). A related approach that uses different local
update rules has been proposed by Schneider et al. (1999).
LEARNING 61
Lagoudakis and Parr (2003), Moallemi and Van Roy (2004), Peshkin et al. (2000), Singh et al.
(2000), Tesauro (2004), Zinkevich et al. (2006). Multiagent reinforcement learning is still a
young field; Shoham et al. (2007) and Gordon (2007) identify several research agendas that
can be used for guiding research and evaluating progress.
book MOBK077-Vlassis August 3, 2007 7:59
62
book MOBK077-Vlassis August 3, 2007 7:59
63
Bibliography
Adamatzky, A. and Komosinski, M., editors (2005). Artificial Life Models in Software. Springer-
Verlag, Berlin.
Arnborg, S., Corneil, D. G., and Proskurowski, A. (1987). Complexity of finding embeddings
in a k-tree. SIAM Journal on Algebraic and Discrete Methods, 8(2):277–284.
Aumann, R. J. (1976). Agreeing to disagree. Annals. of Statistics, 4(6):1236–1239.
Bagnell, J. A. and Ng, A. Y. (2006). On local rewards and the scalability of dis-
tributed reinforcement learning. In Weiss, Y., Schölkopf, B., and Platt, J., edi-
tors, Advances in Neural Information Processing Systems 18. MIT Press, Cambridge,
MA.
Bellman, R. (1961). Adaptive Control Processes: a Guided Tour. Princeton University Press.
Bernstein, D. S., Givan, R., Immerman, N., and Zilberstein, S. (2002). The complexity of
decentralized control of Markov decision processes. Mathematics of Operations Research,
27(4):819–840. doi:10.1287/moor.27.4.819.297
Bertsekas, D. P. (2001). Dynamic Programming and Optimal Control, volume I and II. Athena
Scientific, 2nd edition.
Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-dynamic Programming. Athena Scientific.
Bordini, R., Dastani, M., Dix, J., and El Fallah Seghrouchni, A., editors (2005). Multi-Agent
Programming: Languages, Platforms and Applications. Springer, Berlin.
Boutilier, C. (1996). Planning, learning and coordination in multiagent decision pro-
cesses. In Proc. Conf. on Theoretical Aspects of Rationality and Knowledge, Renesse, The
Netherlands.
Bowling, M. (2005). Convergence and no-regret in multiagent learning. In Saul, L. K.,
Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 17,
pp. 209–216. MIT Press, Cambridge, MA.
Brafman, R. I. and Tennenholtz, M. (2003). Learning to coordinate efficiently: A model based
approach. Journal of Artificial Intelligence Research, 19:11–23.
Brafman, R. I. and Tennenholtz, M. (2004). Efficient learning equilibrium. Artificial Intelli-
gence, 159(1-2):27–47. doi:10.1016/j.artint.2004.04.013
Castelpietra, C., Iocchi, L., Nardi, D., Piaggio, M., Scalzo, A., and Sgorbissa, A. (2000).
Coordination among heterogenous robotic soccer players. In Proc. IEEE/RSJ Int. Conf. on
Intelligent Robots and Systems, Takamatsu, Japan.
book MOBK077-Vlassis August 3, 2007 7:59
BIBLIOGRAPHY 65
Gibbons, R. (1992). Game Theory for Applied Economists. Princeton University Press, Princeton,
NJ.
Gilbert, N. and Doran, J., editors (1994). Simulating Societies: The Computer Simulation of Social
Phenomena. UCL Press, London.
Gmytrasiewicz, P. J. and Durfee, E. H. (2001). Rational communication in
multi-agent environments. Autonomous Agents and Multi-Agent Systems, 4:233–272.
doi:10.1023/A:1011495811107
Gordon, G. J. (2007). Agendas for multi-agent learning. Artificial Intelligence. doi:
10.1016/j.artint.2006.12.006.
Greenwald, A. (2007). The Search for Equilibrium in Markov Games (Synthesis Lectures on
Artificial Intelligence and Machine Learning). Morgan & Claypool Publishers, San Rafael,
CA.
Greenwald, A. and Hall, K. (2003). Correlated-Q learning. In Proc. 20th Int. Conf. on Machine
Learning, Washington, DC, USA.
Groves, T. (1973). Incentives in teams. Econometrica, 41:617–631. doi:10.2307/1914085
Guestrin, C. (2003). Planning Under Uncertainty in Complex Structured Environments. PhD
thesis, Computer Science Department, Stanford University.
Guestrin, C., Koller, D., and Parr, R. (2002a). Multiagent planning with factored MDPs. In
Dietterich, T. G., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information
Processing Systems 14. The MIT Press, Cambridge, MA.
Guestrin, C., Lagoudakis, M., and Parr, R. (2002b). Coordinated reinforcement learning. In
Proc. 19th Int. Conf. on Machine Learning, Sydney, Australia.
Hansen, E. A., Bernstein, D. S., and Zilberstein, S. (2004). Dynamic programming for partially
observable stochastic games. In Proc. 19th National Conf. on Artificial Intelligence, San Jose,
CA.
Harsanyi, J. C. (1967). Games with incomplete information played by ‘Bayesian’ players, Parts
I, II, and III. Management Science, 14:159–182, 320–334, and 486–502.
Harsanyi, J. C. and Selten, R. (1988). A General Theory of Equilibrium Selection in Games, MIT
Press, Cambridge, MA.
Hu, J. and Wellman, M. P. (2004). Nash Q-learning for general-sum stochastic games. Journal
of Machine Learning Research, 4:1039–1069. doi:10.1162/1532443041827880
Huhns, M. N., editor (1987). Distributed Artificial Intelligence. Pitman, Morgan Kaufmann,
San Mateo, CA.
Jennings, N. R. (1996). Coordination techniques for distributed artificial intelligence. In
O’Hare, G. M. P. and Jennings, N. R., editors, Foundations of Distributed Artificial In-
telligence, pp. 187–210. John Wiley & Sons, New York.
Kakade, S. (2003). On the Sample Complexity of Reinforcement Learning. PhD thesis, Gatsby
Computational Neuroscience Unit, University College London.
book MOBK077-Vlassis August 3, 2007 7:59
BIBLIOGRAPHY 67
Nisan, N., Tardos, E., and Vazirani, V., editors (2007). Algorithmic Game Theory. Cambridge
University Press, Cambridge.
Noriega, P. and Sierra, C., editors (1999). Agent Mediated Electronic Commerce. Lecture Notes
in Artificial Intelligence 1571. Springer, Berlin.
O’Hare, G. M. P. and Jennings, N. R., editors (1996). Foundations of Distributed Artificial
Intelligence. John Wiley & Sons, New York.
Oliehoek, F. A. and Vlassis, N. (2007). Q-value functions for decentralized POMDPs. In Proc.
of Int. Joint Conf. on Autonomous Agents and Multi Agent Systems, Honolulu, Hawai’i.
Osborne, M. J. (2003). An Introduction to Game Theory. Oxford University Press, Oxford.
Osborne, M. J. and Rubinstein, A. (1994). A Course in Game Theory. MIT Press, Cambridge,
MA.
Papadimitriou, C. H. (2001). Algorithms, games, and the Internet. In Proc. 33rd Ann. ACM
Symp. on Theory of Computing, Heraklion, Greece.
Papadimitriou, C. H. and Tsitsiklis, J. N. (1987). The complexity of Markov decision processes.
Mathematics of Operations Research, 12(3):441–450.
Parkes, D. C. (2001). Iterative Combinatorial Auctions: Achieving Economic and Computational
Efficiency. PhD thesis, Computer and Information Science, University of Pennsylvania.
Parkes, D. C. and Shneidman, J. (2004). Distributed implementations of Vickrey-Clarke-
Groves mechanisms. In Proc. 3nd Int. Joint Conf. on Autonomous Agents and Multiagent
Systems, New York, USA.
Paskin, M. A., Guestrin, C. E., and McFadden, J. (2005). A robust architecture for distributed
inference in sensor networks. In Proc. 4th Int. Symp. on Information Processing in Sensor
Networks, Los Angeles, CA.
Pearce, J. P. and Tambe, M. (2007). Quality guarantees on k-optimal solutions for distributed
constraint optimization problems. In Proc. 20th Int. Joint Conf. on Artificial Intelligence,
Hyderabad, India.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, San Mateo,
CA.
Peshkin, L., Kee-Eung, K., Meuleau, N., and Kaelbling, L. (2000). Learning to cooperate via
policy search. In Proc. 16th Int. Conf. on Uncertainty in Artificial Intelligence, San Francisco,
CA.
Peyton Young, H. (1993). The evolution of conventions. Econometrica, 61(1):57–84.
doi:10.2307/2951778
Porter, R., Nudelman, E., and Shoham, Y. (2004). Simple search methods for finding a Nash
equilibrium. In Proc. 19th National Conf. on Artificial Intelligence, San Jose, CA.
Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An analytic solution to discrete
Bayesian reinforcement learning. In Proc. Int. Conf. on Machine Learning, Pittsburgh, USA.
book MOBK077-Vlassis August 3, 2007 7:59
BIBLIOGRAPHY 69
Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press,
Cambridge, MA.
Sycara, K. (1998). Multiagent systems. AI Magazine, 19(2):79–92.
Symeonidis, A. L. and Mitkas, P. A., editors (2006). Agent Intelligence Through Data Mining.
Springer, Berlin.
Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In
Proc. 10th Int. Conf. on Machine Learning, Amherst, MA.
Terzopoulos, D. (1999). Artificial life for computer graphics. Commun. ACM, 42(8):32–42.
doi:10.1145/310930.310966
Tesauro, G. (2004). Extending Q-learning to general adaptive multi-agent systems. In Thrun,
S., Saul, L., and Schölkopf, B., editors, Advances in Neural Information Processing Systems
16. MIT Press, Cambridge, MA.
Thorndike, E. L. (1898). Animal Intelligence: An Experimental Study of the Associative Processes
in Animals. PhD thesis, Columbia University.
Vickrey, W. (1961). Counterspeculation, auctions, and competitive sealed tenders. Journal of
Finance, 16:8–37. doi:10.2307/2977633
Vidal, J. M. (2007). Fundamentals of Multiagent Systems with NetLogo Examples. Electronically
available at www.multiagent.com.
Vlassis, N., Elhorst, R. K., and Kok, J. R. (2004). Anytime algorithms for multiagent decision
making using coordination graphs. In Proc. Int. Conf. on Systems, Man and Cybernetics, The
Hague, The Netherlands.
von Neumann, J. and Morgenstern, O. (1944). Theory of Games and Economic Behavior. John
Wiley & Sons, New York.
von Stengel, B. (2007). Equilibrium computation for two-player games in strategic and exten-
sive form. In Nisan, N., Tardos, E., and Vazirani, V., editors, Algorithmic Game Theory.
Cambridge University Press, Cambridge.
Wainwright, M. J., Jaakkola, T. S., and Willsky, A. S. (2004). Tree consistency and bounds
on the performance of the max-product algorithm and its generalizations. Statistics and
Computing, 14:143–166. doi:10.1023/B:STCO.0000021412.33763.d5
Wang, X. and Sandholm, T. (2003). Reinforcement learning to play an optimal Nash equi-
librium in team Markov games. In Becker, S., Thrun, S., and Obermayer, K., editors,
Advances in Neural Information Processing Systems 15, MIT Press, Cambridge, MA.
Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3):279–292.
Weiss, G., editor (1999). Multiagent Systems: a Modern Approach to Distributed Artificial Intel-
ligence. MIT Press, Cambridge, MA.
Wiegerinck, W., van den Broek, B., and Kappen, B. (2006). Stochastic optimal control in
continuous space-time multi-agent systems. In Proc. 22th Int. Conf. on Uncertainty in
Artificial Intelligence, MIT Press, Cambridge, MA.
book MOBK077-Vlassis August 3, 2007 7:59
71
Author Biography
Nikos Vlassis was born in 1970 in Corinth, Greece. He received an MSc (1993) and a PhD
(1998) in Electrical and Computer Engineering from the National Technical University of
Athens, Greece. In 1998 he joined the Informatics Institute of the University of Amsterdam,
The Netherlands, as research fellow, and in 1999 he visited the Electrotechnical Labora-
tory (ETL, currently AIST) in Tsukuba, Japan, with a scholarship from the Japan Industrial
Technology Association (MITI). From 2000 until 2006 he held an Assistant Professor posi-
tion in the Informatics Institute of the University of Amsterdam, The Netherlands. Since 2007
he holds an Assistant Professor position in the Department of Production Engineering and
Management of the Technical University of Crete, Greece. He is coauthor of about 100 papers
on various topics in the fields of machine learning, multiagent systems, robotics, and computer
vision, and has received numerous citations. Awards that he has received include the Dimitris
Chorafas Foundation prize for young researchers in Engineering and Technology (Luzern,
Switzerland, 1998), best-teacher mention at the University of Amsterdam (2001–2005), best
scientific paper award with the paper ‘Using the max-plus algorithm for multiagent decision
making in coordination graphs’ in the annual RoboCup symposium (2005), and various dis-
tinctions with the UvA Trilearn robot soccer team including the 1st position at the RoboCup
world championship (2003). His current research interests are in the areas of robotics, machine
learning, and stochastic optimal control.
book MOBK077-Vlassis August 3, 2007 7:59
72