pomdp py AFramework to Build and Solve POMDP Problems
pomdp py AFramework to Build and Solve POMDP Problems
Abstract belief
target
target objects
agent
In this paper, we present pomdp py, a general purpose Par- agent agent
POMDPs
POMDPs (Kaelbling, Littman, and Cassandra 1998) model
sequential decision making problems where the agent must Figure 2: POMDP model of agent-environment interaction.
act under partial observability of the environment state (Fig- (1) Agent takes an action. (2) Environment state transi-
ure 2). POMDPs consider both uncertainty in action effect tions. (3) Agent receives an observation and a reward signal.
(i.e. transitions) and observations, which are usually incom- (4) Agent updates history and belief.
plete and noisy information related to the state. A POMDP
is defined as a tuple hS, A, O, T, O, R, γi. The problem
domain is specified by S, A, O: the state, action, and obser- quires nested iterations over the state space to update the be-
vation spaces. At each time step, the agent decides to take lief, which is computationally intractable in large domains.
an action a ∈ A, which may be sampled from a ∼ π(ht , ·) Particle belief representation is a simple and scalable belief
according to a policy model π(ht , a) = Pr(a|ht ). This representation which is updated through matching simulated
leads to state change from s to s′ ∼ T (s, a, s′ ) according and real observations exactly (Silver and Veness 2010). Dif-
to the transition model T . Then, the agent receives an ferent schemes of weighted particles have been proposed to
observation o ∼ O(s′ , a, o) according to the observation handle large or continuous observation spaces where exact
model O, and reward r ∼ R(s, a, s′ ), r ∈ R according to matching results in particle depletion (Sunberg and Kochen-
the reward model R. Upon receiving o and r, the agent derfer 2018; Garg, Hsu, and Lee 2019).
updates its history ht and belief bt to ht+1 and bt+1 . The pomdp py does not commit to any specific belief repre-
goal of solving a POMDP is to find a policy π(ht , ·) which sentation. It provides implementations for basic belief rep-
maximizes the expectation
hP of future discounted rewards: i resentations and update algorithms, including tabular, parti-
π
V (ht ) = E
∞
γ k−t
R(s , a ) a = π(h , ·) , cles, and multi-variate Gaussians, but more importantly al-
k=t k k k k
lows the user to create their own new or problem-specific
where γ is the discount factor. In pomdp py, a few key representation, according to the interface of a generative
interfaces are defined to help organize the definition of probability distribution.
POMDPs in a simple and consistent manner.
Solvers. Most recent POMDP solvers are anytime algo- Design Philosphy
rithms (Zilberstein 1996; Ross et al. 2008), due to the in- Our goal is to design a framework that allows simple and
tractable computation required to solve POMDPs exactly intuitive ways of defining POMDPs at scale for both dis-
(Madani, Hanks, and Condon 1999). There are currently crete and continuous domains, as well as solving them either
two major camps of anytime solvers, point-based methods through planning or through reinforcement learning. In ad-
(Kurniawati, Hsu, and Lee 2008; Shani, Pineau, and Kaplow dition, we implement this framework in Python and Cython
2013) which approximates the belief space by a set of reach- to improve accessibility and prototyping efficiency without
able α-vectors, and Monte-Carlo tree search-based methods losing orders of magnitude in performance (Behnel et al.
(Silver and Veness 2010; Somani et al. 2013) that explores a 2011; Smith 2015). We summarize the design principles be-
subset of future action-observation sequences. hind pomdp py below:
Currently, pomdp py contains an implementation of
• Fundamentally, we view the POMDP scenario as the in-
POMCP and PO-UCT (Silver and Veness 2010), as well
teraction between an agent and the environment, through
as a naive exact value iteration algorithm without pruning
a few important generative probability distributions (π,
(Kaelbling, Littman, and Cassandra 1998). The interfaces of
T, O, R or blackbox model G).
the library support implementation of other algorithms; We
hope to cultivate a community to implement more solvers • The agent and the environment may carry different mod-
or create bridges between pomdp py and other libraries. els to support learning, since for real-world problems es-
pecially in robotics, the agent generally does not know
Belief representation The partial observability of en- the true transition or reward models underlying the envi-
vironment state implies that the agent has to main- ronment, and only acts based on a simplified or estimated
tain a posterior distribution over possible states (Thrun, model.
Burgard, and Fox 2005). The agent should update this
• The POMDP domain could be very large or continuous,
belief distribution through new actions and observa-
′ thus explicit enumeration of elements in the spaces should
P belief update is given by bt+1 (s ) =
tions. The exact
be optional.
η Pr(o|s′ , a) s Pr(s′ |s, a)bt (s), where η is the normaliz-
ing factor. Hence, a naive tabular belief representation re- • The representation of belief distribution is decided by the
(1) Interfaces to Define a POMDP
user and can be customized, as long as it follows the in-
terface of a generative distribution. POMDP
Agent Environment
• Models can be reused across different POMDP problems.
Extensions of the POMDP framework to, for example, de-
Agent Environment
centralized POMDPs, should also be possible by building
GenerativeDistribution State
upon existing interfaces. (belief)
PolicyModel either
TransitionModel
either
Programming Model and Features TransitionModel RewardModel
or
The basis of pomdp py is a set of simple interfaces that ObservationModel
BlackboxModel
collectively form a framework for building and solving RewardModel
or
POMDPs. Figure 3 illustrates some of the key components BlackboxModel
Interface corresponding
POMDP component
legend
input/output
(i.e. base class)
in class argument
When defining a POMDP, one first defines the domain (2) POMDP control flow
by implementing the State, Action, Observation in- implemented via the interfaces
terfaces. The only required functions for each interface are Environtment
eq and hash . For example, the interface for State
is simply2 : .sample(..)
copy
class State:
provide_observation(..)
def e q (self , other ):
raise NotImplementedError
def h a s h (self ): copy
ment (e.g. for learning). One also defines a PolicyModel plan(..) belief
update