0% found this document useful (0 votes)
13 views

brooks1981

The document discusses a model-based vision system called ACRONYM, which utilizes four components: models, prediction of image features, description of image features, and interpretation of those features in relation to models. It emphasizes the importance of geometric modeling and constraint manipulation to achieve accurate three-dimensional interpretations from two-dimensional images, while also detailing the system's capabilities for recognizing and manipulating objects. The paper outlines the system's architecture and future research directions for enhancing its algorithms and representations.

Uploaded by

Mounir Zarrouk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

brooks1981

The document discusses a model-based vision system called ACRONYM, which utilizes four components: models, prediction of image features, description of image features, and interpretation of those features in relation to models. It emphasizes the importance of geometric modeling and constraint manipulation to achieve accurate three-dimensional interpretations from two-dimensional images, while also detailing the system's capabilities for recognizing and manipulating objects. The paper outlines the system's architecture and future research directions for enhancing its algorithms and representations.

Uploaded by

Mounir Zarrouk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

ARTIFICIAL INTELLIGENCE 285

Symbolic Reasoning Among 3-D Models


and 2-D Images

Rodney A. Brooks
A I L a b o r a t o r y , S t a n f o r d University, P a l o A l t o ,
C A 94305, U . S . A .

ABSTRACT
We describe model-based vision systems in terms of four components: models, prediction of image
features, description of image features, and interpretation which relates image features to models. We
describe details of modelling, prediction and interpretation in an implemented model-based vision
system. Both generic object classes and specific objects are represented by volume models which are
independent of viewpoint. We model complex real world object classes. Variations of size, structure
and spatial relations within object classes can be modelled. New spatial reasoning techniques are
described which are useful both for prediction within a vision system, and for planning within a
manipulation system. We introduce new approaches to prediction and interpretation based on the
propagation of symbolic constraints. Predictions are two pronged. First, prediction graphs provide a
coarse filter for hypothesizing matches of objects to image feature. Second, they contain instructions on
how to use measurements of image features to deduce three dimensional information about tentative
object interpretations. Interpretation proceeds by merging local hypothesized matches, subject to
consistent derived implications about the size, structure and spatial configuration of the hypothesized
objects. Prediction, description and interpretation proceed concurrently from coarse object subpart and
class interpretations of images, to fine distinctions among object subclasses and more precise three
dimensional quantification of objects. We distinguish our implementations from the fundamental
geometric operations required by our general image understanding scheme. We suggest directions for
future research for improved algorithms and representations.

1. Introduction
W e present both a general philosophy of model-based vision and a specific
implementation of m a n y of those ideas in the ACRONYM model-based vision
system. An earlier version of ACRONYM was described in [18]. H e r e we describe
a new version of ACRO~'YMwhich is almost a completely new implementation. It
includes new methods for modelling generic classes of objects, new techniques
for geometric reasoning, and a m e t h o d for using noisy m e a s u r e m e n t s from
images to gain three dimensional understandings about objects.
ACRONYM is a domain independent model-based vision system. The user
describes to ACatOr¢~4 classes of three dimensional objects, and their relation-
Artificial Intelligence 17 (1981) 285-348
0004--3702/81/0000-0000/$02.50 O North-Holland
286 R.A. BROOKS

ships in the world. The system tries to interpret images by locating instances of
modelled objects. The same models may be used for other purposes, such as
planning manipulator assemblies.

I.I. Model-based vision


Much of the current work in computer vision is based on trying to extract
maximal information from an image using no knowledge about the objects
being viewed. Often the techniques are based on physical considerations
concerning the image producing process (e.g. [7, 26, 51]). Others, principally
David Marr's research group at MIT, have found in physiological evidence
suggestions for algorithms which might be implemented on conventional com-
puters to extract information from images. They call this information two and a
half dimensional information as it includes identification of surfaces and their
local orientation, but not three dimensional location. The idea is that higher
level processes will make use of these rich descriptions to interpret the image
(see [32, 33, 23] for instance).
Once such descriptions of local surface characteristics have been extracted
the problem still remains of meaningfully interpreting those descriptions as
instances of objects in the field of view of the imaging device. This is essentially
the problem we are approaching in this paper.
We must find some mapping between the descriptive elements and objects. If
we are to identify objects, we must have some a priori representation for those
objects. We need to find correspondences between components of the image
description and the object representations. In general the representation we
have for objects might not be in terms of the same primitives as provided by
low level image description processes (see Section 2). Thus there may need to
be more than one mapping carried out.
One approach is to map the image description into a new description sharing
the primitives used in the object models, then to match the descriptions to the
models. This approach may either be impossible, due to unresolvable am-
biguities, or extremely difficult due to lack of sufficient information in the
image descriptions.
The approach we have taken is to map from a priori object class descriptions
to descriptions in terms of the same primitives as produced by the image
description processes. This can be viewed as prediction of image features.
Matching is done between image and object at that level. The matches are not
conservative, and in the interest of not rejecting correct matches some incor-
rect matches may also be accepted. Then a mapping from image description
terms to object model primitive terms is made, making use both of the
information gained at the image description level match and the information
included in the image description prediction. Incorrect matches are found to be
inconsistent with the detailed models at this stage. Eventually a three dimen-
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 287

sional interpretation of the image is obtained in terms of the a priori models.


Thus our approach to image understanding relies on four components:
object models, prediction from models, interpretation of image descriptions in
terms of models, and descriptions of images. In this p a p e r we deal with only the
first three.
W e have not made use of the rich descriptions available from some of the
recent work mentioned above, as that is still somewhat of a moving target.
R a t h e r we chose a fairly simple and primitive system which we had available,
and have used the inaccurate and crude descriptions which it produces. W e feel
that the system can only improve as better low level descriptive systems are
used instead. O u r implementation in fact goes further downwards on the
predictive side than described in this paper. This is necessary because of the
m o r e primitive descriptive system used.

1.2. An o v e r v i e w of ACRONYM

T h e user gives ACRONYM models of objects and their spatial relationships, as


well as classes of models and their subclass relationships. T h e r e is a choice of
input techniques) A text-based description language has proved to be more
useful for describing classes of objects and their spatial relations. MODITOR (a
model editor) i m p l e m e n t e d originally by Harald Westphal and A m y Plikerd,
and revised and significantly expanded by Soroka [44] provides a GEOMED----like
[8] interactive interface, via k e y b o a r d and graphics display. This tends to be
m o r e convenient for modelling specific objects. T h e two input systems produce
the same internal representation. Volumetric models and spatial relations are
represented in the object graph. Volume elements form the nodes while spatial
relations and subpart relations form the arcs. Object class relations are
represented in the restriction graph. Nodes are sets of constraints on volumetric
models. Directed arcs represent subclass inclusion. A graphics module provides
feedback to the user during the modelling process, via a raster display. It
generates images of objects being modelled under the modelled camera con-
ditions. T h e diagrams in this p a p e r were m a d e by the graphics module.
G e o m e t r i c reasoning techniques are used to predict features which will be
invariantly observable. This requires analysis of the ranges of variations in the
size, structure, and spatial relations in the object model classes. Notice that we
are not predicting the complete a p p e a r a n c e of objects from all possible

1We have not yet tried to incorporate model acquisition from images. Techniques of seg-
mentation and description were developed by Nevatia and Binford [39] to build tree structured
generalized cone models of objects detected using a laser range finder. Wiston [50] has shown how
to infer object classes over variations in both size and structure from examples and non-examples
of objects. Together these techniques seem to provide a strong basis for future work on teaching
object class descriptions to ACRONYM,by showing it examples whose component parts it would first
instantiate to specializations of a library of qualitatively different generalized cone models,
including both single cones and joined cones.
288 R.A. BROOKS

viewpoints, but rather we are predicting features which will enable us to


identify instances of objects, and also determine their orientation and position.
Sometimes case analysis is necessary to subdivide ranges of variations in order
to establish observable features. The result is the prediction graph. The nodes of
the graph are predictions of image features, and the arcs specify relations which
must hold between them in the image. Predictions are two pronged. First, they
provide a coarse filter for hypothesizing object to image feature matches.
Second, they contain instructions on how to use measurements of an image
feature to deduce three dimensional information about the object to which it
has been hypothetically matched. The predictor is implemented as a set of
production rules.
In our current implementation we use the results of images processed with
the line finder of Nevatia and Babu [38]. This provides on the order of 1000
edge elements, segmented as linear pieces, ranging between approximately 3
and 100 pixels in length in a 512 by 512 image. Prediction nodes provide
goal-direction to an edge linking algorithm [17] which produces descriptions of
shape elements found. Typically there are 5 to 50 elements from a search of the
whole image. The descriptive process is reinvoked many times during the
interpretation of an image. At first the multiple invocations search for different
image features to determine a coarse image interpretation. Later invocations
search small areas of the image for particular features, both for detailed object
class identification and to gain detailed three dimensional information about
the objects. We plan to later include other low level descriptive processes, such
as the stereo work underway within the Stanford vision group [5].
Invocations of the descriptive processes provide candidate image features for
matching to predicted features. Matching does not proceed by comparing
image feature measurements with predictions for those measurements. Rather
the measurements are used to put constraints on parameters of the three
dimensional models, of which the objects in the world are hypothesized to be
instances. Only if the constraints are consistent with what is already known of
the model in three dimensions, then these local matches are retained for later
interpretation. A local match can thus constrain camera parameters, object size
and structure, or perhaps only relations between camera parameters and object
size. This, for instance, automatically handles problems of scaling. Local
matches are combined to form more global interpretations, but all constraints
implied by local matches must be mutually consistent. Combining local matches
may produce additional constraints which also must be consistent. Additional
iterations of prediction, description, and interpretation occur as finer and finer
details of objects are identified. Once a m e m b e r of an object class has been
identified, it is easy to check whether it is possible that the object is also a
m e m b e r of a subclass. It is merely a matter of checking whether the constraints
introduced by the interpretation are consistent with constraints describing the
subclass.
SYMBOLICREASONINGAIVIONG3-D MODELSAND 2-D IMAGES 289

The ACRONYMsystem has been used for a number of tasks other than image
understanding. D. Michael Overmeyer implemented a set of rules in the rule
language used for the predictor and interpreter, useful for planning manipula-
tor tasks. The system, GRASP[10], was given ACRONYMmodels of simple objects,
from which it automatically deduced positions and orientations which could be
grasped by a manipulator arm, and which would provide a firm stable grip on
the object. Soroka [44] has built SIMULATORon top of ACRONYM.SIMULATORis a
system for off-line debugging of manipulator programs. It uses the ACRONYM
modelling system to model manipulator arms and their environmant. The
graphics system is used to provide stereo pair of images of the scenes, so that
the user perceives a three dimensional model. Currently the system can be
driven by the output of AL [22], which is normally used to drive manipulator
arms directly. Instead SIMULATOROlives models of manipulators in real time, by
specializing the spatial relations between manipulator links. Work is underway
to extend SIMULATORto interface to other manipulator languages.

1.3. Outline of the paper


The bulk of this paper is divided into four major sections. Sections 2-4 describe
major subsystems necessary for a general purpose model-based vision system.
Section 5 shows how these modules can work together to carry out image
interpretation. Each section surveys related work in the field, describes the
computational problems involved, and explains the particular approach taken
to solve these problems in the implementation of ACRONYM.
Section 2 deals with geometric modelling. Geometric modelling is often
associated with modelling specific objects. We extend the demands on
geometric modelling to include generic object classes and partially specified
spatial relationships between instances of object classes. We further describe
how to maintain complex internal relationships between parameters of object
class instances.
The deductive power for implementations of our paradigm for model-based
vision is provided by a constraint manipulation system. Section 3 describes the
formal requirements for the constraint manipulation system and our constraint
manipulation system implemented for ACRONYM. It is demonstrated bounding
nonlinear functions over nonlinearly defined subsets of the Euclidean space.
An understanding of geometry is required to make full use of volumetric
models, and make inferences from their modelled spatial relations, no matter
how incompletely specified. Section 4 describes methods for handling complex
geometric relationships. It also provides methods for making deductions from
the relationships between objects and the camera. Explicit rules are given
which implement these methods.
Section 5 shows how generic models, constraint manipulation and geometric
reasoning can be used to make predictions from models (as distinct from
290 R.A. BROOKS

making predictions of the appearance of instances of the models) and using


those predictions to interpret images. Further it is shown how to make use of
noisy image measurements to gain a three dimensional understanding of the
objects which generated the image. Prediction and interpretation are im-
plemented as a set of production rules in ACRONYM.

2. Model Representation
The world is described to ACRONYM as volume elements and their spatial
relationships, and as classes of objects and their subclass relationships.
A single simple mechanism is used within the geometric models to represent
variations in size, structure, and spatial relationships. Sets of constraints on
such variations specify classes of three dimensional objects. Adding constraints
specializes classes to subclasses and eventually to specific instances.
The model representation scheme used in a vision system must be able to
represent the classes of objects which the system is required to recognize.
When the representation is in world terms rather than image terms, it is
necessary that observables be computable from the representation.
Previous model-based vision systems have not made a distinction between
models of objects in world terms and models of objects in terms of directly
observable image features. The models themselves have been descriptions of
observable two dimensional image features and relations among them. Msvs [6]
models objects as usually homogeneous image regions, isis [21] includes
brightness, hue and saturation of image regions in its object models which are
constrained to meet viewpoint-dependent spatial relations. Ohta et al. [40] also
model objects as image regions, but they include shape descriptions in two
dimensions. Again viewpoint--dependent spatial relations are used.
For the general vision problem where exact contexts are unknown, and often
even approximate orientations are unknown with viewpoint-dependent image
models there must be multiple models or descriptions of a given object or
object class. Instead, viewpoint-independent models should be given to the
system. The resolution of the problem of multiple appearances from multiple
viewpoints then becomes the responsibility of the vision system itself. For a
model to be completely viewpoint-independent yet still provide shape in-
formation, it must embody the three dimensional structure of the object being
modelled. Volume descriptions are useful for other applications too. Planning
how to manipulate objects, while avoiding collisions requires volume descrip-
tions (e.g. [31, 44]). Objects can be recognized from range data, given volume
descriptions (e.g. [39, 47]). For individual applications additional information
might be included; e.g. surface properties for image understanding and density
of subparts for manipulation planning. Volume descriptions provide a common
representational basis for various distinct but possibly interacting processes,
each of which need models of the world.
SYMBOLICREASONING AMONG 3-D MODELS AND 2-D IMAGES 291

Consider the situation where the vision system is one component of a much
larger system which deals with models or representations of objects which will
appear in the images to be examined. For example in a highly automated
production system we might wish to use the CAD (computer aided design)
model of some industrial part as the only description necessary for a vision
system. It would be able to recognize, locate and orient instances of the part
when they later appear on a conveyor belt leading to a coordinated vision and
manipulation assembly station, with no description further than the CAD
model. It should not be necessary to have a human in the control path, whose
task is to understand the CAD model and then to translate it into a description
of observable features for the vision system. CAD systems for industrial parts
deal in models which are viewpoint independent and which embody a three
dimensional description of the volume occupied by the part (e.g. both the PADL
system [45] and that of Braid [16] meet these requirements; see also the survey
[4]). The representation scheme should also facilitate automatic computation of
observable features from models. Lieberman's system [28] provides for
automatic computation of silhouettes of objects as they will appear in
binary images. In general, more comprehensive descriptions of observable
features provide for robust vision in situations which are not completely con-
trolled.
ACRONYM is by no means the first model-based vision system to use volu-
metric models. Baumgart [8] and Lieberman [28] both used polyhedral
representations of objects. Nevatia and Binford [39] used generalized cones.
However ACRONYM goes beyond these systems. It has the capability to
represent generic classes of objects as well as individual specific objects, and
situations which are only partially specified and constrained, as well as specific
situations.
We do not claim that ACRONYM'Sclass mechanism is adequate for all image
interpretation tasks. In fact some of the examples below may seem to have
been carried out successfully in spite of the representation mechanism. Other
vision and modelling systems, however, do not have even that capability.
The following description of our model representation centers around the
types of things which must be represented about objects, for a variety of image
interpretation tasks. We first describe a volumetric representation for objects.
A method for describing variations in such models by describing allowed
variations in place holders for object parameters is given. This method allows
representation of variations in size, structure and position, and orientation of
objects. A class mechanism, based on specialization of variations, is built
orthogonally to the volumetric representations.

2.1. Volumetric representation


Generalized cones have been used by many people both as the output
292 R.A. B R O O K S

FIG. 2.1. A selection of generalized cones used by ACRONYM as primitive volume elements.

language for descriptive processes working from range data [2, 39, 43] and for
modelling systems for vision [25, 32, 34, 37].
Generalized cones [9] provide a compact, viewpoint-independent represen-
tation of volume elements. A generalized cone is defined by a planar cross
section, a space curve spine, and a sweeping rule. It represents the volume
swept out by the cross section as it is translated along the spine, held at some
constant angle to the spine, and transformed according to the sweeping rule.
Each generalized cone has its own local coordinate system. We use a right
handed system such that the initial end of the spine is at the origin, the initial
cross section lies in the y-z plane, and the x component of the directional
tangent to the spine at the origin is positive. Thus for cones where the cross
section is normal to a straight spine the latter lies in the positive x-axis.
Fig. 2.1 gives examples of generalized cones used as the primitive volume
elements in ACRONYM's representation. They include straight and circular
spines, circles and simple polygons for cross sections and sweeping rules which
can be constant, linear contractions or more generally, contractions linear in
two orthogonal directions. Cross sections may be held at any constant angle to
noncircular spines.
The internal representation of all ACRONYM data structures is frame-like in
that each data object is an instance of a unit. Units have a set of associated slots
whose fillers define their values [14]. Fig. 2.2 shows the unit representation of a
generalized cone representing the body of a particular electric motor. Its cross
section, spine and sweeping rule units are also shown. It is a simple right
S Y M B O L I C R E A S O N I N G A M O N G 3-D M O D E L S A N D 2-D I M A G E S 293

Node: ELECTRIC_MOTOR_CONE
CLASS: SIMPLE CONE
SPINE: ZOO14
SWEEPINGRULE: CONSTANT_SWEEPINGRULE
CROSS SECTION: ZO013
Node: ZOO14
CLASS: SPINE
TYPE: STRAIGHT
LENGTH: 8.0
Node: CONSTANT SWEEPING_RULE
CLASS: SWEEPINGRULE
TYPE: CONSTANT
Node: ZOO13
CLASS: CROSS_SECTION
TYPE: CIRCLE
RADIUS: 2.5

FIG. 2.2. A generalized cone model of a specific electric motor body.

circular cylinder of length 8.0 and radius 2.5 (our system currently does not
enforce any particular units of measurement).
ACRONYM'S volumetric representation is built around units of class object (a
unit's class is given by its class slot; this corresponds roughly to the self slot of
KaL units [14]). Objects are the nodes of the object graph. The arcs are units of
class subpart and class affixment. Objects have slots for optional cone-descriptor
(which is filled with a pointer to a unit, representing a generalized cone),
subparts and a]]ixments which are filled with a list of pointers to instances of
the appropriate classes of units, and a few more which we will not discuss here.
Subpart and aflixment arcs are directional; pointing from the object whose unit
references them, to the object referenced in their object slot.
The object graph has two natural subgraphs defined by the two classes of
directional arcs. Connected components of the subpart subgraph are required
to be trees. It is intended that each such tree be arranged in a coarse to fine
hierarchy. Cutting the tree off at different depths gives models with different
levels of detail. For example the subpart tree for the electric motors illustrated
in Fig. 2.4 has a root-node whose cone descriptor is the large cylindrical body
of the motor. At the next lower level of the tree are the smaller flanges and
spindle. The coarse to fine representation has obvious utility in image under-
standing tasks. Unless ACRONYM has already hypothesized an interpretation of
some images features as an instance of an object with its own generalized cone
descriptor, it does not search for subparts of the object in the image.
Currently the user inputs the subpart trees directly; there is no enforcement
of coarse to fine levels of representation. It is certainly within the capabilities of
ACRONYM'S geometric reasoning system (see Section 4) to detect when the
condition is violated. It is eminently reasonable that in such cases the system
should build its own internal coarse to fine structure, while maintaining the
user's hierarchical decomposition for future interaction. We have not diverted
294 R.A. BROOKS

resources to implement such a capability. There may be minor problems with


such a scheme in light of the discussion of modality in Section 2.2.3 below.
E v e r y object has its own local coordinate system. If an object has a cone
descriptor, then the generalized cone shares the same coordinate system as the
object. Affixment arcs relate coordinate systems of objects. An affixment
includes a product of symbolic coordinate transforms, which transform the
coordinate system of the object pointed at by the aflixment to the coordinate
system of the original object.
We represent coordinate transforms as a pair (internally a unit with two
slots) written (r, v) where r is a rotation and v is a translation vector. A
rotation is a pair (again a unit with two slots) written (a, m) representing a
rotation of scalar magnitude m about unit axis vector a. A vector is a triple (x,
y, z). In this p a p e r we will use infix * for composition both of rotations and of
coordinate transforms, meaning that the left argument is applied following the
right. Similary we will use infix @ for application of a left argument which is
either a rotation or a coordinate transform to a vector as the right argument.
O u r affixments do not carry any connotation of attachment. For instance
affixments do not distinguish between the case of the coordinate transform
relating an electric m o t o r sitting on a table, and the coordinate transform
relating a permanently attached flange to the m o t o r body. The attachment
notion (whether rigid or articulated) is implied by the subpart relation. There
are valid objections to such an assumption. A model of an operational airfield
should include the fact that aircraft must be present. The only way to represent
such a fact in ACRONYM (see again the discussion modality in Section 2.2.3) is
to m a k e aircraft a subpart of airfieM, and clearly in that case any assertion of
p e r m a n e n t attachment is false. We will probably encounter problems, especi-
ally in planning manipulator tasks from this aspect of the representation.
Both subpart and affixment arcs are represented by units. Subpart units have
a quantity slot which specifies how many instances of a subpart an object has.
For example the left-most electric m o t o r in Fig. 2.4 has four identical flanges.
T h e subpart relation for all four was represented as a single subpart arc
between an electric m o t o r and a flange node in the object graph. Affixment
arcs similarly have a quantity slot. In the case of a quantity greater than one,
the expression for the coordinate transform includes a free variable which is
iterated over the specified range to produce the distinct coordinate transforms.
That process produced the spatial relations of the numerous flanges to the
electric m o t o r bodies in Fig. 2.4.
Objects are placed in a world by affixing them to a world coordinate system.
A camera position and orientation is described by affixing a camera unit to a
world coordinate system. A camera views the world along the negative z-axis
of its coordinate system, with the y-axis pointing in the direction of the top of
the image plane, and the x-axis to the right. A camera unit also has a
focal-ratio slot, which is filled with a number. If r is focal ratio of a camera, and
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 295

an object of length 1 is parallel to the image plane of the camera at distance d


from the center of the camera, then the image of the object will measure rl/d in
image plane coordinates.

2.2. Quantification and constraints


The previous section described how we represent specific objects. On top of
that representation scheme we have built a mechanism for representing classes
of objects. We use the term class rather than set because we use a criterial or
intensional augmentation of the volumetric representation. A class is the
extension of a description of allowed variations in values of numeric slots of a
volumetric model.
In KIu:type systems [14], it is usual to describe allowed variations of a slot
filler by attaching a description directly to the slot. We have chosen a different
approach. Any numeric slot can be filled with an algebraic expression ranging
over numeric constants, declared constant symbols, and free variables. We
refer to the latter as quantifiers. The simplest case is when the expression is a
simple numeric constant, which is exactly that described in the previous
section. Declared constant symbols are purely for user convenience. Classes of
objects are specified by supplying a set of constraints (inequalities on algebraic
expressions) which define the set of values which can be taken by quantifiers.
We describe the benefits of such an approach below. Clearly this approach
could be extended to allow variations in non-numeric slots. The language of
expressions for slot fillers and constraints would need to be extended to include
nonnumeric operators and comparators. Any such extensions would require a
more comprehensive constraint manipulation system than the one we describe
in Section 3.
The PADLsystem [45] seems to be the only other geometric modelling system
which allows detailed geometric models with quantifiable variations. Variations
are limited to numeric tolerances on nominal values. The system uses a mixture
of attaching descriptions of variations to slots, and attaching them to named
variables. Slots can be filled by expressions, but each term has a tolerance
associated with it, which propagate from the expression to the slot. A default
tolerance is given to numbers and variables for which no explicit tolerance is
given.
The current restriction of allowing variations in numeric valued slots of
ACRONYm'S representations still allows large generic classes of objects to be
easily and naturally defined. In our work on aerial images we have made
extensive use of models of the generic classes of airports and wide-bodied
passenger/et aircraft (see [19] for details). Variations of numeric valued slots
allows three distinct types of variations within a class of models; variations in
size, limited variations in structure, and variations in spatial relationships. We
examine each of these in more detail.
296 R.A. B R O O K S

2.2.1. Variations in size


Fig. 2.3 shows the unit representation of a generalized cone which is the body
of a generic electric motor. Compare it to the cone for the specific electric
motor of Fig. 2.2. The only difference is that the spine length and cross section
radius slots are now filled with the quantifiers M O T O R - L E N G T H and
M O T O R - R A D I U S respectively, rather than 8.0 and 2.5.
Suppose we want to represent a class of small electric motors that might be
built on a particular assembly line. (Abraham et al. [1] describe a manufactur-
ing situation where approximately 450 different styles of motors are manufac-
tured with an average batch size of 600 and a number of style changes each
day. The example models in this paper are loosely based on examples in that
report. All dimensions are in inches.) Then we could restrict the length and
radius of the m o t o r independently, using the constraints

6.0 ~< M O T O R - L E N G T H ~ 9.0,


2.0 ~< M O T O R - R A D I U S ~< 3.0.

Suppose, also, that the length of a m o t o r is roughly inversely proportional to its


radius; i.e. over the class of motors which are to be modelled it is true that the
longer motors are of narrower diameter. Then this fact could be expressed by a
constraint of the form:
17.0 ~<M O T O R - L E N G T H x M O T O R - R A D I U S ~<21.0.
If there was an exact relationship between motor length and radius (unlikely in
this case), then an equality could be employed in the constraint.
Notice that the last constraint relates the fillers of two distinct slots from two
distinct units. Such a relation would be harder (or at least, more clumsy) to

Node: GENERIC ELECTRIC MOTOR_CONE


CLASS: SIMPLECONE
SPINE: Z0014
SWEEPING_RULE: CONSTANTSWEEPINGRULE
CROSS SECTION: ZOO13

Node: Z0014
CLASS: SPINE
TYPE: STRAIGHT
LENGTH: MOTORLENGTH
Node: CONSTANT_SWEEPINGRULE
CLASS: SWEEPINGRULE
TYPE: CONSTANT
Node: ZOO13
CLASS: CROSS SECTION
TYPE: CIRCLE
RADIUS: MOTORRADIUS

FIG. 2.3. A generalized cone model of a genetic electnc m o t o r body.


SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 297

specify if descriptions of allowed variations were attached directly to slots of


units. In this case the description attached to at least one slot would have to
explicitly refer to the other slot. If the system is to make use of new tighter
constraints on either the length or radius to further constrain the other (we will
see this happen during image interpretation in Section 5), then the description
attached to the two slots would have to refer to each other. If a relation exists
between more than two slots, the situation becomes worse (such relations
commonly arise during the image interpretation process). By placing the
restrictions directly on quantifiers no such duplication of information is neces-
sary.
A n o t h e r benefit of attaching descriptions of allowed variations to quantifiers
rather than to slots is that it becomes very easy to express many symmetries
and other exact geometric relationships. For instance to specify that the wings
of an aircraft are the same length it suffices to fill the length slots of the spines
of the two wings with the same expression; e.g. just a single quantifier
W I N G - L E N G T H . Similarly to express the fact that a chair has four legs of
the same length their spine length slots could all be filled with the quantifier
L E G - L E N G T H . Compare this to the representation of this fact used by
Shapiro et al. [41].
The PADL system [45] allows the user to supply tolerances for object models,
as described above. Grossman [24] has approached tolerancing by generating a
large number of instances of models, using a random number generator to
produce varying dimensions of objects within prescribed bounds and dis-
tributions. The ACRONYM system of constraining quantifiers allows tolerancing
of objects in a simple manner. For instance suppose we wanted to represent a
particular type of electric motor in Fig. 2.4 with length 8 . 0 _ 0.01 inches. Then
we could simply use the constraint
8.0 - 0.01 ~< M O T O R - L E N G T H ~< 8.0 + 0.01.
Alternatively we might fill the spine length slot with the expression
8.0 + L E N G T H - E R R O R
and use the constraint
-0.01 ~ L E N G T H - E R R O R ~<0.01.

FIG. 2.4. Three specializations of the generic class of small electric motors.
298 R.A. BROOKS

Notice however that ACRONYMmodels need not be restricted to use only such
simple plus-minus tolerances as are models in the PADLsystem. Tolerances can be
specified using arbitrary algebraic expressions.
2.2.2. Variations in structure
The fact that ACRONYM'Ssubpart and affixment arcs are units with quantity slots
allows a limited form of structural variation to be included in model classes.
Filling the quantity slot of a subpart arc with 1 or 0 can be used to indicate the
presence or absence of a subpart. The slot can alternately be filled with a
quantifier, constrained to be 1 or 0, to model the possibility that the subpart
may or may not be present. Similarly a variable number of identical subparts of
an object can be indicated; e.g. the number of flanges on the electric motors in
Fig. 2.4 or the number of engines on an aircraft wing.
Fig. 2.4 shows the generic model of an electric m o t o r under three different
sets of constraints which each fully determine values for the quantifiers
B A S E - Q U A N T I T Y and F L A N G E - Q U A N T I T Y which fill the obvious quantity
slots.
Given such a mechanism for representing structural variations, we must
consider what class of structure varying models we can describe. Suppose we
wish to specify that an electric motor has either a base or flanges but not both.
Furthermore if there are flanges, then there are between 3 and 6 of them. This
could be expressed with the following constraint.
((3 <~ F L A N G E - Q U A N T I T Y ~< 6) ^ (0 = B A S E - Q U A N T I T Y ) )

v ((0 = F L A N G E - Q U A N T I T Y ) ^ (1 = B A S E - Q U A N T I T Y ) ) .
Such a constraint is beyond the currently implemented capabilities of
ACRONYM. Constraints must be algebraic inequalities, with an implicit con-
junction over sets of such constraints. The explicit inclusion of logical dis-
junction requires a more comprehensive reasoning system for prediction and
interpretation than our current system (see Section 5).
Since our algebraic constraints can be nonlinear it is possible to represent
many disjunctions without overtaxing our theorem prover. In fact the above
constraint is equivalent to the following set of linear constraints:
0 ~< BASE-QUANTITY ~< 1,
0~ FLANGE-QUANTITY ~< 6,
F L A N G E - Q U A N T I T Y + 6 x B A S E - Q U A N T I T Y ~< 6,
3 ~<F L A N G E - Q U A N T I T Y + 3 x B A S E - Q U A N T I T Y .

Such a set of constraints is clearly not intuitive--unlike the previous con-


straint. For vision tasks it is probably not necessary. In our work with ACRONYM
we have been content to underconstrain classes of objects. For example our
generic model of electric motors uses only the first two constraints.
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 299

In an ideal situation the modelling language should provide easy and natural
means for the user to specify objects and classes in as much detail as is wished.
The system should then sift out just enough detail of constraint for its own
purposes. We have not tackled these problems.
2.2.3. Variations in spatial relationships
An affixment specifies the spatial relationship between two objects by provi-
ding a product of coordinate transforms which relate the local coordinate
systems of the objects. Each coordinate transform consists of a rotation and a
translation vector. The slots in the units representing these can naturally be
filled with quantifiers or even expressions on quantifiers. Thus variable spatial
relationships can be represented.
Suppose that members of the class of electric motors with bases are going to
be placed at a work station, upright but with arbitrary orientation about the
vertical, and at a constrained but inexact position. The coordinate system of the
motor has its x-axis running along the center of the spindle, and its z-axis
vertical. The work station coordinates have a vertical Z-axis also. The position
and orientation of the motor relative to the work station could then be
represented by the transform
((2, ORI), (X-POS, Y-POS, B A S E - T H I C K N E S S + M O T O R -
RADIUS))
where ~ as usual denotes a unit vector in the z direction. Typically X-POS and
Y-POS might be constrained by
0 ~< X-POS ~ 24,
18 ~< Y-POS ~ 42,
and O R I would be left free. The geometric reasoning system described in
Section 3 manipulates such underconstrained transforms.
There is an inadequacy in such specifications of spatial relations. It is
possible to represent that aircraft can be found on runaways or on taxiways, for
instance, by affixing the generic model of an aircraft to both, using similar
coordinate transforms to those described above. It is not possible however to
specify that the one and only motor cover which will appear in an image will be
located on either the left or right parts feeder. The only way within ACRONYM'S
representational mechanism to allow for such a possibility is to express some
connected area between both parts feeders where it might be found. Such
inexactness might lead to much greater searches in locating the motor cover in
an image, given that the feeders have already been located. Again, the
reasoning systems described in the rest of this paper need major additions to
handle a more concise specification language. Furthermore the current inter-
pretation algorithm treats affixments as defining a necessary condition on
where objects are located. A more flexible scheme would allow the user to give
300 R.A. BROOKS

FIG. 2.5. Variable affixments are used to model articulated objects such as this piston assembly.

advice to look first in one location for a particular object and then in another
higher cost location if that fails.
Variable affixments can also be used to model articulated objects. Fig. 2.5
shows two views of a piston model with different values assigned to a quantifier
filling the rotation magnitude slot of the coordinate transform between the
piston and the con-rod. Constraints on the quantifier express the range of
travel of the con-rod.
T h e representation of articulated objects may be important if manipulator
arms are present in images, and it is desired to visually calibrate or servo them.
Soroka's [44] simulator is based on these representations.
Variable camera geometry can also be represented by filling the slots of the
transforms affixing the camera to world coordinates with quantifiers. Section 4
gives two examples of variable camera geometries. If the characteristics of the
imaging camera are not known exactly, the focal-ratio slot can be filled with a
quantifier rather than a number. Any image interpretation will provide in-
formation which can be used to constrain this quantifier (see Section 5 for an
example of how this comes about).

2.3. Restriction nodes and specialization


F r o m the forgoing discussion it should be clear that given a volumetric model
which includes quantifiers in various of its slots, different sets of constraints on
those quantifiers define different classes of models. We organize sets of
constraints using units of class restriction as nodes in a directed graph called the
restriction graph.
A restriction unit has a constraint slot filled by a set of algebraic constraints
on quantifiers. The constraints used earlier in this section are typical examples,
although they are reduced to a normal form described in Section 3.3.1. A set of
constraints on n quantifiers defines a subset of n-dimensional Euclidean space.
It is the set of substitutions for the quantifiers which satisfy the given set of
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 301

constraints. That set may be empty. We call this set the satisfying set of the
restriction node. Set inclusion on the satisfying sets provides a natural partial
order on restriction nodes, defining a distributive lattice on them. The lattice
meet operation ( A ) is used during image interpretation (see Section 5). Arcs of
the restriction graph must be directed from a less restrictive node (a larger
satisfying set) to a more restrictive node (a smaller satisfying set). Restriction-
nodes keep track of the arc relations in which they participate via suprema and
infima slots which are filled with lists of sources and destinations of incoming
and outgoing arcs respectively. It is permissible that comparable restriction
nodes do not have an explicit arc indicating that fact. In fact the restriction
graph is just that part of the restriction lattice which has been computed.
A restriction graph always includes a base-restriction node, which has an
empty set of constraints, and is thus the least restrictive node in the graph.
Every other node in the graph must be an explicitly indicated infimum of
another restriction node.
The user specifies part of the restriction graph to the system. Other parts are
added by ACRONYMwhile carrying out image understanding tasks. By contrast
the object graph is completely specified by the user, perhaps from a C A D
data-base, and remains static during image interpretation. Eventually we plan
to build from examples, using techniques of Nevatia and Binford [39].
Restriction nodes have type and specialization-of slots. In nodes specified by
the user the type slot is filled with the atom model-specialization and the
specialization-of slot with an object node from the object graph.
A restriction node specified by the user represents an object class; those
objects which have the volumetric structure modelled by the object in the
specialization-of slot subject to the constraints associated with the restriction
node.
Thus the arcs of the subgraph defined by the user specify object class
specialization. The arcs added later by ACRONYM also indicate specialization,
but of a slightly different nature. They can specialize a model for case analysis
during prediction (see Section 5.1), or they can indicate specialization implied
for a particular instance of the model by a hypothesized match with an image
feature or features (see Section 5.2).
Fig. 2.6 is a typical example of the portion of the restriction graph which the
user might specify. The constraints associated with the node generic-electric-
motor would be those described in the previous sections. The motor-with-base
node includes the additional constraints
BASE-QUANTITY = 1,
F L A N G E - Q U A N T I T Y = 0,

while the motor-with-flanges node has


BASE-QUANTITY = 0,
3 ~<F L A N G E - Q U A N T I T Y ~< 6.
302 R.A. BROOKS

BASE RESTRICTION

GENERIC
ELECTRIC MOTOR

MOTOR WITH BASE] MOTOR WITH FLANGES

/
CARBONATOR~
MOTOR MOTOR - GAS PUMP

FIG. 2.6. Part of the restriction graph: a model class hierarchy defined by the user.

Of course additional constraints on quantifiers determining size, and perhaps


there relationships to say the structure determining quantifier F L A N G E -
Q U A N T I T Y might be included at these restriction nodes.
Additional constraints specialize the subclasses of electric motors further to
particular functional classes (these classes are taken from [1]), namely in-
dustrial-motor, carbonator-motor and gas-pump. Further constraints on these
three classes were added to restrict each quantifier to specific values in order to
produce Fig. 2.4, which shows instances of the three classes of m o t o r in left to
right order.
The specialization mechanism we have described here relies on complete
sharing of the volumetric description amongst all its specializations. There are
never multiple copies of fragments of the volume model. The specialization
information is in a domain orthogonal to the underlying representation. It is
therefore compact. More importantly, during image interpretation we will see
that when an instance of a superclass has been identified it is rather easy to
check whether it happens to also be an instance of a more specialized class.
Instead of it being necessary to recompute image to model correspondences for
the specialized model, we simply take the meet of the specialization restriction
node with a restriction node produced in the original interpretation. If the
resultant restriction node has a non-empty satisfying set, then the perceived
object is also a instance of the subclass. Section 5 describes this in more detail.
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 303

3. Constraint Manipulation
In this paper we propose a number of refined or new techniques to be used in
understanding what in a three dimensional world produces a given image.
These include volumetric representation of generic classes of three dimensional
objects, concise representation of generic spatial relationships, geometric
reasoning about uncertain situations, generic prediction of appearance of
objects, and use of information from matches of predicted and discovered,
from goal-directed search, image features to gain three dimensional knowledge
of what is in the world. In ACRONYMwe tie all these pieces together by using
systems of symbolic constraints. We do not solve such systems, but rather
propagate their implications both downward during prediction and upward
during interpretation. In our current implementation those constraints are
algebraic inequalities over a set of variables (quantifiers). Our methods pro-
pagate these nonlinear constraints and handles them algebraically, rather than
restoring to numerical approximations and tradition numerical methods for
solution.
In this section we describe some implementation-independent requirements
for a 'constraint manipulation system' (a CMS) in an ACRONYM-likesystem, and
then the particular system which we have implemented and use.
Systems of algebraic constraints have arisen in a number of domains of
artificial intelligence research.
Bobrow [13] used algebraic problems stated in English as a domain for an
early natural language understanding system. The proof of understanding was
to find a correct solution to the algebraic constraints implied by the English
sentences. The domain was restricted to those sentences which could be
represented as single linear equations and for which constraint problems could
always be solved by simple linear algebra.
Fikes [20] developed a heuristic problem-solving program, where problems
were described to the system in a nondeterministic programming language. The
constraints were a mixture of algebraic relations and set inclusion statements
over finite sets of integers, Fikes could thus solve constraint systems by
backtracking, although he included a number of heuristic constraint pro-
pagation techniques to prune the search space.
A common source of algebraic constraint systems is in systems for computer
aided design of electronic circuits. Stallman and Sussman [46] and de Kleer and
Sussman [27] describe systems for analysis and synthesis of circuits respectively.
In each case, systems of constraints are solved by using domain knowledge to
order the examination of constraints, and propagate maximal information from
one to the next. An algebraic simplifier is able to reduce these guided
constraints to simple sets of (perhaps nonlinear) constraints which can be
solved by well-known numeric methods. The original constraints usually form
too large a system to be solved in this way.
304 R.A. BROOKS

Borning [15] describes an interactive environment for building simulations of


things such as electrical circuits, constrained plane geometrical objects, and
simple civil engineering models. Algebraic constraints are used to specify the
relations to be maintained in the simulation. Borning uses the constraint
propagation technique described above, along with the dual method of pro-
pagating degrees of freedom among variables. When all else fails he uses a
relaxation technique which first approximates the constraints with linear equa-
tions, then uses least-mean-squares fit to guide the relaxation.
We have two main requirements for our CMS. First, we want to decide
(partially, see below) whether a set of constraints is satisfiable. This is a weaker
requirement than asking that the CMS provide a solution for a set of con-
straints when one exists, such as is done by the previously described systems.
W e are not interested in an actual solution (there may be many), but rather in
its existence. Second, we use the CMS to estimate bounds on algebraic
expressions on quantifiers over the satisfying set of values for the quantifiers.
This is quite different from the tasks required of other CMS's.

3.1. Requirements for a CMS


In Section 2.3 we noted that a set of constraints on n quantifiers defines a
subset of n-dimensional Euclidean space corresponding to all. possible sets of
substitutions for the quantifiers such that all the constraints are simultaneously
satisfied. Given a set S of constraints we will write the satisfying set as Cs. We
will also interchangeably use sets of constraints and restriction nodes, as in
general instances of each are associated with unique instances of the other.
The algorithms presented in the remainder of this p a p e r use the constraint
manipulation system in three ways. In decreasing order of importance it would
be ideal if the CMS could:
(I1) Given S decide whether or not Cs is empty.
(I2) Given satisfiable S and an expression E over quantifiers constrained by
S compute the s u p r e m u m and infimum of values achieved by E over the set of
substitutions Cs.
(I3) Given constraint sets S and R calculate a constraint set T such that
Cr = Cs N CR ; i.e. in the lattice defined in section 2.3, T = S ^ R.
If the constraints were always linear in the quantifiers, then it would not be
hard to construct a CMS to behave as required, based on the simplex method.
See Section 3.3 for further details. (Clearly (I3) can be simply achieved by
letting T = S t_JR.)
However, the algorithms to be described use the CMS as a pruning tool, in
searches for invariant predictions and for interpretations. Imperfect pruning
does not necessarily lead to failure of the algorithms. It may lead to an increase
in the portion of the search space which must be examined. If the pruning is
very poor, the algorithms may fail to find predictions and interpretations for
lack of storage space and time.
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 305

We revise the above requirements to those actually required by the predic-


tion and interpretation algorithms, independent of the heuristic power which is
required for efficient operation of those algorithms.
(A1) Given S, partially decide whether or not Us is empty; i.e. if Us is
non-empty, return " d o n ' t know", and if Cs is empty return either " e m p t y " or
" d o n ' t know". Conversely this can be stated as if the CMS can prove that S is
unsatisfiable, it should indicate so, otherwise indicate that it may be satisfiable.
(A2) Given satisfiable S and an expression E over quantifiers constrained by
S, compute an upper bound on the supremum and a lower bound on the
infimum of values achieved by E over the set of substitutions Cs; i.e. compute l
and u (numbers, or ___oo)such that
l<~infE <~sup E <~u.
Cs Cs
(A3) Given constraint sets S and R calculate a constraint set T such that
( Cs n C,O C_CT C_( Cs U CR ).
Note that for T derived from S and R as in (A3), if Cr is empty, then so is
Cs f'ICR (which is equal to Cs~). At first sight it may seem strange that a
straightforward requirement such as (I3) be relaxed to that of (A3). First, since
the prediction and interpretation algorithms can operate under (A3), it is only a
search efficiency consideration in deciding to settle for the weaker requirement.
Second, it may be that the CMS works better on sets of constraints in some
particular form. It may not be the case that if S and R have that form, then
necessarily so will T = S tJ R.
While not strictly necessary it is also desirable that the CMS be monotonic,
where we define monotonicity as follows. If T is a constraint set derived from S
and R as in (A3), and in particular if T _DS, then:
(M1) If the CMS decides S is unsatisfiable, then it also decides that T is
unsatisfiable.
(M2) For an expression E, if Is and Us are the bounds on E over S
calculated as in (A2), Ir and ur the bounds over T similarly calculated, and
both Cs and Cr are non-empty, then
ls <~ l~ <<-uT <~ Us.

In Section 3.3 we describe the CMS which we have implemented to meet


these requirements. It is capable of doing so for a wide class of nonlinear
constraints, and it is monotonic.

3.2. Algebraic simplification


We digress briefly to discuss some issues involved in algebraic simplification
and the idea of reducing all algebraic expressions to a canonical symbolic form.
Any algebraic constraint manipulation system needs a simplifier to make use of
the results of formal manipulations of expressions.
306 R.A. BROOKS

De Kleer and Sussman [27] describe their experience with an algebraic


simplification system which mapped all algebraically equivalent expressions
into a canonical form as the ratio of two relatively prime multivariate poly-
nomials. Each variable has a global priority used to determine the main
variables of the polynomials and other orderings recursively. They point out
that the canonical form is sometimes not compact, and the size can vary greatly
if the variables are globally reordered. More importantly, they discovered the
algebraic manipulator spent most of its time and space calculating greatest
common divisors (GCD's) of polynomials. When their circuit synthesis system
failed due to lack of storage, it was always because of intermediate require-
ments of a single G C D calculation, whose solution was actually quite small.
They point out that their system is forced into doing much more complex
manipulations that would ever be attempted by a human engineer.
The solution to this problem is to have a system which uses a simplifier which
can handle more complex cases than a single canonical form. Furthermore, the
system should be at least mildly intelligent in what it requests of the simplifier.
Lastly, it would be advantageous if the higher level system were robust in the
following sense. Suppose the simplifier returns a complex expression which is
really equal to 0 (since we are not insisting on a canonical form, the simplifier
may not have discovered this). Suppose further that the higher level system
eventually has to abandon that expression because it is greater than some
complexity bound. A robust system's outward behavior would not necessarily
be affected by such failure, as it would possibly find some other approach to
take. The algorithms to be described in Section 5 for prediction and inter-
pretation have some of this flavor.
3.2.1. ACRONYM'Salgebraic simplifier
Our particular algebraic simplifier treats the symbols oo and -oo in the same way
as numbers, and we will include them when we refer to numeric expressions.
The simplifier propagates them through operators such as +, x, max, rain, etc.,
where such propagations are well defined.
The simplifier has special knowledge about how to handle +, - , x, /, max
and rain (other functions such as sin and cos are treated purely syntactically
and no trigonomeric identities are used). The CMS we use makes heavy use of
expressions involving max and min. The expressions representable cannot even
be tested for equality syntactically. For instance the expression A max(B, C) is
equal to max (AB, AC) if the expression A is positive, but equal to min (AB,
AC) if A is negative. Thus a syntactic canonical form is not possible. We have
not tried to develop a semantic canonical form, but instead have increased the
interaction between the simplifier and the constraint manipulation system
which uses it. Of course, inclusion of sin and cos makes the problem of
simplification to a canonical form even more difficult.
The exact details of the standard form produced by the simplifier are not
SYMBOLICREASONING AMONG3-D MODELSAND 2-D IMAGES 307

important. For the purpose of following the explanation of the constraint


manipulation system given in Section 3.3, it is sufficient to note that all
instances of ' - ' are removed (multiplication by - 1 is used where necessary),
and quotients always have a numerator of I. In general, multiplication is
distributed over addition, and addition is distributed over max and min as are
multiplication and division, where possible.
The correctness of distributing division over max and min depends on the
original arguments to those functions (unlike the multiplication case which
depends only on properties of the term being distributed, as in the example
above). For instance the simplification

1 ma (¼
min(A, B)
cannot be made when A and B are of different signs. If their signs cannot be
determined in advance therefore, the simplification should not be made. When
invoked by the CMS, however, our simplifier will not return an expression
having the form of the left of the above equation; doing so can lead to
non-monotonicity of the system as described in point (M2) of Section 3.1.
Instead it returns expressions which may not be equal to the supplied expres-
sion. For instance, given the expression on the left above, the simplifier is
guaranteed to return an expression smaller or equal. Since such an expression
can only arise as a lower bound on some quantity (see Section 3.3.2), this
'simplification' results in at worst a weaker bound. For instance, given the
expression
1
rain(A, B, C, D ) '
the simplifier interacts with the CMS to try to determine the sign of the
expressions A, B, C and D using a method described in Section 3.3.1. Suppose
it determines that A and C are strictly negative, D is strictly positive, and the
CMS cannot determine the sign of expression B. Then the simplifier will return
the possibly smaller expression

If the original expression had been


1
max(A, B, C, D ) '
then given the same information about the signs of subexpressions, the sim-
plifier would return the possibly larger expression
1
D"
308 R.A. BROOKS

Since such an expression could only arise as an upper bound, the result is
merely a weaker bound.
Finally we note that every term in a simplified expression is invariant when
simplified by the simplifier.

3.3. A particular CMS


The general requirements we have stated for a CMS can be satisfied by the
well-known linear programming simplex method in the case that all the
constraints are linear. Finding whether a set of constraints is satisfiable is the
first step of simplex---determining whether there is a feasible solution. Finding
a bound on an expression is referred to as maximizing, or minimizing, a linear
objective function. By the nature of the simplex method, it seems unlikely that
it can be extended to nonlinear cases. We have already seen an example of a
nonlinear constraint arising in model definition in Section 2.2.1. Nonlinear
constraints are also regularly generated in image interpretation as will be seen
in Section 5.
The CMS we have implemented is based on another method which solves
linear programming problems. This is the ' S U P - I N F ' method, developed
originally by Bledsoe [11, 12] and later improved by Shostak [42]. They
developed it as part of a method for determining the validity of universally
quantified logical formulas on linear integer expressions. These formulas often
arise in program verification systems.
We have taken the linear method described by Shostak and extended it in a
fairly natural way to handle certain nonlinear cases. We have integrated a
method of bounding difficult satisfying sets by n-space rectangloids, when
straightforward extensions to the method fail or are not applicable. This
additional method was the major part of an earlier attempt of ours to build a
CMS. By itself it weakly meets the requirements of the previous section, and
may be adequate for some interpretation tasks, where fine distinctions need not
be drawn and where structural considerations (see Section 5) remove most
ambiguities.
3.3.1. A normal form for constraints
Algebraic constraints are supplied to the CMS in a variety of forms. A set of
given constraints are incorporated into a consistent normal form; an implicit
conjunction over a set of inequalities using the relation ' ~<' where at least one
side is a single variable and the other side consists of numbers, variables and
the operators +, / with numerator 1, x, sin and cos. Furthermore every such
constraint derivable from the supplied constraint is merged into the constraint
set. Much of the work is done directly by the algebraic simplifier.
Constraint sets are actually attached to restriction nodes in our im-
plementation. Constraints in the normal form are grouped into subsets,
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 309

determined by the variable which appears alone on one side of the inequality.
Constraints with single variables on both sides appear twice, once in each
s u b s e t w f o r example a ~< b is associated both with variable a and variable b.
A new constraint is split into one or more inequalities. Constraints involving
an equality are split into two inequalities: A = B becomes A ~< B and A/> B.
Thus for instance the constraint x = y eventually becomes four inequalities:
y <~ x and x ~< y which are associated with x, and x ~< y and y ~< x which are
associated with y. A constraint such as A E [B, C] where A, B and C are
expressions, can similarly be broken into two inequalities. The operators max
and min are removed, and equivalent constraints derived where possible (if not
possible, then the constraint is discarded, and if externally generated, the user
is warned; there should never be such constraints generated internally). Thus
for instance m a x ( A , B ) ~ m i n ( C , D ) becomes the four constraints A ~< C,
A < ~ D , B <~C and B < ~ D .
Next the constraints are 'solved' for each variable which occurs in them; i.e.
each variable is isolated on one side of the inequality. Since inequalities are
involved, the signs of variables and expressions are important for these
solutions. Sometimes the signs cannot be determined but often they can be
deduced simply from explicit numeric bounds on variables given in earlier
constraints (see the discussion of parity below). Finally inequalities using ' t > '
are converted to use '<~'
For example, given prior constraints of y ~<-1 and x ~>0, the addition of
constraint My ~< min(-100, 2 0 0 - z ) generates the following set of constraints:
0~<X, X<~Oo,
- 100y ~< x ,
200y - yz ~< x,
-oo~< y, y~<-l,
- x / l O 0 ~< y,
1
200/x - z / x <~y'
--oo ~ Z, Z ~ oo,
z ~< 200 - My.
Constraint sets generated by the CMS always contain single numeric upper
and lower bounds on each variable--defaulted to oo or -oo if nothing more
definite is known. If a new numeric bound is added for some variable, it is
compared to the current bound (since they are both numeric or oo, or -oo they
are comparable), and the tighter bound is used.
Constraint sets are accessed by two pairs of functions. Given a set of
constraints S and a variable v, H I V A L s ( v ) and L O V A L s ( v ) return the
numeric upper and lower bounds respectively that are represented explicitly in
S. For instance, given the example set E of constraints above, HIVALE(x)
return o0 and L O V A L E ( x ) returns 0. (The CMS using SUP defined below, is
310 R.A. BROOKS

able to determine that 100 is the largest value x can have and still satisfy all the
constraints in E.)
More generally, the constraint sets are accessed via the functions U P P E R
and L O W E R which return the symbolic upper and lower bounds on a variable,
represented explicitly in the constraint set. U P P E R s ( v ) constructs an expres-
sion which applies min to the set of upper bounds on x appearing explicitly in
S. The algebraic simplifier SIMP is applied and the simplified expression
returned. Similarly L O W E R s ( v ) returns the symbolic max of the explicit lower
bounds. T h u s , for instance L O W E R s ( x ) returns max(0, -100y, 2 0 0 y - y z ) ,
while U P P E R s ( z ) constructs min(~, 2 0 0 - x / y ) which gets simplified to 2 0 0 -
x/y. These definitions of U P P E R and L O W E R closely follow those used by
Bledsoe l11] and Shostak [42]. They did not use H I V A L and L O V A L .
We digress briefly to explain an important use of H I V A L and L O V A L . They
are used by the algebraic simplifier to try to determine whether an expression
is always nonnegative (we will loosely refer to this as positive) or always
negative. We call this information the parity of an expression. If L O V A L s ( v )
and H I V A L s ( v ) have the same sign for a variable v, then v has a parity
determined by the sign. If the lower and upper numeric bounds on v have
different signs, then we say v has unknown parity. A few simple rules are used
to try to determine the parity of more complex expressions. For instance the
sum or product of two terms with the same known parity shares that parity.
The inverse of a term with known parity has that same parity. More complex
rules are possible--we have not used them.
We return now to producing a normal form for constraint sets. As symbolic
bounds are added, an attempt is made to compare them to existing bounds.
This is done by symbolically subtracting the new bound from each of the old,
simplifying the resulting expressions and applying the parity determining
function. Whenever a parity for the difference can be found, the bounds are
comparable over the ranges of variables given by H I V A L and L O V A L , and
the stronger bound can be determined from that parity.
These techniques can be used to meet requirement (A3) of Section 3.1. In
fact they also meet the ideal requirement (I3), but they do more than merely
form the union of constraint sets. Instead an equivalent set of constraints is
produced which allows for efficient operation of the bounding algorithms
described in the next section.
3.3.2. Bounding algorithms
In this section we describe algorithms used to estimate upper and lower bounds
on expressions over satisfying sets of constraint sets. They satisfy the require-
ments of (A2) of Section 3.1. They are monotonic also. Our partial decision
procedure is based on these algorithms (see Section 3.3.3).
The major algorithms SUP, SUPP and SUPPP are described in Figs. 3.1, 3.2
and 3.3 respectively. There are three similarly defined algorithms INF, INFF
S Y M B O L I C R E A S O N I N G A M O N G 3-D M O D E L S A N D 2-D I M A G E S 311

Algorithm SUPs(J, H )

IF ACTION RETURN

1. J is a number

2. J is a variable
2.1 J e l l J
2.2 8 U P s ( J , H ) is already
on the stack HIVALs(J)
2.3 y / H Let A ~- UPPERs(J)
B *-~ S U P s ( a , H U{J}) 8UPPs(J, SIMP(B),H)

3. J ~ " r A " where


r is a number
3.1 r < 0 Let B *-- I N F s ( A , H )
3.2 r > 0 Let B *- SUPs(A,H) r B ,j

4. J = " r v @ A " w h e r e r is
a number, v a variable Let B ~- SUPs(A, I f I,J{v})
4.1 v occurs in B Let C *-- SIMP("rv @ B") SUPs(C, H)
4.2 v does not occur in B Let C *- S O P s ( " r v " , H ) " C -F B "

5. J = "rain(A, B)" Let C *~ SUPs(A, H)


D , - SUPs(B, H ) "rain(C, D)"

6. J = " A - i - B " Let C *-- SUPs(A,H)


~- SUps(B, H) " C -]- D "

7. J ~ "sin(A)" TRIGs(A,' sin, 'SUP)

8. J = "cos(A)" TRIGs(A,' cos, 'SUP)

9. J = "l/A"
9.1 A has known parity Let B *-- INFs(A, H) "I/B"
9.2 A has unknown parity Let b *~ INFs(A, 0)
c *- SUPs(A, 0)
9.2.1 b > c --OO
9.2.2 bc > 0 "l/b"
9.2.3 bc < 0 UP

10. J = "v~A '' where v is a


variable with known
parity, not occurring
in A, also of known
parity
10.1 A, J same parity Let B *-- SUPs(A, H U{v})
10.2 A, J 0PP" parity Let B *-- INFs(A, H [,J{v})
10.n.1 v occurs in B Let C *-- SIUP("v"B") SUPs(C, H)
10.n.2 v, J same parity Let C *-- SUPs(v, H) "C"B"
10.n.3 v, J opp. parity Let C *-- INFs(v, H) "C"B"

FIG. 3.1 (Continued overleaf).


312 R.A. B R O O K S

Algorithm SUPcont.

ACTION R~TURN

11. J . . . . A B " where A and B


have known parity
11.1 A, J same parity Let C ~-- SUPs(A, H)
11.2 A, J opp. parity Let C ~- INFs(A, H)
ll.n.1 B, J same parity Let D +- SUPs(B, H) "CD"
l l . n . 2 B, J opp. parity Let D ~-- INFs(B, tt) "CD"

12. J " A B " where A has known


parity, B has unknown Let c , - - I N F s ( B , 0)
d~-SOPs(B,O)
12.1 0 < c Let E ~ S U P s ( A , H )
12.1.1 A positive "dE"
12.1.2 A negative ~tcE,,
12.2 d < 0 Let E *- INFs(A, H )
12.2.1 A positive "dE"
12.2.2 A negative "cE"
12.3 0 < d
12.3.1 A positive Let E *- SUPs(A,H) "dE"
12.3.2 A negative Let E *- INFs(A, H) ~CcE ~

13. J = " A B " where A and B


have unknown parity Let c+-INFs(A,O)
d~--SUPs(A, 0)
e~-INFs(B,@)
f~SUPs(B,@)
13.1 c > d --oo
13.2 e > f --co
13.3 max(ce, cf, de, df)

14. SUPPPs(J,H)

INF is defined exactly symmetrically to SUP above, with the following textual substitutions: SUP -+
INF, INF ~ SUP, SUPP -~ INFF, HIVAL -~ LOVAL, UPPER -+ LOWER, rain -~ max, m a x --~ rain,
co -* - - c o and - - c o --+ co, except in the a-ction columns of 12.1, 12.2 and 12.3, SUP and INF are
not changed, while the inequalities in those if columns are reversed.

FiG. 3.1. Definition of algorithm SUP and iexical changes needed to define algorithm INF.

and INFFF whose definitions can be derived from the others by simple textual
substitutions. The necessary substitutions for each algorithm are described in
the captions of the appropriate figures.
The double quote marks around expressions in the figures mean that the
values of variables within their range should be substituted into the expression,
but no evaluation should occur. Thus, for instance, if the value of variable A is
S Y M B O L I C R E A S O N I N G A M O N G 3-D M O D E L S A N D 2-D I M A G E S 313

A l g o r i t h m SUPPs(z, Y, H )

IZ ACTION RETURN

I. x do~s not occur in Y Y

2. z = Y co

3. Y = " s i n ( A , B)" Let C ,--- SUPPs(x,A, H)


D *-- SUPPs(~, B, H ) "min(C, D)"

4. Y ~ "bx + C " where b is a


number, x does not
occur in C
4.1 b > l co
4.2 b < 1 " C / ( 1 - - b)"
4.3 b = 1
4.3.1 C has u n k n o w n parity co
4.3.2 C < 0 --oo
4.3.3 C :> 0 co

5. SUPPPs(Y,H)

INFF is defined exactly symmetrically to SUPP above, with the following textual substitutions:
SUPP -* INFF, SUPPP --* INFFF, s i n --* max, co ~ - - c o and - - c o --* co. Also the inequalities in
4.3.2 and 4.3.3 are reversed.

FIG. 3.2. Definition of algorithm S U P P and lexical changes needed to define algorithm INFF.

symbol x, and that of B is the symbolic expression y + 3 - x, then the value of


"A + B" is x + y + 3 - x . In general in the definitions of the algorithms the
lower case variables have single numbers or symbols as their values while
upper case letters may also have complete expressions as their values. The
function SIMP refers to the algebraic simplifier described in Section 3.2.1
above. More liberal use of SIMP does not affect the correctness of the
algorithms, it merely decreases efficiency. The function TRIG is described in
detail below.
Each algorithm is described as a table of condition-action-return triples. This
follows the notation used by Bledsoe [11] to describe the first version of these
procedures. Our original implementation of these algorithms was in the
production rule system that is used for prediction and interpretation within
ACRONYM. Each step in the decision table was represented as a production rule.
However, the algorithms are highly recursive, and the overhead of 'procedure'
invocation for production rules made the algorithms very slow. We rewrote the
algorithms directly in MACLISe, gaining significant speedups. However, even in
the LISP environment, using different options for the code for procedure,
314 R.A. B R O O K S

Algorithm SUPPPs(Y, H )

I_EF ACTION RETURN

1. Y is a number Y

2. Y is a variable
2.1 Y e l l Y
2.2 Y / H HIVALs(Y)

3. Y = " A + B " Let C *-- SUPPPs(A, H )


D ~ SUPPPs(B, H ) "(7 + D "

4. Y = "min(A, B)" Let C 4-- 8UPPPs(A; H )


D +-- SUPPPs(B, H ) "min(G, D)"

5. Y = " I / A " where A has


known parity Let B ~-- I N F F F s ( A , H ) "I/B"

6. Y = " A B " where A and B have


known parity
6.1 Y , A same parity Let C ,-- SUPPPs(A, H )
6.2 Y , A opp. parity Let C ,--- INFFFs(A, H )
6.n.1 Y, B same parity Let D ~-- SUPPPs(B, H ) "CD"
6.n.2 U, B opp. parity Let D ~- I N F F F s ( B , H) "CD"

7. OO

I N F F F is defined exactly symmetrically to S U P P P above, with the following textual substitutions:


S U P P P --+ INFFF, I N F F F -~ SUPPP, H I V A L --~ LOVAL, m i n --+ m a x a n d oo -4 --oo.

FIG. 3.3. Definition of algorithm SUPPP and lexical changes needed to define algorithm INFFF.

invocation can change the running time of the algorithms by a factor of four.
This gives some indication of just how recursion-intensive these algorithms are.
Algorithms SUP, INF, SUPP and INFF are extensions to algorithms of the
same names given by Shostak [42]. Algorithms SUPPP and INFFF are new (as
is algorithm TRIG). The first five steps of our SUP and INF, minus step 2.2,
comprise Shostak's SUP and INF. Our additional steps (6-13) handle non-
linearities. Our algorithms SUPP and INFF are identical to those of Shostak,
with the addition of a final step which invokes SUPPP or INFFF in the
respective cases. For a set of linear constraints and a linear expression to
bound, our algorithms behave identically to those of Shostak.
Given a set of constraints S and an expression E, SUPs(E, r) produces an
upper bound on the values achieved by E over the satisfying set of S, and
INFs(E, r) a lower bound.
S Y M B O L I C R E A S O N I N G A M O N G 3-D M O D E L S A N D 2-D I M A G E S 315

The following descriptions give an intuitive feel for what each of algorithms
SUP, SUPP and S U P P P compute. Dual statements hold for INF, INFF and
INFFF, respectively. S is always a set of constraints and H a set of variables (i.e.
quantifiers) which occur in S.
SUPs(J, H ) : were J is a simplified (by SIMP) expression in variables constrained by S, returns an
expression E in variables in /4. In particular if H = fir, the SUP returns a number. In general, if
numerical values are assigned to variables in H and E evaluated for those assignments, then its
value is an upper bound on the value achievable by expression J over the assignments in the
satisfying set of S which have the same assignments as fixed for the variables in H.
SUPPs(x, Y, H): where x is a variable, x is not in H, and Y is a simplified expression in
variables in H O{x}, returns an upper bound for x, which is an expression in variables in H and is
computed by 'solving' x ~< Y, e.g. solving x ~ 9 - 2x yields an upper bound of 3 for x.
SUPPPs(Y, H): where Y is a simplified expression, returns an upper bound on Y, as does SUP,
but in general the bounds are weaker than those of SUP. Essentially SUP uses SUPPP when it
hasn't got specific methods to handle Y.

Algorithm T R I G is called from both SUP and INF. It is invoked with three
arguments, the first an expression, the second the symbol 'sin' or 'cos' and the
third is the symbol SUP or INF. Implicitly it has a fourth argument S which is
the constraint set. It takes lower and upper bounds on A using INFs(A, fl) and
SUPs(A, fl) and then finds the indicated bound on the indicated trigonometric
function over that interval.
Consider the example of Fig. 3.4. The given constraints are a ~> 2, b >i 1 and
ab ~<4. These are normalized by the procedure described in Section 3.3.1. Then
a trace of SUPs(a, g) is shown. It eventually returns 4 as an upper bound for a
over the satisfying set Cs of constraint set S. In fact 4 is the maximum value
which a can achieve on Cs.
Fig. 3.5 demonstrates finding an upper bound for a2b, by invoking SUPs(a2b,
fl) which returns 16. Again 16 is also the maximum value which can be achieved
by a2b over the satisfying set of S. In general, SUP will not return the
maximum value for an expression, merely an upper bound. Shostak [42] gives
an example of a linear constraint set and a linear expression to bound where it
fails to return the maximum.
Bledsoe [11] and Shostak [42] proved a number of properties of the al-
gorithms SUP and INF for sets of linear constraints and linear expressions to
be bound. The properties of interest to us are:
(P1) The algorithms terminate.
(P2) The algorithms return upper and lower bounds on expressions.
(P3) When the expression is a variable and the auxiliary set ( H in our
notation) is empty, the algorithms return a maximum and minimum (including
_+oowhen appropriate).
W e can extend the proofs of (P1) and (P2) (due to Bledsoe [11]) to our
extended algorithms.
First note that algorithms SUPPP and INFFF terminate, since all recursive
calls reduce the number of symbols in their first argument and they exit simply
316 R.A. B R O O K S

Given constraints a > 2, b > 1 and ab < 4 the normalization procedure produces as set S the
constraints:

2 < a a < 4Xl/b


1 < b b < 4Xl/a

SUPs(a, 0) =
SUPPs(a, 8IMP(SUPs(UPPERs(a), {a})), 0) Step 2.3
SUPPs(a, SIMP(SUPs(min(4, 4 X 1/b), {a})),@)
- - SUPPs(a, SIMP(min(SUPs(4, {a}), SUPs(4 X 1/b, {a}))), 0) Step 5
SUPs(4, {a}) ~ 4 Step I
SUPs(4 X 1/b, {a}) =
4 X SUPs(l/b, {a}) Step 3.2
:= 4 X 1/INFs(b, {a}) Step 9.1
4 X 1/INFFs(b, SIIdP(INFs(LOWERs(b),{a,b})),{a}) Step 2.3
4 x ,/INZFs(b, SIMP(INFs(1, {a, b})), {a})
= 4 X 1/INFFs(b, 1, {a}) Step 1
: : 4 X 1/1 Step 1 of INFF
= SOPPs(a, SIblP(rnin(4,4 X 1/1),0)
= SUPPs(a, 4, O)
--- 4 Step 1 of SUPP

FIG. 3.4. Example of algorithm S U P bounding a variable over the satisfying set of a set of constraints.

when the argument is a single symbol--via steps 1 or 2. By induction they


return upper and lower bounds on their first argument. Essentially the al-
gorithms evaluate their first argument at a vertex of a rectangloid which bounds
the satisfying set of S. The rectangloid is determined by the numeric upper and
lower bounds in the constraint set (as determined by HIVAL, LOVAL). If a
term can't be shown to achieve its extreme value at a vertex of the projection
of the rectangloid into the subspace of the variables of the term, then a most
pessimistic estimate is used for its value, namely ___~.
Algorithms SUPP and INFF are identical to those of [42], except that they
can take more complex arguments, in which case they invoke SUPPP and
INFFF respectively. So from Bledsoe's proof and argument above, they too
terminate and provide appropriate bounds. Note that the overall performance
of the constraint manipulation system may be improved by including extra
techniques in SUPP and INFF to solve some nonlinear inequalities, rather than
passing the bounding expressions to SUPPP and INFFF in those cases.
The proof that SUP and INF terminate follows that of Bledsoe [11], and all
but steps 9.2, 12 and 13 can be so covered (step 14 is covered by the arguments
above for SUPPP and INFFF). The problem with these steps is that they reset
the auxiliary set to be empty, so there is the danger of infinite recursion, where
S Y M B O L I C R E A S O N I N G A M O N G 3-D M O D E L S A N D 2-D I M A G E S 317

SUPs(a%, 0) :
Let B = SUPs(b, {a}) Step 10.1
= SUPPs(b, SIMP(SUPs(UPPERs(b), {b, a})), {a}) Step 2.3
= SUPPs(b, SIMP(SUPs(min(2, 4 X 1/a), {b, a})), {a})
SOPPs(b, SIgP(min(SUPs(2, {b, a}),
SUPs(4 X l/a, {b,a}))),{a}) Step 5
SUPs(2, {b,a}) = 2 Step 1
SUPs(4 X 1/a, {b,a}) =
4 X SUPs(1/a, {b, a}) Step 3.2
: 4 X 1/INFs(a, {b,a}) Step 9.1
~- 4 X 1/a Step 2.1
= sUPes(b, SIl~P(min(2, 4 X l/a)), {a})
= SUPPs(b, min(2, 4 X 1/a), { a } )
= min(2,4 X 1/a) Step 1 of SUPP
= SUPs(SIMP(a 2 rain(2, 4 X l/a)), O) Step lO.n.1
: SUPs(min(2a 2, 4a), 0)
: mi~SUPs(2a 2, O), SUPs(4a, O)) Step 5
SUPs(2a 2, @) ----~
2 X SUPs(a 2, @) Step 3.2
Let B --~ SUPs(l, {a}) Step 10.1
= 1 Step 1
= 2 X (SUPs(a, O))a Step lO.n.2
= 2 X 42 as in fig. 3.4
sues(4a, O) =
4 X SUPs(a,@) Step 3.2
= 4 X 4 as in fig. 3.4
= rain(2 X 4 ~ , 4 x 4)
= 16

FIO. 3.5. Example of algorithm S U P bounding a nonlinear expression subject to a set of nonlinear
constraints.

an identical call is made further down the computation tree. But all of these
steps make recursive calls with first arguments containing fewer symbols. The
only place the number of symbols can grow is step 2.3, and there the first
argument is a single variable. Since there are only a finite number of pairs
consisting of a variable and a subset of the variables, any infinite recursion
must include an infinite recursion on some form SUPs(v, H ) , and similarly for
INF. But step 2.2 explicitly checks for duplication of such calls on the execution
stack, so step 2.3 will not be reached (not that we can't only check for
duplications of calls of the form SUPs(v, fl), because steps 4 and 10, besides
step 2.3, can also increase the size of set H ) . That SUP and INF bound their
first argument is a straightforward extension of the proof of Bledsoe [11].
Finally, we note that many of the recursive calls to SUP and INF are of the
form SUPs(v, fl) for some variable v. Each such evaluation generates a large
computation tree. Therefore we have modified the algorithms to check for this
case explicitly. The first time such a call is made for a given set S, the result is
compared to the numeric bound on the variable v amongst the constraints in S
318 R.A. BROOKS

(as indexed with function H I V A L - - r e c a l l the normal form for constraint sets).
If the calculated bound is better, then S is changed to reflect this. Subsequent
invocations of SUPs(v, 0) on an unchanged S simply use H I V A L to retrieve the
previously calculated result. This is similar to the notion of a m e m o function as
described by Michie [36].

3.3.3. A partial decision procedure


W e are now in a position to describe the partial decision procedure used by our
CMS. It is completely analogous to that used by Bledsoe and Shostak.
If for each variable (quantifier) x, constrained by a constraint set S, it is true that

INF~,("x", 0) ~< S U P s ( " x " . 0),

then S is said to be possibly satisfiable, otherwise it is definitely unsatisfiable.


As an example suppose we change the constraint b t> 1 to b / > 3 in the
example of Fig. 3.4. Then the bounds derived for a and b using I N F and SUP
are
2 ~< a ~< 1.333,
3~<b~<2.
So the decision procedure concludes that the constraints are definitely not
satisfiable. Note also that for these constraints I N F produces a larger lower
bound for a~b than the upper bound produced by SUP.
The soundness of the partial decision procedure follows directly from the
fact that for a satisfiable set S, S U P and I N F return upper and lower bounds on
expressions over that set (for soundness it is not necessary that they return least
u p p e r bounds and greatest lower bounds).
However, a partial decision procedure that always returned the same result,
namely that the constraint set is possibly satisfiable is also sound. A partial
decision procedure is only interesting if it sometimes detects unsatisfiable sets
of constraints. The m o r e often it successfully detects such sets, the more
interesting it is.
W e do not have a good characterization of what classes of inconsistent
constraints our CMS can detect. In practice we have not encountered any cases
where it has failed to detect an inconsistency. We hypothesize that for sets of
linear constraints our CMS is in fact a full decision procedure. We further
hypothesize that for sets of constraints free of sin and cos, and where every
term has known parity, our CMS is also a full decision procedure. It is possible
to construct inconsistent constraints which the C M C cannot decide are un-
satisfiable.
Finally it should be pointed out that the decision procedure can be aug-
mented by checking that the intervals estimated for a quantitifier which is
known to be an integer--e.g, it represents the n u m b e r of some type of
subparts----can be checked to see whether they include an integer. If not, the set
S can be rejected as unsatisfiable.
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 319

3.3.4. Approximating complex expressions


After implementing the CMS described in the previous section, we realize that
the functions involved had some useful applications which we had not anti-
cipated.
Since the partial decision procedure is at least exponential in the number of
symbols in the constraint set it is desirable to keep the constraint set simple
when possible. Informally at least, it also seems that inclusion of expressions
involving cos, sin or simply expressions of indeterminate parity is expensive, as
analysis of bounds on these expressions using SUP and INF involve 'resetting'
of the argument H to the empty set 0, making the invocation tree even deeper.
Thus while Taylor [48] was interested in linearizing expressions so they could
be handled with the simplex method, we are interested in approximating
expressions with simpler expressions, with fewer symbols and perhaps free of
non-monotonic subexpressions.
The algorithms SUP and INF prove to be extremely useful for precisely this
task. By invoking them with a non-empty set H of variables, expressions in just
those variables in H are returned, which are respectively upper and lower
bounds on the given expression over the satisfying set of the constraint set.
More formally, given an expression E, a set of variables H, and a set of
constraints S, then INFs(E, H ) and SUPs(E, H ) are expressions involving only
variables in H, and
INFs(E, H ) ~ E ~< SUPs(E, H )
is true identically over the satisfying set of S.
We give a brief, and at this point largely unmotivated, example. The
following expression arises from the example used throughout in Section 4 on
geometric reasoning:
E = 83.5 c o s ( - P A N ) sin(TILT)+ SH-Y sin(TILT) s i n ( - P A N )
-21.875 cos(TILT) - 30 sin(TILT) s i n ( - P A N )
- S H - X c o s ( - P A N ) sin(TILT) - cos(SH-ORI - PAN) sin(TILT).

Given a constraint set S derived from the following given constraints:

~/12~< T I L T ~<~/6,
-rr/12~< PAN ~<xr/12,
- oo~< S H - O R I ~< ~,
then INFs(E, {SH-X, SH-Y}) produces a lower bound of
-4.637 - 0.5 SH-X - 0.129 SH-Y,
and SUPs(E, {SH-X, SH-Y}) gives
27.188 - 0.25 SH-X + 0.129 SH-Y
as an upper bound. The only alterations we have made to the expressions
320 R.A. BROOKS

actually used and generated by the system for this example, is to reintroduce
the ' - ' sign, reorder the terms in the sums and around the numeric constants,
all to increase readability.

4. Geometric Reasoning
Geometric reasoning is making deductions about spatial relationships of
objects in three dimensions, given some description of their positions, orien-
tations and shapes. There are many straightforward, and some not straight-
forward, ways to calculate properties of spatial relationships numerically when
situations are completely specified. Given the generic classes of objects which
we model in ACRONYM and generic positions and orientations which our
representation admits, purely numerical techniques are obviously inadequate.
A number of other workers have faced similar problems in the area of planning
manipulation tasks. We briefly compare a few of their solutions to these
problems below. They can be characterized as applying analytic algebraic tools
to geometry. That is the general approach that we take. We deal with more
general situations, however, There are other approaches to these problems;
most rely on simplifying the descriptive terms to coarse predicates. Deductive
results must necessarily be similarly unrefined in nature.
Ambler and Popplestone [3] assume they are given a description of a goal
state of spatial relationships between a set of objects, such as 'against' and 'fits',
and describe a system for determining the relative positions and orientations of
objects which satisfy these relations. The method assumes that there are at
least two distinct expressions for relative positions and orientations derivable
from the constraints. These are equated to give a geometric equation. They
then use a simplifier for geometric expressions which can handle a subset of
that of our system described below in Section 4.1. Finally they use special
purpose techniques to solve the small class of simplified equations that can be
produced from the problem which can be handled by the system. The solution
may retain degrees of freedom.
Lozano-Prrez [30] attacks a similar problem, but with more restrictions on
the relationships specifiable. H e is therefore able to use simpler methods to
solve cases where there are no variations allowed in parameters. He describes a
method for extending these to cases where parameters can vary over an
interval by propagating those intervals through the constraints. H e relies on
strong restrictions on the allowed class of geometric situations for this to work.
Taylor [48] tackles a problem similar to ours. H e has positions and locations
of objects represented as parameterized coordinate transforms, and looks for
bounds on the position coordinates of objects, given constraints on the
parameters. An incomplete set of rules is used to simplify transform expres-
sions as much as possible, based on the constraints. Then, if only one rotational
degree of freedom remains, the transform is expanded into explicit coordinate
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 321

expressions which are linearized by assuming small errors. The simplex m e t h o d


is used to estimate bounds on these expressions.
M c D e r m o t t [35] describes a representational scheme for meteric relations
between fairly unstructured objects in a planar map. Coordinates and orien-
tations within and between frames of reference are represented by ranges. A
multi-dimensional indexing scheme (based on k-d trees) is used to answer
questions involving near neighbors of objects which satisfy additional con-
straints. The system has mostly been used for planning paths past incompletely
specified obstacles. The ACROSVM constraint manipulation system and the
ACRONYM geometric simplifier described below are together able to make
stronger deductions than those described by McDermott.
In the ACaONYM context we have spatial relationships among objects them-
selves and a camera frame, which are not specified at all directly. Typically it is
necessary to combine more than ten coordinate transforms, involving four or
more variables (quantifiers) to determine relative positions and orientations of
coordinate frames. We have two primary requirements for our geometric
reasoning system.
(i) Given an expression in many variables for a position and orientation of
an object relative to the camera frame, and given a set of constraints on those
variables (encapsulated in a restriction node), we wish to determine what image
features that object will generate quasi-invariantly over the modelled range of
variations.
(ii) Discover further constraints which can be used to split the range of
variations into cases in which further quasi-invariant features can be predicted.
We use the term image feature to mean those parts of an image which are
observable by descriptive processes. We expand on that definition in Section
5.1.
As a by-product of achieving the above objectives we also gain ways of using
measurements of image features to deduce three dimensional information. We
also believe that the techniques we are developing for geometric reasoning will
be useful in planning manipulation tasks, based on an ACRONYM-style represen-
tation of generic spatial relationships.
Throughout this section we will use as examples the two situations shown in
Fig. 4.1. These two views, with different camera geometries, are of the same
electric screwdriver sitting in its holder (it is one of the tools used by the
manipulator arms in our coordinated robotics experimental work station). The
position is represented by the quantifier SH-X inches in the x direction of table
(world) coordinates and SH-Y inches in the y direction. Its orientation is a
rotation about the vertical z-axis in world coordinates of magnitude SH-ORI.
The following constraints apply to the position quantifiers.
0 ~< SH-X ~< 24,
18 ~< SH-Y <~ 42.
322 R.A. BROOKS

(a) (b)
Flo. 4.1. Two views of the electric screw-driver in its holder. The left (a) is from a camera a little
above the table, with variable pan and tilt. The right (b) is from a camera directly above the table,
with variable pitch and roll.

The orientation S H - O R I is unconstrained.


In Fig. 4.1a the camera is at world coordinates (83.5, 30, 25), with variable
pan and tilt represented by the quantifiers P A N and T I L T . The following
constraints apply.
"tr/12 <- T I L T ~< w/6,
- w / 1 2 ~< P A N ~ ~r/12.
Setting T I L T and P A N to zero corresponds to the camera looking along a ray
parallel to, but opposite in directioon to, the world x-axis. The camera is a
couple of feet above the table, tilted slightly downwards, looking at the
screwdriver and its holder about five to seven feet away.
Fig. 4.1 gives a view from an overhead c a m e r a at world coordinates (24, 30,
H E I G H T ) , where H E I G H T is a constrained quantifier (the uncertainty in the
height of the camera may m a k e this example seem a little contrived; it is meant
to be illustrative in nature). The camera image plane is rotated about the y and
x axes, with magnitudes represented by the quantifiers P I T C H and R O L L
respectively. The following constraints apply:
-~r/12~<PITCH <~ ~r/12,
-'rr/12 ~< R O L L <<-~r/12,
60 ~< H E I G H T ~< 84.
In Section 4.1 we show how to simplify large products of coordinate
transforms using some identities which allow rotations to be transposed within
a rotation product expression, Simplification of the coordinate transforms
relating objects to each other, and the camera allows us to decide whether
objects are in the field of view and what objects might be expected to occlude
others. The simplified expressions are in a form which allows for prediction of
invariant and quasi°invariant features. In particular we show how they can be
used in the prediction of the projected two dimensional image shape objects.
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 323

4.1. Geometric simplification


The geometric simplifier takes a symbolic product of coordinate transforms and
produces a single coordinate transform, which is a pair of expressions for a
rotation and a translation, by using the following identity (recall the notation
from Section 2.1).

(Rb T,)*(R2, T2)= (RI*R2, RI(~T2+ T1). (4.1)

The rotation expression obtained is thus a product of rotations. Using dis-


tributivity of rotations over translations, the translation expression becomes a
sum of terms, each of which is a product of rotations applied to a vector.
The expressions in the coordinate transform are simplified to standard forms.
Equivalent expressions will not necessarily be reduced to the same standard
expression. The non-canonical nature of the reduction methods is not due only
to the fact that the algebraic simplifier they use does not produce a canonical
form. It is inherent in the methods themselves. Similar arguments to those of
de Kleer and Sussman [27] apply to this case also. If the mechanisms which use
the simplified geometric expressions are intelligent about their use, and are
robust in the face of occasional failure to identify a product of rotations as the
identity, for instance, then the utility of a natural standard form, which is not
necessarily canonical, far outweighs the benefits of having a canonical form
which may be clumsy in expression, and may be time and space consuming to
compute.
The geometric simplification mechanisms used by ACRONYM are useful
because of certain inherent properties of cultural artifacts. We state some such
properties here, but provide no empirical evidence for them. Suppose the
objects in a human-built setting have been described by generalized cones in a
'natural' way. By that we mean that for a given generalized cone the spine (the
x-axis of its coordinate system)lies along an axis of generalized translational
invariance [9], and that if the cross section has an axis of generalized symmetry,
then that corresponds to one of the other coordinate axes of the cone's
coordinate system. Then given two mechanically coupled cones (whether
attached or merely coincident in some way) frequently their coordinate systems
will have a pair of parallel axes (e.g. the x-axis of one may be parallel to the
z-axis of the other). Furthermore it will often be the case that there are two
(and hence three) pairs of parallel axes.
One approach to geometric simplification is to turn all rotation expressions
into three by three matrices involving sine and cosine terms, multiply them out
and then use an algebraic simplifier. (Similar approaches use homogeneous
coordinates; the same arguments apply.) W e do not follow that course for two
reasons. First, it means that the algebraic simplifier must search for tri-
gonometric simplifications that are obscure in the expanded form, but obvious
in the unexpanded geometric notation, both due to the abundance of the
324 R.A. BROOKS

spatial relations described in the previous paragraph and due to the simple
algebraic relation in axis-magnitude representation between a rotation and its
inverse. Second, as we show in Section 4.2, we are able to make better use of
expressions describing spatial relations as combinations of simple geometric
transforms than we could make use of a single rotation and translation
expression, where the axis and magnitude of the rotation are both complex
trigonometric forms.
4.1.1. Products of rotations
Rotations of three space form a group under composition. The group is
associative so it is permissible to simplify a symbolic product of rotations by
collapsing adjacent ones with algebraically equal axis expressions (recall that we
represent rotations as magnitudes about an axis) by adding their magnitudes.
The group is not commutative, however. It is not possible to merely collect all
rotations with common axis expressions. There is a slightly weaker condition
on the elements of the group which allows partial use of this idea. Let al and a2
be vectors, and ml and mz be scalars. Then the following two identities are true
(the proof is simple but tedious and omitted here).
(al, mO*(a2, m2)= (a~, m~)*(((a~, -m2)®aO, m3,
(am, ml)*(a2, m2)--(((abrnl)@a2), m2)*(al, ml).
The geometric reasoning system of Ambler and Popplestone [3] collapses
adjacent rotations sharing common axis expressions, and uses the special case
of the above identities where a~ = ~, a2 = )3 and rn2 = ar to simplify geometric
expressions.
We use a more general special case here (and the general case in parts of the
system--see Section 5) to 'shift' rotations to the left and right in the product
expression. However, as a rotation is shifted it leaves rotations with complex
axis expressions in its wake. There is a subgroup of rotations for which these
axis expresssions are no more complex than the originals. This is the group of
24 rotations which permute the positive and negative x-, y- and z-axes among
themselves. When they are used with the above identities, the new axis
expression is a permutation of the original axis components, perhaps with some
sign changes.
Notice that these rotations are precisely the ones which relate two coordinate
systems with two (or three) parallel pairs of axes; they are very common in
models of human-made objects. We are particularly interested in a generating
subset of this rotational subgroup. It consists of the identity rotation i, and
rotations about the three coordinate axes whose magnitudes are multiples of
~r/2. We write them xl, x2, x3, yl, )'2, )/3, Zl, 22 and z3. The subscript indicates
the magnitude of the rotation as a multiple of ~r/2. We call these ten rotations
elementary. The fifteen other axis preserving rotations cannot be expressed as
rotations about a coordinate axis, but they can be expressed as a product of at
S Y M B O L I C R E A S O N I N G A M O N G 3-D M O D E L S A N D 2-D I M A G E S 325

most 2 elementary rotations. Furthermore they can be pictured intuitively by


someone modelling an object; so they tend to be the most common way in
which users describe orientations to ACRONYM.
Since in general (-a, m ) = (a, -m), elementary rotations are closed under
inverses (negation of the magnitude) and under the identities given above. For
instance
x3 * Yl = Yl * z3, x3 * Yl = z3 * x3.

We call a rotation principal if its axis is in the direction of one of the


coordinate axes. These were the other type of rotation that we mentioned
above as commonly occurring in models of human-made objects. Elementary
rotations map ~, ~ and ~ among themselves and their negations, so using the
two identities we see that moving an elementary rotation past a principal
rotation, either to the left or the right, leaves another principal rotation in its
wake. For example
(~,m)*yl=yt*(~,m), yl*(fc,rn)=(~,-m)*yl.
We simplify products of rotations which include elementary rotations by
transposing, using the identities above and multiplying out adjacent elementary
and adjacent principal rotations which share the same axis. Consider the
following five simplification rules.
(SR1) Compose adjacent elementary or adjacent principal rotations sharing
the same axis of rotation, and remove all instances of the identity rotation.
(SR2) Move instances of zl, z2 and z3 to the left of the expression and apply
(SR1).
(SR3) While there is an x-axis elementary rotation in the expression which is
not right-most, choose the left-most such, move it one place to the right, and
apply (SR2).
(SR4) While there is a y-axis elementary rotation in the expression which is
not right-most or immediately to the left of an x-axis elementary rotation,
move it to the right one place and apply (SR1).
(SR5) Make substitutions at the right of the expression using the following
identities and apply (SR2):
Yl*Xl= z3*yh Yl*X2=Z2*yl, Yl*X3=Zl*Yl,
Y2 * Xl = Z2 * X3, Y2 * X2 = i, Y2 * X3 = Z2 * Xl,
Y3 * X1 = Z1 * Y3, Y3 * X2 = Z2 * Y3, Y3 * X3 = ,7.3 * Y3.

If these are applied to a symbolic product of rotations, then, after applying


each of the five rules in order, the expression contain at most two elementary
rotations. Any such elementary rotation will either be left-most and one of zl,
z2 or z3, or it will be right-most and one of xl, x2, x3, yl or y3.
To show that the five rules do indeed produce such a standard form is
straightforward. The only potential difficulty is in showing the termination of
326 R.A. BROOKS

(SR3), since at each step the application of (SR2) may produce an x-axis
elementary rotation left of that which was previously left-most. Observe
however that if ze and xe are elementary z-axis and x-axis rotations, respec-
tively, and w * ze = ze * xe, then w must be an elementary y-axis rotation. Using
this, termination follows from showing that the number of elementary rotations
in the expression, apart from a left-most z-axis elementary rotation, is reduced
by one at each phase of (SR3).
The following expression is the 'raw' product of rotations expressing the
orientation of the screwdriver tool (the only cylinder in the left hand illus-
tration of Fig. 4.1) relative to the camera. It was obtained by inverting the
rotation expression for the camera relative to world coordinates and composing
that with the expression for the orientation of the tool in world coordinates,
found by tracing down the affixment tree.
(.f, TILT) * ()3, - P A N ) * 2"3 * y3 * i * (~, SH-ORI)
• i*y3*Yl*yl*i*i*i*i. (4.2)
When we apply the five rules (SR1)-(SR5) we obtain the much simpler
expression
z3 * ()3, T I L T ) * (.~, PAN - SH-ORI). (4.3)
(In this case (SR3) had no effect.)
The appearance of a given object may be invariant with respect to certain
changes in relative orientation of object and camera. The standard form for the
rotation expressions was chosen to make it easy to further simplify the
expression by making use of such invariants. Section 4.3.2 gives an example of
this. The standard form for rotation expressions also happens to be very
convenient for the simplification of the translational component of a coordinate
transform.
4.1.2. Simplification of translation expressions
Simplification of translation expressions is quite straightforward and relies on
the rules given below. Rule (SR6) is applicable to a product of rotations in the
standard form described in the previous section. Rules (SR7)--(SRll) are
applicable to a sum of terms, each of which is a product of rotations applied to
a vector.
(SR6) Shift elementary z-axis rotations to the right end of products of
rotations.
(SR7) For each term in the sum, apply rule (SR6) to the rotation product,
then apply the elementary rotations at the right to the vector by permuting its
components and changing their signs appropriately.
(SR8) R e m o v e terms in the sum where the vector is zero.
(SR9) Collect terms with symbolically identical rotation expressions by
symbolically summing the components of the vectors to which they are applied,
then apply rule (SR8)
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 327

(SR10) In each term remove a right-most rotation from the rotation product
if its axis vector is collinear with the vector to which the product is being
applied.
(SR11) While there is a term whose right-most rotation has an axis which is
neither collinear with, nor normal to, the vector to which the product is
applied, split the vector into collinear and normal component vectors, replace
the single term with the two new ones formed in this way, and apply rule
(SR10).
In the process of determining the translation component of a transform
expression by using (4.1) the geometric simplification system simplifies all the
rotation products in the terms of the sum. T o simplify the final translation
expression, rules (SR7), (SR9) and ( S R l l ) are applied in order. The following is
the expression for the position of the screwdriver tool in camera coordinates in
the situation shown in Fig. 4.1a.
(£, TILT) t~) (0, -21.875, 0)
+ (£, TILT) • ()~, S H - O R I - PAN) t~) (0, 0, 1)
+ (£, T I L T ) • ()3, - P A N ) @ (SH-Y - 30, 0, SH-X - 83.5). (4.4)
The original unsimplified form is far too large to warrant inclusion here. The
simplified form is both tractable and useful as we will see in the next section.
Finally we note that it is simple to subtract one translation expression from
another. In the translation to be subtracted, simply negate each component of
the vector in each of its terms, symbolically add the two translations by
appending the lists of terms, and then simplify as above.

4.2. Deciding on visibility


Given a spatial reasoning capability, one of the simplest questions which can be
asked when predicting the appearance of an object is whether the object will
be visible at all. If it is known in advance that an object will definitely not be
visible, then a lot of time can be saved by not searching for it in the image.
There are two ways that an object may not be visible. First it may be outside
the field of view. Then, even if it is in the field of view, it may be obscured by
another object closer to the camera.
It is quite straightforward to use the geometric simplification algorithms
described in the previous section and the constraint manipulation system of
Section 3.3 (or more generally any constraint which can meet requirements
(A1) and (A2) of Section 3.1) to answer questions of possible invisibility.
To determine whether an object is in the field of view it is necessary to know
its coordinates, (cx, cy, Cz) in the camera frame of reference, and the focal ratio r
of the camera (see Section 2.1). In general these can all be expressions in
quantifiers. The z coordinate cz must be negative for the object to be in front
of the camera. In that case the image coordinates of the object are (rcx/(-Cz),
rcy/(-Cz)). These two components can be bounded using algorithms INF and
328 R.A. BROOKS

SUP of Section 3.3. The bounds are then compared to the extreme visible
image coordinates (-0.5 and 0.5 by convention in ACRONYM), and whatever
deductions possible are made (one of 'definitely invisible', 'definitely in field of
view').
For example, the expression E of Section 3.3.4 is the y camera coordinate cy
of the origin of the coordinate frame of the screwdriver tool in the camera
geometry in Fig. 4.1a. The z camera coordinate is of similar complexity. For a
focal ratio r of 2.42, algorithms INF and SUP provide bounds of -2.658 and
3.326, respectively, for the y image coordinate of the screwdriver tool. Thus
ACROr~ can deduce that the screwdriver tool may indeed be visible. For other
constraints on the position quantifiers (SH-X and SH-Y) it can be deduced
that the screwdriver tool is invisible even though its position and orientation
and the pan and tilt of the camera are all uncertain.
Similar techniques can be used to decide whether an object might occlude
another; whether over the whole range of variations in their sizes, structures
and spatial relations, over some ranges, or never. In this case it is better to
examine the translation between the origins of their coordinate frames. This
can be calculated by symbolically differencing their coordinates in the camera
frame and simplifying as in the previous section. Various heuristics (im-
plemented as rules in ACROrq'CM'S predictor) can then be used to decide about
occlusion possibilities.
For example, consider the camera overhead geometry which gives rise to the
illustration in Fig. 4.lb. The expression for the position of the screwdriver
holder base minus the position of the screwdriver tool is
(£, - R O L L ) * ()3, - P I T C H ) @ (0, 0, -2.625)
+(.f, - R O L L ) * (9, - P I T C H ) * (2, SH-ORI) @ (-1, 0, 0).
Expanding this out and applying algorithms INF and SUP gives bounds of
-1.679 and 1.679 on the x component, -1.746 and 1.746 on y, and -3.143 and
-1.932 on the z component. Thus ACaONVM can conclude that the screwdriver
holder base is always further from the camera than the screwdriver tool. One
heuristic rule concludes that since the x and y components are comparable in
size to the z component, it is possible that the screwdriver tool will appear in
front of the screwdriver holder base in images. Another rule, however, says
that since the view of the tool that can be seen (see Section 4.3 for the
deduction of the view to be seen) is small compared to the view that will be
seen of the holder base, it will not interfere significantly with observation of the
latter. (Actually in this case it is also concluded that the screwdriver tool is
occluded always by the screwdriver motor above it. Also other subparts of both
the screwdriver holder and the screwdriver itself interfere more with obser-
vation of the screwdriver holder base.)
Before leaving the subject of visibility consider the following. If an object is
visible, then its image coordinates must be within the bounds of the visible part
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 329

of the image plane. Thus the expressions for the image coordinates, as
calculated above, can be bounded above and below by 0.5 and - 0 . 5 respec-
tively, and those constraints can be merged into the constraint set. If the object is
visible it must satisfy those constraints anyway. Having the constraints expli-
citly stated may help prune some incorrect hypotheses as we will see in Section
5. Note that if the decision procedure as described in Section 3 was actually a
complete decision procedure, then we could simply merge the constraints and
test the constraint set for satisfiability to decide whether the object was visible.
However, since the decision procedure we use is only partial and cannot always
detect unsatisfiable sets of constraints, we use the less direct procedure as
described above. Even with the new constraints merged into the constraint set,
algorithms SUP and INF may not produce image coordinate bounds of 0.5 and
-0.5. This is because the bound on the expressions must be reconstructed from
the normal form of the constraints, rather than referring directly to the newly
added constraints. As we have seen in Section 3.3, SUP and INF produce only
upper and lower bounds on expressions, not suprema and infima. Futhermore,
to keep the number of symbols in the constraint set at a reasonable level we do
not use the image coordinate expressions directly in the bounds, but rather use
simplified bounding expressions as demonstrated in Section 3.3.4.

4.3. Finding invariants


The best things to predict about the appearance of an object are those which
will always be observable. We define an observable as something which can be
observed in an image; it is either a feature which might be described directly by
the low level descriptive processes, or it is a directly computable relation
between two or more such features. We say that something is an invariant
observable if it is constant and observable over the whole range of variations in
model size and structure, and its spatial relation to the camera coordinate
system.
For instance, collinear features of models (not image features) which are
observable (as image features) give rise to observable collinearity over the
whole range of spatial relations between the models and the camera coordinate
system. Parallel features of models which are observable produce parallel
observable features over the range of relative camera object orientations,
where the plane defined by the parallel model features is itself parallel to the
image plane of the camera (i.e. in ACRONV~ the plane is parallel to the x-y
plane of camera coordinates).
Collinearity and parallelism can easily be detected with the coordinate
transform simplifications of Section 4.1. First, a coordinate system for the
object features is defined so that the linear feature lies along the x-axis. For
straight spines of cones, for instance, this is just the local coordinate system of
the cone. For straight spines of cones, for instance, this is just the local
33t) R.A. BROOKS

coordinate system of the cone. The relative orientations of these coordinate


systems can be determined by inverting the orientation of one with respect to
the camera (this simply involves reversing the order of the rotation product and
inverting the sign of the rotation magnitudes) and multiplying on the right by
the orientation of the other, and applying the simplification algorithm of
Section 4.1.1, followed by rule (SR6) of Section 4.1.2. If the resulting expres-
sion is a product of rotations containing only the identity rotation i, y2, z2, and
rotations of the form (~, a) for arbitrary expressions a, then the object features
are certainly parallel and perhaps collinear. (It is certainly possible that other
more complex rotations be present in the orientation expression relating
parallel object features, but in general it is not worth pursuing them.) T o
decide whether object features are collinear or whether they are parallel and
generate a plane parallel to the image plane, requires examination of the
translation between their local coordinate systems. The camera coordinates of
one can be subtracted from the other as described in Section 4.1.2. If the y and
z components of the resulting vector are zero, then the object features are
collinear. If the z component is zero, then they are parallel and will invariantly
appear parallel in the image.
Collinearity and parallelism are important relations which can be used to
check for consistency of interpretation of image features; consistency with
relations expected from examination of the models. Another such relation is
connectivity which, as with collinearity, is invariant over all camera models,
given that the features to be checked for connectivity are both observable.
However, to make use of all these relations we first need observable image
features. Currently we make use of primitive shape descriptions as the primary
image features used by ACRONYM for initial hypothesis of object image cor-
respondences. We digress briefly to describe the shape descriptions produced
by ACRONYM'S lOW level processes. We will then return to the topic of using
geometric reasoning to deduce shape invariants. In Section 4.3.3 we generalize
the notion of invariants to quasi-invariants; features which are observable over
a wide range of modelled variations.
4.3.1. Ribbons and ellipses
In the current implementation of ACRONYM we use ribbons and ellipses as the
features which low level processes produce. Ribbons are two dimensional
specializations of generalized cones. A ribbon is a planar shape which can be
described by three components. The spine is a two dimensional curve. A line
segment, the cross-section, is held at a constant angle to the spine, and swept
along it varying according to the sweeping-rule.
Ribbons are a good way of describing the images generated by generalized
cones. Consider a ribbon which corresponds to the image of the swept surface
of a generalized cone. For straight spines the projection of the cone spine into
the image would closely correspond to the spine of the ribbon. Thus a good
approximation of the observed angle between the spines of two generalized
SYMBOLIC REASONING A M O N G 3-D MODELS AND 2-D IMAGES 331

• , . . . . " -~7 ~ ~ ----~"~----C

/~-"~1 - ~
• .C] / J ~ "-~-~ - - e - - ~ . . / ~- " "-~-'~"--~
~
,. ~ (/// ~ - - , . S s 7 ~

k . . " ~,~.~.'-
. ~ "~,/

(b)

FIG. 4.2. Top figure (a) shows the low level input to ACRONYM. Bottom figure (b) shows the ribbon
descriptions returned by the descriptive process, when directed by predictions to look for shapes
generated by the fuselage and wings.
332 R.A. BROOKS

cones is the angle between the spines of the two ribbons in the image
corresponding to their swept surfaces. We do not have a quantitative theory of
these correspondences.
Ellipses are a good way of describing the shapes generated by the ends of
generalized cones. The perspective projections of ends of cones with circular
cross-sections are exactly ellipses. For other cross-sections, ellipses can some-
times provide better descriptions of the ends. For example, over a given class
of orientations of a cone relative to the camera any axis of symmetry of the
cross-section is strongly skewed. Thus the axis of symmetry might be the
obvious choice for the spine of a ribbon in a geometrically simpler situation. In
a more complex situation an ellipse can provide a more tolerant prediction and
an easier descriptive hypothesis.
The descriptive module consists of two algorithms [17]: first an edge linking
algorithm based on best-first search, and second an algorithm to fit ribbons and
ellipses to sets of linked edges. The descriptive module returns a graph
structure, the observation graph. The nodes are ribbon (ellipse) descriptions.
The arcs are observed image relations between ribbons; currently we use only
image connectivity. The module produces ribbons which have straight spines
and sweeping-rules which describe linear scalings. The module provides in-
formation regarding orientation and position of the spine in image coordinates.
Fig. 4.2 demonstrates the action of the descriptive module. In Fig. 4.2a are the
692 edges found by Nevatia and Babu's [38] line finder in an 8 bit aerial image
taken from above San Francisco airport. In Fig. 4.2b are 39 ribbons found by
ACRONYM'S descriptive module when searching for shapes generated by the
fuselage and wings. There are 161 connectivity arcs in the observation graph so
produced.
4.3.2. Invariant shapes
The most important factor in predicting shape is the orientation of the object
relative to the camera. It is therefore potentially interesting to consider under
what variations in orientation of an object relative to the camera does its
perceived shape remain invariant. In fact such invariants are very useful for
reducing the complexity of the expressions derived, using the methods des-
cribed in Section 4.1.1 for object orientations, to manageable levels where shape
can be predicted directly.
Note first that for a generalized cone which is small compared to its distance
from the camera, perspective effects are small. There may still be strong
perspective effects between such objects, however. (For instance cones with
parallel spines defining a plane which is not parallel to the image plane will still
have a vanishing point.) In any case it is therefore true that in predicting the
apparent shape of such generalized cones, we can approximate the perspective
projection with a slightly simpler projection. In ACRONYMwe carry out shape
prediction using a 'perspective-normal' projection. For a generalized cone
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 333

whose closest point to the camera has z coordinate z', the projection of a point
(x, y, z) in three space into the image plane of a camera with focal ratio r is
(rx/(-z'), ry/(-z')). Intuitively we think of this as a normal projection into a
plane which is parallel to the image plane, intersects the generalized cone, and
is the closest such plane to the camera, followed by a perspective projection of
the image into the camera. It is also equivalent to a normal projection scaled
according to the distance of the object from the camera. We will see examples
of why this is so useful in Section 4.3.3. We further simplify the perspective-
normal projection in ACRO~rYMby using the z camera coordinate of the origin
of the cone coordinate frame, rather than z' as defined above.
We return now to the problem of simplifying orientation expressions while
keeping the implied shape invariant. The normal form described in Section
4.1.1 was designed with such problems in mind. First note that a rotation about
the z-axis at the left of a rotation product corresponds to a rotation in the
image plane (recall the definition of camera geometry in Section 2.1). Our two
dimensional shape descriptions are invariant with respect to orientation. So all
shape prediction is unaffected by ignoring such rotations. Thus, for instance,
(4.3) for the orientation of the screwdriver tool in Fig. 4.1a, is equivalent to
(9, TILT) • (£, PAN - SH-ORI) (4.5)
for the purpose of predicting the shape of the image of the screwdriver tool. In
general our standard form for rotation expressions has all elementary z-axis
rotations moved to the left--ready to be ignored.
The screwdriver tool is a cylinder, with its spine (an axis of radial symmetry)
along the x-axis of its coordinate system. Thus the appearance of the tool is
invariant with respect to a rotation about its £-axis. The right rotation of (4.5)
can thus be ignored for the purpose of shape prediction, leaving
()~, TILT) (4.6)
to be analyzed. In physical terms this says that the camera tilt is the only
variable of the case in Fig. 4.1a that is important for shape prediction.
Expression (4.6) is simple enough that special case rules are applicable. One
says that the cylinder will appear as a ribbon generated by its swept surface and
an ellipse generated by its initial cross-section. Furthermore they will be
connected in the image. (If the descriptive process which found ellipses was
able to accurately determine their major axis, then another useful rule could
come into play. From (4.6) it would deduce that in the image the major axis of
the ellipse will be normal to the spine of the ribbon.) Later in the prediction it
is decided that the ellipse corresponding to the top of the screwdriver tool will
actually be occluded (as described in Section 4.2), but that need not concern us
here.
The screwdriver modelled in Fig. 4.1 is actually a particular screwdriver with
specific dimensions. To make this example more general, suppose that the
334 R.A. BROOKS

screwdriver tool has variable size, with its length represented by the quantifier
T O O L - L E N G T H and its radius by T O O L - R A D I U S . Using the perspective-
normal projection approximation, the length to width ratio of the ribbon
corresponding to the swept surface of the screwdriver tool can be predicted to
be
T O O L - L E N G T H × cos(TILT)
TOOL-RADIUS.

Consider the ellipse corresponding to the top of the screwdriver tool. The ratio
of its minor axis to its major axis is simply
sin(TILT).
Thus the range of shapes that can be generated have been comprehensively
predicted. The actual form in which these predictions are used is not just to
establish a predicate against which hypothesized shape matches will be tested.
They are used in a more powerful way, described in Section 5, to actually
extract three dimensional information about the viewed scene.
Beside shape, bounds on the dimensions of objects can also be predicted.
This too is used to extract three dimensional information as described in
Section 5. Size bounds have a more immediate application, however. They are
used to direct the low level descriptive processes [17], which search the image
for candidate shapes to be matched to predictions. Given that the focal ratio is
2.42 and the length of the screwdriver tool is 1, and using the expanded z
component of (4.4), the algebraically simplified prediction of the length of the
ribbon in the image is
-2.42 cos(TILT)/(30 cos(TILT) s i n ( - P A N ) + SH-X x c o s ( - P A N )
+cos(TILT) cos(SH-ORI - PAN) - 21.875 sin(TILT)
-83.5 cos(TILT) c o s ( - P A N ) - SH-Y x cos(TILT) sin(-PAN)).
Algorithms INF and SUP are used to determine that this quantity is bounded
by 0.0190 and 0.0701, information which can be used by the descriptive
processes to limit the search for candidate ribbons in the image. These are not
particularly accurate estimates on the infimum and supremum of the above
expression because it contains sines and cosines which have coupled arguments,
but our constraint manipulation system treats them independently and makes
the most pessimistic bounds. However, they are still exceedingly useful for
limiting search.
In general, at the right of the standard form for a product of rotations is one
of six elementary rotations: i (implicitly only), Yl, y3, xl, x: and x3. If there are
no other rotations in the expression these correspond to the six views of a
generalized cone from along the positive and negative coordinate axes. Rota-
tions yl and y3 correspond to viewing the initial and final views of the swept
surface of the generalized cone. In (4.3) the right-most elementary rotation is
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 335

implicitly the identity i, which corresponds to a side view of the cylinder. In


trying to reduce the complexity of the orientation expression, ACRONYM essen-
tially tries to find how the nonelementary rotations change the viewpoint from
one of the six primitive viewpoints of a generalized cone.
As a final example consider the orientation expression derived for the
screwdriver tool in the camera geometry illustrated in Fig. 4.1b:
(22, - R O L L ) • (9, - P I T C H ) • (~, SH-ORI) • yl.
H e r e the right-most elementary rotation is Yl which corresponds to viewing the
initial cross-section of the cylinder of the screwdriver tool. In the modelled
situation that is the top of the cylinder, but that is not derivable from this
expression with as simple an analysis as used in (4.3). Various heuristic rules try
rearranging the expression to find a situation in which an invariant sim-
plification can be detected. One such rule tries shifting the right-most elemen-
tary rotation left one position. Using the identities of Section 4.1 this gives the
expression
(22, - R O L L ) • ()3, - P I T C H ) • yl * (22, - - SH-ORI).
As in the previous example we now have an x-axis rotation at the right. A
cylinder's appearance is invariant with respect to a rotation about its axis, so
we can use the simplified expression
(22, - R O L L ) * (9, - P I T C H ) * yl (4.7)
for predicting shape. This expression is still too complex for direct prediction,
and to handle it we need to use quasi-invariance, introduced above.
4.3.3. Ouasi-invariants
Sometimes the search for invariants will be unsuccessful. Often when there are
no invariants directly available, ACRONYM is able to carry out case analysis. It
produces new restriction nodes, descendants of the old, each with additional
constraints, which restrict the situation in each restriction node to one where
there are adequate invariants available. The additional constraints are chosen
so that the lattice supremum of the new restriction nodes is the original node.
Often, however, case analysis is not enough. Expression (4.7) is a case in
point. The problem is that there are two rotations with uncertain magnitude. If
there was only one, shape prediction could be achieved in a similar manner to
that in the previous example. In this case both rotations have small magnitudes.
The effects of these rotations on the perceived shape of the object depend
roughly on the cosines of their magnitudes. The cosines are almost constant,
ranging from 0.965 to 1 over the modelled range of variations. Thus for shape
prediction at least they can be effectively ignored; the error involved in doing
so will be smaller than the errors incurred by low level descriptive processes.
Heuristic rules, written on the basis of error analyses, are used to identify such
336 R.A. BROOKS

cases. Later we may include rules which can carry out analyses dynamically
using differential approximations to nonlinear expressions.
The preceding is a particular case of a more general phenomenon, involving
local maxima, minima, or points of inflexion of expressions. Often some
prediction is very nearly invariant over the modelled range of variations.
Where an invariant is found by ignoring a small effect of some term we call it a
quasi-invariant. The most common instances of quasi-invariants arise from
ignoring cosine terms with small arguments.

5. Prediction and Interpretation


In the previous section we showed how invariant and quasi-invariant observ-
ables could be discovered from reasoning geometrically about models. We now
describe how to combine quasi-invariants into predictions of image features.
Predictions include instructions on how to use image feature measurements
from hypothesized partial interpretations, to constrain the three dimensional
models, identifying class memberships and specific three dimensional spatial
relations. The predictions drive descriptive processes to produce descriptions of
image features, to be matched to predictions. An interpretation algorithm is
used to hypothesize such matches, to apply the resulting constraints to models,
and to combine local interpretations into globally consistent interpretations.
At the time of writing, the interpretation algorithm described here has not
been fully implemented as part of ACRONYM. All image interpretations carried
out completely automatically by ACRONYMSO far have used a syntactic graph
matcher due to Greiner (see [18]). It does not have the back-constraining
capability, and thus it cannot use the most significant aspects of the predictions.
Implementation and integration of the proposed algorithm is underway and
will be completed soon.
A procedure [29] has been demonstrated which determines certain model
parameters (slot values rather than quantifiers) numerically by an iterative
technique once good matches have been established. It has not been integrated
into ACRONYM.
We first describe in detail the form of the prediction graph, then show how
back constraints are set up and the effects of using multiple back constraints.
We then outline an algorithm for interpretation which consists of screening
matches between predictions and observed features, followed by combining
local interpretations into more global interpretations.

5.1. Prediction
Prediction is used to build the prediction graph which provides a description of
features and their relations which should be matched by features in an image.
Prediction has two other major uses, however. The first is to provide direction
to the low level descriptive processes. This was described in Section 4.3. l. The
SYMBOLICREASONING AMONG 3-D MODELSAND 2-D IMAGES 337

second is to provide instructions on how to use image measurements to


understand the three dimensional aspects of the objects which gave rise to the
measured image.
The preceding sections have dealt with many of the specific mechanisms used
for prediction. The following subsections give an overview of how these
mechanisms fit together.
5.1.1. Producing the prediction graph
The prediction graph consists of nodes and arcs. The nodes are either predic-
tions of specific image features, or recursively complete prediction graphs of
finer level features. In our current implementation only shapes are predicted.
The arcs of the graph specify relations between the nodes. There are three
types of arcs; must-be, should-be, and exclusive. The first two are similar but
imply slightly different acceptance criteria for instantiation of their associated
nodes. Details are given below in Section 5.2. Such arcs can predict a variety of
relations. For instance, we currently predict connectedness, relative spine
orientation in the image, and simply the AND relation that instantiations must
exist for both nodes, to consider an instantiation of either to be correct.
Exclusive arcs say that instantiations of the two related nodes cannot coexist in
an interpretation graph. This last type of arc is rarely intrinsically needed as
such information is usually encoded in the back constraints implied by different
instantiations. However, when the prediction algorithm knows that two predic-
tions are mutually exclusive (such as the visible shapes for two ends of a simple
generalized cone) it can save the interpretation algorithm the expense of
deciding that the meet of restrictions associated with two interpretation nodes
is unsatisfiable by joining the prediction nodes by an exclusive arc. Thus they
are an efficiency consideration.
A prediction graph has associated with it a restriction node which refers to
the object class being predicted. It could also be the base-restriction, the most
general restriction node, in which case the graph predicts the whole scene
which appears in the image to be interpreted.
In Section 4 we showed how to predict the shape of a cylinder given certain
constraints on its orientation relative to the camera. Prediction proceeds by
examining the constraints on objects. If they are tractable, then specific rules
are used to make special case predictions. Otherwise case analysis is performed
by adding constraints which produce tractable situations. Each different con-
straint adds a new restriction node, more restrictive than that associated with
the restriction node of the prediction graph. It is the lattice infimum of that
restriction and the new constraint.
Single generalized cones can generate image shapes in a number of ways.
Shapes can be generated primarily by cross-sections at each end, by the swept
surfaces, or a combination of the two. In each case the shape boundary may be
generated by actual edges on the generalized cone (discontinuity in the
338 R.A. BROOKS

direction of surface normal) or on apparent edges dependent on camera


location (images of points were the surface normal is normal to the line of
sight). We call these boundary curves contours.
Image feature predictions are made for each contour, specifically a predic-
tion of the shape of each contour is made. First the size of the contour in model
coordinates is calculated. Certain simple approximations can be made at this
point. For instance the occluding contour of a right circular cylinder is a
rectangle having the same length as the cylinder and a width twice its radius,
given that the cylinder is not extremely close to the camera. The dimensions of
the contour are expressed terms of model parameters. The contour is then
symbolically projected with the perspective normal transform to obtain a
prediction of the ribbon or ellipse which it will generate in an image. This
whole process is carried out by special purpose rules which embody an analysis
of their domain of applicability. A rule-base is used to enhance extensibility of
the system. New constraint ranges and classes of generalized cones can be
handled simply by adding a new rule to the rule-base once a new analysis has
been carried out by the ACRONYMmaintainer.
5.1.2. Setting up back constraints
The prediction algorithms produce symbolic expressions for predicted image
feature characteristics (e.g. length of a ribbon). Let E be an example expres-
sion. During prediction such expressions are bounded numerically to give
direction to the low level processes. The bounds are calculated over the
satisfying set of the quantifiers in the symbolic expression. When a feature is
hypothesized as a match for a predicted feature there is a corresponding image
measurement available. Let it be m. If the match is correct, then the
measurement m is the actual numeric value of the prediction expression E.
Assuming the expression has that value provides constraints on the values of
the individual quantifiers. If the image provided an exact measurement, then
we could simply add the constraint E = m, and use SUP and INF to find what
this implied about model and spatial parameters.
There are large errors in results of the feature description algorithms [17],
and instead of an exact measurement m the algorithms provide estimates
o n l y - - a closed error interval, [mi, m,] say, on image feature parameters.
Therefore we can add the constraint mt ~<E ~< mu. In practice such expressions
may not be the best ones to use as they may contain many symbols, and they
may be hard for the simplifier to manipulate due to uncertain parities of
subexpressions. During prediction, however, we may have special knowledge
about the expressions from geometric considerations and so can write in-
structions for the interpretation algorithm about how to build constraints from
feature measurements which can be handled by the simplifier and CMS in
general. These instructions are attached to predictions.
We illustrate the preceding by following through our example of the screw-
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 339

driver tool as it appears in the camera geometry illustrated in Fig. 4.1a. Given
the constraints on the location of the screwdriver holder (in terms of table
coordinates SH-X and SH-Y) we have already seen that the length of the
image ribbon corresponding to the screwdriver tool will lie between 0.0186 and
0.0701, which was obtained by bounding the formula rl/(-z') where r is the
camera focal ratio (2.42), l the normal projection length of the tool cylinder
(cos(TILT)), and z' the distance of the origin of the cylinder coordinates to the
image plane (an expression in PAN, TILT, SH-ORI, SH-X and SH-Y). We
know that if the object is visible, then it must be that z' ~< O. This information is
not derivable by the algebraic simplifier in this case because z' is a complex
expression. The prediction rules can safely assume, however, and so they
specify that when a ribbon with length estimate [mr, m~] is hypothesized as the
image of the screwdriver tool in the context of restriction node S, then
constraints obtained by evaluating the evaluating expressions within the two
inequalities
ml x INFs(z', H ) ~< SUPs(2.42 cos(TILT), H ) ,
m, x SUPs(z', H ) ~ INFs(2.42 cos(TILT), H )
should be added to the constraints already in S, where H = {PAN, TILT,
SH-ORI}. (The exact mechanism for selection of node S is given in the next
section.) In this case the constraints will further constrain SH-X and SH-Y, the
table position coordinates of the screwdriver holder. In general, addition of
such constraints may constrain positions, orientations, model size, or camera
parameters.
We demonstrate the effect of these additional constraints. We add them to
the initial modelled set of constraints only. Recall that in that case we have
0<~SH-X~<24 and 18~<SH-Y~<42.
Suppose the interpretation processes described below hypothesize a match of
the swept surfaces of the screwdriver tool with a ribbon in the image. The
descriptive processes return image measurements as nominal values with
fractional error estimates. In the example at hand, suppose that the ribbon is
measured to have length of 0.05 units, with plus or minus 10% error. Then the
additional constraints generated ensure that
4.762 ~< SH-X~< 24 and 18 ~< SH-Y ~<42.
If the length is measured as 0.07 with an error bound of 10%, then the
constraints imply that
20.127 ~< SH-X <~ 24 and 27.035 ~< SH-Y ~< 42.
Even with a 40% error estimate, a measurement of 0.07 contributes three
dimensional information. This is to be expected as it is very close to an extreme
of the predicted range of measurement.
340 R.A. BROOKS

Note that the constraints added actually contain more information than is
reflected in examining the resulting ranges on individual parameters. The
constraints added actually chip off (in general nonlinear) portions of the
original rectangle of satisfying values achievable for SH-X and SH-Y. The
actual constraints added in the first example above were
3.017 - 0.0435 x SH-X - 0.0113 x SH-Y ~< 2.338,
5.503 - 0.0460 x SH-X + 0.0138 x SH-Y I> 2.096
which are much stronger than simple linear inequalities in SH-X from these
two constraints.
There are other image measurements even from a single ribbon which can be
used to constrain three dimensional parameters. Obviously ribbon width and
taper can be used analogously to ribbon length. Position of the ribbon within
the image can also be used. In the above example it will tend to constrain
camera parameters such as PAN, TILT, and also SH-Y. Prediction rules set up
the appropriate instructions for building constraints based on these measure-
ments.
5.1.3. Multiple back constraints
The previous example deals only with constraints derivable from hypothesizing
a match with a single ribbon. In identifying instances of an object whose
description is more complex than a single generalized cone, there will be more
than one primitive shape feature matched. Each provides a number of such
back constraints which combine to further constrain the individual parameters.
Suppose an object is modelled with a well-determined size, position and
orientation. When constraints from hypothesized matches for many objects are
combined, that particular object will be extremely useful for determining
parameters of the camera and other objects. If there are many such tightly
constrained modelled objects, then they are even more useful. Thus a mobile
robot can use known reference objects to visually determine its absolute
location and orientation, and the absolute location and orientation of other
movable objects.
In a bin picking task the camera parameters and location of the bin are
probably well determined (although ACI~Or,,VMwould not be at a loss if this were
not the case). The problem is to distinguish instances of an object and
determine its orientation so that a manipulator can be commanded to pick it
from the bin. There will be many instances of each predicted image feature as
there will be many instances of each object. The back constraints provide a
mechanism for the interpretation algorithm to find mutually consistent fea-
tures, and thus identify object instances. Furthermore the back constraints
provide information on the position and orientation of the object instance.
In aerial photographs the back constraints tend to relate scale factors to
camera height and focal ratio. In aerial photographs an identifiable landmark
SYMBOLICREASONING AMONG 3-D MODELSAND 2-D IMAGES 341

can provide one tight relationship between these parameters. Derived back
constraints from other objects interact to give relatively tight bounds on all
unknowns.

5.2. Interpretation
Interpretation proceeds by combining local matches of shapes to individual
generalized cones into more global matches for more complete objects
(ACaOr~YM currently relies on shapes only). The global interpretations must be
consistent in two respects. First, they must conform to the requirements
specified by the arcs of the prediction graph. Second, the constraints that each
local match implies on the three dimensional model must be globally con-
sistent; i.e. the total set of constraints must be satisfiable.
At a given time the interpreter looks for matches for a set of generalized
cones, called the search set. They are cones determined by the predictor to
have approximately equal importance in determining an interpretation for an
image. Smaller generalized cones, corresponding to finer image features, are
searched for later. Feature predictions include both an estimated range for
feature parameters (e.g. ribbon length) and constraints on the model implied by
hypothesizing a match with an image feature. The descriptive processes are
invoked with direction from the first aspect of the predictions. The observation
graph of features and observed relations between features is the result. Since
the search set in general contains more than one generalized cone, not all the
described features will match all, or even any, generalized cones in the search
set. A comparison of all the image feature parameters with their range
predictions is carried out to determine possible matches for each generalized
cone in the search set (e.g. a ribbon's length and width must both fall in the
predicted ranges to be considered further).
There is a question of partial matches for predicted features. The current
descriptive processes used [17], partially take care of this problem in a fairly
undirected manner. If edges associated with the two ends of a ribbon are
observed by the line finder [38], then the edge linking algorithm will probably
hypothesize a ribbon, despite possible interference in the middle sections. (The
strategy which works successfully is to make as many plausible hypotheses as
possible at the lowest levels, so that the likelihood of missing a correct
hypothesis which may be locally weak is low, and use the higher level know-
ledge of the system to prune away the excess later.) Sometimes, also the
predictor will predict specific obscurations and adjust its feature prediction
accordingly. In general, however, an additional mechanism which hypothesizes
image features as partial matches for larger predictions may be very useful.
Thus a ribbon might be hypothesized as being only one end of a larger ribbon
by not requiring that it fits the length prediction. It is also necessary in this case
to increase the error estimate in the length measurement for the next stage of
342 R.A. B R O O K S

pruning, described below. We have not yet implemented such a mechanism,


but plan to in the near future.
For each feature prediction pair which survives this first stage of pruning a
restriction node is built which is more restrictive than the restriction node
which is associated with the prediction. It inherits the constraints from the
prediction restriction, but also has added those constraints built by following
the instructions in the prediction. Often the new restriction node will be
unsatisfiable, and so the feature prediction pair can be eliminated from further
consideration. For instance both the length and width of a ribbon may fall in
the predicted ranges, but perhaps the length is at the high end of the range
and the width at the low end. Then it is possible that the back constraints so
generated will put inconsistent demands on the orientation of the object
relative to the camera, or will be inconsistent with some modelled con-
straint on the length to diameter ratio of the cone. (For example in the generic
class of jet aircraft, the fuselage lengths and diameters can vary greatly, but the
length to diameter ratio varies much less. A constraint may be added to the
model class expressing this fundamental relationship of overall scaling in
aircraft.)
The interpreter tries to instantiate arcs of the prediction graph by pairwise
checking hypothesized instantations of predicted features which have a relation
predicted between them. Both must-be and should-be arcs are thus in-
stantiated. Instantiation of arcs is similar to that of nodes. Gross predicted
features are checked first, then a restriction node is constructed which includes
the constraints implied from image measurements of the relation. For instance,
suppose an arc predicts a range of angles between the spines of two ribbons.
First, the angle between the image spines must lie in the predicted numerical
range. Then the constraints associated with the arc may constrain the relation
between the orientation of the object relative to the camera and the relative
orientations in three space of the two generalized cones corresponding to the
two ribbons.
A combinatorial search is carried out to collect individual hypothesized
instantiations of nodes and arcs into hypothesized connected components of
the interpretation graph. The connectivity referred to, here, is that supplied by
instantiated must-be and should-be arcs. Exclusive arcs prevent the collection
together with some inherently mutually exclusive local interpretations. The
algorithm used here is a variation on the constraint propagation algorithm
introduced by Waltz [49], used for labelling line drawings.
Recall the semantics of the two arc types. If the two feature predictions
which participate in a should-be arc are instantiated, then they can be regarded
as consistent local interpretations only if the instantiations support an in-
stantiation of the arc predicted between them. A feature prediction which
participates in a must-be arc can only be instantiated if there is a mutually
consistent instantiation of the other node participating in that arc and also the
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 343

corresponding arc is instantiated. As an example consider the prediction that


an aerial view of an aircraft will include wings connected to the fuselage. If
must-be arcs are used, then all of the fuselage, port and starboard wings must
be observed to allow interpretation of image features as an aircraft. If should-
be arcs are used, then it is possible to return a partial interpretation such as a
'an aircraft with one wing missing'.
In our previous implementation of the interpreter we found that just pruning
using must-be arc requirements was sufficient to carry out object classification
correctly. In that implementation of the interpreter we also used only the
simpler form of matching predictions to features where feature measurements
were compared to prediction ranges, but no back constraining was done. The
simple requirements specified by the must-be arcs of the observability graph,
while only moderately strong by themselves, are very strong in co.njunction
with the requirements specified by the nodes, In our experience with aerial
images we found it extremely rare that two nodes and a connecting arc of the
prediction graph were incorrectly instantiated in the observation graph. We
have not observed a case of a three node, two arc subgraph of the prediction
graph being incorrectly instantiated.
The reasons that we have added the constraint mechanism to a successful
interpretation system are two fold. First, although the original scheme never
incorrectly interpreted image features as an object instance, they sometimes
failed to detect objects when predicted feature relations were not observed.
Merely relaxing feature relation predictions does lead to incorrect image
interpretations. The constraint system allows for relaxed predictions but still
provides a mechanism for checking consistency of partial matches to dis-
connected connected components of the prediction graph (at least via must-be
arcs). The relaxation of predictions referred to is the replacement of must-be
by should-be arcs. Second, the constraints provide a mechanism for gaining
three dimensional information from image interpretations.
As each connected component is built, interpretation restriction nodes are
checked for consistency. The simplest way to do this would be to calculate the
lattice infimum (actually use (A3) of Section 3.1) over the restriction nodes
associated with the interpretations of each feature and feature relation.
However, in the general case this can lead to some problems. For example, a
class of aircraft may be modelled with the spines of generalized cones of the
two wings each having their length slot filled with the quantifier
W I N G - L E N G T H . When combining local matches for the two wings of a single
aircraft we want the constraints on W I N G - L E N G T H to be consistent, as each
wing should have the same length. However, when we are combining two local
interpretations of aircraft into, say, an interpretation of an image as an airfield,
then the W I N G - L E N G T H in the two cases refers to a different physical
quantity. Individual aircraft have their wings the same length but different
aircraft may have different wing lengths.
344 R.A. BROOKS

We use the term conglomeration to refer to the process of combining local


interpretations whether at the feature level, or when combining connected
components of the interpretation graph. One result of conglomeration should
be a new restriction node which is more restrictive than all the restriction nodes
associated with the interpretations conglomerated. Of course it should be the
least restrictive of such restriction if possible.
Somehow the system has to decide whether quantifiers with the same name
in two local matches refer to the same physical quantity. In an earlier paper
[17] we proposed that the user should include such information explicitly in the
geometric models. However, this information is actually implicitly available
elsewhere, and so we have developed a new scheme whereby the system
decides itself from class rather than geometric considerations. As described
above each prediction graph is associated with a particular user-supplied
restriction node, which describes the class of objects predicted by the graph. In
conglomerating submatches to the prediction graph, the system assumes that
only quantifiers which are constrained in that restriction node refer to unique
physical quantities. Therefore they are the only quantifiers retained in the
conglomeration restriction node.
The conglomeration restriction node is computed as follows. For each
restriction node to be conglomerated, a more general restriction node contain-
ing only quantifiers to be retained is computed. For this purpose algorithm INF
is used on all upper bounds in the normal constraint form and SUP on all lower
bounds. In both cases the set H is the set of quantifiers to be retained, so all
others are eliminated from the bounding expressions (see Section 3.3.4). Then
the lattice infimum of these new restriction nodes is calculated. If it is
unsatisfiable, then the local interpretations associated with each of the restric-
tion nodes are mutually insconsistent.
An alternative to eliminating quantifiers is to rename them, so that
quantifiers referring to unique physical quantities have unique names. The
advantage to this is that the current scheme of removing quantifiers leads to a
weaker conglomeration restriction node which conceivably (but with very low
probability) will allow an inconsistent interpretation to pass later in inter-
pretation. By renaming quantifiers no information is thrown away, so no later
errors can be introduced by the conglomeration process. The disadvantage is
that the number of quantifiers and bounding expressions tends to grow, making
the higher levels of interpretation run roughly exponentially slower in the
number of component interpretations. We feel that the advantages of renaming
are small and the disadvantages great. Also by renaming variables inter-
pretation never proceeds to higher level abstractions, but is inherently always
carrying around baggage from lower level details. For instance suppose the
system has hypothesized a number of aircraft in an aerial view on an airfield,
and then combines these in a global interpretation of an airfield instance.
Without removing some variables from the conglomeration as is done in our
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 345

current scheme it would be forced to carry around the variables for, say, the
lengths of the engine pods of each aircraft. At best this is aesthetically
unpleasing. Worse, the increasing complexity of constrafnt sets overwhelms the
constraint manipulation system. In our current scheme, the individual inter-
pretations for the aircraft contribute knowledge derived about the rest of the
world from their local hypothesis, but then can be treated simply as atomic
aircraft instances--a higher level abstraction.
At this stage of interpretation we now have hypothesized connected com-
ponents of the interpretation graph. These may be complete components, in
that they have instances of all predicted arcs and nodes, or they may only be
partial (e.g. an interpretation may correspond to an aircraft except that no
feature was found corresponding to the port wing). With each compont is a
restriction node which describes the constraints on the three dimensional world
implied by accepting that hypothesis. A combinatorial search is now carried out
to find consistent connected components. Essentially this is done by deciding
whether the restriction node produced as the conglomeration of the component
restriction nodes is satisfiable. Conglomeration can also add constraints (equal-
ities) on quantifiers used to describe variable numbers of subparts (e.g. the
variable numbers of flanges on the electric motor in Section 2). These con-
straints too, of course, must be consistent with all the conglomerated restriction
nodes.
Eventually then, a number of interpretation graphs may be hypothesized. In
general, some will be large and mostly correct interpretation graphs and the
others will be small, consisting of individual incorrect interpretations of parts of
the image. The large graphs will be very similar in gross aspects but may differ
locally where they have accepted slightly different local interpretations. A
single interpretation can be synthesized from the gross similarities. Our
experience with our earlier interpretation algorithms suggests that the number
of large interpretation graphs will typically be on the order of less than five and
most likely only one or two. A large correct interpretation graph has associated
with it a restriction node which specializes both object models and their spatial
relations to the three dimensional understanding of the world derived from the
feature prediction hypothesized matches in the interpretation graph. Other
restriction nodes associated with components of the total interpretation may
contain extra three dimensional information pertinent to the appropriate local
interpretation.
A final aspect of this scheme for interpretation is the ease with which
subclass identification can be carded out once class identification has been
achieved. Suppose we have an interpretation of a set of image features as an
electric motor (see Section 2 for the subclass definitions of this example).
Associated with that interpretaton is a restriction node. We can immediately
check whether the interpretation is consistent with the object being an instance
of some subclass of electric motors, e.g. carbonator motor, by taking the lattice
346 R.A. BROOKS

infimum of the subclass restriction node and the interpretation restriction. If


the infimum is unsatisfiable, then the object cannot be an instance of the
subclass. If no inconsistency is found for several subclasses, but those sub-
classes themselves are inconsistent (i.e. the lattice infimum of their restriction
nodes is known to be unsatisfiable), then perhaps prediction and search for
finer features of the object must be carried out to resolve the classification.

6. Conclusion
We have concentrated on the predictive aspects of vision in this paper and
indeed in the ACRONYMsystem as a whole. This is not to say that descriptive
processes are not vitally important for robust and accurate vision. Rather, we
are investigating the question of how to use models independently of particular
descriptive processes which may eventually be available.
In investigating the use of models for vision we have found that many of the
requirements for the modelling and spatial understanding system are exactly
those needed in other areas of motor-sensory functions. The same models and
geometric reasoning capabilities are extremely useful for robot mobility and
manipulation. We have derived techniques to automatically deduce three
dimensional information from descriptions of monocular images in a general
way.
The particular class representation is not universal. We have shown,
however, how to use classes of models for understanding images. A more
general representation of classes, e.g. inclusion of disjunctions in constraints,
would require an upgrade of the various computing engines described (e.g. the
constraint manipulation system, and the geometric reasoning system).
However, the interaction of these parts of the system could still operate in
much the same manner.
Finally notice that there is no notion of assigning probabilities to local or
global interpretations, nor is there any underlying statistical model. ACRONYM
only 'labels' parts of an image for which it can find a globally consistent
interpretation.

ACKNOWLEDGMENT
Much of this work, especially that of Sections 2 and 5 has been carried out under close advice
from my thesis advisor Thomas Binford.

REFERENCES
1. Abraham, R.G., Csakvary, T., Korpela, J., Shum, L., Stewart, R.J.S. and Taleff, A., Program-
mable assembly research technology:Transfer to industry, 4th Bi-Monthly Report, NSF Grant
ISP 76-2416,1, Westinghouse R&D Center, Pittsburgh, June 1977.
2. Agin, G. J., Representation and description of curved objects, Memo AIM 173, Stanford
University AI Lab (1972).
SYMBOLIC REASONING AMONG 3-D MODELS AND 2-D IMAGES 347

3. Ambler, A.P. and Popplestone, R.J., Inferring the positions of bodies from specified spatial
relationships, Artificial Intelligence 6 (1975) 175-208.
4. Baer, A., Eastman, C. and Henrion, M., A survey of geometric modeling, CMU Institute of
Physical Planning, Research Rept. No. 66 (1977).
5. Baker, H.H., Edge based stereo correlation, Proceedings ARPA Image Understanding Work-
shop, College Park, MD (1980) 168-175.
6. Barrow, H.G. and Tenenbaum, J.M., MSYS: A system for reasoning about scenes, SRI AI
Center, Tech. Note 121 (1976).
7. Barrow, H.G. and Tenenbaum, J.M., Recovering intrinsic scene characteristics from images,
in: A. Hanson and E. Riseman, Eds., Computer Vision Systems (Academic Press, New York,
1978).
8. Baumgart, B.G., Geometric modeling for computer vision, Memo AIM 249, Stanford Uni-
versity AI Lab, (1974).
9. Binford, T.O., Visual perception by computer, invited paper at IEEE Systems Science and
Cybernetics Conference, Miami, Dec. 1971.
10. Binford, T.O., Computer integrated assembly systems, Proceedings NSF Grantees Conference
on Industrial Automation, Cornell Univ., Ithaca, Sep. 1979.
11. Bledsoe, W.W., The sup--inf method in Presburger arithmetic, Memo ATP 18, Dept. of Math.
and Comp. Sei., University of Texas at Austin, Austin, Texas (1974).
12. Bledsoe, W.W., A new method for proving certain Presburger formulas, Proceedings of IJCAI
4, Tibilsi, Georgia, U.S.S.R. (1975) 15-21.
13. Bobrow, D.G., Natural language input for a computer problem solving system, in: M.L.
Minsky, Ed., Semantic Information Processing (MIT Press, Cambridge, MA, 1968).
14. Bobrow, D.G. and Winograd, T., An overview of KRL, a knowledge representation language,
Cognitive Sci. 1 (1977) 3--46.
15. Borning, A., THINGLAB: m constraint-oriented simulation laboratory, Stanford CS Report,
STAN-CS-79-746 (July 1979).
16. Braid, I.C., Designing With Volumes (Cantab Press, Cambridge, England, 1973).
17. Brooks, R.A., Goal-directed edge linking and ribbon finding, Proceedings ARPA Image
Understanding Workshop, Menlo Park, CA (1979) 72-76.
18. Brooks, R.A., Greiner, R. and Binford, T.O., The ACRONYM model-based vision system,
Proceedings IJCAI 6, Tokyo (1979) 105-113.
19. Brooks, R.A. and Binford, T.O., Representing and reasoning about partially specified scenes,
Proceedings ARPA Image Understanding Workshop, College Park, MD (1980) 95-103.
20. Fikes, R.E., Ref-ARF: A system for solving problems stated as procedures, Artificial In-
telligence ! (1970) 27-120.
21. Garvey, T.D., Perceptual strategies for purposive vision, SRI AI Center, Tech. Note 117
(1976).
22. Goldman, R., Recent work with the AL system, Proceedings IJCAI 5, Cambridge (1977)
733--735.
23. Grimson, W.E.L., Aspects of a computational theory of human stereo vision, Proceedings
ARPA Image Understanding Workshop, College Park, MD (1980) 128--149.
24. Grossman, D.D., Monte Carlo simulation of tolerancing in discrete parts manufacturing and
assembly, Memo AIM 280, Stanford University AI Lab (1976).
25. Hollerbach, J., Hierarchical shape description of objects by selection and modification of
prototypes, Tech. Rept. AI-TR-346, M1T, Cambridge (1975).
26. Horn, B.K.P., Obtaining shape from shading information, in: P.H. Winston, Ed., The Psy-
chology of Computer Vision (McGraw-Hill, New York, 1975).
27. de Kleer, J. and Sussman G.J., Propagation of constraints applied to circuit synthesis, Memo
AIM 485, MIT, Cambridge (1978).
348 R.A. BROOKS

28. Lieberman, L., Model-driven vision for industrial automation, in: P. Stucki, Ed., Advances in
Digital Image Processing: Theory, Application, Implementation (Plenum Press, New York.
1979).
29. Lowe, D., Solving for the parameters of object models from image descriptions, Proceedings
ARPA Image Understanding Workshop, College Park, MD (1980) 121-127.
30. Lozano-P~rez, T., The design of a mechanical assembly system, Tech. Rept. AI-TR-397, MIT,
Cambridge (1976).
31. Lozano-P6rez, T. and Wesley, M.A., An algorithm for planning collision-free paths among
polyhedral obstacles, Comm. ACM 22 (1979) 560-570.
32. Marr, D., Visual information processing: The structure and creation of visual representations,
Proceedings IJCAI 6, Tokyo (1979) 1108-1126.
33. Mart, D. and Hiidreth, E., Theory of edge detection, Memo AIM 518, MIT, Cambridge (1979).
34. Mart, D. and Nishihara, H.K., Representation and recognition of the spatial organization of
three-dimensional shapes, Memo AIM 377, MIT, Cambridge (1976).
35. McDermott, D., A theory of metric spatial inference, Proceedings of the First Annual National
Conference of on Artificial Intelligence, Stanford (1980) 246-248.
36. Michie, D., Memo functions: A language feature with rote-learning properties, Proceedings
IFIP, 1968.
37. Miyamoto, E. and Binford, T.O., Display generated by a generalized cone representation.
IEEE Conference on Computer Graphics and Image Processing, May 1975.
38. Nevatia, R. and Ramesh Babu, K., Linear feature extraction and description, Comput.
Graphics and Image Processing 13 (1980) 257-269.
39. Nevatia, R. and Binford, T.O., Description and recognition of curved objects, Artificial
Intelligence 8 (1977) 77-98.
40. Ohta, Y., Kanade, T. and Sakai, T., A production system for region analysis, Proceedings
IJCAI 6, Tokyo (1979) 684---686.
41. Shapiro, L.G., Moriarty, J.D. Mulgaonkar, P.G. and Haralick, R.M., Sticks, plates, and blobs:
A three-dimensional object representation for scene analysis, Proceedings~of the First Annual
National Conference on Artificial Intelligence, Stanford (1980) 28-30.
42. Shostak, R.E., On the sup-inf method for proving Presburger formulas, J. Assoc. Comput. Mach.
24 (1977) 529-543.
43. Soroka, B.I., Understanding objects from slices: Extracting generalised cylinder descriptions
from serial sections, Tech. Rept. TR-79-1, Dept. of Computer Science, University of Kansas,
Lawrence (1979).
44. Soroka, B.I., Debugging manipulator programs with a simulator, to be present at CAD/CAM
8, Anaheim, Nov. 1980.
45. Staff, An introduction to PADL: characteristics, status, and rationale, Production Automation
Project, Tech. Merao TM-22, University of Rochester, Rochester (1974).
46. Stallman, R. and Sussman, G.J., Forward reasoning and dependency--Directed backtracking
in a system for computer-aided circuit analysis, Artifwial Intelligence 9 (1977) 135-196.
47. Sugihara, K., Automatic construction of junction directionaries and their exploitation for the
analysis of range data, Proceedings of IJCAI 6, Tokyo (1979) 859-864.
48. Taylor, Russel, H., A synthesis of manipulator control programs from task-level specifications,
Memo AIM 282, Stanford University AI Lab (1976).
49. Waltz, D,, Understanding line drawings of scenes with shadows, in: P.H. Winston, Ed,, The
Psychology of Computer Vision (McGraw-Hill, New York, 1975).
50. Winston, P.H., Learning structural descriptions from examples, in: P.H. Winston, Ed., The
Psychology of Computer Vision, (McGraw-Hill, New York, 1975).
51. Woodham, R.J., Relating properties of surface curvature to image inti~nsity, Proceedings of
IJCAI 6, Tokyo (1979) 971-977.

R e c e i v e d N o v e m b e r 1980

You might also like