0% found this document useful (0 votes)
59 views

Hierarchical Perceptual Grouping For Object Recognition: Eckart Michaelsen Jochen Meidow

Uploaded by

Arun Kushwaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Hierarchical Perceptual Grouping For Object Recognition: Eckart Michaelsen Jochen Meidow

Uploaded by

Arun Kushwaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 200

Advances in Computer Vision and Pattern Recognition

Eckart Michaelsen
Jochen Meidow

Hierarchical
Perceptual
Grouping for Object
Recognition
Theoretical Views and Gestalt-Law
Applications
Advances in Computer Vision and Pattern
Recognition

Founding editor
Sameer Singh, Rail Vision, Castle Donington, UK

Series editor
Sing Bing Kang, Microsoft Research, Redmond, WA, USA

Advisory Board
Horst Bischof, Graz University of Technology, Austria
Richard Bowden, University of Surrey, Guildford, UK
Sven Dickinson, University of Toronto, ON, Canada
Jiaya Jia, The Chinese University of Hong Kong, Hong Kong
Kyoung Mu Lee, Seoul National University, South Korea
Yoichi Sato, The University of Tokyo, Japan
Bernt Schiele, Max Planck Institute for Computer Science, Saarbrücken, Germany
Stan Sclaroff, Boston University, MA, USA
More information about this series at http://www.springer.com/series/4205
Eckart Michaelsen Jochen Meidow

Hierarchical Perceptual
Grouping for Object
Recognition
Theoretical Views and Gestalt-Law
Applications

123
Eckart Michaelsen Jochen Meidow
Fraunhofer IOSB Fraunhofer IOSB
Ettlingen, Baden-Württemberg, Germany Ettlingen, Baden-Württemberg, Germany

ISSN 2191-6586 ISSN 2191-6594 (electronic)


Advances in Computer Vision and Pattern Recognition
ISBN 978-3-030-04039-0 ISBN 978-3-030-04040-6 (eBook)
https://doi.org/10.1007/978-3-030-04040-6

Library of Congress Control Number: 2018960737

© Springer Nature Switzerland AG 2019


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

Around the year 2008, I realized that much of what we had published as
knowledge-based methods for image analysis was actually perceptual grouping.
Moreover, these perceptual grouping rules were those that turned out to be more
robust than the actual automatic knowledge utilization part. Moreover, the same
constructions were needed over and over again, through many modalities of sensing
and tasks to be fulfilled. One main source of malfunction of rule-based systems was
the threshold parameters: Should two straight lines be parallel if their orientation
deviation is less than ten degrees? Or rather five degrees? It became evident that
such hard thresholds should be replaced by using soft assessment functions.
At the International Conference on Pattern Recognition 2012 in Tsukuba, I
discussed the issue with Vera Yashina of the algebraic branch of the pattern
recognition community of the Russian Academy of Sciences. We agreed that such
approach is not really a syntactic approach anymore, it is an algebraic formulation:
The Gestalt algebra. 2012 happened to be the year of a major upheaval in pattern
recognition and machine vision. It was realized that deep learning utilizing con-
volutional neural networks yields superior performance on object recognition from
imagery. Almost nobody in the community seemed to like those machines with
their vast number of parameters, but the facts could not be ignored. In the few years
that have passed since 2012, this neural network approach has been adapted to
almost any task in machine perception and artificial intelligence with remarkable
success. So isn’t perceptual grouping utilizing Gestalt laws, and knowledge-based
machine inference an outdated topic?
Neural network approaches existed before 2012. Their superior performance
nowadays results from the training data amounts which are at hand now and from
the advances in computing machinery. Still, anything that must not be learned,
because it is already known, helps in concentrating these precious resources on
learning the unknown things. The laws of seeing are known for more than one
hundred years. Seeing must not be learned, it can be coded by implementing these
laws in computing machinery. There has been enough knowledge about this topic
published in numerous papers, and also in several very recommendable textbooks.
Why then yet another book on Gestalt laws?

v
vi Preface

Because the aspect of hierarchical grouping has been hardly treated in the
existing literature, e.g., a window sash may be made of a lattice of 12 small
sub-windows, and two such sashes make a reflection symmetric window aggregate,
and several of these are repeated as a frieze on a facade, and the building on which
the facade is seen, is repeated along a road. It is much more likely that we, or our
machines, encounter images containing such deep hierarchies through the scales,
than that the images contain only random noise and clutter. The Gestalt algebra has
been deliberately designed for such hierarchical patterns.
When asked to write a textbook on this topic, I realized that expertise in
probability calculus, least squares estimation, and projective geometry would be
needed, and I asked Jochen Meidow to join in. Together we revised the operations
of Gestalt algebra and present them in the volume at hand. For each such Gestalt
operation, there is a separate chapter, containing the definition, as well as examples
of application, and some brief review of the corresponding literature. The most
important chapter is the algebraic closure chapter, where all operations can par-
ticipate in the construction of hierarchies of such aggregates. But the book would
not be complete without a chapter connecting the method to the data—i.e., a chapter
on the extraction of primitives from pictures, a chapter on the cooperation with
machine-readable knowledge, and a chapter on cooperation with machine learning.
The book is intended for students, researchers, and engineers active in machine
vision. We hope that the field may benefit from our methods and that some of our
proposals may help to develop and improve future seeing machines. We thank the
management of the Fraunhofer Institute of Optronics, System Technologies and
Image Exploitation IOSB in Ettlingen, Germany, for facilitating the work on it as an
ancillary activity, while being committed to the day-to-day business.

Ettlingen, Germany Eckart Michaelsen


September 2018
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............ 1
1.1 Examples of Pictures with Hierarchical Gestalt . ............ 1
1.2 The State of the Art of Automatic Symmetry
and Gestalt Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 The Gestalt Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Assessments for Gestalten . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Statistically Best Mean Direction or Axis . . . . . . . . . . . . . . . . . 18
1.6 The Structure of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Reflection Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23
2.1 Introduction to Reflection Symmetric Gestalten . . . . . . . . . . .. 23
2.2 The Reflection Symmetry Constraint as Defined
for Extracted Primitive Objects . . . . . . . . . . . . . . . . . . . . . . .. 25
2.3 Reformulation of the Constraint as a Continuous Score
Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27
2.4 Optimal Fitting of Reflection Symmetry Aggregate Features . .. 29
2.5 The Role of Proximity in Evidence for Reflection
Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 31
2.6 The Role of Similarity in Evidence for Reflection Symmetry
and How to Combine the Evidences . . . . . . . . . . . . . . . . . . .. 33
2.7 Nested Symmetries Reformulated as Successive Scoring
on Rising Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 35
2.8 Clustering Reflection Symmetric Gestalten with Similar
Axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 41
2.9 The Theory of A Contrario Testing and its Application
to Finding Reflection Symmetric Patches in Images . . . . . . . .. 46

vii
viii Contents

2.10 The Minimum Description Length Approach for Nested


Reflection Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.11 Projective Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3 Good Continuation in Rows or Frieze Symmetry . . . . . . . . . . . . . . 53
3.1 Related Work on Row Gestalt Grouping . . . . . . . . . . . . . . . . . 55
3.2 The Row Gestalt as Defined on Locations . . . . . . . . . . . . . . . . 56
3.3 Proximity for Row Gestalten . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 The Role of Similarity in Row Gestalten . . . . . . . . . . . . . . . . . 59
3.4.1 Vector Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4.2 Scale Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4.3 Orientation Features . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Sequential Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5.1 The Combinatorics of Row Gestalten . . . . . . . . . . . . . 64
3.5.2 Greedy Search for Row Prolongation . . . . . . . . . . . . . . 65
3.6 The A Contrario Approach to Row Grouping . . . . . . . . . . . . . . 67
3.7 Perspective Foreshortening of Rows . . . . . . . . . . . . . . . . . . . . . 67
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4 Rotational Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1 The Rotational Gestalt Law as Defined on Locations . . . . . . . . 72
4.2 Fusion with Other Gestalt Laws . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.1 Proximity Assessments for Rotational Gestalten . . . . . . 75
4.2.2 Similarity Assessments for Rotational Gestalten . . . . . . 77
4.3 Search for Rotational Gestalten . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.1 Greedy Search for Rotational Gestalten . . . . . . . . . . . . 78
4.3.2 A Practical Example with Rotational Gestalten
of Level 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 The Rotational Group and the Dihedral on Group . . . . . . . . . . . 82
4.5 Perspective Foreshortening of Rotational Gestalts . . . . . . . . . . . 82
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5 Closure—Hierarchies of Gestalten . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1 Gestalt Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2 Empirical Experiments with Closure . . . . . . . . . . . . . . . . . . . . 90
5.3 Transporting Evidence through Gestalt Algebra Terms . . . . . . . 92
5.3.1 Considering Additional Features . . . . . . . . . . . . . . . . . 93
5.3.2 Propagation of Adjustments through the Hierarchy . . . . 95
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.1 Stratified Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2 Recursive Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Contents ix

6.3 Monte Carlo Sampling with Preferences . . . . . . . . . . . . . . . . . . 103


6.4 Any-time Search Using a Blackboard . . . . . . . . . . . . . . . . . . . . 104
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7 Illusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1 Literature about Illusions in Seeing . . . . . . . . . . . . . . . . . . . . . 107
7.2 Deriving Illusion from Top-down Search . . . . . . . . . . . . . . . . . 108
7.3 Illusion as Tool to Counter Occlusion . . . . . . . . . . . . . . . . . . . 108
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8 Prolongation in Good Continuation . . . . . . . . . . . . . . . . . . . . . . . . 111
8.1 Related Work on Contour Chaining, Line Prolongation,
and Gap Filling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.2 Tensor Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.3 The Linear Prolongation Law and Corresponding
Assessment Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.4 Greedy Search for Maximal Line Prolongation and Gap
Closing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.5 Prolongation in Good Continuation as Control Problem . . . . . . 121
8.6 Illusory Contours at Line Ends . . . . . . . . . . . . . . . . . . . . . . . . 123
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9 Parallelism and Rectangularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
9.1 Close Parallel Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
9.2 Drawing on Screens as Graphical User Interface . . . . . . . . . . . . 129
9.3 Orthogonality and Parallelism for Polygons . . . . . . . . . . . . . . . 130
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
10 Lattice Gestalten . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.1 Related Work on Lattice Grouping . . . . . . . . . . . . . . . . . . . . . 136
10.2 The Lattice Gestalt as Defined on Locations . . . . . . . . . . . . . . . 136
10.3 The Role of Similarity in Lattice Gestalt Grouping . . . . . . . . . . 138
10.4 Searching for Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.5 An Example from SAR Scatterers . . . . . . . . . . . . . . . . . . . . . . 141
10.6 Projective Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11 Primitive Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
11.1 Threshold Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.2 Super-Pixel Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.3 Maximally Stable Extremal Regions . . . . . . . . . . . . . . . . . . . . 150
11.4 Scale-Invariant Feature Transform . . . . . . . . . . . . . . . . . . . . . . 152
11.5 Multimodal Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
x Contents

11.6 Segmentation by Unsupervised Machine Learning . . . . . . . . . . 154


11.6.1 Learning Characteristic Colors from a Standard
Three Bytes Per Pixel Image . . . . . . . . . . . . . . . . . . . . 155
11.6.2 Learning Characteristic Spectra from
a Hyper-Spectral Image . . . . . . . . . . . . . . . . . . . . . . . 156
11.7 Local Non-maxima Suppression . . . . . . . . . . . . . . . . . . . . . . . . 159
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12 Knowledge and Gestalt Interaction . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.1 Visual Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.2 A Small Review on Knowledge-Based Image Analysis . . . . . . . 166
12.3 An Example from Remotely Sensed Hyper-spectral
Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
12.4 An Example from Synthetic Aperture RADAR Imagery . . . . . . 171
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
13 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.1 Labeling of Imagery for Evaluation and Performance
Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.2 Learning Assessment Weight Parameters . . . . . . . . . . . . . . . . . 178
13.3 Learning Proximity Parameters with Reflection
Ground Truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
13.4 Assembling Orientation Statistics with Frieze Ground Truth . . . 181
13.5 Estimating Parametric Mixture Distributions from
Orientation Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Appendix A: General Adjustment Model with Constraints . . . . . . . . . . . 189
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Notations

Assessment Functions
a/ Assessment function w.r.t. orientation
as Assessment function w.r.t. scale
ad Assessment function w.r.t. proximity (distance)
af Assessment function w.r.t. periodicity
a Assessment of gestalt

Objects and Sets


Object, e.g., Gestalt
G; S Set of Gestalten
X Set of two-dimensional points

Features
/ Orientation (direction)
e Elongation
f Periodicity
s Size or scale
x Coordinates two-dimensional point (x, y)

Constants and Parameters


t Threshold
… Constant
j Parameter von Mises distribution

xi
Chapter 1
Introduction

Images, as they occur as well in our everyday life, as in many technical and scientific
applications, often contain hierarchical arrangements of parts and aggregates. It is
likely that certain contents are repeated with high similarity within one image [1].
Such repetitions follow certain mappings , e.g., reflection, fixed repetitive translation,
or rotation. Thus patterns are ubiquitous in the pictorial data around us. And before
the term “pattern recognition” found its technical use in the scientific community, it
had a common sense meaning: the unveiling of geometrical and hierarchical image
structure by the human observer. Those concerned with the topic were aware of
strong analogies with the perception of language or music, and wrote books with
titles such as “Picture Languages” [2].
Long before computers were at hand, psychologists already opened the topic of
perceptive organization of patterns and parts. Important publications on the issue,
e.g., [3], were written in German language, using German terms such as “Gestalt”.
Laws of grouping were found, and it became evident that these are a key to the seeing
of objects, the discrimination of background and clutter from objects of interest, and
the simplification of the visual stimulus without loss of meaning.
Before the technical issues are discussed in the following chapters, we motivate
our view on machine Gestalt perception by looking at some example images.

1.1 Examples of Pictures with Hierarchical Gestalt

The most interesting visual stimulus for a human observer is other human subjects.
Therefore, next to portraits, group pictures are among the oldest and most important
genres of photography. Figure 1.1 displays an example. The persons are aligned
in horizontal rows. A strong reflective symmetry is perceived. However, certain

© Springer Nature Switzerland AG 2019 1


E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6_1
2 1 Introduction

Fig. 1.1 A typical group picture; image source FC Germania 07 Untergrombach

difficulties which will be discussed in the technical chapters below already become
evident from this example:

• The natural reflection symmetry of man—in particular when the full figure is seen
in frontal view, such as with the central person in the front row—is broken for
most of the men. Symmetry breaking is sometimes intentionally used by artists
and designers in an attempt to make their products more eye-catching. This can
be a major obstacle for machine Gestalt perception.
• The “semantic” rows of this picture that human observers would naturally use when
referring to a specific person in a phrase like “the third from the left in the second
row” do not correspond to visual rows. Visually, based on the laws of proximity and
good continuation, diagonal rows of three faces each are more salient. These then
form a horizontal row of diagonals. Such phenomena constitute a major obstacle in
constructing proper ground truth for visual salience. The observer subjects should
distinguish between their object recognition and scene understanding (which is
not the topic here) and the pure perceptual grouping (see Sect. 13.1).
• Human observers will concentrate most of their attention on the faces. This could
be quantitatively verified by the use of eye trackers. Faces in frontal view are one
example for reflection symmetry, and the decomposition of this picture in rows
of reflection symmetric patches is probably the most valid Gestalt description
concerning visual salience. However, there is more repetitive structure in the image,
such as the stairs on which the persons stand and which are visible to the left and
right. Human observers would tend to omit these, because they can hardly switch
1.1 Examples of Pictures with Hierarchical Gestalt 3

off their object recognition. It is hard to find a set of images that contains no such
objects, no faces, no animals, no cars, no facades, etc., and is still representative
for our visual surrounding. Such set would be needed in order to separate Gestalt
perception from object recognition in the evaluation of machine vision.
• Gestalt grouping may construct relations between patterns that are almost arbitrar-
ily far away from each other. In this example the leftmost and the rightmost men
obviously are in correspondence. Image processing that is not locally bounded
will cause high computational efforts (see Chap. 6).
• Occlusion is a major obstacle. Due to their team outfit, the men would be very
similar to each other, but only the ones in the front are completely visible. Only
a small portion of the Gestalt of the men standing in the middle or back rows is
visible at all. Moreover, many contours that outline the figures in the front row are
missing, because dark foreground yields low contrast on dark background.
• The ball in the image center may be considered as a good example for rotational
symmetry of order three (as treated in Chap. 4). It is self-similar with respect
to 120◦ rotations, but has no reflection symmetries. This is either the result of
intention or rather unlikely. The classical football features icosahedron symmetry
with twelve pentagons and twenty hexagons—a perfect rotational 3D symmetry.
Certain parallel projections of it feature 2D rotational symmetry of order five or
three. The classical design with black pentagons and white hexagons also features
many reflection symmetries in these particular projections. However, an arbitrary
projection is probably not symmetric at all. In the general case, projection of a
3D scene into a 2D picture destroys the symmetries. This particular ball breaks
reflection symmetry—again probably by intention of the designer. Its 3D symmetry
may also be something derived from the tetrahedron. Identical 2D symmetries may
be obtained by projecting different 3D symmetries.
• The ball is oriented so that the rotational symmetry is preserved—up to a certain
precision. Salient symmetry in images is always a smooth concept. This ball pro-
jection is quite or fairly symmetric. Projections of such an object are almost never
symmetric in the strict mathematical sense (the probability of occurrence is zero).
The same holds for projections of reflection symmetric objects like the man in the
center. Treating such fuzziness by intervals with heuristic thresholds is doomed to
fail.

Our technical civilization produces imagery to which the human visual system has
not been adapted by nature. An example is remotely sensed imagery of the planet’s
surface. Figure 1.2 displays an example. Such nadir view from above may be natural
for birds, but it is only accessible to humans for little more than one century. Yet,
the laws of Gestalt grouping aid the analysis of such imagery to a large portion.
Humans—even untrained random subjects—often still outperform automatic land-
use classification, and trained experts are capable of almost magical unveiling of
hidden information.
The difficulties and challenges with perceptual grouping on such imagery are
similar to the ones mentioned above. Details are discussed in the technical chapters.
However, here a preview is given with this example at hand:
4 1 Introduction

Fig. 1.2 Aerial image of urban terrain in Ettlingen, Germany, oriented to North, image source
Google Earth

• Many hierarchical Gestalt groups are immediately evident, such as the East–West
reflection symmetry between the two large buildings to the North (top) of the
image, which are reflection symmetric on their own (with slightly tilted axes), and
decomposed into wings with symmetric roofs, etc. Repetition in rows is also very
salient in this image: the trees along the roads, triplets of houses, cars aligned in
parking lots, repetitive structures on the roofs, etc. There is a strong preference for
parallelism and orthogonality. Rows are often oriented parallel to each other or to
other linear structures.
• This image was captured in the visual spectral domain under sun lighting from
Southeast. Thus, the measured intensities highly depend on the angles between the
surface normals and the direction to the sun. This breaks the symmetry. While the
buildings are designed in perfect symmetry, they do not appear with symmetric
intensities.
• Large buildings are likely to have mirror symmetry in 3D, i.e., reflection planes.
In a general perspective projection this is unlikely to be preserved. However, the
symmetry plane is likely oriented orthogonal to the ground plane. Then it will
be preserved in nadir-looking views, and these are more frequently used in aerial
mapping than oblique views. All terrestrial images are more or less restricted to
the ground plane (depending on the body height of the photographer). There is an
1.1 Examples of Pictures with Hierarchical Gestalt 5

intersection line between the ground plane and the symmetry plane. Taking the
picture of the building from a point on this line will give a good and characteristic
view of the building. The facade was made for such viewpoint, this is the per-
spective that has been drawn before it was erected in 3D, and it will be preferred
(intentionally or not) by any photographer who has the task of taking a picture of
it.
• The interior organization of the building often has similar reflection symmetries
and repetitions. These are lost in such pictures. Due to occlusion, almost only the
roof structures and symmetries are visible to the bird’s eye. Interior symmetries are
explicit in schemata and plans, i.e., drawings and CAD rendered screen images.
On these the symmetries are instantaneously seen by the human’s grouping system
in just the same way as in any other pictorial mode.
• Buildings and urban structures have a deeper hierarchy over a larger domain of
scale than other genres of human artifacts. Here we have a reflection symmetric
arrangement of buildings at a scale of several hundred meters. The building parts
that are arranged in each building according to sophisticated symmetry laws are in
the scale of several dozens of meters. Reflection symmetric windows come in long
rows on the facades, and inside the windows are separated in parts in the decimeter
scale. The same scale is relevant for the roof tiles, which are spread over the roofs
in regular lattices with a lot of members.

In Sect. 1.3 we will give a mathematical domain, in which such patterns can be
described. Throughout the book elements of this domain will be the topic. Most
chapters will specialize in a particular perceptual law. In this book such laws are
defined as operations working on elements of the Gestalt domain. In the next section,
however, the state of the art will be discussed.

1.2 The State of the Art of Automatic Symmetry and


Gestalt Recognition

A somewhat unusual alternative view on symmetry perception is presented by Ley-


ton in his controversial 1992 book [4]. There perception is always inference from
detectable traces on untouched homogenous ground, i.e., symmetry-breaking distor-
tions on symmetric background that does not give any information on its own. While
most other authors state that symmetric arrangements are perceived as foreground,
Leyton almost claims the opposite. For him homogenous background is symmetric,
and breaking the symmetry causes perception. Example, from scratches on a wall
we infer that something must have been moving along it, and we perceive footprints
in the snow and infer that someone has walked there.
Leyton lists the shape-from-x methods that were state of the art of machine vision
in those days and shows plausibly how these can be understood as inference of past
events that caused asymmetry in the percepts. He develops a terminology and sets
certain key assertions in it establishing what he calls revolutionary machine vision
6 1 Introduction

as counter-approach to standard machine vision. Today this revolution appears to


have failed. Yet this book should not be forgotten. Many of its statements seem to
contradict the view and results presented in this book, or other related work. However,
it is sound in itself, and it allows the reader to take a complete different perspective
on the same issue.
The contradictions appearing here may be tempered or explained by the different
terminologies. “Symmetry” in Leyton’s book is not what it is in this book. The traces
on the wall, and the footprints in the snow, would probably be arranged in good
continuation and repetition. They would follow Gestalt laws. Thus, we would see
them as symmetric and explain their salience by such laws. We would not consider
the white uniform background at all, and it is not symmetric in our view. Another
thing we do not care about is causality. We leave the inference to other levels of the
perception system, e.g., to a knowledge-based interpreter. In Chap. 12 we explain
how our perceptual grouping system may interact with such inference machines. As
far as we are familiar with them they would regard the kind of inferences that Leyton
draws from traces, scratches, prints on otherwise perfect ground as abductive: A foot
set on snow will cause a footprint. Therefore, it can be inferred that if there is no
footprint in the snow, nobody can have walked there. The other way round, from
the present prints to the past walker, is not a sound deductive inference. Thus, the
meaning of “causality” and “time” is also a bit different Leyton’s book.
A typical example of approaches to the topic from the psychology community is
given by the book of Pizlo and his group [5]. It claims to take an engineering stand-
point, giving a contribution to a handbook on how to construct seeing machines.
However, the model is clearly given by the human vision, and evidence on how it
works is drawn from diligent and extensive psychological experiments. Pizlo criti-
cizes that too many such experiments have used oversimplified patterns on the screen
presented to the subjects, in particular, that a large portion of the prior work used
only dots. This has some tradition since the days of Wertheimer, and in the chapters
below numerous such dot figures are presented also in this book.
Pizlo prefers line drawings that are projections of 3D polyhedrons (including
hidden line occlusion handling). Abundant evidence is given that humans rely on
reflection symmetry as prior, when reconstructing previously unseen objects. Among
the infinitely many polyhedrons that project to the very same 2D line drawing, human
subjects instantaneously see the symmetric body. Actually, in many cases still a
continuum of different symmetric 3D polyhedrons may project to the same image.
Pizlo and his group established evidence that then a certain compactness serves as
additional prior yielding again a unique perception. In the end nobody sees such
drawings as what they are: sets of lines on a 2D plane. Every human being sees
symmetric and possibly compact 3D bodies.
Pizlo emphasizes that to humans 3D reflection symmetry is most salient, i.e.,
reflection with respect to a mirror plane in 3D space. This plane may project to a
mirror axis in the image, and then the image will inherit the reflection symmetry.
Pizlo argues that this happens almost newer. He calls such projections “degenerate
views”. He refers to such “almost” as having almost mathematical, i.e., measure
theoretic, meaning. Picking a particular singular point from a continuous interval
1.2 The State of the Art of Automatic Symmetry and Gestalt Recognition 7

under uniform density assumption has probability zero, because the measure of a
single element subset is zero. Thus, he comes to the conclusion that constructing a
machine that can only detect 2D reflection symmetry is a waste of time, because an
appropriate input image will never come.
If this was true, how can we dare to present such methods in most of Chap. 2? Well,
Pizlo himself admits that the probability might actually be a little bigger than zero
because the set of receptors in the retina is finite. We add that all the proposed methods
include some tolerances. For instance, most simple practical implementations would
use accumulators with a certain bin granulation. This will raise the true positive rate
substantially. We avoid hard bin margins or thresholds, and use soft membership
functions which are called “assessments” throughout this book. Something like 10◦
off degenerate view direction will be a problem neither to our approach nor to most
other state-of-the-art methods. Even under uniformity assumption, an orientation
interval of ±10° yields a probability of more than 10% in a domain of 180◦ . That is
not never. Moreover, from the two example images given in Sect. 1.1 we concluded
that uniformity is violated in favor of symmetry preserving views.
We admit that skewed views make a substantial part of the standard benchmark
image collections for symmetry recognition, such as [6], and that this may explain to
some degree the rates we achieved in those competitions. However, there is doubt that
view directions are uniformly distributed in the world of real pictures. In Sect. 1.1
we used a group picture and an aerial picture as example pictures. For both genres,
a slanted viewing direction would be degenerate, while the symmetry-preserving
viewing direction is standard. Whenever a visitor wants to take a tourist picture of a
major must-have-seen attraction, such as a palace or a cathedral, or a selfie with this
background he or she would always try to use the spot where the perspective is in this
sense degenerate. In fact, you can virtually see the people clustering at corresponding
look-from spots in front of such attractions. If someone likes to take pictures of
butterflies he or she would often move along the animal with the goal to have the
most symmetric shot. An engineer would usually prefer such degenerate perspectives
in the schemes, drawings, and views of the objects and parts of concern. Even when
advertising some object, e.g., a car, people use maybe one general perspective view,
but the majority of the view directions will be degenerate. A skilled craftsman will
rotate the workpiece in his hand until the object–eye direction fits such special view,
and the robot that might someday replace him will have to do the same thing. Indeed,
in industrial machine vision special views prevail vastly. In medical imaging most
methods give projections that preserve symmetry.
Of course we admit that skewed symmetries occur, and we add a section on
corresponding augmentations of our operations to each chapter, such as Sect. 2.11
at the end of the reflection symmetry part. And we do recommend to study [5] with
diligence. We particularly appreciate the deep knowledge of the literature on the
topic presented in it. Most people are not aware of how old the science of Gestalt
perception is. Pizlo refers to Alhazen’s work, which was published almost thousand
years ago. One of his favorites is E. Mach who studied the topic and discussed his
findings in the mid-nineteenth century.
8 1 Introduction

In computer graphics the symmetry topic is discussed more frequently. A good


review is given by Mitra et al. [7]. This includes the search for symmetries in given
data (measured or constructed). Usually the projection problem is circumvented by
directly working on 3D data, such as point clouds, polygon meshes, or NURBS.
The problems are similar to what is discussed in the book at hand. For instance,
for testing symmetry most often a segmentation is needed—a search in the set of
subsets. Stability is generally achieved by use of evidence accumulation. The machine
vision community can probably learn a lot from taking a look into the computer
graphics community, when analyzing and utilizing symmetry. Mitra emphasizes that
symmetry is ubiquitous through the scales from crystals to galaxies in any world. Yet,
also in the graphics community, there seems to be not much work around that treats
hierarchies of symmetries, i.e., symmetric aggregates that are arranged in higher-
order symmetric aggregates on larger scale (and contain symmetric parts that may be
further decomposed). This is somewhat remarkable, because in computer graphics
the utilization of syntactic approaches has survived, whereas in machine vision it has
almost faded away.
To the best of our knowledge, the grammar of seeing of Kanizsa [8] has not been
translated. We still recommend this book, also for readers with little or no knowledge
of the Italian language. Most of the topic-specific terms are more or less Latin in both
English and Italian with moderate transformations. You will get used to it quickly. Of
course, also “Gestalt” is frequently found. The main point is that Kanizsa argues with
numerous fantastic and convincing drawings. During his long and fruitful teaching
and research days at the University of Trieste he collected a huge amount of empirical
evidence with representative sets of trial subjects. The corresponding quantitative
results are found as tables in the book. Evidence is given on the mutual preference of
Gestalt laws in figure/background seeing. Some results are surprising. Some results
also contradict with the older classics such as Wertheimer [3]. In such situation,
more trust should be given to Kanizsa. His work is more diligent. Gestalt perception
research has a large overlap with fine arts, in particular architecture, drawing and
painting, and design. This aspect becomes most evident in Kanizsa’s work.
We understand the term “pattern recognition” more in its common sense mean-
ing than in its technical meaning generally used in computer sciences. There is the
impression that the prevailing concentration on machine learning methods classify-
ing feature vectors lacks an important aspect of human pattern recognition, namely
the structural side of it. This was and is shared by many researchers. A distinguished
figure in this was Ulf Grenander. Avoiding double use of the term “pattern recog-
nition” he called his approach pattern theory, and the most important source is the
“general pattern theory” [9], an 850-page volume starting with the sentence “Reading
this book will require a determined effort”. Yet, studying this book brings important
insights.
The elements of the pattern domain are called generators. These have a finite
set of bonds, where other generators may connect to them. There is a table listing
admissible bondings. Here we may confine ourselves to the image generating and
analyzing cases. Then the topology is fixed to some regular pixel lattice, e.g., with
four neighbors for each node. Using the term configuration comes naturally for an
1.2 The State of the Art of Automatic Symmetry and Gestalt Recognition 9

aggregate made of such generators in an admissible way. Grenander then defines


an algebraic structure on these configurations, namely certain equivalence relations
called identification rules. Thus, images come as equivalence classes of configura-
tions with respect to identifications. And, depending on the bond tables, repetitive
patterns may be contained in such images. In order to cope with real-world signals,
which are subject to uncertainty and noise, certain deformations are introduced.
Moreover, the hard Boolean constraints in the bond table are later softened to con-
ditional probabilities. A severe problem is the construction of the generators and
bond tables. Grenander warns his readers to hastily construct them using heuris-
tics. In Chapter 19 of his book he instead gives a method how to learn them from
given data. Dependencies between not directly connected nodes must be transported
through some connecting path. Thus the Gestalt operations presented in our book
can hardly be formulated in generators and bonding tables. Scale space and hierarchy
are missing. Also, psychological considerations do not play a major part in Grenan-
der’s pattern theory, and he does hardly refer to the Gestalt literature. However, one
important member of his school is David Mumford and in further succession Agnes
Desolneux. There is a newer follow-up on pattern theory by these two authors [10].
This is much closer to the issues discussed in our contribution. In fact, the most
important reference for us is [11] of Desolneux and her group.
Interestingly, A. Desolneux bases her work on the quite old discoveries, in par-
ticular on H. von Helmholtz. The Helmholtz principle states: “We immediately per-
ceive whatever could not happen by chance.” So the role of the most homogeneous
“symmetry” in Leyton’s world is here played by randomly distributed clutter. The
Desolneux book is the most important reference for a sound and diligent probabilistic
elaboration of this ansatz. Formulating it in terms of statistical tests and combinato-
rial reckoning, the principle leads to the construction of very successful and robust
Gestalt perception machines. Standard model for the background clutter is uniform
distribution. The Gestalt is detected as unlikely outlier from this. One main result
reported in Desolneux’s book is the fact that estimations or even exact results on
the probabilities often require extensive computation. It turns out easier to estimate
expectation values: E.g., if at a certain clutter density and certain foreground devi-
ations from the regularity 10 friezes of primitive objects can be expected in a pure
clutter image, such frieze will not be very salient. On the other hand, if in a different
setting of clutter density and foreground deviations the expectation would be 0.01
friezes, finding one will be a surprise and thus salient.
We follow this approach in that we would assume the background clutter Gestalten
to be uniformly distributed in their location, orientation, etc. Many of our example
figures use the uniform background versus normally distributed foreground rationale.
But we do not repeat Desolneux’s approach in detail. In particular, for the time
being, we do not base our work on the mathematical theory of probability. Our book
emphasizes the algebraic view on the topic instead. Perceptual grouping comes in
nested hierarchical organization. Desolneux is well aware of that, but the technical
difficulties of probabilistic reckoning seem to hinder her in advancing deeper into this
aspect. We think that one may well advance in that direction, and do the probability
math later.
10 1 Introduction

Pioneering the application of Gestalt laws in machine vision, D. G. Lowe’s book


on perceptual organization appeared in 1985 [12]. This book is primarily cited when
treating the two-and-a-half D sketch, an idea that has not proven very successful.
However, it is full of brilliant ideas on how a future-seeing machine would be best
organized and where the different aspects—knowledge, perceptual organization, and
learning—should be utilized and should interact. Lowe emphasizes frequently that
the key property is “non-accidentalness”—a configuration should be seen as aggre-
gate if it is unlikely to have occurred by chance. Given the computational resources,
and even limitations of sources of digital imagery in those days, the work shows
prophetical qualities. Lowe emphasizes that vision always works in the scale-space
domain. On the iconic level one should not concentrate on the pixel grid matrix.
Instead, the correct domain is a continuous plane together with a scale pyramid on it.
Not only the locations of maximal curvature are located in the plane with sub-pixel
accuracy, they are also most salient in a particular scale of the image. The SIFT
keypoints resulting from this approach are discussed in more detail below in
Sect. 11.4. It is a bit sad that some people today reduce Lowe’s work to this low-level
SIFT issue only. His contribution to perceptual grouping is of equal importance.
Analysis of remotely sensed images of the surface of the Earth has been one of the
major application fields of machine vision right from the start, and the authors of this
book have been active in this field in particular. A particularly well-understandable
book on perceptual organization in this context written by Sarkar and Boyer [13]
appeared in 1994, in a way concluding this period. There, vision is regarded as a
stratified process distinguishing: (1) signal level, (2) primitive level, (3) structural
level, and (4) assembly level. On each of these levels the laws of Gestalt perception—
they frequently prefer Lowe’s term “perceptual organization”—apply in a different
way. They introduce the perceptual inference net (PIN) as coding and interface
format for the automation of such processes.
The PIN is essentially a Bayesian network. Such networks are a graphical nota-
tion for the factorization of a joint probability into conditional probabilities. As an
example, take two line segment primitives. Their probability of appearance at a given
position, orientation, and scale will at first be assumed independent, so that the joint
probability is just the product of the individual probabilities. But then a hidden node
can be introduced, denoting a parallel pair, just as they are constructed in Sect. 9.1.
Sarkar and Boyer call such construction a “composite”, while throughout this book
the word “aggregate” is used for it. Given these two primitives are part of such
aggregate, the probability for certain positions, orientations, and scales reckons very
differently. As a practical approach to pattern recognition, messages are introduced
traveling through the links of the net in both directions. Bottom-up, new nodes are
constructed, i.e., hypotheses of what aggregates might be present. Top-down, tests
are performed evaluating how well lower-level elements fit into the aggregate. The
parameters of the distributions can be set heuristically or also estimated from labeled
data.
Sarkar and Boyer demonstrate their PIN method on aerial images of urban ter-
rain. They achieve impressing results without using much knowledge about human
1.2 The State of the Art of Automatic Symmetry and Gestalt Recognition 11

settlement or exhaustive training of classifiers. They emphasize hierarchy through


the scales of such data. We close this section by citing them (pp. 64 of [13]):
• “We make the problem tractable by exploiting the conditional dependencies inher-
ent among the variables. Features which are dependent tend to be close together
spatially. In the context of a hierarchical system this assumption is generally
true. Dependencies among distant features are captured at higher levels of the
hierarchy. ...”
There is no better way to say this.

1.3 The Gestalt Domain

Motivated by the examples presented in Sect. 1.1 and the state of the art presented
above, we define a domain G in which the objects or things we have been talking
about can be handled as well by machines as by developers. Once such domain is
given, its mathematical properties can be discussed, operations on its elements can be
defined, and theorems about these structures can be proven. Code can be constructed
and tested on appropriate example image data sets, procedures for ground truth
generation can be constructed, and so forth.
We refer to the elements of this domain as “Gestalt” or in plural as “Gestalten”,
a word borrowed from German language, and we do so with reference to the rich
literature on Gestalt grouping in the English-speaking world. With the use of this term
we place our work in the interdisciplinary field between the empirical psychology
on the human visual perception system on the one hand and the machine vision
engineers world on the other hand.
As a starting point, a finite set of such Gestalten can be extracted from an image.
There are numerous methods to do such extraction, some of which are given in
Chap. 11. Some are better for one type of images; some of them are better for other
types of images; some of them are complex to understand, while others are very
simple; some require substantial computational effort; others are faster than video
rate. For the time being we do not care what method is being used. Our interest here
is only in the properties of G and its elements—the Gestalten.
Any g ∈ G has the following compulsory features:
• A location in the image denoted as x g . We prefer the standard 2D vector space on
real numbers for this feature.
• A scale noted as sg . This is a real number greater than zero.
• An orientation noted as φg . This is the angle between the Gestalt and the horizontal
axis.
• A periodicity with respect to rotary self-similarity denoted as f g . This is a positive
integer. It will be 1 if the object has no self-similarity when rotated in the plane. It
will be 2 if it appears similar when rotated by 180◦ and so forth.
12 1 Introduction

• An assessment noted as ag . Assessments are real values bound between zero and
one. Throughout this book, assessments replace predicates or laws which can only
be fulfilled or violated. An arrangement will be assigned with assessment 1.0 if the
corresponding symmetry law is fulfilled in perfection and with 0.0 if it is a perfect
violation of the law. Most arrangements will be assessed somewhere in between.
All machine vision engineering uses some kind of 2D location feature. Most
image processing modules use square (or almost square) pixel grids. In this world
the location is an index pair (r, c), first the row index running down from first row
to last row and then the column index running left to right from the first column to
the last. In fact, this is a matrix format, enabling summation in rows or columns, or
integral image tricks. However, it is an awkward format if geometric constructions
are the topic. It is, e.g., not closed under the operation average determination. The
average location of a set of pixels will most often not have integer pixel coordinates.
This results from the mathematical property of the set of integer numbers, which is not
closed with respect to division. But the raster is only one problem. Another important
problem results from the margins. For example, constructing the intersecting location
of two straight line segments will often come up with an out of margins result. Despite
the margins, you can also never have true invariance with respect to shift. All this
causes unnecessary problems. Therefore throughout this book a location will be
just a point (x1 , x2 ) in the 2D plane, i.e., a pair of real numbers, with the first axis
pointing right and the second axis pointing up. All primitive extraction methods must
transform their results accordingly. The vector space properties of this domain allow
the use of Gaussian distributions on it. Thus the probabilistic approach reduces to
quadratic and linear forms, and often closed-form solutions are possible.
A. Rosenfeld discovered the importance of image pyramids about fifty years ago.
Today many image processing tools use this representation. Obviously, scale is an
important and natural feature of any image content. Mathematically scale lives in a
multiplicative continuous group. It makes no sense to add or subtract scales, and you
would always multiply them with a factor smaller or bigger than 1, with 1 acting
as a neutral element. The proper mean scale between scales 2 and 8 is not 5, and
you’d prefer the geometric mean which yields here 4. In image processing tools the
scale is again treated in raster √formats,√but this time an exponential raster spacing is
used, such as 1, 2, 4, ... or 1, 2, 2, 2 2, .... We prefer a continuous scale feature
s > 0. Normal distributions are not a good statistical model in this domain. Instead,
the log-normal distribution appears to be a good choice.
Not all image elements of concern must necessarily feature an orientation. Circular
dots or disks are completely symmetric with respect to rotation. However, we regard
these objects as comparably rare exceptions. Orientations constitute a continuous
domain as well, but this one has no vector space structure. You may add or subtract
angles, but there is no metric. The triangle inequality is violated. Mathematically, this
domain is a continuous additive group with zero rotation as neutral element, and each
rotation can be undone by counter-rotation. In practice, often normal distributions
are used to describe the variation of orientations. However, this is not sound. The
sound way of handling statistics on orientations is given below in Sect. 1.5. There
you will also find how a proper mean of a set of orientations must be reckoned.
1.3 The Gestalt Domain 13

Fig. 1.3 Forty randomly drawn Gestalten

Mathematically the assessment domain is just the interval of real numbers between
0 and 1. We may understand it as degree of fulfillment in a similar way as the
membership value in the fuzzy-set theory of Zadeh [14]. Throughout this book, such
assessments will be combined or fused. Much vigilance and care have to be employed
when defining assessment functions. This is the key to success in the topic at hand.
Often assessments will be understood, interpreted, constructed, and fused similar to
probabilities or probability density functions. But we cannot at the current state of
the work guarantee that every assessment has all properties of probabilities. Yet in
the next section probabilities and statistics on our domain will be treated.
Figure 1.3 displays a set of randomly drawn elements from this domain. The loca-
tion of each element is indicated as center of a circle. Here this feature is distributed
uniformly within a rectangle of 150 × 100 units. Drawing circles has the advantage
that also the scale can be straightforwardly indicated as size of the circle. Note the
number of spokes connecting the center of an element with its perimeter is varying.
This displays the periodicity feature. For example, if an element is displayed with
three spokes, it will be indistinguishable when rotated by 120◦ . Most Gestalten given
throughout this book have periodicity 2; i.e., they are indicated with a cross-sectional
line. Only special sorts of primitive Gestalten can have periodicity 1, and only in
Chap. 4 periodicities higher than 2 can be constructed. Displaying such spokes also
gives a natural way of indicating the orientation feature. Orientation zero is defined as
14 1 Introduction

horizontal and pointing right. Rising orientation goes in counterclockwise direction.


The most important feature, the assessment, is indicated as gray-tone. Good Gestal-
ten are drawn in black on white ground, so as to being salient. Bad Gestalten are
lighter. A Gestalt with assessment 0 would be drawn in white on white background;
i.e., it would disappear. This corresponds to our intention, because such Gestalt is
meaningless.
These are the compulsory features, which an object must have in order to par-
ticipate in our constructions. However, the list of features may not yet be complete.
Many Gestalten will feature additional properties, such as colors, or any kind of
descriptors. These may also contribute to the assessment of aggregates. A particular
discrete feature is the class of an object. This book does not focus on classification,
but of course the methods presented may interact or complement object recogni-
tion and classification. We will, however, use labels referring to the kind of Gestalt
grouping utilized in the construction of an aggregate, saying, e.g., this is a reflection
Gestalt, or that is a row Gestalt.

1.4 Assessments for Gestalten

The detection, extraction, and grouping of Gestalten are usually based on features
which are inherently uncertain due to the measurement process, missing or invalid
data, and invalid model assumptions. Therefore, even apparently identical objects fea-
ture relations which are not perfectly fulfilled and attributes which are not identical.
Any reasoning and evaluation process have to consider the assessment of Gestalten
and their relations in various respects:
• Similarity and proximity. The similarity and proximity of two or more Gestalten
must be specified by the use of distance measures to check hypotheses.
• Agglomerative grouping. The hierarchical clustering of Gestalten or Gestalt parts
can be performed by a bottom-up approach, preferably taking the complexity of
potential new Gestalten into account.
• Performance evaluation. Based on given ground truth the performance of the
processes and the results must be specified.
• Classification. The affiliation of a given Gestalt to a certain class of Gestalten must
be determined by classification.
In most methods the decisions are based on a distance or similarity function and
a linkage criterion. The former should use an appropriate metric to allow for the
combinations of different types of features.

Similarity and Distance Functions


A similarity measure or similarity function a is a real-valued function that quantifies
the similarity between two objects, preferably in the range [0, 1] for easy interpre-
tation. Although no single definition of a similarity measure exists, usually such
1.4 Assessments for Gestalten 15

measures are in some sense the inverse of distance metrics d:

a = exp {−d} , d≥0 (1.1)

A metric or distance function d in turn is a function that defines a distance between


each pair of elements of a set X :

d : X × X → [0, ∞) (1.2)

where [0, ∞) is the set of nonnegative real numbers, and for all x, y, z ∈ X , the
following conditions are satisfied:

1. d(x, y) ≥ 0 non-negativity or separation axiom


2. d(x, y) = 0 ⇔ x = y identity of indiscernibles
3. d(x, y) = d(y, x) symmetry
4. d(x, z) ≤ d(x, y) + d(y, z) subadditivity or triangle inequality

Similarity of Gestalten
A common way to specify the distance between two feature sets x j and xk is the
unitless and scale-invariant Mahalanobis distance

 T  
d(x j , xk ) = xk − x j Σx−1
x x j − xi (1.3)

with the covariance matrix Σx x taking care for scale invariance and the consideration
of correlations. If the covariance matrix is diagonal, i.e., the features are independent,
the resulting distance measure is called a normalized Euclidean distance

 n

 xi j − xik 2
d(x j , xk ) =  , (1.4)
k
σi

and with identical distributed features



 n

d x j , xk =  (xi j − xik )2 . (1.5)
i

In this case the assessment or similarity function (1.1) reads

 
n
a x j , xk = exp −d x j , xk = ... (1.6)
i
16 1 Introduction

Comparing the Properties of Assessment Fusion and T-Norms

Fuzzy-set theory uses triangular norms as generalization of conjunctive combina-


tion, i.e., logical and or set intersection. For further reading on such T-norms we
refer to [15], but we recapitulate here what is needed for combinations of Gestalt
assessment functions. A T-norm is a function t : [0, 1] × [0, 1] → [0, 1] with the
following properties:
• Commutativity: t (a, b) = t (b, a).
• Monotonicity: if a ≤ c and b ≤ d, then t (a, b) ≤ t (c, d).
• Associativity: t (a, t (b, c)) = t (t (a, b), c).
• Identity element: t (a, 1) = a.
It is easily verified that the logical conjunction “∧” fulfills these properties restricted
to the extreme values 0, 1 which are interpreted as false and true. With associativity
at hand we can write t (a1 , a2 , . . . , an ) instead of t (a1 , t (a2 , . . . , an ) · · · ). Most
common examples for T-norms are t (a, b) = min(a, b) and t (a, b) = a · b. In fuzzy
sets the former is standard for conjunction; however, in the logical combination of
our Gestalt assessments we will prefer the latter, i.e., the product. This is the correct
way of combining probabilities.
If Gestalt assessments are combined with respect to n different properties using a
T-norm, the result will tend to become small—and ever smaller with rising numbers
of properties. This can cause a problem when assessments are used to set priorities
in a competition for computational resources in a smart search system as outlined
below, e.g., in Sect. 6.4. In such context assessments are compared with each other
that are based on different numbers of properties. We therefore define a fifth property
for such functions, the balance:
• Balance: t (a, a) = a.
The T-norm “min” fulfills this property. It is known that this is the maximal T-norm.
All functions substantially different from min will violate balance. In particular the
product a · a violates it for any 0 < a < 1, i.e., for almost all a. If a and b are
uniformly distributed random variables, min (a, b) will not be uniformly distributed.
It will be expected substantially lower than 0.5. The only balanced function that
transports expectations of assessments would to our knowledge be the mean:

a+b
t (a, b) = .
2
This follows from the linearity of expectation. Unfortunately, this function violates
associativity and 1 is not an identity for this function. Therfore, this is not a T-norm.
Associativity can be enforced by using

a1 + · · · + an
t (a1 , . . . , an ) =
n
1.4 Assessments for Gestalten 17

instead of nested terms. But the existence of an identity can to our knowledge not
be kept if balance and transport of expectation are demanded. The function min,
which constitutes the maximal T-norm, has a substantially lower expectation as
compared to the mean. For uniform assessments a and b the expectation turns out as
E (min(a, b)) = 1/3.
There is yet another property that conjunctive combinations of assessments should
fulfill: If any of the partial properties should be assessed as zero, the combination
should also be assessed as zero:
• Null element: t (a, 0) = 0.
This is always fulfilled for any T-norm, but not for the mean. There is a commutative
and balanced function that also has this null element property that is the geometric
mean: √
t (a, b) = a · b.
2

It is not a T-norm, because it violates the identity property for the element 1. It does
not transport
√ expectations.
 For uniform assessments a and b the expectation turns out
as E 2
a · b = 0.444 . . .. This is much closer to 1/2 than the largest expectation
that can be achieved with T-norms. The geometric mean also violates associativity.
But this can be fixed by using

t (a1 , . . . , an ) = n
a1 · · · an

for conjunctions of more than two assessments. Indeed this function is used frequently
throughout this book.
Performance Evaluation
The determination of empirical accuracy requires reference values y or ground truth
for the results x or functions y(x) of the results. If the reference values have at least
the same accuracy as the estimated values, the differences

Δy = y(x) − y (1.7)

can be analyzed, e.g., by computing the histogram or just the extrema. In order to
test whether the accuracy potential of the observations is fully exploited, one needs
to compare the differences (1.7) with their standard deviations by considering the
ratios
Δyi
(1.8)
σΔyi

with the standard deviations σΔyi = ΣΔyi Δyi obtained from ΣΔyΔy = Σ y y + Σ y y .
The covariance matrix Σ yy of the function values y needs to be determined by
variance–covariance propagation with the covariance matrix Σ yy of the reference
values y.
18 1 Introduction

A combined test to check the complete set of n values y uses the Mahalanobis
distance −1
(y − y)T Σ yy + Σ yy (y − y) ∼ χ2n (1.9)

as test statistic, which is χ2 -distributed with n degrees of freedom if the mathematical


model holds and if the data are normally distributed.

1.5 Statistically Best Mean Direction or Axis

Directions are important features of many Gestalts. Axes are considered to be undi-
rected lines. For a comprehensive survey of this topic see [16] for instance.
Statistically Best Mean Direction
Given n unit vectors di representing directions, the statistically best mean direction
d is given by
n
d = arg min wi di − d 2 , (1.10)
d
i=1

i.e., the solution is found by minimizing the sum of weighted squared distances with
the weights wi = 1/σi2 allowing for different uncertainties σi2 [17]. The estimate


wd
d = N  i i (1.11)
wi

for the mean direction d is simply the weighted sum of directions normalized to unit
length with the operator N(·). As shown in Fig. 1.4, vector addition is the natural
way to combine unit vectors. If the angles between the directions di and the mean
direction are small, the sum of the weighted squares of angles is also minimized.

Fig. 1.4 Example of


addition of unit vectors,
according to [16]. The
resultant vector has the mean
direction of the individual
vectors
1.5 Statistically Best Mean Direction or Axis 19

Statistically Best Mean Axis


Axes are homogeneous unit vectors; i.e., the vectors a and −a represent the same
axis. Given n unit vectors ai with arbitrary signs and weights wi , the statistically best
mean axis is found by minimizing


n
wi sin2 (αi ) (1.12)
i=1

with the angles αi = arccos(aiT a) between ai and a. Minimizing (1.12) is equivalent


to maximizing


n 
n 
n
wi cos2 (αi ) = wi (aiT a)2 = wi a T ai aiT a. (1.13)
i=1 i=1 i=1

Therefore, the optimal axis is given by the eigenvector corresponding to the largest
eigenvalue of the weighted moment matrix


n
M= wi ai aiT . (1.14)
i=1

1.6 The Structure of this Book

The laws of Gestalt perception differ considerably in the structure of the equations
inferring the perceived features from the observations. When used in the analytic
direction, data-driven search for best instances also requires very specific algorithms
for each law. Moreover, it is also important to be aware of the algebraic operations
that are permitted on the input. They give again different equivalence relations for
every Gestalt law. For these reasons, the laws are treated in separate chapters:
• Reflection symmetry (see Fig. 1.5 upper left) is treated in Chap. 2.
• Repetition in a frieze in good continuation using a generator vector (see Fig. 1.5
upper right) is treated in Chap. 3.
• Rotational symmetry (see Fig. 1.5 second from top left) stands out because of its
rich algebraic beauty. It is not listed as Gestalt law in the classical literature and
less important in the applications. Yet it deserves its own Chap. 4
• Parallelism and rectangular settings are of particular importance in the analysis
of man-made objects such as buildings or other infrastructure. They are treated
together in Chap. 9. Parallelism alone needs fusion with close proximity as indi-
cated in Fig. 1.5 third from top left.
• Chap. 8 is about contour prolongation (see Fig. 1.5 second from top right). Distin-
guishing this kind of good continuation law from the one presented as frieze law
20 1 Introduction

Fig. 1.5 Operations on the


Gestalt domain

in Chap. 3 is important. It differs not only in the observation equations but also
decisively in the algebraic structure.
• Last but not least, lattices (see lowest operation in Fig. 1.5) set an important
application field, for instance, encountered in facade analysis. These should not
be treated simply as rows of rows. Instead, there is a separate chapter for this:
Chap. 10.
There is general agreement that, while those individual laws may all be of interest
to the research community on their own, the most exciting point is in their combina-
tion. One way of such combination can be formulated as nested hierarchies on rising
scales. For instance, frequently things like rows, of reflection symmetric objects, are
encountered, which are made of lattices on an even finer scale, etc. Chap. 5 follows
an algebraic approach to this kind of hierarchical reasoning. It also gives an exam-
ple of how the adjustment of observed features can be guided by propagating the
Gestalt laws through the part of hierarchy and enforce all resulting constraints in one
minimization.
All the laws of Gestalt perception, as well as their combination, can in principle
be studied without any learning data at hand. That is, the corresponding machine
perception code can be built just translating the stochastic and algebraic content of
these chapters that distinguishes this subfield of pattern recognition or machine vision
sharply from the deep-learning approaches that were so popular and successful in
the last decade. However, when after the setup of this Gestalt perception structure
example data are considered, one may introduce parameters and optimize these in
order to improve the recognition performance on these given examples. Chapter 13
outlines such possibilities.
Learning such weight parameters from data is at the core of the so-called artificial
neural nets that received so much attention twice in the history of machine vision
1.6 The Structure of this Book 21

and pattern recognition: first in the three decades following the 1960 adaptive linear
element model of B. Widrow and later in the most recent decade. In between there
was much interest in knowledge-based recognition methods. Though looking quite
old-fashioned today, there is good reason to also consider such approaches today. The
main reason can be seen in the huge amounts of machine interpretable knowledge
available today. In order to achieve highest possible recognition capabilities these
sources should not be left out. This book contains a separate chapter treating the
combination of Gestalt perception with knowledge utilization: Chap. 10.
The book would not be complete without treating two issues which are very
important to practical applicability:
• The primitive extraction—this step from the signal level to the object level, where
the data are represented as a set of tokens, is most critical. Much information
can be lost here. And in fact often the perceptual grouping step often fails before
it has started, because the decisive items have been lost in the primitive extrac-
tion process. Chapter 11 lists a set of corresponding methods, together with their
advantages and disadvantages. Of course such listing cannot be exhaustive.
• Hierarchical Gestalt grouping is a combinatorial process. Thus, a severe obstacle
to practical application must be seen in its potentially high computational com-
plexity. For many applications of vision unpredictable data-dependent run-times
and storage requirements cannot be accepted. It is therefore important to discuss
any-time search algorithms that trade soundness for speed. Chapter 6 discusses
this aspect quite early in the book and proposes alternatives in these directions, in
order to counter any premature critique along these lines.
The sequence in which the chapters are presented may be a good sequence for reading
it. But actually one may as well move around freely.

References

1. Glasner D, Bagon S, Irani M (2009) Super-resolution from a single image. In: IEEE 12th
international conference on computer vision (ICCV), pp 349–356
2. Rosenfeld A (1979) Picture languages. Academic Press
3. Wertheimer M (1923) Untersuchungen zur Lehre der Gestalt II. Psychologische Forschung
4:301–350
4. Leyton M (2014) Symmetry, causality, mind. MIT Press, Cambrige
5. Pizlo Z, Li Y, Sawada T, Steinman RM (2014) Making a machine that sees like us. Oxford
University Press
6. Liu J, Slota G, Zheng G, Wu Z, Park M, Lee S, Rauschert I, Liu Y (2013) Symmetry detection
from realworld images competition 2013: summary and results. In: CVPR 2013, Workshops
7. Mitra NJ, Pauly M, Wand M, Ceylan D (2013) Symmetry in 3D geometry: extraction and
applications. Comput Graph Forum 32(6):1–23
8. Kanizsa G (1980) Grammatica del vedere. Saggi su percezione e gestalt. Il Mulino
9. Grenander U (1993) General pattern theory. Oxford University Press
10. Mumford D, Desolneux A (2010) Pattern theory. CRC Press, A K Peters Ltd., Natick
11. Desolneux A, Moisan L, Morel J-M (2008) From Gestalt theory to image analysis: a proba-
bilistic approach. Springer
22 1 Introduction

12. Lowe DG (1985) Perceptual organization and visual recognition. Kluwer Academic Publishing
13. Sarkar S, Boyer KL (1994) Computing perceptual organization in computer vision. World
Scientific
14. Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353
15. Klement EP, Mesiar R, Pap E (2000) Triangular norms. Kluwer
16. Fisher NI (1995) Statistical analysis of circular data. Cambridge University Press
17. Förstner W, Wrobel B (2016) Photogrammetric computer vision. Springer
Chapter 2
Reflection Symmetry

Buildings—in particular large representative buildings like temples or castles—often


feature reflection symmetry. Therefore in remote sensing on urban terrain reflection
symmetry constitutes a strong prior that unfortunately is rarely investigated. Also in
facade recognition, reflection symmetry has not been in the focus of attention. In face
recognition, reflection symmetry has been identified as some valuable feature [1, 2]
and indeed faces constitute a substantial part of the few publicly available benchmarks
[3, 4]. Animals often feature reflection symmetry as well. Indeed when looking
on random dot patterns, such as the ones given below in Fig. 2.1, human subjects
might well perceive illusions showing faces or insects for instance, where reflection
symmetric clusters are present.
When the term “reflection symmetry” is used throughout this book, usually reflec-
tion in 2D with respect to a symmetry axis is meant. Sometimes people prefer the
word “mirror symmetry” for this law. Geometry also knows reflection symmetry
with respect to a point. This topic is treated as special case in Sect. 2.11.

2.1 Introduction to Reflection Symmetric Gestalten

Psychological investigation reveals reflection symmetry as an important grouping


law for foreground-to-background discrimination [5]. Following the classical Gestalt
approach—as, e.g., in [6, 7]—such conclusion can be drawn in the publications from
graphics containing dots or short line primitives, that form a Gestalt. With no com-
puter support for empirical evidence, early work demonstrated the effects abusing
the reader as trial subject. Numerous illusions were presented to the reader/viewer
accompanied with text stating what should be seen. Of course such evidence lacks
a representative observer set since it uses only one subject. But often it is still con-
vincing because of the strength of the effects.
© Springer Nature Switzerland AG 2019 23
E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6_2
24 2 Reflection Symmetry

Fig. 2.1 Reflection symmetric Gestalten on clutter: left to right rising amount of clutter, first row
locations only, second row location and similarity, third row location and orientation

Figure 2.1 follows this methodology, but today such graphics can be generated
using a random number generator in a computer. Following [8] there is not only the
foreground Gestalt aggregate in the graphic but also background clutter objects that
resemble the parts. Such clutter acts against the Gestalt perception. In the figure the
amount of clutter is increasing from left to right—left column 40 clutter elements,
center column 80 clutter elements, right column 160 clutter elements. Actually, the
hiding effect is so strong that in the right column not all readers may perceive the
reflection symmetric Gestalt at once—it is near the left margin. So one may state
that the transition quantity between “meaningful” and “meaningless” in the terms
of Desolneux [9] must be somewhere near theses clutter densities. Of course with
computer generated graphics such experiments can be repeated, and they can be
presented to more than one person. Averaging the results over a more representative
trial-subject set will constitute more stable and convincing evidence.
Figure 2.1 shows three rows containing primitives with different features. The first
row features location only. The second row features additionally intensity and size.
Thus, the influence of additional similarity can be investigated. On some patterns
the Gestalt may become visible where it was hidden in clutter with location feature
only. The third row shows orientation as additional feature following the algebraic
setting of Sect. 2.2 below.
Mathematically it is evident that constructing such a figure is much easier than
analyzing it. The construction performs the following steps:
2.1 Introduction to Reflection Symmetric Gestalten 25

• Choose a location and size for the Gestalt at random, in such a way, that it will fit
into the image margins, and choose an orientation for its reflection axis—also by
drawing from uniform density. This constitutes the ground truth for the trial.
• The size chosen in the previous step is used as length for the axis. Accordingly the
axis has two endpoints, and between these endpoints a certain number of pairs of
objects will fit (depending on the rough size of the primitives). For each such pair
a distance is drawn uniformly between zero and again the size, so that the width
of the cluster will also be expected to be the size (and maximally the width will
be double the size). Thus the location feature for all the parts is set.
• Possible additional features such as orientation, size, and intensity are set for each
pair using again uniform distributions.
• The features of the foreground primitives may be disturbed according to normal
distributions (in Fig. 2.1 location coordinates are disturbed with a standard devia-
tion of one unit).
• Generate the desired number of clutter objects using again uniform densities. It
is recommended to avoid placing background objects into the region where the
foreground object is—so that the objects will appear saliently denser there. So if
a clutter object falls into this region it will not be used, and instead a new one will
be drawn.
It is clear that such method is of linear algorithmic complexity in terms of the
number of objects in the graphic. With 100 or 200 objects very low effort arises.
However, the automatic analysis of such data for such Gestalten may cause consid-
erable computational effort. Searching for the proper subset of foreground objects,
operates mathematically in the power set. It is well known that the power set of one
or two hundred objects is not tractable. Yet humans perform remarkably well in this
discipline.
The remainder of this chapter will first treat the analysis of pairs of given primitives
by defining assessments for the fit into reflection symmetry. Then search strategies
in the power set will be discussed that are of feasible computational complexity.

2.2 The Reflection Symmetry Constraint as Defined for


Extracted Primitive Objects

The most natural feature of objects appearing images is the 2D location. Most com-
mon coordinate systems for this in machine vision have row–column format with
the origin in the upper left corner—just like matrix indices. However, here Cartesian
x-coordinates are used throughout with the first axis pointing left and the second
axis pointing up—like in textbook geometry. The default location for the origin
is the image center. If the location remains the only feature of objects every pair
(g p , gq ) of objects will obey the symmetry constraint perfectly. The most natural
position for a newly constructed symmetric aggregate object gs is the mean of the
26 2 Reflection Symmetry

Fig. 2.2 Two parts form an aggregate following the law of reflection symmetry with respect to an
axis

parts xs = 21 (x p + xq ). As indicated above in Sect. 1.3, “x” refers here to the position
feature of objects which are given as indices.
Figure 2.2a demonstrates the construction exemplarily. If x p = xq holds, an appro-
priate orientation feature for the newly constructed object gs can be taken from the
orientation of the line connecting x p and xq . In Fig. 2.2a it is indicated as φs . The
orientation of the axis of symmetry is assigned α which is perpendicular to φs . Note
that α, φs ∈ [0, π), and recall that the object gs remains the same whether it is con-
structed from (g p , gq ) or (gq , g p ). Thus its self-similarity periodicity will be 2, and
its orientation is an element in [0, π).
Section 1.3 demands compulsory orientation features for Gestalten. In Fig. 2.2b g p
and gq have orientation features in [0, 2π), i.e., self-similarity periodicity 1. These
orientations are indicated as φ p and φq . These features obey reflection symmetry
when  π
2 φs + = φ p + φq (2.1)
2
holds. The validity of this can first be seen by setting φs = 0 and π = φ p + φq —
i.e., a vertical symmetry axis (α = π/2). Then the whole configuration is rotated
by an angle β ∈ [0, π). This adds 2β as well to the left-hand side as to the right-
hand side. For example, for a horizontal symmetry axis we have φ p = −φq , and
φs = π/2, so that Eq. (2.1) is again valid. Equality is only possible in Eq. (2.1) because
of the factor 2 on the left-hand side—recall φ p , φq ∈ [0, 2π) while α, φs ∈ [0, π).
All these operations are understood in the additive 2D rotation group R mod 2π.
This simple constraint formulation has already been used for symmetry detection in
[10]. Orientation features in [0, 2π) can be obtained, e.g., using brightness gradient
directions. This is possible for locations where the intensity gradient magnitude is
nonzero, i.e., for non-homogeneous locations.
When other object extraction methods are used, self-similarity with respect to
rotation may have to be considered, i.e., periodicity. For example, Fig. 2.2c shows
g p and gq as rotationally symmetric objects according to the group generated by
2π/3 rotation—and the mirror operation on them. So the objects are assumed to
be invariant under transformations of the dihedral group (in this case of order three
2.2 The Reflection Symmetry Constraint as Defined … 27

commonly known as D3 ). Equation (2.1) still holds but care has to be taken: The
support of this equation now is [0, 2π/3). So, for instance, if φ p = 10◦ and φq = 80◦
are given, then possible solutions for φs are 15◦ , 75◦ , and 135◦ . φs being given still
between 0◦ and 180◦ .
We close this section with a remark on the rotational self-similarity of such newly
aggregated Gestalt gs . It remains a reflection symmetric aggregate with the same
location, axis-orientation, and scale when rotated by 180◦ . But in the general case
the orientations of the parts will not map on each other under such rotations. Thus, if
the object is regarded as simple reflection Gestalt its periodicity is 2. In Chap. 5 below
aggregated Gestalten are considered, i.e., all parts and parts of parts have to fulfill
the laws as well. Then self-similarity with respect to such rotation requires special
orientations from the parts as well as dihedral symmetry. Otherwise, the aggregated
reflection Gestalt gs has self-similarity periodicity 1.

2.3 Reformulation of the Constraint as a Continuous Score


Function

In machine vision, constraints as given by Eq. (2.1) cannot be utilized with rigor.
Given even small errors on the entities to be tested—in the case of (2.1) angles
in some continuous 2D rotation group—will almost never fulfill such constraint.
“Almost never” holds here in the mathematical measure-theoretic sense. Therefore,
often such constraint is reformulated as inequality:
 
2(φs + π/2) − (φ p + φq ) < tφ , (2.2)

using a threshold tφ . This reformulation has two disadvantages: (1) A new parameter
tφ is introduced, and with it a search for its optimal setting. (2) For elements of
rotational groups “|...|” is not a given thing. According to the rationale of this work
such step functions should be replaced by—if possible parameter-free—continuous
assessment functions with certain properties. For proximity in planar rotation groups
the following properties are natural:
1. Being 1 for the term between the absolute bars being 0.
2. Being 0 for the term between the absolute bars being maximal.
3. Being continuous and differentiable everywhere.
The natural choice here is to use a cosine function:
  1 1   π 
aφ g p , gq = + cos 2 φs + − (φ p + φq ) . (2.3)
2 2 2
Recall, the orientation feature of the aggregate results from the location features of
its parts: φs = arctan((x p − xq )/(y p − yq )). The function aφ is used as orientation
28 2 Reflection Symmetry

Fig. 2.3 Possible continuous score functions replacing the orientation interval constraint Eq. (2.2):
solid line—aφ as defined in Eq. (2.3), Mises assessments with parameters κ = 0.33 (· · · ), κ = 1
(−−), and κ = 3 (− · −)

assessment function in [8, 11–13]. It is plotted as polar plot in Fig. 2.3. For score
functions on the 2D rotation group with probabilistic semantics one has to consult the
families of distributions defined on such domain—e.g., von Mises distributions [14].
The function aφ defined in (2.3) is only one example for a class of possible
orientation assessment functions which is defined as:
Definition 2.1 A function aφ : G × G → [0, 1) is called reflection orientation
assessment iff for all g p , gq ∈ G: φ p − φq = maximal  aφ (g p , gq ) = 0, and φ p =
φq  aφ (g p , gq ) = 1.
The Gestalt domain G has been defined Chap. 1. Here only the orientation feature
of it is needed. The above example aφ is recommended as default setting. When
performance optimization on a representative data set is the goal, similar functions,
or their parameters, may be learned. Chapter 13 considers possible machine learning
approaches for this problem. For the motivation of such functions also the minimum-
description-length ansatz can be considered, see below in Sect. 2.10.
2.4 Optimal Fitting of Reflection Symmetry Aggregate Features 29

2.4 Optimal Fitting of Reflection Symmetry Aggregate


Features

In Sect. 2.2 the reflection constraint was defined on two Gestalten g p and gq . If that
constraint is fulfilled—like in Fig. 2.2—the features of a newly built aggregate gs with
these two parts are straightforward: The location xs is set as mean 1/2(x p + xq ), the
orientation results from the connecting vector φs = arctan2 ((x p − xq )), the scale is

obtained from the same vector plus the scale of the parts ss = |x p − xq | + s p · sq ,
the frequency is 2, and because of perfect fulfillment of the reflection constraint the
assessment will preliminarily set to 1.
In Sect. 2.3 deviations from the constraint were permitted. In such situation the
simple construction of the location and orientation of the aggregate gs as outlined
above is still a possibility. It was used in [8, 11, 13], etc. However, there are better
options that require a bit more attention and computation: Regarding the Gestalten
g p and gq as noisy observations subject to deviation as outlined in Sect. 1.4, and gs
as hidden object that is not observed but inferred, an optimization is required. The
parameters of a model are fitted such that the resulting estimation gives the most
likely features of gs given g p and gq . Most likely are the smallest deviations which
are called residuals in this context. The resulting features are the most likely features
a posteriori, i.e., given the observations and the model.
The adjusted observations ( p ,
x p, φ s p ) and ( q ,
xq , φ sq ) must fulfill the con-
straint. The optimization uses homogeneous representations. The straight lines l p
 T
and lq result from orientation features using l p = cos(φ p ), sin(φ p ), −d p , and
 T
lq = cos(φq ), sin(φq ), −dq , respectively. d p and dq are the distances of the two
lines to the origin of the coordinate system. The straight line n connecting the two
Gestalt locations is obtained by cross product n = x p × xq of the two locations x p
and xq in homogeneous representation x p = [x p , y p , 1]T and xq = [xq , yq , 1]T . The
perpendicular bisector m of the side defined by midpoint x0 of the two locations is
 T  T
then m = n 2 , −n 1 , −d with the distance d = x0T n 2 , −n 1 .
The three straight lines l p , lq and m have to meet in a point which can readily be
expressed with the concurrence constraint
 
det  l1 ,
l2 , m
 =0 (2.4)

for the adjusted observations. Furthermore,  s p /


sq = 1 is required for the adjusted
scales. Note that the point of intersection can by at infinity. In this case, the ori-
entations of both Gestalten are identical, i.e., the two straight lines l p and lq are
parallel.
Enforcing the constraint Eq. 2.4 yields the minimization of residuals. Residuals
are the differences between the adjusted and the measured features  x p − x p , etc.
More precisely, this approach minimizes the sum of squared residuals S in a tan-
gent vector space of the projective plane. Preferably, the origin should be placed
at the mean of x p and xq , and the coordinate scale at their distance. Furthermore
30 2 Reflection Symmetry

Fig. 2.4 Two observed 3


Gestalten (black) and the
observed
corresponding adjusted
adjusted
Gestalten (gray) fulfilling the
constructed
concurrence constraint 2
perp. bisector
Eq. (2.4). The straight lines converging point
defined by the adjusted
Gestalt locations and
orientations and the 1
perpendicular bisector meet
in a point
0

-1

-2

-3
-1 0 1 2 3 4

a mutual weighting of the different residual components is required, e.g., by giv-


ing standard deviations. Figure 2.4 exemplarily shows a mirror symmetric pair of
observed Gestalten, the resulting aggregate, and the adjusted parts after enforcing
the reflection constraint Eq. 2.4. Here, standard deviations of σx = 0.1 units (normal
distribution in 2D), σφ = 20◦ (von Mises approximation to wrapped normal distri-
bution in orientation), and σs = 0.4 units (normal distribution restricted to positive
scales) have been chosen. Note, the location of the aggregate always remains at the
mean of the locations of the parts.
A yet even more precise solution would minimize the residual Euclidean dis-
tances. For such minimization, an iteration using Jacobi matrices is required. Details
on that can be found in Appendix A.
The amount of adjustment required to fulfill constraint Eq. 2.4, i.e., the weighted
sum of squared residuals, should be used in the assessment of the aggregate Gestalt.
Definition 2.2 A function a| : G × G → [0, 1) is called residual reflection con-
straint assessment iff for all g p , gq ∈ G: S = maximal  a| (g p , gq ) = 0 (i.e., the
constraint is violated in the strongest degree), and S = 0  a| = 1 (i.e., the con-
straint is already fulfilled by the measured features).
If an exponential function of the negative value of the sum a| = exp (−S) is used such
assessment will correspond to an a posteriori probability. The orientation assessment
given in Definition 2.1 can be seen as approximation to this more appropriate assess-
2.4 Optimal Fitting of Reflection Symmetry Aggregate Features 31

ment function based on the sum of residuals. Actually, it is a special case of it putting
most weight on the location features of the parts.
With these definitions at hand, we can formalize the definition of the operation |
mentioned in Chap. 1:
Definition 2.3 A binary operation | : G × G → G is called reflection symmetry oper-
ation iff for all g p , gq ∈ G:
• Location x p|q and orientation φ p|q result from the residual reflection constraint
assessment calculation.

• s p|q = |x p − xq | + s p · sq (the new scale is larger than the mean scale of the
parts),
• f p|q = 2 (periodicity is 2 because g p |gq = gq |g p ),
and a p|q is a conjunctive assessment combination Definition 2.6 of residual reflection
constraint assessment Definition 2.2, proximity assessment Definition 2.4, similarity
in scale assessment Definition 2.5, and assessment inheritance from both parts.
Some of the details of the assessment reckoning are given below. Algebraic closure
and other important formal properties of this operation are proven in Chap. 5.

2.5 The Role of Proximity in Evidence for Reflection


Symmetry

Reisfeld et al. propose in [10] a variant of the score function aφ Eq. (2.3) for accumu-
lating evidence for the degree of symmetry to be assigned to a location. In their work
objects are just single pixels, but for consistency the same notations as in Fig. 2.3
may be used. Then the position of the central pixel is xs between x p and xq . They also
propose an additional score function punishing distance. Using common sense this
is clear: The number of pixels rises linearly with their distance from xs . Therefore,
the probability of occurrence of symmetric pairs by chance also rises. The larger the
distance the less meaningful is a reflection symmetry. In Gestalt psychology this law
is known as the law of proximity (German “Nähe”). Rather heuristically Reisfeld et
al. choose D( p, q) = exp(−|x p − xq |) as proximity score function. This function
fulfills important properties that are natural for a proximity score, namely:
• Being 1 for |x p − xq | approaching 0.
• Asymptotically approaching 0 for |x p − xq | approaching ∞.
• Being continuous and differentiable.
• Decaying faster than linear—has a finite integral.
• D( p, q) = D(q, p).
But, there are two problems with simple negative exponential decay: First it has
no foundation in probability calculus, e.g., exp(−|x p − xq |2 ) which is known from
normal distributions; second, and more serious, there is a semantic problem with
the first property. The term “proximity” has more in common with “neighborhood”.
32 2 Reflection Symmetry

Actually, D(g p , g p ) should not be one, because g p may be as close as it is possible


to itself but it is not in perfect proximity to itself. The same holds for the classical
German Gestalt term “Nähe”. If both objects g p and gq also have a size or scale s p
and sq , we may instead demand that the objects should just touch each other for being
in perfect “proximity”. Thus the first property changes to:
• Being 1 for |x p − xq | = (s p + sq )/2.
• Being 0 for |x p − xq | = 0.
This new property also has a flaw: Scales are elements of the multiplicative group
on (0, ∞); there should never be addition or something like averaging with such

elements. Instead it is proposed to use the geometric mean s p · sq . A quite natural
way of capturing these properties by a continuous score function is:


  |x p − xq | s p · sq
ad g p , gq = exp 2 − √ − . (2.5)
s p · sq |x p − xq |

This function is not defined for |x p − xq | = 0, but since ad (g p , gq ) → 0 for |x p −


xq | → 0, we may set ad (g p , gq ) = 0 for this case, without violating continuity. The
proximity assessment function 2.5 was used in [8, 11]. As a heuristic this is perfect,
however, to our knowledge there is no probabilistic semantic in it—no known stan-
dard density has this form. The new properties are however perfectly meat by a score
function that is derived from a Rayleigh density:


  |x p − xq | |x p − xq |2
ad g p , gq =e· √ exp − . (2.6)
s p · sq s p · sq

Figure 2.5 compares the shapes of these three possibilities. Function 2.6 was used
as proximity assessment function in [15]. Motivated by these examples we set the
following definition:
Definition 2.4 A function ad : G × G → [0, 1) is called proximity assessment iff
for all g p , gq ∈ G: |x p − xq | = 0  ad (g p , gq ) = 0, |x p − xq | = 1  ad (g p , gq ) = 1,
|x p − xq | → ∞  ad ( p, q) → 0, and ad (g p , gq ) = ad (gq , g p ).

Fig. 2.5 Possible choices for proximity assessment functions: solid two thresholds (for “near” and
“far”), dotted · · · ad as in Eq. (2.5), dashed −− ad as in Eq. (2.6)
2.5 The Role of Proximity in Evidence for Reflection Symmetry 33

The above examples Eq. (2.5) or (2.6) are recommended as default settings. When
performance optimization on a representative data set is the goal, also more specific
functions or their parameters may be learned. Chapter 13 treats possible machine
learning approaches for this problem. In addition, proximity assessment functions
can be motivated by the minimum-description-length ansatz, see Sect. 2.10 below.
Next to being plausible, utilization of a proximity law bears one other advantage—
it may reduce algorithmic complexity. Listing all pairs ( p, q) of will be of quadratic
computational complexity in terms of image size. If a proximity function fulfills the
above listed properties a threshold can be given for distances. If two points (x p , xq )
are further away from each other than that distance, their evidence will be negligible.
So around each g p only a search window of fixed size must be listed looking for
partners gq . This can be coded such that only linear effort results with respect to the
image size.

2.6 The Role of Similarity in Evidence for Reflection


Symmetry and How to Combine the Evidences

Figure 2.1 second row already gave some evidence that additional features such as
size and color (or gray-tone) may help in perceiving the foreground Gestalt on clut-
tered background. An example for a higher dimensional feature space for measuring
similarity of objects is the descriptor space of the scale-invariant feature transform
(SIFT). Loy and Eklundh give a permutation of the dimensions of this 128 dimen-
sional space that can map reflection symmetry [16].
A rather extreme example for a similarity feature is given by Kondra et al. [17].
Here the image patch around a SIFT key location is taken as a high-dimensional
feature—with all its colors. Similarity with other patch objects is then measured in
terms of correlation.
If super-pixel segmentation is used as primitive extraction method—as in many
examples presented throughout this work—each super-pixel will feature an average
color. Also the second moments of the segment are calculated in order to get the
orientation feature (as arctan 2 of the eigenvector corresponding to the larger eigen-
value). So the ratio of the eigenvalues will be at hand. It is called elongation (or
eccentricity) feature. This feature is bounded between zero and one.
Exemplarily the following continuous function is considered:


  sp sq
as g p , gq = exp 2 − − . (2.7)
sq sp

This function will take value one if both partners have equal scale (or size) feature.
It will be large if the scales are similar, and approach zero if the scales are very
dissimilar. Motivated by this example we set the following definition
34 2 Reflection Symmetry

Definition 2.5 A function as : G × G → [0, 1] is called similarity in scale assess-


ment iff for all g p , gq ∈ G:
 
• s p = sq  as g p , gq = 1, 
 q → ∞  as g p , gq → 0, and
• s p /s
• as g p , gq = as gq , g p .

This kind of function is compatible with the other assessment functions given in
Definitions 2.1 and 2.4. As motivated in Sect. 1.3 any Gestalt throughout this work
has position x, orientation φ, scale s, frequency f , and assessment a features. The
following combination of fusion of the corresponding assessments can be set:
            
acombined = aφ g p , gq · ad g p , gq · as g p , gq · a f g p , gq · a g p · a gq .
(2.8)
Here aφ is a mirror orientation assessment (Definition 2.1), ad is a proximity
assessment (Definition 2.4), as is a similarity in scale assessment (Definition 2.5),
a f = 0 ↔ f p = f q , and a f = 1 else.
The fusion Eq. 2.8 is only one of many possibilities. It is a t-norm as introduced
in Sect. 1.4. Thus, the assessment functions can be interpreted as membership func-
tions in a fuzzy-set approach, and this would be a classical conjunction, a logical
“and”. Multiplication of assessments also allows a probabilistic interpretation: It
is a Bayesian fusion under independence assumption. In Sect. 2.10 the minimum-
description-length rationale for reflection symmetry is given. There it becomes evi-
dent that each feature of the Gestalt domain contributes its own gain in the number
of saved bits. Since the information domain (number of bits) is logarithmic, this
corresponds to multiplication of independent assessment functions as in Eq. 2.8.
However, there is a decisive practical disadvantage: With such fusion function
the combined assessments will tend to decline with rising number of components
under consideration. In the fuzzy-set community people prefer the maximal t-norm
for conjunctive fusion, which is acombined = min (a1 , . . . , an ). However, this would
still tend to decline. Throughout this book we prefer

             1/6
acombined = aφ g p , gq · ad g p , gq · as g p , gq · a f g p , gq · a g p · a gq .
(2.9)
This violates the identity role of 1. Therefore it is not a t-norm. The other properties
however, are fulfilled. It is a conjunctive fusion, because if any of the partial assess-
ments is zero the combination will also be assessed zero, and if all assessments have
value one (optima) the combination will also reckon as one. In fact for any 0 < a < 1
holds: If all the partial assessments are a the fusion will also be a. In Eq. 2.9 1/6 is
used as exponent. This does not change rank-orders. It is merely a heuristic measure
recommended when nested symmetries are considered (see Sect. 2.7). In such a case
the assessments remain in the same order of magnitude independent of the depth of
the nested hierarchy. Notice that the assessment features of the parts themselves are
2.6 The Role of Similarity in Evidence for Reflection Symmetry … 35

combined with the other mutual assessments. Thus, assessments are also somehow
inherited from bottom to top, i.e., from primitives to aggregates. It is also possible to
introduce weight parameters with each partial assessment in Eq. 2.9. Then a rationale
is required for the adjustment of such parameters; Sect. 13.2 treats this in more detail.
Motivated by the example (2.9) we set the following definition:
Definition 2.6 A function a : [0, 1]6 → [0, 1] is called conjunctive assessment com-
bination iff for all a(1, . . . , 1) = 1, and a(a1 , . . . , a6 ) = 0 if any ai = 0.
Additional similarities can be added to the definition if additional features f are at
hand. Many such features are defined in intervals in vector spaces (e.g., colors or
whole resampled patches with their pixel colors). There are maximal and minimal
values for each dimension.  Then there is a maximal possible Euclidean
 distance

dmax . So one may set a f g p , gq = 0 if | f p − f p | = dmax and a f g p , gq = 1 if
| f p − f p | = 0. Between these extremes the function may be linear for instance.

2.7 Nested Symmetries Reformulated as Successive Scoring


on Rising Scale

It is known and emphasized by many authors that Gestalt perception comes in hier-
archies on rising scales [3, 9]. There may, for instance, be reflection symmetric
aggregates whose parts are again reflection symmetric aggregates on their own and
so forth. Within such hierarchies also different Gestalt laws may apply on the differ-
ent aggregation levels. Figure 2.6 presents an example again in the style of Fig. 2.1,
i.e., first 2D location only, then additional orientation and in the lowest frame with
size and gray-tone.
Here the primitive Gestalten follow first a row or frieze law (see Chap. 3) and
then the two row Gestalten are arranged in a reflection symmetry. The clutter level
is the same as in Fig. 2.1, rightmost column, i.e., 160 background objects. How-
ever, the saliency is much stronger here than in the right column of Fig. 2.1. Note
that deviations in orientation, gray-tone and size are larger here than in Fig. 2.1.
Obviously human observers have a strong preference for nested hierarchical Gestalt
organizations.
Gestalt researchers have been aware of this for a long time; however, it seems
to be difficult to extend strict mathematical approaches—such as the a contrario
test ansatz as presented in Sect. 2.9—to hierarchies. Most natural for hierarchical
structures is a of course a syntactic approach. This holds in particular for generating,
i.e., rendering scenes. However, for recognition of nested symmetries from noisy
and cluttered imagery, the syntactic approaches are notoriously unstable and often
demand intractable computational efforts.
The book at hand proposes to define the grouping of parts into aggregates along
the lines of the Gestalt laws as algebraic operations instead. Following this rationale,
any part can be combined with any other part. Thresholds controlling the combina-
torial growth are avoided. They have been a major source of instability for syntactic
36 2 Reflection Symmetry

Fig. 2.6 A (shallow)


hierarchy of Gestalten
2.7 Nested Symmetries Reformulated as Successive Scoring on Rising Scale 37

approaches in the past. Instead, continuous assessment functions are used for each
Gestalt law and combined by multiplication. Hierarchies of nested symmetries then
come naturally as algebraic closure.
Exemplarily, the search for nested reflection symmetry Gestalten is discussed
with data extracted from an aerial image. Figure 2.7 displays two important steps in
the extraction of primitive level-0 Gestalten: Given an aerial image—in this case a
Google Earth image of a part of Santa Barbara, California—a standard super-pixel
segmentation was performed using the standard MATLAB implementation [18] (with
recommended parameter settings). More details about this extraction method can be
found below in Sect. 11.2. Approximately 1500 segments result, which are cleared
from isolated small regions. They feature the mean color of the segments, but are
displayed in the upper part of the figure using their mean intensity only. The following
features are stored for the primitive Gestalten: mean location (first moment) as x,
size (square root of pixel number) as s, orientation as φ ∈ [0, π), and elongation
e ∈ [0, 1), where the latter is obtained via eigenvalue decomposition of the second
moment. Also stored with the primitives is the mid-color feature in RGB space. The
lower part of the figure gives these features as ellipses with again only the intensity
displayed instead of the color.
It can be seen that much information is lost during the primitive extraction in both
steps, the super-pixel segmentation, and the simplification of the segments using
the few features mentioned above. This corresponds to a drastic compression in the
number of bits representing the image. However, the main Gestalten salient to the
human eye are preserved, and the figure in a certain way emphasizes them—almost
like in abstract pieces of painting.
Gestalt algebra requires an additional assessment feature for the primitives to be
used as level-0 Gestalten. In this case this was obtained by reckoning the mid-color
difference between a segment and its neighbors. Recall that the super-pixel segmen-
tation yields an adjacency graph. Homogeneous image regions will be decomposed
in a hexagonal grid with each segment in it having a similar color as its six neighbors.
These primitives are of course meaningless.
Algorithm 1 enumerates a finite subset of the algebraic closure of the binary
operation | on an input set of primitives (level-0 Gestalten). As input it also requires
a threshold 0 < θ < 1 controlling the computational effort. Another possibility is
setting a maximal number m of acceptable Gestalten. Then inside the repeat loop
sorting with respect to the assessments is required. Only the best m Gestalten on
each level will be kept—before the level is incremented. In Fig. 2.8 this was done
with m = 500. Figure 2.8 (a) displays the level-0 Gestalten as they come from the
primitive extraction, the gray-tone codes assessments with black being good and
white being bad, (b) level-1, (c) level-2, (d) level-3. It can be observed that the
Gestalten are getting larger with rising level, and they are concentrating more and
more on the most salient region.
Assessments are displayed in Fig. 2.8 in gray-tones following the conventions
given in Sect. 1.3. On the first levels assessments are getting better with the level
depth. This could not be possible with conjunctive assessment fusion following
Eq. 2.8. However, we are preferring the version Eq. 2.9 with the sixth root. Further
38 2 Reflection Symmetry

Algorithm 1 Pseudo-code for stratified enumeration of nested hierarchies of reflec-


tion symmetric Gestalten
input: Set of Gestalen inSet, assessment threshold θ
output: Set of Gestalten outSet
level ← 0
outSet(level) ← inSet
repeat
level++
outSet(level) ← ∅
for all pairs {g p , gq } ∈ outSet(level-1) do
gs = g p |gq
if as > θ then
outSet(level) ← outSet(level) ∪ {gs }
end if
end for
until outSet(level) empty

below in Sect. 5.2 this effect will be studied in more detail on synthetic random data.
Generally, the Gestalten are growing in scale with level depth. Well-assessed small
Gestalten are impossible on deeper levels. And there is less uniformly distributed
clutter with growing in scale. Things concentrate on the salient regions. This was
observed on many images, not only on aerial imagery, and not only on this example.
Beginning with level-3 the assessments are declining. Level-4 is already empty (with
θ = 0.5). The best Gestalt appears on level-2. It is displayed in an enlarged scale in
Fig. 2.8e with its predecessors. Almost all good Gestalten at this level cluster closely
around the position an observer would mark as the most salient location. With respect
to this aspect, this trial would thus be counted as success.
If such observer would be asked to mark the most salient reflection axes, he
or she would probably only give a vertical and/or a horizontal axis intersecting at
that location. Empirical psychological investigations show strong priors for these
directions in human perception. Including such prior in automatic assessment and
search procedures would be easy. However, we avoid it throughout the book. For
aerial imagery such directions induced by gravity direction are meaningless anyway.
Instead the North direction—here it coincides with the vertical direction—might be
a useful preference which might help sometimes. Such preferences are one of the
topics addressed in Chap. 12.
The automatic enumeration outlined above will also find very many oblique axes.
Moreover, the decomposition into parts will not always coincide with manual decom-
positions plausible to human observers. Figure 2.8e displays the decomposition in
algebraic wording the term, in syntactic wording the parse tree that leads to the best
level-2 Gestalt. As a Gestalt term, this object would have the form
       
p1 |p2 | p3 |p4 | p5 |p6 | p7 |p8 (2.10)
2.7 Nested Symmetries Reformulated as Successive Scoring on Rising Scale 39

Fig. 2.7 Super-pixel segmentation of a salient nested mirror symmetric building complex in Santa
Barbara, California; upper part: super-pixels without isolated fragments; lower part: resulting Gestalt
primitives featuring also color and elongation
40 2 Reflection Symmetry

(a)

(c)
(b)

(d) (e)

Fig. 2.8 Hierarchy of constructed mirror Gestalten

In Sect. 5.1 below such terms are analyzed deeper. However, we may remark here
already that due to the commutativity of the operation | this term may be reordered in
128 different ways without touching the identity of the corresponding Gestalt. The
commutativity operates on each sub-term. This is why some algebraic understanding
helps when coding a search algorithm on sets of Gestalten like Algorithm 1.
2.7 Nested Symmetries Reformulated as Successive Scoring on Rising Scale 41

Finally recall that here only reflection symmetric orientations, proximity, sim-
ilarity in size, and level-0 assessments are considered. If additional features, e.g.,
color and eccentricity, are included the result will be more stable and more plausible.
And, most important, here no top-down comparison through the hierarchy, and no
adjustment of features through the hierarchy was performed. This will be the subject
of Sect. 5.3.

2.8 Clustering Reflection Symmetric Gestalten with


Similar Axes

In Sect. 2.5 the Gestalt law of proximity was discussed with respect to the two parts
of a reflection symmetric Gestalt. Here proximity is discussed with respect to the
reflection Gestalten themselves. These have not only a location feature but also an
orientation in [0, π). Actually, it will be natural to consider such objects as mutually
consistent and supporting if their axes are fairly collinear. Such configuration is
shown in Fig. 2.9 with the reflection Gestalten s1 and s2 . The question is: how far is
this valid? Should, e.g., s3 be also a part of the cluster?
The Gestalt law of good continuation here means that mirror Gestalten with
roughly co-linear axes and in vicinity should be grouped into a cluster (prolong-
ing the axis). In accordance with common practice in machine vision, the axis is
stored as homogeneous 3-vector (a1 , a2 , a3 ) ∈ IP2 . In it the parameters of the corre-
sponding line equation a1 x + a2 y + a3 = 0 show up. Since this is a homogeneous
representation, it may be multiplied by any nonzero real number without chang-
ing its identity. Two ways of canonic representation are recommended: forcing the
2D-normal (a1 , a2 ) to unit length—the Hesse form, so that the third entry a3 gives
the displacement from the coordinate origin in units; or forcing the whole vector
(a1 , a2 , a3 ) to unit length.
The projective plane IP2 is a 2-manifold with its own specific topology. However,
no metric can be given on it. Thus the usual method for clustering such entities—the
Hough transform—cannot be recommended. For the Hough transform a bin-raster is
initialized with each bin representing a specific orientation in [0, π) and offset from
the origin. Several parameters have to be chosen, for example, 180 intervals for orien-
tation, 64 intervals for offset, the origin in the left upper image corner. Then for every
element in the set under consideration the corresponding bin is incremented. Accord-
ingly, there is only one linear loop through this set. Then the bin with the maximal
count is chosen as result. We admit that this procedure has very low computational
effort. However, that is about its only advantage.
The results of proximity tests in projective domains will be very sensitive to
changes in the choice of the coordinate system, if vector distances are used [19].
If, e.g., a, b ∈ R P 2 and we restrict the representations to unit Euclidean length
|a| = |b| = 1, the Euclidean distance d = |a − b| may still take very different values
depending on the choice of the origin of the coordinate system. If we use the Hesse
42 2 Reflection Symmetry

Fig. 2.9 The problem of


mirror Gestalt clustering

normal form, we will actually compare angles with the first two coordinates, and
distances from the origin with the third coordinate, i.e., we would arbitrarily set
a scale—which is a concept that makes no sense for an axis; arbitrarily setting a
weight balancing deviation in offset and deviation in orientation. The only distance
definition not depending on the choice of the coordinated system would be the length
of the geodetic curve connecting a and b.
When a set of planar projective points X = {x1 , . . . , xn } is to be worked upon
Hartley and Zisserman recommend to use the center of the image or of the objects as
coordinate origin, and set the scale such that the standard deviation from the center
is one [19]. Then Euclidean distances are a good approximation, and the geodetic
curve distances can be avoided. Reflection symmetry Gestalten, as they are displayed
in Fig. 2.9, can be treated in this way. They combine location features x with axis
features a. Using these coordinates and a threshold θ a pair such Gestalten (gs , gt )
can be tested for axis consistency using θ > |as − at |.
For a set of such Gestalten S = {s1 , . . . , s
n } we can thus check pairwise con-
sistency and determine an average axis a A = ai . Along that axis, i.e., using the
homogeneous term a A2 x1 − a A1 x2 + a A3 x3 , the extreme locations can be found—
the endpoints of the symmetric cluster. The axis and endpoints corresponding to the
2.8 Clustering Reflection Symmetric Gestalten with Similar Axes 43

largest consistent subset S form a proper output for evaluations such as [3]. Such
solution avoids the awkward problems that come with accumulators and bins, such as
Hough transform approaches for instance. However, this would in principle require
to list the power set of the set of reflection symmetry Gestalten obtained from an
image.
Since this is intractable, a greedy search is performed instead. It is outlined in
Algorithm 2. It relies on the assessment feature that comes with reflection symmetry
Gestalten. Starting with the best one as a seed, it selects those partners that are
consistent with it. The first corresponding cluster is constructed from this inlier set.
Then this set is removed, and the procedure is repeated with the best of the remaining
Gestalten. This can be repeated until the set is empty.

Algorithm 2 Pseudo-code for greedy clustering of reflection symmetric Gestalten


input: inSet, θ
output: outSet
transfer2homogenousCoord(inSet)
storeEndings(inSet)
outSet ← ∅
workSet ← inSet
while workSet not empty do
bestGestalt ← pickBestAssessed(workSet)
inlierSet ← consistentWith(workSet, bestGestalt, θ)
newCluster ← determineClusterFeatures(inlierSet)
outSet ← outSet ∪ newCluster
workSet ← workSet \ inlierSet
end while

Figure 2.10 displays such reckoning on a typical reflection symmetry bench-


mark image (displayed in (a) without colors). For primitive extraction again stan-
dard super-pixel segmentation was used, like in Sect. 2.7 above. For a small image
like that approximately five hundred segments result, which are cleaned from iso-
lated clutter and displayed in (b) using their average intensities. Primitive Gestalten
feature location, orientation, size, eccentricity, and color. In (c) they are displayed
as ellipses with intensities. Eccentricity and color features are not used, but each
primitive Gestalt has an assessment feature shown as gray-tone in (d). Then all pairs
are listed and combined using the assessments as they are defined in Eqs. 2.3 and
2.7, but without a proximity assessment. The five hundred best resulting reflection
symmetry Gestalten are clustered using Algorithm 1 with θ = 0.1, and the result is
displayed in Fig. 2.10e.
The best axis cluster is displayed in lighter color with its direction and extreme
locations. The line thickness is proportional to the number of members accumulated
in a cluster. The ten best ones are displayed. It can be seen that the result on this
image is acceptable as success, and quite stable. Note, only the information present
in (d) was used for this.
44 2 Reflection Symmetry

Fig. 2.10 An example image from the symmetry benchmark http://symmetry.cs.nyu.edu/ : a image;
b super-pixel segmentation; c primitives with color and eccentricity; d assessed primitive Gestalten;
e ten most dominant axis clusters

Such clustering (but with use of the proximity law and also on level-2 Gestal-
ten) was used, e.g., in [11, 12] in order to obtain recognition rates on benchmarks.
However, the following points should be kept in mind here:
2.8 Clustering Reflection Symmetric Gestalten with Similar Axes 45

• In particular without proximity law, there is a considerable bias for a vertical sym-
metry axis in the image center. This particular setting can accumulate most votes
because it allows the largest regions on both sides (in a landscape-format image).
Unfortunately, the benchmarks also have a strong bias for vertical symmetry axes
in the center. The result of accumulator or cluster methods depends seriously on the
location of the object with respect to the image margins—violating shift invariance
(see Chap. 1).
• With accumulators or cluster methods the detection of the object depends on its
size with respect to the image. In a 3000 × 2000 image a reflection symmetry of
100 pixel size is hard to detect, while in a 300 × 200 image it is much easier. This
violates scale invariance (see Chap. 1).
• Intuitively, a façade like the one presented in Fig. 2.10 contains a reflection sym-
metry not just of arbitrary objects on left and right wing, but of lattices of windows.
There is a nested hierarchy of Gestalten present that follow different ordering laws
on different scales.
A proper theory for accumulating or clustering evidence for reflection symmetry
is given below in Sect. 2.9 with the formalism of a contrario testing. This leads in
particular to the inference of correct meaningful thresholds. Heuristics are avoided.
We close this section by remarking that axes clustering with support segments
along each part element is very closely related to linear contour prolongation—
including gap closure. That topic is treated in detail below in Chap. 8. In particular,
Sect. 8.4 provides an efficient Method for that purpose in detail. The mathematical
model for that can be defined as Gestalt operation on its own—see Sect. 8.3. There
is a difference in the scale, which is irrelevant for our approach. Also there is a
difference in the very different size parts may have in the symmetry clustering case.
Viewed in this way, clustering of reflection axes is just another case of hierarchical
Gestalt operation application. The corresponding term takes the form
 
Λi=1
n
pi,1 |pi,2 . (2.11)

Algebraically the full permutation group operates on the indices i ∈ {1, ..., n} without
changing the aggregate at all, and of course this must be multiplied with the two
possibilities yielded by the commutativity of the operation | for all the parts. The
standard way of setting the orientation of a reflection Gestalt in Eq. 2.3 is from part
to part, i.e., perpendicular to the axis. On the other hand Λ, as defined in Definition
8.1, prefers orientations collinear with the aggregated line. This needs to be fixed
here by ninety degree rotation.
46 2 Reflection Symmetry

2.9 The Theory of A Contrario Testing and its Application


to Finding Reflection Symmetric Patches in Images

The winner of the symmetry contest along with the CVPR 2013 in the category
reflection symmetric patches was the method of Pătrăucean et al. [20]. This work
was based on the theory of a contrario testing as developed by Desolneux et al. for
Gestalt recognition in a more general sense [9]. This technique sets the null hypothesis
that the image or a particular patch in it contains no symmetry and attempts to falsify
such hypothesis—which creates good evidence that there is indeed some symmetry.
This follows the most rigorous branch of empirical science. In other words, it is the
least heuristic approach published in Gestalt recognition now. It has been applied
with much success for recognition of straight contours. As null hypothesis usually
uniform distributions are set.
In the case of symmetry recognition following [9] the brightness gradient direction
is assumed to be uniformly distributed in the orientation space [0, 2π). This domain
has to be understood as a continuous group with closed topology, 0 and 2π being
identical. Normal distributions do not exist on this domain, and also something like
a mean cannot be defined for any set of such orientations. The uniform distribution
exists with density 1/(2π) on the whole domain. A problem arises for homogeneous

Fig. 2.11 A contrario test for mirror symmetry on a pixel grid


2.9 The Theory of A Contrario Testing and its Application … 47

regions, where the gradient is null or very small, and thus the gradient direction poorly
defined. Pătrăucean identifies such pixels by thresholding the brightness gradient
magnitude and excludes them from the testing [20]. Probably, in some picture like
the one presented in Fig. 2.10 the majority of the pixels will thus not be used for the
further inference at all. Moreover, the color is not used.
In contrast to the rationale of the book at hand the a contrario works [9, 20]
assume the input data given on a pixel grid of size n × m. Figure 2.11 shows such a
grid. The null hypothesis sets the gradient orientations independently and uniformly
distributed. The light gray pixels where drawn from such distribution with their
orientation feature indicated as little tail. One statistical test now picks two particular
locations from the grid—the endpoints of the white line. This line is the chosen as
symmetry axis so that the orientation of pixels in a region to the left of this line is
compared with the orientation of the corresponding pixel to the right of the line. In
[11] rectangular (half-square) regions are used instead of the circular disk indicated
by black color in Fig. 2.11 but that makes no substantial difference.
The idea is: It is unlikely that a pair of such corresponding pixel orientations
happens to fulfill the reflection constraint Eq. (2.2) by chance. For such test-primitive
the noise on the orientations must be considered. Eventually, there will be a general
threshold for that such as 10% (for ±180◦ tolerance). Due to the independence
assumption the probability that the constraint is fulfilled by chance on the whole
half disk—such as in the Figure—is very small indeed. For a disk-diameter of 17
pixel (as in this Figure) the probability of occurring by chance would be 0.1h, where
h ≈ 21 π(17/2)2 ≈ 450. For less than all the 450 tests succeeding the probabilities
are given by the Binomial distribution with parameters 0.1 and 450. A threshold
k ≤ 450 may be defined such that the mass of the tail of the Binomial distribution
from k on is sufficiently small to meet the level of the statistical test (usually 5% or
1%).
All a contrario works emphasize that care has to be taken at this point: These
extreme small probabilities result only for one such statistical test, i.e., one particular
choice of an axis line or pixel pair. An answer to the question “Is there evidence for
a reflection symmetry in this picture?” requires many such tests. How many can
be estimated by bounding the number of appropriate pixel pairs, e.g., simply by
n 2 · m 2 . A bound for the probability of any such reflection symmetry test succeeding
by chance given the null hypothesis can then be obtained using Bonferroni inequality:

450


450
P(detection) ≤ n 2 · m 2 · 0.1 j 0.9k− j . (2.12)
j
j=k

Here k should be chosen larger accordingly.


48 2 Reflection Symmetry

2.10 The Minimum Description Length Approach for


Nested Reflection Symmetry

There is a possibility to derive plausible assessment functions similar to Eqs. 2.2–


2.3 from the epistemic and information theoretic principle of minimum description
length. According to this approach one would prefer that explanation (or model)
for the given image that has the shortest description—in terms of number of bits.
Reflection symmetry recognition following this rationale has been outlined in [9].
So starting from the longest explanation—the list of all extracted primitive objects
with all their features—the task would be in explaining some of the objects (the
foreground objects) with less consumption of bits (less features).
Following [19, p. 141 ff] we count the number of bits we may save by introducing
reflection symmetry as model (or compressing code). First example for this are two
Gestalten g p and gq represented by their location and orientation features {x, φ}.
Exemplarily the following choices are appropriate:
• Ten bits per location coordinate and eight bits per orientation yields: 20 (for x p ) +
8 (for φ p ) + 20 (for xq ) + 8 (for φq ) = 56 bits for the uncompressed code.
• For gs = g p |gq only one location feature xs is needed. No residual error results
with respect to the location feature, but a scale (distance) feature ss codes where p
and q are—together with the orientation φs . There will be a residuum with respect
to the orientations φ p and φq and the reflection constraint Eq. 2.1. How many bits
are needed for ss and the residuum? For a large ss more bits are needed rising
logarithmically, but not more than ten. For large deviations from the reflection
constraint maximally seven bits are needed. However, for small residual angular
deviations less bits are needed—again following a logarithmic law. This reckons
to 20 (for xs ) + 8 (for φs ) + 10 (for ss ) + 8 (for φ p ) + 7 (for the angular residual)
= 53 in the worst case. But if g p and gq are fairly close to each other maybe four
bits will suffice to code ss , and if the orientation features almost fit the constraint
2.1 two bit may be enough for the residuum. In such a benign case 42 bits would
remain for the compressed code.
From such calculations assessment functions can be constructed which are quite
similar to the ones presented above. Thus heuristic assessments can be replaced by
assessments derived from a sound theory. For the time being we leave that as future
work.

2.11 Projective Symmetry

Plane objects with a bilateral symmetry feature corresponding points because of point
and/or axis reflections. In perspective images theses points are related by the planar
harmonic homology [19]. The point transformation xi = Hxi can be parametrized
in homogeneous representation as
2.11 Projective Symmetry 49

Fig. 2.12 Point and axis symmetry detected in perspective images of objects with bilateral sym-
metry. Confidence regions of the estimated axis and the vertex are denoted by hyperbola and ellipse

vaT
H = H−1 = I − 2 (2.13)
vT a

with corresponding points coordinates {xi , xi }, the image a of the symmetry axis,
and the vertex v. The transformation matrix H obeys HH = I and the eigenvalues of
the matrix are, up to a common scale factor, {−1, 1, 1}. The eigenvectors are e1 = v,
e2 = a⊥ ⊥ ⊥ ⊥
1 , and e3 = a2 , and the axis is the join a = a1 × a2 . Thus the transformation

xi = Hxi is an involution and two pairs of point correspondences determine H.
The parametrization with homogeneous coordinates allows for the representation of
entities at infinity. If the axis or the vertex is at infinity, the transformation is affine.
Figure 2.12 shows two examples for detected axes and point reflections in two
images of symmetric objects and objects placed in a symmetric arrangement. For
the establishment of point correspondences, the original image and their mirrored
versions are considered [21]. Interest points are extracted in the image, and its mir-
ror image and the corresponding image descriptors are matched by applying the
RANSAC paradigm with the model Eq. 2.13.
A direct but approximate solution for the homology matrix Hand its decomposi- 
tion into a and v is obtained by minimizing the algebraic distances S(xi ) ⊗ xiT h = 0
with h = vec(H). The maximum likelihood estimation with the parameter constraints
a = 1 and v = 1 yields statistical optimal solutions, including estimated covari-
ance matrices for the estimated parameters, see Appendix A.
In the presence of multiple symmetries, a clustering of the solutions in the param-
eter space can be performed, e.g., by the j-linkage algorithm [22] (Fig. 2.13).
50 2 Reflection Symmetry

Fig. 2.13 Multiple


symmetry detections with
established point
correspondences, estimated
axes, and confidence regions
of the estimated axes

References

1. Yang Q, Ding X (2002) Symmetrical PCA in face recognition. In: Image processing—2002.
Institute of Electrical and Electronics Engineers (IEEE)
2. Harguess J, Aggarwal JK (2011) Is there a connection between face symmetry and face recog-
nition? In: Computer vision and pattern recognition workshops—CVPRW 2011. Institute of
Electrical and Electronics Engineers (IEEE)
3. Liu J, Slota G, Zheng G, Wu Z, Park M, Lee S, Rauschert I, Liu Y (2013) Symmetry detection
from realworld images competition 2013: summary and results. In: CVPR 2013, Workshops
4. Funk C, Lee S, Oswald MR, Tsokas S, Shen W, Cohen A, Dickinson S, Liu Y (2017) ICCV
challenge: detecting symmetry in the wild. In: ICCV 2017, Workshops
5. Pizlo Z, Li Y, Sawada T, Steinman RM (2014) Making a machine that sees like us. Oxford
University Press
6. Wertheimer M (1923) Untersuchungen zur Lehre der Gestalt II. Psychologische Forschung
4:301–350
7. Kanizsa G (1980) Grammatica del vedere. Saggi su percezione e gestalt. Il Mulino
8. Michaelsen E, Yashina VV (2014) Simple gestalt algebra. Pattern Recognit Image Anal
24(4):542–551
9. Desolneux A, Moisan L, Morel J-M (2008) From Gestalt theory to image analysis: a proba-
bilistic approach. Springer
10. Reisfeld D, Wolfson H, Yeshurun Y (1990) Detection of interest points using symmetry. In:
International conference on computer vision (ICCV 1990), pp 62–65
11. Michaelsen E, Münch D, Arens M (2013) Recognition of symmetry structure by use of gestalt
algebra. In: CVPR 2013 competition on symmetry detection
12. Michaelsen E (2014) Gestalt algebra—a proposal for the formalization of gestalt perception
and rendering. Symmetry 6(3):566–577
13. Michaelsen E, Arens M (2017) Hierarchical grouping using gestalt assessments. In: CVPR
2017, Workshops, detecting symmetry in the wild
14. Fisher NI (1995) Statistical analysis of circular data. Cambridge University Press
15. Michaelsen E, Münch D, Arens M (2016) Searching remotely sensed images for meaningful
nested Gestalten. In: XXII ISPRS Congress, (ISPRS Archives XLI-B3), pp 899–903
16. Loy G, Eklundh J (2006) Detecting symmetry and symmetric constellations of features. In:
European conference on computer vision (ECCV), pp 508–521
17. Kondra S, Petrosino A, Iodice S (2013) Multi-scale kernel operators for reflection and rotation
symmetry: further achievements. In: CVPR 2013 competition on symmetry detection
18. Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Susstrunk S (2012) SLIC superpixels compared
to state-of-the-art superpixel, methods. Trans Pattern Anal Mach Intell 34(11):2274–2281
References 51

19. Hartley R, Zisserman A (2000) Multiple view geometry in computer vision. Cambridge Uni-
versity Press
20. Pătrăucean V, von Gioi RG, Ovsjanikov M (2013) Detection of mirror-symmetric image
patches. In: 2013 IEEE conference on computer vision and pattern recognition workshops,
pp 211–216
21. Tang Z, Monasse P, Morel J-M (2014) Reflexive symmetry detection in single image. In Bois-
sonnat J-D, Cohen A, Gibaru O, Gout C, Lyche T, Mazure M-L, Schumaker LL (eds) Curves
and surfaces. Proceedings of the 8th international conference curves and surfaces, Lecture notes
in computer science, vol 9213. Springer, pp 452–460
22. Toldo R, Fusiello A (2008) Robust multiple structures estimation with j-linkage. In: European
conference on computer vision (ECCV 2008). Springer, pp 537–547
Chapter 3
Good Continuation in Rows or Frieze
Symmetry

About a hundred years ago, Wertheimer opened his classic investigation on Gestalt
perception [1] with a drawing of a row of dots arranged in equal spacing on a straight
line. With Fig. 3.1 we follow this tradition. However, similar to Chap. 2 on mirror
symmetry with its Fig. 2.1, this chapter starts with computer generated synthetic
Gestalten displayed on a background of synthetic random clutter. In addition, the
rows of the figure contain different information: upper row—location only, middle
row—location and similarity (in gray-tone and size), and the lower row—location
and orientation. In contrast to Fig. 2.1, the columns do not contain increasing amounts
of clutter objects but instead rising displacement of the foreground locations.
In all examples, the aggregated foreground Gestalt is a row of five primitive
Gestalten in roughly equidistant positions along a straight line. Thus, the foreground
locations are chosen with four random parameters: two for the location of the center,
and two for the generator vector that maps one part to the next. These are obtained
from a uniform distribution such that the foreground Gestalt fits into the image
margins, and the parts are not too close to each other. The resulting locations are
afterward disturbed by normally distributed, zero mean displacements. Each test
figure is eventually completed by adding a fixed number of uniformly distributed
background primitives. Such clutter primitive will not be accepted if it is located too
close to one of the foreground Gestalten.
In Fig. 3.2 the density of background clutter objects doubles as compared to
Fig. 3.1. As expected, perceiving the foreground Gestalten is more challenging now,
and sometimes it may fail at all. Some readers may have difficulties in perceiving
the Gestalten in the first row without the aid of the second row, where additional
similarity in gray-tone and size helps. The similarities in orientations—displayed in
the third row—help less.
With rising number of clutter objects, it becomes harder to pick the correct fore-
ground subset. For instance in the example rendered in the upper left graphic of
Fig. 3.2 the reader may well perceive a large S-shaped curve swinging over more
© Springer Nature Switzerland AG 2019 53
E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6_3
54 3 Good Continuation in Rows or Frieze Symmetry

Fig. 3.1 Rows of Gestalten on clutter: left to right rising displacement (1, 2, and 4 units respec-
tively); amount of clutter fixed to 40; first row locations only, second row location and similarity,
third row location and orientation

than half of the frame. Some people might extend it to filling the frame completely.
This particular example thus gives rise to three insights: (1) It can be difficult to pick
the correct first and last element, while enumerating the correct sequence inside is
quite self-evident; (2) the human perceptive system tolerates considerable curvature
in the law of good continuation; (3) some parameters of the human perceptual system
may well be tuned to the utmost edge, easily generating illusions on random patterns.
Even higher is the clutter density in Fig. 3.3. There are 165 objects present in
the frame, and again only five form a row Gestalt. From the locations alone, see
the upper row, almost no human subject will perceive the intended foreground row
Gestalt. With the aid of the second row, where similarity in gray-tone and size helps
distinguishing foreground from background, and taking some more time, one may
see the Gestalt. In the first column it is in the lower left corner, in the second column
it is in upper right part, very close to the right margin, and in the last column it runs
horizontally close to the bottom right corner.
3.1 Related Work on Row Gestalt Grouping 55

Fig. 3.2 Rows of Gestalten on clutter: left to right rising displacement like in Fig. 3.1; amount
of clutter now 80 objects; first row locations only, second row location and similarity, third row
location and orientation

3.1 Related Work on Row Gestalt Grouping

Often repetitive patterns positioned equidistantly along a straight line—that is in good


continuation, German Gestalt terminus “gute Fortsetzung”—are treated as frieze
symmetry in the machine vision community [2, 3]. Throughout this work they are
often called row Gestalten. In an algebraic view on this topic the corresponding
shift operation defines a group. Accordingly, the pattern must be imagined as infinite
repetition on both sides. In contrast to this, we are interested in finite patterns where
there is a first and a last member. For instance, we analyze an aerial image that
contains a salient row of buildings. They cannot be modeled as an algebraic group
because the corresponding shift maps most of the part Gestalten on each other—but
not the last and the first. Yet, an algebraic perspective—e.g., concerning subgroup
hierarchies, commutativity, inverse—helps understanding the phenomena.
Apart from building rows in remotely sensed data, row Gestalten are almost ubiq-
uitous in our environments. They appear in architecture, e.g., facades, and indoor
scenes, as well as on animals and plants. In fact, repetition in rows is the standard
example for Gestalt phenomena from the very beginning [1].
56 3 Good Continuation in Rows or Frieze Symmetry

Fig. 3.3 Rows of Gestalten on clutter: left to right rising displacement like in Figs. 3.1 and 3.2;
amount of clutter now 160 objects

3.2 The Row Gestalt as Defined on Locations

Figure 3.4 displays a group of objects placed in good continuation as intended by


the corresponding Gestalt law. There is a number of parts n > 1, one of the objects,
g1 , sets the beginning, and one other, gn , sets the end of the group. The law of good
continuation demands that there is a common 2D-vector v, setting the differences
between the objects as they are enumerated:

x gi+1 = x gi + v, ∀ i = 1, . . . , n − 1. (3.1)

It is natural to consider an operation on the indices at this point. The aggregate


Gestalt is considered identical, if the enumeration is reversed, i.e., if v is replaced by
v and i is replaced by n + 1 − i. In an algebraic view this is a subgroup of the group
of index permutations Sn operating on the Gestalt, and the Gestalt is understood as
equivalence class modulo this group.
For the case n = 2 the symmetric group S2 contains only two elements the identity
and the reversion of enumeration, respectively. Like the reflection symmetric con-
straint for location features only is always perfectly fulfilled for any pair of objects,
see Chap. 2, Sect. 2.1, the row location constraint is always perfectly fulfilled on
any pair of locations {x 1 , x 2 }. In the following we consider the case n > 2 where Sn
3.2 The Row Gestalt as Defined on Locations 57

Fig. 3.4 Minimizing the


sum of squared errors for a
row Gestalt

contains n! elements of which only two give a correct enumeration. In addition, we


will generally assume that in the case n > 2 there will be a displacement vector ei
for the locations as they are found, measured, or given; i.e., in the task of analyzing
images, the good continuation law 3.1 will almost never hold precisely. There are
several reasons for this:
• The pixel raster may not be consistent with the locations and in particular with v.
• The primitive extraction method may cause displacements, e.g., because of mea-
surement noise.
• The objects may be slightly displaced in the scene, e.g., along a street because
the construction workers may have failed to produce the equal spacing that was
intended by the city planners.
The examples given above indicate that it may be a challenge to model these
deviations properly. Nevertheless, with no further knowledge, it is the best choice
to assume the residual errors ei for each location to be normally distributed in 2D
with zero mean and equal deviation in both directions. Under this assumption, the
most likely features for the aggregate Gestalt r result from minimizing the sum of
the squared residual displacements:
 

n
∂ ei2 ∂ ei2
(x r , vr ) = arg min ei2 which is equivalent to =0= . (3.2)
x,v
i=1
∂ xr ∂vr

Here x r is the location feature and vr is its generating vector. Each squared residual
results from the construction 3.1:
  2
 1 
ei =  x gi − x r + (i − 1)vr − (n − 1)vr 
2 
 . (3.3)
2
58 3 Good Continuation in Rows or Frieze Symmetry

There is a closed-form solution for the minimization 3.2. It turns out that the
location results as the mean x r of all measured part locations. The solution for the
generator vector vr turns out less intuitive:
 
12 
n
n+1
vr = x g i − . (3.4)
n 3 − n i=1 i 2

Figure 3.4 shows an example with five part Gestalten whose locations are pre-
sented as solid dots. The resulting set locations are indicated as empty boxes. For an
odd number of parts the set location for the middle part will be the mean of the part
locations and the location feature for the new aggregate Gestalt.
Two more Gestalt features of the newly generated aggregate can now be defined:
(1) its size results as
 n 1/n

sr = (n − 1)vr  + sgi , (3.5)
i=1

where the geometric mean of the sizes of the part Gestalten is added to n − 1 times
the length of the generator; (2) its orientation in [0, π) is given as

φr = arctan(vr y /vr x ). (3.6)

Furthermore, it is reasonable to set the frequency feature as fr = 2, because the


Gestalt is considered as identical if the enumeration of the parts is reversed. The
minimization of 3.2 also gives a sum of squared residuals, which can be used as
reasonable assessment component for the newly constructed aggregate:
Definition 3.1 A function a : G n → [0, 1] is called residual row assessment iff n >
2, and there is a scale parameter τ > 0 with


τ n
a g1 , . . . , gn = exp − 2 ei 2 , (3.7)
u (n − 2) i=1

where u is the geometric mean of the scale of the parts.

3.3 Proximity for Row Gestalten

Already the figures above—in particular Fig. 3.2, middle row—indicate that with
comparable length of the generating vector v larger dots (rightmost picture) make a
much more salient row than smaller dots (central picture). This is still true even though
the locations in right most picture have more deviation from the good continuation
constraint defined above in Sect. 3.2. An assessment function is needed that scores
the ratio between the Euclidean length of the generator v and some mid-scale of the
3.3 Proximity for Row Gestalten 59

parts smid . Without further prior knowledge, and following the same rationale as in
Sect. 2.5, the following assumptions are reasonable:
• The score should be one for |v|/smid = 1—that is for the objects being adjacent;
• The score should asymptotically approach 0 for |v|/smid approaching ∞—that is,
for very large generator vectors;
• The score should also approach 0 for |v|/smid approaching 0—that is, for very
small generator vectors that do not really give a proper row;
• Being continuous and differentiable.
• Decaying faster than linear—has a finite integral.
In fact, we may use the same assessment functions as they are given in Sect. 2.5
in Formulae 2.5 or 2.6 and plotted in Fig. 2.3. The natural definition for the mid-
scale smid of n parts is again the geometric mean: smid = (s1 · . . . · sn )1/n . Thus the
following definition is made:
Definition 3.2 A function a : G n → [0, 1) is called proximity assessment iff for all
pi ∈ G and r =
p1 , . . . , pn :
n

|vr | = 0  a p1
, . . . , pn = 0,
|vr | = smid  a
p1 , . . . , pn = 1,
|v
r | → ∞  a p1 ,
. . . , pn → 0, and
a p1 , . . . , pn = a pn , . . . , p1 .
Experience reveals that often the distance of ground truth part Gestalten forming
a row is a little larger than their scale. For example, the example for the row shown
in Fig. 3.5 below the generator is about twice as long as the longer diameter (height)
of the parts. This can either be compensated by introducing an appropriate factor in
the scale of the primitives—this has been done here—or such a factor can be learned
if sufficient example material with ground truth is provided (see Chap. 13).

3.4 The Role of Similarity in Row Gestalten

It was already mentioned above that most humans tend to perceive a large S-shaped
Gestalt in Fig. 3.2, upper left image. With the help of additional information from
the gray-tone and size in the picture below, most people will almost instantaneously
reproduce the ground truth Gestalt: the vertical row of five dots. Such perception
trials indicate that the law of good continuation in its form of the row constraint is
particularly strong when combined with similarity of other features. In the following
different kinds of features are distinguished based on their algebraic properties.
60 3 Good Continuation in Rows or Frieze Symmetry

3.4.1 Vector Features

Many features such as gray-tone or colors can be treated as a vector. The situation is
easier than in Sect. 2.6. For example, there is no need to re-arrange the SIFT feature
vector before comparison. Most features of this kind can be directly compared using
cross-correlation. This even holds for image patches: Along the row at each primitive
location, or at each set location, a patch of certain size can be cropped from the image.
Then a common template patch is obtained by averaging, and the row Gestalt can be
re-assessed using the sum of squared differences from all patches to the template:
⎛ ⎞


n 
a p1 , . . . , pn = exp ⎝−γ ci, j − ct, j 2 ⎠ . (3.8)
i=1 j∈J

Here ci j refers to the color of the jth pixel in the ith patch, ct j refers to the color of
the jth pixel in the template, and γ is some appropriate constant.
Figure 3.5 shows an example where such template matching helps. The original
picture (taken from the 2013 symmetry competition benchmark [2] is displayed in
Fig. 3.5a). It is a gray-tone picture. The usual SLIC super-pixel segmentation [4]
with the goal parameter set to 200 segments yields the set of primitive Gestalten
which is displayed in Fig. 3.5b. These objects only have location, orientation, scale,
frequency, gray-tone, and eccentricity features, respectively. It can be seen that thus
much information is lost. Still the search for row Gestalten, as outlined below in
Sect. 3.5, yields the frieze–Gestalt presented in Fig. 3.5c as one of its best results. In
fact it is the best-assessed row of length 7 or longer (there are some rows of length
8 in the result). However, the result is not very stable, the assessment gap to other
rows is not very dominant, and among the well-assessed long rows there are some
false positives.
Figure 3.6 presents seven image patches cropped from the input image at the
positions of the seven primitives forming the best row displayed in Fig. 3.5c. Matching
and assessing these patches using the similarity assessment Eq. 3.8 give a better
discrimination, as compared to the assessments based on position, scale, orientation,
eccentricity, and gray-tone features alone. In particular, the seventh—rightmost—
element can be clearly distinguished as an outlier. Removing it gives the row the
correct end location with respect to the ground truth of the benchmark.
The average patch utilized as template for Eq. 3.8 is shown as eighth patch in the
figure. In a way it looks a bit blurred, indicating that there is noise on the cropping
locations. Such noise is exactly what should be compensated by using the good
continuation assessment as defined in Sect. 3.1 and the preceding constructions of
the position Eq. 3.1 and the generator Eq. 3.4.
Figure 3.7 shows corresponding patches cropped at the set-positions of the aggre-
gated row instead of the positions of the primitives. Compare to Fig. 3.4, where
part-positions are indicated as solid dots and set-positions as empty squares. Using
the set positons for cropping the patches compensates some of the noise of the prim-
3.4 The Role of Similarity in Row Gestalten 61

Fig. 3.5 Similarity of


patches—an example: a
original image from the
frieze part of the 2013
competition benchmark [2];
b about two hundred
primitives extracted from it
with gray-tone, orientation,
and eccentricity features; c
best long row Gestalt with
predecessors and
corresponding super-pixel
segments

itive extraction. Correlation between the patches is higher, and the average patch
displayed again as eighth patch in the lower right corner is less blurred. It is thus
better suited for similarity assessment.
62 3 Good Continuation in Rows or Frieze Symmetry

Fig. 3.6 Patches and templates resulting from the row Gestalt in Fig. 3.5—cropping centers
obtained from the primitive part positions; a–f inliers, g outlier, template (mean patch)

Fig. 3.7 Patches and templates resulting from the same row Gestalt—cropping centers now
obtained from the set positions of the row Gestalt; a–f inliers, g outlier, template (mean patch)

3.4.2 Scale Features

The scale s p of a Gestalt g p is always greater than zero, and the natural operation on
scales is multiplication. They do not form a vector space, and they should never be
added or subtracted. Section 3.3 already gave the geometric mean smid as a proper
average for a set of scales.



n
smid n
si
a p1 , . . . , pn = exp 2n − − . (3.9)
i=1
si s
i=1 mid
3.4 The Role of Similarity in Row Gestalten 63

This assessment will yield one if and only if all parts have the same scale. Other-
wise it will be less, and if the scales deviate very much it will approach zero. With
such functions in mind the following definition is made:
Definition 3.3 A function a : G n → [0, 1) is called scale similarity assessment iff
for all pi ∈ G n and r =
Σ p1 , . . . , pn :
s1 = . . . = sn  a p1 , . . .
, pn = 1,

si /s j → ∞ or 0
 a p1 , . . . , pn → 0, and
any
a p1 , . . . , pn = a pn , . . . , p1 .
Actually, the scale similarity assessment should not depend on the enumeration of
the set {p1 , . . . , pn } at all.

3.4.3 Orientation Features

All Gestalten g p have an orientation feature φ p in some continuous additive rotational


group, such as [0, π). In Figs. 3.1, 3.2 and 3.3 this feature was given in the last row,
respectively. Obviously, similarity in orientations also helps when distinguishing
foreground from background. However, a continuous additive rotation group cannot
be treated like a vector space. We have seen that before in Definition 2.1 where the
orientations of the parts of a reflection symmetric pair were assessed.
Section 1.5 explains how the mean φmid of a set of orientations can be constructed
and gives a proper average. It is clear that this is not well defined for degenerate
configurations where the parts are completely uniformly distributed and unstable
around such settings. However, these will be badly assessed anyway, so that for such
situation an arbitrary setting such as φmid = 0 suffices.
Definition 3.4 A function a : G n
→ [0, 1] is called orientation similarity assess-
ment iff for all pi ∈ G
and r = p1 , . . . , pn :
n

φ1 = . . . = φn  a p1 , . . . , pn = 1,
if any φi ... .
With these definitions at hand, we can formalize the definition of the operation Σ
mentioned in Chap. 1:
Definition 3.5 An n-ary operation Σ : G n → G is called row symmetry operation
iff for all g1 , . 
. . gn ∈ G:
n
x Σ = 1/n i=1 xi ,
φΣ = arctan 2(v Σ ) mod π as resulting from solution Eq. 3.4,

 1/n
sΣ = (n − 1)|v Σ | + si ,
f Σ = 2, and
aΣ is a conjunctive combination of residual row assessment Sect. 3.1 orienta-
tion similarity assessment, proximity assessment, similarity in scale assessment, and
assessment inheritance of the parts.
Algebraic closure and other important formal properties of this operation are
treated below in Sect. 3.5 and in more detail in Chap. 5.
64 3 Good Continuation in Rows or Frieze Symmetry

3.5 Sequential Search

Frieze or row symmetry is serial by nature. The eye movements of an observer


will show a considerable tendency to follow the Gestalt as it is enumerated by its
geometric structure presented in Sect. 3.2. Of course, they may as well scan the
pattern in decending indices order. In contrast to audio signals there is no specific
order given by the domain. Eye movements will also be influenced by other stimuli
and may as well be random by nature to a large extent. Let us start with a more
mathematical view on the complexity of this search.

3.5.1 The Combinatorics of Row Gestalten

Throughout this book no particular order will be assumed on the primitives extracted
from an image. Some of the extraction methods given in Chap. 11 may provide
sometimes lists already in good sequence for enumerating rows, but in many other
cases the order will be meaningless for such search. So the primitive Gestalten
obtained from an image as well as any higher-order Gestalten are always  treated
as an unordered set. Finding the correct enumeration sequence for r = gi is a
problem. In order to give the reader an impression of the combinatorial nightmare
related to this issue, some definitions will help: Given a set of n Gestalten P we can
define the set of k-row Gestalten of depth level 1 as
 

k
R1,k = r ∈ G; r = pt (i) , t ∈ n , pt (i) ∈ P .
k
(3.10)
i=1

Here t lists all k-tuples of Gestalten from P. Obviously this is exponential in k.


Repetitions are allowed, which is correct if Σ is seen as algebraic operation. Thus
the set of rows of arbitrary length k resulting from P is infinite. Repeating the same
Gestalt in a row is, however, meaningless and will decrease the assessment. Details
are discussed below in Chap. 5. Altering definition Eq. 3.10 accordingly gives
 

k
R1,k = r ∈ G; r = pt (i) , t ∈ n · (n − 1) · . . . · (n − k + 1), pt (i) ∈ P .
i=1
(3.11)
This is only possible for k ≤ n. The set of rows of arbitrary length k resulting
from P without repetition is finite—but huge. All rows can be listed using set union


n
R1 = R1,k . (3.12)
2=k
3.5 Sequential Search 65

The topic of this book is hierarchical grouping. Our interest is in rows made
from rows and so forth. Thus, Eq. 3.12 only sets the initialization for the recursive
definition
 
k
R j+1,k = r ∈ G; r = pt (i) , t ∈ n · (n − 1) · ... · (n − k + 1), pt (i) ∈ R j .
i=1
(3.13)
And then the union R j is formed in the same way as in Eq. 3.12. Note that the
recursion step Eq. 3.13 will create increasing set sizes with each application. Repe-
tition of elements may have been prohibited by using Eq. 3.11 instead of Eq. 3.10,
but that was only for running index i. Already with the union Eq. 3.12 there will be
all part rows of a longer row included in R1 . This set is finite but very big, and each
primitive pi is contained in many of its rows r . No bound can be given for running
index j in Eq. 3.13. The set of all R j is infinite.
This is not meant to be really used as enumerating search in practical recognition
work, because it causes obviously intractable computational efforts. Instead, it sets
the structure in which the hierarchical Gestalten can be described and treated. It is
necessary to demonstrate now how—by use of appropriate assessment functions—a
computationally tractable subset can be given.

3.5.2 Greedy Search for Row Prolongation



Initially all pairs (p1 , p2 ) ∈ P will be listed (without repetition), and r = (p1 , p2 )
is evaluated. A threshold condition ar ≥ θ controls the assessment of these rows
of pairs. This is very similar to one step in the search for reflective symmetry in
Algorithm 1 in Chap. 2. Here such step is only the starting point for the search.

Definition 3.6 For any 0 < θ ≤ 1 a row Gestalt r = (p1 , p2 ) with two parts
p1 , p2 ∈ P will be called a θ-row-seed in P, if ar ≥ θ.
One might argue that listing all pairs of elements of P is already of quadratic
complexity concerning the size of P. But the assessment comes with a proximity
component following Definition 3.2. For such trivial row-seeds the generating vector
v is simply the difference between the two parts. It cannot be much longer than the
geometric mean of the scales of the two parts. Depending on the form of the proximity
assessment function, there is a certain length factor for any such θ. If the distance is
larger ar < θ will result. Accordingly, the search for θ-row-seeds in P only needs to
list one partner and look for the partners for it only in this certain vicinity. This can
be implemented in sub-quadratic complexity.
Once all row-seeds have been collected, they will be prolonged fore and aft. This
will be controlled again by the assessment. A prolongation will not be accepted if
the resulting longer row has lower assessment than the original had:
66 3 Good Continuation in Rows or Frieze Symmetry


• Fore-Prolongation: Given a set P and a row Gestalt rold = (p1 , ..., pn ) of parts
 p all p0 ∈ P (but not present in the tuple (p1 , . . . , pn )) are tested for rnew =
from
(p0 , p1 , . . . , pn ), and the best one is chosen:

Max

anew = p0 ∈ P arnew

If anew ≥ arold the old row will be replaced by the old row. Otherwise the old row
will be kept, and the new row will be rejected. 
• Aft-Prolongation: Given a set P and a row Gestalt rold = (p1 , . . . , pn ) of parts
from p all pn+1 ∈ P (but not present in the tuple (p1 , . . . , pn )) are tested for
rnew = Σ(p1 , . . . , pn , pn+1 ), and the best one is chosen:
Max

anew =p0 ∈ P arnew

If anew ≥ arold the old row will be replaced by the old row. Otherwise, the old row
will be kept and the new row will be rejected.
Both procedures will be repeated until nothing changes anymore. Several properties
of this search have to be emphasized:
1. It keeps only maximal rows: All partial rows that are also assessed better than
θ are deleted and forgotten. This implements A. Desolneux’s principle of the
maximum meaningful row (see [5] and Sect. 3.6 below).
2. It is greedy: It decides for the current best partner, which may actually be wrong or
misleading. The current best partner might not be the globally best next element
in the set R∞ as defined above in Eq. 3.12.
3. It can be tractable: Admitting that here with each step possible partners are
searched from P it is noted that the same proximity argument, that was true
when listing all θ-row-seeds in P, also holds here. The search region is even
narrower, because with the old row there was a generator vector v, and using
this the position of a partner with good prospects can be well predicted.
4. Limited depth: The depth will be limited not only by the size of P. If the primi-
tives come from a roughly square image and are roughly uniformly distributed
over it, no good rows much longer than the square root of the size of P are
possible. 
 row multiply: A good row of length (p1 , . . . , pn ) contains n − 1
5. It finds every
seeds r = (pi , pi+1 ) of which more than  one may be better than θ. All of these
will start the search and find the same (p1 , . . . , pn ) in the end. Care has to be
taken that it is not multiply listed in the resulting set of maximal rows. This may
sound like a waste of resources, but if we only executed say prolongation aft, we
would have a violation of commutativity. Plus, we may have situations where
the first pair fails to exceed θ and thus is not listed among the seeds, while the
whole row is good enough. Moreover, proceeding greedily into a wrong branch
of the search (see point 2 of this list) will be much less probable, if we execute
search from both ends into such a critical location.
3.5 Sequential Search 67

Now, that the first-level rows have been listed, the search has to proceed deeper
into hierarchical grouping. The combinatorial nature of rows of rows and so forth has
been given with Eq. 3.13. To this end the method outlined above is taken as starting
point:

Definition 3.7 For any 0 < θ ≤ 1 and any finite set of primitives P

R⊆,∞

is the set obtained by the above given greedy method. It is called the level-1 row set
of P with threshold θ.

And of course the recursion step then follows Eq. 3.13 and reads.

Definition 3.8 For any 0 < θ ≤ 1 and any finite set of primitives P

R⊆,|+∞

is the set obtained by the above given greedy method applied to R⊆,| instead of P.
It is called the level j+1 row set of P with threshold θ.

We are quite sure that this recursion is bounded. Details are given below in
Chap. 5 with Theorem 5.1. However, practical experience shows that intractable
growth of sets with rising j may well occur. In particular, the choice of θ is critical.
Awareness is crucial that we are acting in the combinatorial world outlined above
in Sect. 3.5.1. Methods for breaking the growth problem by use of propagation of
constraints through the hierarchy are also treated below in Chap. 5. There is another
alternative: One may replace the constant threshold parameter θ by a constant false
alarm rate.

3.6 The A Contrario Approach to Row Grouping

The uniform background clutter model utilized in the synthetically generated exam-
ple Gestalt sets presented in Figs. 3.1, 3.1 and 3.3 can also be used for the statistical
analysis of measured Gestalt sets or of sets inferred from measured data. In this case
uniform distribution serves as null hypothesis H0 , and the goal will be to reject it
given a predefined test level. We follow here A. Desolneux [5].

3.7 Perspective Foreshortening of Rows

Like in Sect. 2.11 the last section of this chapter will be dedicated to the situation,
where the 2D Gestalt is subset of a plane which is tilted with respect to the viewer,
i.e., the projective case. In this case the perceived generating vector of a row or
68 3 Good Continuation in Rows or Frieze Symmetry

(a)

(b) (c)

(d)

Fig. 3.8 Example of perspective foreshortening: a binary version of image 47 of the frieze bench-
mark using a threshold; b primitives obtained from this; c best foreshortened row subset; d corre-
sponding row Gestalt: observed locations (◦), adjusted set locations with further predicted positions
to the left and the right (), and estimated vanishing point ( ) with its positional uncertainty drawn
as a standard error ellipse
3.7 Perspective Foreshortening of Rows 69

frieze will be foreshortened with every step from part to part by a certain ratio—the
cross-ratio. So we now have an index i running with the generator v. And while the
direction of the vectors remains constant, their length will vary using:
In this situation the grouping of Gestalts or part of Gestalts must consider this fact
by estimating the corresponding effect. The mapping of adjacent points or straight
lines in a row can be modeled by special planar homology assuming that the points
or line is equidistant and collinear and coplanar, respectively. The homology which
maps a point xi in homogeneous representation into the subsequent one xi+1 reads

xi+1 = Hxi with H = I 3 + μvaT , (3.14)

the unit matrix I 3 , a factor μ, the vertex v, and the axis a.


For points in a row, we can choose the line at infinity a∞ = [0, 0, 1]T as axis a
and 3.14 becomes an elation with three degrees of freedom since vT a = 0 holds
for all vertices and only the product u = μv can be estimated. The determination of
approximate values for u by minimizing algebraic and the subsequent statistically
optimal estimation is based on the constraints


ci = S(xi+1 ) I 3 + ua∞
T
xi (3.15)

with two independent constraints per point pair. The utilization of the cross-product
in 3.15 cancels out the homogeneous factors of the observations xi and xi+1 .
Figure 3.8 shows the image of size 537 × 720 with eleven persons in a row in per-
spective foreshortening. Assuming equidistant and collinear points in space yields
adjusted Gestalt positions, an estimation of the vanishing point with its positional
uncertainty, and the position-to-position homography 3.14 which allows for predic-
tion of further locations in the image. For the coordinates of the observed positions
a standard deviation of σ = 1 pixel is assumed. Provided that the model is valid, the
estimate of σ = 5.15 pixel for the coordinated is obtained. Obviously, the estimated
vanishing point position is too close as it is not coincident with the white road mark-
ings. This is due to the bias in the estimation. However, the close range predictions
of Gestalt are valuable for grouping with greedy algorithms exploiting proximity.

References

1. Wertheimer M (1923) Untersuchungen zur Lehre der Gestalt. II. Psychologische Forschung
4:301–350
2. Liu J, Slota G, Zheng G, Wu Z, Park M, Lee S, Rauschert I, Liu Y (2013) Symmetry detection
from realworld images competition 2013: summary and results. In: CVPR 2013, workshops
70 3 Good Continuation in Rows or Frieze Symmetry

3. Funk C, Lee S, Oswald MR, Tsokas S, Shen W, Cohen A, Dickinson S, Liu Y (2017) ICCV
challenge: detecting symmetry in the wild. In ICCV 2017, workshops
4. Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Susstrunk S (2012) SLIC superpixels compared
to state-of-the-art superpixel, methods. Trans Pattern Anal Mach Intell 34(11):2274–2281
5. Desolneux A, Moisan L, Morel J-M (2008) From gestalt theory to image analysis: a probabilistic
approach. Springer
Chapter 4
Rotational Symmetry

The picture presented in Fig. 4.1 has been obtained from the CVPR 2013 symmetry
competition data [1]. It is obvious that such patterns are salient to human beings,
although flower symmetries are more intended as stimulus to attract insects. Often
they additionally come with salient colors visible to the targeted insect species. Some
of these colors may be visible to human observers as well, and this picture of flowers
comes in RGB colors in the contest data set. However, for the results reported here,
and in [2], these colors were not used, and accordingly the figure displays the gray-
tone version.
Flowers and blossoms frequently come in rotational symmetry, meaning they are
self-similar with respect to rotation using a finite cyclic group or also a dihedral group.
And also other forms created by life evolution exhibit rotational symmetry, such as
pollen or jellyfish. Many such structures even have 3D-rotational self-similarity,
which appears rotationally symmetric in 2D projections only under special viewing
directions. In aerial and satellite imagery of man-made environments rotational sym-
metry is less frequent. Roundabouts and other traffic infrastructure sometimes have
rotational symmetry, and there are famous examples of buildings, such as the Pen-
tagon. However, other imaging modalities, such as ground based architecture pictures
often show rotational Gestalt. For instance, on facades such patterns are not rare, in
particular, on facades of representative or religious purpose. In Buddhism and Hin-
duism numerous rotational Gestalten are known as mandalas, and frequently appear
in the related documents, the interior decorations of such temples, and on facades.
They are believed to have a strong impact on the mind of humans, in particular when
imagined during meditation. Also in the technical world rotational symmetry is not
rare. The mechanical civilization is full of wheels.
In contrast to that, rotational symmetry is not mentioned among the Gestalt laws of
the classical authors, such as Wertheimer [3]. Yet, it sets one discipline of the recent
symmetry recognition contests organized by the Penn State working group [1]. The
winning approach by S. Kondra et al. utilizes cross-correlation [4]. Originally, this
© Springer Nature Switzerland AG 2019 71
E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6_4
72 4 Rotational Symmetry

Fig. 4.1 Example image


showing multiple rotational
Gestalten of order five

is a brute force approach exhaustively listing all possible parameters, i.e., locations,
scales, and periodicities in appropriate step width. For each such parameter combi-
nation the corresponding transformations are applied to the image patch defined by
the parameters, and subsequently the score for the presence of this setting is accumu-
lated by cross-correlation. The performance was reported as worse than the baseline
method (the state of the art in 2013) which was set by Loy and Eklundh in 2006 [5].
That method in turn is based on SIFT-primitives.

4.1 The Rotational Gestalt Law as Defined on Locations

Starting with the perfect configuration the self-similarity is defined as finite cyclic
group of rotation transforms. These map one set location to the next. Thus, the
locations x i of a rotational Gestalt of order n are located on a circular orbit around
the center c. Instead of the generating vector used in Sect. 3.2, Eq. 3.1, there is the
generating angle ϕ = 2π/n:
 
cos(ϕ) sin(ϕ)
x i+1 − c = · (x i − c) (4.1)
− sin(ϕ) cos(ϕ)

For row Gestalten the locations x 1 and x n have a special role as begin and end of the
row. Here all locations play the same part in a rotational Gestalt. It is more appropriate
to use the enumeration i = 0, ..., n − 1 here. The predecessor of 0 is n − 1 and the
successor of n − 1 is 0 according to the generation law Eq. 4.1. where the indices i
4.1 The Rotational Gestalt Law as Defined on Locations 73

are understood modulo n. Obviously, the center c is the location of the new aggregate
Gestalt. But two important features result as well, namely the radius r = c − x n 
and the phase τ = arctan2 (x n,2 − c2 , x n,1 − c1 ). The periodicity feature of the new
Gestalt will of course be set to n. That is what we introduced the periodicity feature
for. For defining the operation
n
g p = gi (4.2)
i=1

all Gestalt features, as demanded in Sect. 1.3, must be calculated from the features of
the gi . For the perfect configuration the set locations equal the locations of the parts
x i = xgi . But for less perfect input there will be residual vectors σ i = x i − x gi . The
center c , radius r , and phase τ , respectively, set the set locations by
   
cos (χ) sin (χ) r 2π · i
xi = c + · ; χ= + τ, (4.3)
− sin (χ) cos (χ) 0 n

and these parameters should be optimized such that


n
min = E = σ iT · σ i (4.4)
i=1

This is a nonlinear minimization problem for which we have no closed-form solution.


An initial approximate solution is required as well as a Jacobian for the iterative
improvement steps along the lines outlined in Appendix A. After convergence of the
minimization the remaining sum of squared residuals 4.4 will also be used to assess
the new aggregate Gestalt g p .
The natural choice for the initial location feature c0 = x 0, p is the mean of the
locations of the parts
1
n
c0 = xg (4.5)
n i=1 i

Once this center is given, the initial radius can be set as mean distance

1  
n
r0 =  c0 − x g  (4.6)
i
n i=1

There are also n angles υi = arctan 2(xgi − c). For the phase feature along the
orbit of Eq. 4.1 these must be understood as phase modulo 2π/n. In this group they
should cluster around a mean phase:
 

τ0 = mean υi mod (4.7)
n
74 4 Rotational Symmetry

Fig. 4.2 Example of


rotational arrangement of
order five: given locations xi
as ∗, set locations x i as o,
correspondence is indicated
by a connecting line also
representing the residuum σi

This is problematic and may fail in case of uniform spreading around the whole
phase group. Clustering and means in the domain of orientations are treated in detail
in Sect. 1.5. In case of failure we may set τ0 arbitrarily, for instance, at random or at
υ1 . No well-assessed aggregate is possible in such cases anyway (Fig. 4.2).
With the initial solution (c0 , r0 , τ0 ) the Gauss/Newton minimization for 4.4 can
start. In [6] the corresponding 2n × 4 Jacobian was already given. It reads
⎡ ⎤
1 0 −r sin(τ + 1 2π
n
) r cos(τ + 1 2π
n
)
⎢0 1 r cos(τ + 1 2π ) r sin(τ + 1 2π )⎥
⎢ n ⎥
⎥  
n

J = ⎢ ... ..
.
..
.
..
. ⎥ · xi − c (4.8)
⎢ ⎥
⎣1 0 −r sin(τ + n 2π ) r cos(τ + n 2π )⎦
n n
0 1 r cos(τ + n n ) r sin(τ + n n )
2π 2π

This matrix must be newly filled, squared, and inverted with every iteration step
using ⎡ ⎤
⎡ ⎤ σ1,x
cx ⎢σ1,y ⎥
⎢c y ⎥  T −1 ⎢ ⎥
⎢ ⎥ = J J ⎢ .. ⎥
J · ⎢ ⎥ (4.9)
⎣ r ⎦ i i i
⎢ . ⎥
τ i+1 ⎣σn,x ⎦
σn,y i

This iteration step updates both the solution (c, r, τ ) and the residuals σ. Starting
from very bad initializations this iteration may fail to converge, for instance, because
of rank deficit of the Jacobian. It may also oscillate for some steps, or yield a negative
radius. In such cases we may assign zero assessment and an arbitrary solution. In
benign cases it will converge already with the first step. After that, the residuals can
4.1 The Rotational Gestalt Law as Defined on Locations 75

be used to assess the geometric fit of the rotational model using the function given
in the following definition.
Definition 4.1 A function aρ : G n → [0, 1) is called rotational fit assessment iff
 
  
n
aρ g1 , . . . , gn = exp −λ σi2
i=1

using a scale factor λ > 0.


In the absence of any learned parameter λ (see Chap. 13) we use again the geo-
metric mean of the scales of the gi as scale parameter here.

4.2 Fusion with Other Gestalt Laws

After the features (c0 , r0 , τ0 ) have been optimized, we can combine the resulting
assessment with the other Gestalt assessments, namely proximity and similarity.

4.2.1 Proximity Assessments for Rotational Gestalten

The assessment functions presented here are very similar to the ones presented in
Sect. 3.3, and while repeating we will emphasize the differences. Figure 4.3 demon-
strates: Displacement from the rotational pattern as defined in Sect. 4.1 disturbs
salience, but proximity is even more important. A rotational pattern appears most
salient if the scale of the parts is about as large as their mutual distance.
For rotational Gestalten, n parts are distributed round the circumference which
has length r · 2 · π. Thus, the assessment function should score the ratio between
the Euclidean length of the radius d = r/n · 2 · π and some kind of mid-scale of the
parts smid . Without further prior knowledge, and following the same rationale as in
Sects. 2.5 and 3.3, scores with the features listed below are preferable:
• The score should be one for d/smid = 1—that is for the objects being adjacent;
• The score should asymptotically approach 0 for d/smid approaching ∞—that is
for very large radii;
• The score should also approach 0 for d/smid approaching 0—that is for very small
radii that do not really give a proper Gestalt;
• Being continuous and differentiable.
• Decaying faster than linear—has a finite integral.
In fact, we may use the same assessment functions as they are given in Sect. 2.5
in Formulae 2.5 or 2.6, and plotted in Fig. 2.3. The natural definition for the mid-
scale smid of n parts is again the geometric mean: smid = (s1 · . . . · sn )1/n . Thus, the
following definition is made:
76 4 Rotational Symmetry

Fig. 4.3 Rotational Gestalten on clutter: top to down rising displacement; left to right rising scale
of parts; 20 clutter objects uniformly placed

Definition 4.2 Afunction a : G n → [0, 1) is called proximity assessment iff for all
pi ∈ G n and r =  p1 , . . ., pn :
dr = 0  a p1, . . . , pn = 0,
dr = smid  a p1 , . . . , pn = 1,
dr → ∞  a p1 , . . . , pn → 0,  and
a p1 , . . . , pn = a pn , . . . , p1 .
It depends on the primitive extraction method, but at least for flower pictures often
the mutual distance of adjacent ground truth part-Gestalten is a little smaller than
their scale. Example, for the pattern shown in Fig. 4.1 the leaf spacing may be about
half of the leaf length. This can either be compensated by introducing an appropriate
heuristic factor in the scale of the primitives, or such a factor can be set a priori, or
also learned if sufficient example material with ground truth is provided.
4.2 Fusion with Other Gestalt Laws 77

Fig. 4.4 Rotational Gestalten on clutter: top to down rising displacement; left to right rising dis-
tortion in orientation of the parts; 20 clutter objects uniformly placed

4.2.2 Similarity Assessments for Rotational Gestalten

Of course, a rotational Gestalt is also more salient if all parts are similar in scale. Here
the same similar-in-scale-assessment can be used that was also appropriate for rows,
namely Definition 3.3. This component should be fused with the other components
given above, in particular the proximity law, and the assessment based on the residual
displacements from the rotational arrangement.
Of course there is also a similar-in-orientation-assessment. What is special about
the orientation component becomes evident from Fig. 4.4: The orientations of the
parts should be rotating with their index following
Definition 4.3 A n-ary operation Π : G n → G is called rotational symmetry oper-
ation iff for all p1 , . . . pn ∈ G:
• x Π p1 ... pn = c,
• φΠ p1 ... pn = τ mod n2 π,

• sΠ p1 ... pn = r + n s1 · . . . · sn ,
• f Π1...n = n,
78 4 Rotational Symmetry

and aΠ p1 ... pn is a conjunctive assessment combination of rotational location, orien-


tation, proximity, and similarity in scale.

4.3 Search for Rotational Gestalten

With respect to the combinatorial structure, the operation Π is n-ary and thus similar
to the operation Σ treated in Chap. 3. Recall, the combinations listed in Sect. 3.5.1
operate only on the indices, forming tuples with, or without, repetition. All that
remains true here as well. The unique main theoretical difference lies in the additional
algebraic group structure on the index-tuple, i.e., the different law of commutativity:
When we list all row Gestalten by listing all n-tuples, we have each row two times in
the list—first in forward enumeration and the second time in backward enumeration.
In contrast to this: When we list all rotational Gestalten by listing all n-tuples, we will
have each rotational aggregate n times in the list. Recall, cutting the first element from
the tuple and appending it behind the last does not change the Gestalt. Only for the
special case n = 2 the inverse enumeration of the parts will give the same aggregate.
For n ≥ 3 inversion of the enumeration sequence will yield a completely different
element. Recall that the law Eq. 4.1 defines a mathematically positive rotation (if
the first axis points right, and the second axis points upward, positive rotation will
be counter-clockwise). Using the wrong rotation will yield very bad assessments,
sometimes the Gauss/Newton iteration may even fail to converge.
In the following subsections, a greedy search will be given that may require larger
computational efforts than the search for rows presented in Sect. 3.5.2, but still
appears tractable.

4.3.1 Greedy Search for Rotational Gestalten

Given a finite set of primitives P all pairs (p1 , p2 ) ∈ P will be listed (without repe-
tition), and r = Π (p1 , p2 ) is evaluated. If ar is very good we are done, and can list
the new aggregate with the results. This simple case n = 2 is a very special case. It
is sometimes called point reflection symmetry. Recall that it was already mentioned
as special case of a planar harmonic homology in Sect. 2.11.
Else, if in particular the orientations φ1 and φ2 do not fit the 180◦ rotational law,
but the rest of the Gestalt laws—namely proximity and similarity—fit sufficiently,
a set of new seeds will be constructed. Again, a threshold parameter θ is utilized
for search seeds, just like in Definition 3.6. Actually, the same row-seeds can be
used here as well. Only the similarity-in-orientation part of the assessment should
be neglected.
Figure 4.5 displays a typical rotational seed. A collection of convex regular n-gons
is constructed from it. In this case there are a square, a pentagon, and a hexagon.
All have the first side in common. It is the vector connecting the locations of the
4.3 Search for Rotational Gestalten 79

Fig. 4.5 Example of a


rotational seed and the
corresponding search orbits

two seed-Gestalten. The center of all n-gons is located on the perpendicular bisector.
This center forms an equal-sided triangle with the two locations. The angle at the
center is 2π/n—which is 90◦ , 72◦ , and 60◦ , respectively. The n − 2 other vertices
result from this construction. They give the search locations.
For each n-gon, for each search location the best fitting partner is selected greed-
ily. Best fit means here using a conjunction of the assessments used for Π in
Definition 4.3. Namely:
• rotational symmetry meaning here closeness to the search location (close in the
mid-scale of the seed elements);
• similar size meaning here similar in scale to the mid-scale of the seed elements);
• fitting in orientation meaning here similar in orientation to the set angle resulting
from the seed-element orientations and the index in the orbit.
• similar with respect to all additional features to the corresponding seed-element
features.
All this can only be heuristic. The risk of greedily picking the wrong element for
one vertex is not negligible here. However, a non-greedy search would surely be
intractable here. The computational load caused by this greedy search is bad enough
already.

4.3.2 A Practical Example with Rotational Gestalten of


Level 1

Figure 4.6 displays a set of primitive Gestalten extracted from the example image.
Here SIFT-primitive-extraction was used (see Sect. 11.4). For the human observer
the pentagram is very salient in this display. Thus, we can assume that in this case the
80 4 Rotational Symmetry

Fig. 4.6 SIFT-primitives extracted from the example image Fig. 4.1

salient symmetry in the image shown in Fig. 4.1 is not lost in the primitive extraction
process. These primitives are preferably located at the tip of leaves or on the sharp
vertex appearing between two leaves. Other extraction methods—such as super-pixel
segmentation (see Sect. 11.2)—may yield primitives that are located in the center
of the leaves. There are ... primitives here, and the greedy search outlined above in
Sect. 4.3.1 yields ... level-1 Gestalten. Some of them have frequency five, but some
also have other frequencies, such as three or four. They are displayed in Fig. 4.7.
Also in Fig. 4.7 we see the most dominant cluster of similar rotational Gestalten.
The location and size fits the expectation, and the ground truth given with this example
well enough to count it as success. The symmetry contests of 2013 [1] and 2017 [7]
did not give ground truth for frequency and phase. The former would of course be
success. But the phase may be disputable. The result sets the phase between the
leaves, which is as far away as possible from the leave-axes. These would probably
be the ground truth most observers would click on this example. The phases are often
problematic and disputable in such patterns. These results were published in [2].
4.3 Search for Rotational Gestalten 81

Fig. 4.7 Π -Gestalten of level 1 obtained from the example image Fig. 4.1 and the most dominant
cluster found
82 4 Rotational Symmetry

4.4 The Rotational Group and the Dihedral on Group

Most of the patterns given in the competition data sets for rotational symmetry
[1, 7] also have multiple reflection symmetry. Their reflection axes then intersect at
the rotation center. It is actually not easy to find examples of pure rotational symmetry
that do not have reflection symmetry, such as certain flags and arms, such as the arms
of the Isle of Man with its three rotating legs.
Algebraically, the corresponding group is known as the dihedral group of order
n. It has 2n elements, and the corresponding rotational group is a subgroup of it.
When searching simultaneously for both, rotational Gestalten and reflection sym-
metric Gestalten, the presence of a dihedral symmetry should manifest as a cluster
of both types at the corresponding location. Such cluster may well be detected, the
number and intersecting angles of reflection symmetry tested for consistency with the
rotational periodicity, and an aggregate of the dihedral type constructed and assessed
accordingly.

4.5 Perspective Foreshortening of Rotational Gestalts

Circular man-made objects and plant parts often consist of parts or features which
are arranged equidistantly in a circle. Thus, such objects feature a periodicity by
construction. Examples are hexagonal bolts, Ferris wheels, blossoms as in Fig. 4.1,
or a ventilator as shown in Fig. 4.8. In the perspective views of images, theses circles
usually appear as ellipses and a conjugate rotation can be utilized to map a point’s
position x i to the subsequent position x i on the ellipse.
If we model the mapping between the image plane and the object’s plane in space
by a general homography T, the general conjugate rotation for an image point i reads

xi = Hxi with H = TRT−1 (4.10)

and a rotation matrix R. As a special type of collineation, the homography 4.10 has
seven degrees of freedom and the eigenvalues of H are the same as for the rotation
matrix R with rotation angle ω, namely {μ, μeiω , μe−iω } with μ = 1 if det(H) = 1
holds [8]. Thus, the complex eigenvalues determine the rotation angle. Four-point
correspondences are required to determine 4.10, and the eigenvector corresponding
to the real eigenvalue is the fix point of the transformation.
Figure 4.8 shows the image of a ventilator in a perspective view. The four rotor
blades have been detected by the extraction of maximally stable extremal regions
(MSER) [9] (see also Sect. 11.3) and are illustrated by fitted ellipses. Since the
centroids of these image areas do not correspond to the midpoints of the blades
in space, we utilize the angular points of the ellipses as consecutive corresponding
points. This yields two corresponding sequences of four image points each, marked
by crosses in the figure.
4.5 Perspective Foreshortening of Rotational Gestalts 83

Fig. 4.8 Image of a ventilator in perspective view (source ICCV 2017 competition [7]). Corre-
sponding points on the rotor blades transform according to a conjugate rotation with period 4. As
an illustration a virtual ventilator with period 16 is plotted, too

For the determination of the mapping we start with the estimation of a general
homography H0 , parametrized by eight parameters effectively. This can be done by
considering four or more consecutive point correspondences, for example, the four
outermost angular points or three consecutive points of both sequences. From this
approximate solution, the rotation angle can be derived by computing the phase of one
of the complex eigenvalues. Alternatively, the period n = 2π/ω can be determined
by the relation
tr(H0 ) − 1 = 2 cos(2π/n) (4.11)

with the trace of the matrix H0 . By construction, the period is integer. Thus, we round
n to obtain a precise period and rotation angle.
With this information at hand, an optimal estimation of the homography 4.10
can be followed, enforcing the constraint 4.11. The cyclic homography determined
in this way is a n-fold rotation, i.e., the transformation obeys Hn = I 3 . The planar
harmonic homology 2.13 utilized in Sect. 2.11 is of period n = 2 and therefore a
so-called involution.
For the generation of grouping hypotheses, four consecutive point correspon-
dences are sufficient to determine the rotation angle and the mapping. This offers the
possibility to aggregate rotational objects even in the case of missing data, e.g., due
to occlusions, and to find new object parts by guided matching in a greedy manner
as sketched in Sect. 4.3.1.
84 4 Rotational Symmetry

References

1. Liu J, Slota G, Zheng G, Wu Z, Park M, Lee S, Rauschert I, Liu Y (2013) Symmetry detection
from realworld images competition 2013: summary and results. In: CVPR 2013, workshops
2. Michaelsen E (2014) Searching for rotational symmetries based on the gestalt algebra opera-
tion. In: OGRW 2014, 9-th open German-Russian workshop on pattern recognition and image
understanding
3. Wertheimer M (1923) Untersuchungen zur Lehre der Gestalt. II. Psychologische Forschung
4:301–350
4. Kondra S, Petrosino A, Iodice S (2013) Multi-scale kernel operators for reflection and rotation
symmetry: further achievements. In: CVPR 2013 competition on symmetry detection
5. Loy G, Eklundh J (2006) Detecting symmetry and symmetric constellations of features. In:
European conference on computer vision (ECCV), vol II, pp 508–521
6. Michaelsen E, Yashina VV (2014) Simple gestalt algebra. Pattern Recogn Image Anal
24(4):542–551
7. Funk C, Lee S, Oswald MR, Tsokas S, Shen W, Cohen A, Dickinson S, Liu Y. (2017) ICCV
challenge: detecting symmetry in the wild. In ICCV 2017, workshops
8. Förstner W, Wrobel B (2016) Photogrammetric computer vision. Springer
9. Matas J, Chum O, Urban M, Pajdla T (2002) Robust wide baseline stereo from maximally stable
extremal regions. In: British machine vision conference BMVC 2002, pp 384–396
Chapter 5
Closure—Hierarchies of Gestalten

The traditional way to handle nested hierarchies of patterns in the machine recogni-
tion literature was inspired by the theory of formal languages of N. Chomsky. Here
we only mention the Picture Languages of Rosenfeld [1], Fu [2] who had great influ-
ence in those days, and Narasinham [3] showing very early the direction in which this
chapter is intended. Note that the earliest technical committees (TC1 and TC2) of the
International Association of Pattern Recognition (IAPR) [4] were dedicated to sta-
tistical pattern recognition—which was discussed in the form of “perceptrons,” i.e.,
artificial neural nets [5]—and syntactical pattern recognition—which was discussed
in the form of grammars [2]. TC2 was later renamed in “structural and syntactical
pattern recognition” and shows little interest in grammars today.
The obvious problem with any generalization of Chomsky grammars to “2D” is
that the latter refers to a vector domain while the former are defined on the domain of
strings. Strings are not “1D”. Replacing a sub-string of a string by another sub-string
that has different lengths will alter the length of the whole string. The very popu-
lar context-free Chomsky grammars, for instance, will generate longer and longer
strings, starting from a single symbol. Most syntactic 2D models discussed, e.g., in
[2] can only replace a tile in a grid of cells by another tile of the same size and format.
For the Gestalt laws discussed in this book a change in scale is ubiquitous. Every
sensible aggregate is larger in scale than its parts. We will treat this fundamental
property with more rigor below in Lemma 5.1.
There remains one type of grammar that can cope with such growing scales
and that is the multiset grammar. Such structure was first proposed by Milgram
and Rosenfeld [6] for the use of automatic formula parsing. There it was called
“coordinate grammar.” The location is stored as an attribute or feature with each
instance of the symbols. Thus, it is free from the raster, and locations can remain
empty or also occupied by more than one instance. Once such features are introduced,
the road is open to append additional features such as scale and orientation, and
such grammars come fairly close to the approach presented throughout the book at
© Springer Nature Switzerland AG 2019 85
E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6_5
86 5 Closure—Hierarchies of Gestalten

hand. In [7] elaborated grammars of this type are presented. The main application
domain is seen not in the automation of vision or object recognition, but in graphical
man–machine interfaces and in automatic analysis of all kinds of schemata and
drawings. Marriott and Meyer call their approach constrained multiset grammars.
This emphasizes the role of constraints on the attribute values in the production rules.
Not every element can mate with any other element.
We replaced such constraints by continuous assessment functions (introduced
in Sect. 1.3). Thus, the applicability of a production rule is never denied for any
combination of elements. This last syntactic property is also transferred to the features
coming with the elements; precisely it is now modeled in the assessment feature. The
term grammar is not appropriate anymore and we rather speak of an algebra. The
production rules then become operations.

5.1 Gestalt Algebra

Following [8], and in accordance with the usual terminology of universal algebra,
we introduce the following structure:
Definition 5.1 A set D is called assessed domain iff for all p ∈ D there is an assess-
ment 0 ≤ a p ≤ 1.
Note that the 2D domain G given in Sect. 1.3 is an example, but also other domains,
e.g., of higher or lower dimension and with different arithmetic structure, fit this
definition. Universal algebra deals with operations on sets that have arities:
Definition 5.2 A function Dk → D is called k-ary operation.
The set of operations permitted for an algebra is always finite. In universal algebra k
can be a fixed integer including 0 or also an unspecified integer. For assessed domains
any operation must also give an assessment. In what follows three cases prevail:
• k = 0: Such 0-ary operations are usually called constants. We will refer to them
as primitives.
• k = 2: Such binary operations are written between their arguments like the arith-
metic operator +.
• k ≥ 2 and unspecified: Such k-ary operators are written in front of the list of
arguments, and brackets are used.
It is a common practice in algebra to refer to the entities noted by such symbols
as terms. This includes the nested use of operations, such as in:
      
p1 , p2 , p3 , p4 |p5 , p6 |p7 , p8 |p9 , p10 (5.1)

Machines are generally quite good in reading and analyzing such terms. This one
refers to a row Gestalt that is made of three parts: The first one of these is a rotational
5.1 Gestalt Algebra 87

Gestalt made of the primitives p1 p2 p3 ; the second one of these, again a rotational
Gestalt, is in itself composed of four mirror Gestalten, which are in turn made up of
the next eight primitives p1 . . . p11 ; the third and last one is just a simple primitive,
namely p12 . Some humans might prefer such natural language descriptions rather
than using the notation as term. In the end, the sub-sentence hierarchy of that text is
actually also a representation of the very same term.
Using terms has many advantages. One is that it can be easily verified that
    
p10 , p7 |p6 , p8 |p9 , p4 |p5 , p3 , p1 , p2 (5.2)

indicates the same Gestalt as Eq. 5.1 because of the commutativity laws associated
with the operations |, Σ, and Π . We may set an “=” between these terms. For a
human analyst such a test of equality might be some labor and cost a couple of
seconds. However, a machine can perform such test in microseconds.
Another reason to prefer the notation as term is that it precisely gives the function
calls. If the operations are coded in interpreter systems such as MATLAB or Octave,
typing the term will directly yield the corresponding aggregate Gestalt. “Closure”
then means nothing else but proper coding—there will never be an error report as a
result of such an input.
Next to terms or natural language representations humans may also prefer graphi-
cal representations. To this end, one may simply connect each aggregate Gestalt with
its preceding part Gestalten by directed links resulting in a term graph. This has been
done in for the terms above in Fig. 5.1. Note that such graphical representation allows
also to show the detailed values of features. In the wording of semantic nets (see,
e.g., [9]) this would give a particular instance, while the terms in Eq. 5.1 or Eq. 5.2
would rather give a structured form or a concept, at least as long as the primitives
remain unspecified. Such graphical interfaces are standard in production systems,
semantic nets, ontologies, or other knowledge-based systems and have proven to be
very useful for knowledge explanation and acquisition.
Note also that this particular graph has tree structure. Algebraically multiple use of
the same sub-terms is not forbidden. However, by now the reader should be convinced
that such use would lead to very bad assessments. In fact something has gone badly
wrong if any sensible search engine should yield a graph which is not a tree. It
indicates a bug. For what follows, we will therefore use the word term tree for such
displays. We may discuss all properties that trees may have. For instance, this tree
has depth three. Note, it is imbalanced: One branch has depth one, one branch has
depth two, and only the third branch has depth three. We may decide to punish or
forbid such lack of balance.
Figure 5.2 displays the Gestalten of the same term using the standards found
throughout this book. It can be seen that a fairly well-assessed (0.894) aggregate
results from a configuration that is not particularly salient or appears extraordinarily
symmetric. This is the reason why the algebra at hand was called simple Gestalt
algebra in [10]. It is a beginning, but definitely not satisfactory. More constraints are
needed and given below in Sect. 5.3. But surely demanding balance of term depth on
88 5 Closure—Hierarchies of Gestalten

Fig. 5.1 A term tree—this one has the same structure as the terms in Eq. 5.1 or Eq. 5.2; the features
given in the rows beneath the names are the standard Gestalt features: location in x and y, orientation
between zero and one, frequency, scale, and assessment

every branch is a first step to more appealing and compliance with Gestalt intuition.
Before that is discussed in more detail we will give an important theorem that follows
already from our simple Gestalt algebraic setting.
We have seen that the operations defined in Chaps. 2, 3, and 4 do not allow the
definition of neutral (null or one) elements, and accordingly there are no inverse
elements. This is somewhat unusual in algebra. It is also easy to see that the, often
axiomatically demanded, associativity is violated: e.g., f |(g |h ) = (f |g )|h holds for
almost any Gestalten. Thus the operation | defines only a commutative groupoid,
not a group and not a semigroup. However, some monotony lemmas follow from the
continuity of the proximity assessment:

Lemma 5.1 For any  ≥ 0 there is a δ ≥ 0 such that for any Gestalten g and h the
scale of g |h is bounded by
√  
ag|h ≥  ⇒ sg|h ≥ (δ + 1) sg · sh ≥ (δ + 1) min sg , sh

Proof Definition 2.4 demands for proximity assessments a p continuity and that xg =

 a p (g|h) = 0. So for any  ≥ 0 there is a δ ≥
x h implies
 √
0 such that δ sg · sh ≤
xg − x h . By Definition 2.3 we have sg|h = xg − x h  + sg · sh . From this follows
Lemma 5.1 for the proximity part of the assessment function. The other assessment
components are bounded by one, and the overall assessment is a conjunctive fusion
5.1 Gestalt Algebra 89

Fig. 5.2 Nested term of Gestalten: upper left—a row of three rotational Gestalten; upper right—
further decomposition of first and second Gestalt; lower left—decomposing the middle Gestalt
further; lower right—corresponding primitives only

of them (a product). Thus the lemma follows. Even if—such as in Eq. 2.9—a root
function is used for assessment fusion, there will still be such δ though it may be
smaller. Chaining continuous functions yields again continuous functions.

There are analogous properties for the other operations:


Lemma 5.2 For any  ≥ 0 there is a δ ≥ 0 such that for any Gestalten tuple
(g1 , ..., gn ) the scale of Σg1 , ..., gn is bounded by
 1
aΣg1 ,...,gn ≥  ⇒ sΣg1 ,...,gn ≥ (δ + 1) · sg1 · ... · sgn n

Proof Also for the row operation proximity assessments a p are zero for zero dis-
tances. So if all locations are equal xg1 = ... = xgn we will have a p (Σg1 , ..., gn ) = 0.
And the same demand for continuity of proximity assessments a p that was stated

 also is given here. For any  ≥ 0 there is a δ ≥ 0 such that δ sg · sh ≤
for reflection

xg − x h . So at least one xi must be different from the others. It can be easily veri-
fied from Eq. 3.4 that then the generator vector vΣg1 ,...,gn also cannot be a zero vector.
The scale of the new object sΣg1 ,...,gn is given by Eq. 3.5. Thus, the lemma follows.
90 5 Closure—Hierarchies of Gestalten

Even if—such as in Eq. 2.9—a root function is used for assessment fusion, there will
still be such δ though it may be smaller. Chaining continuous functions yields again
continuous functions.

Lemma 5.3 For any  ≥ 0 there is a δ ≥ 0 such that for any Gestalten tuple
(g1 , . . . , gn ) the scale of Π g1 , . . . , gn is bounded by

 1
aΠ g1 ,...,gn ≥  ⇒ sΠ g1 ,...,gn ≥ (δ + 1) · sg1 · ... · sgn n

Proof Definition 2.4 demands for proximity assessments a p continuity and that

x g = x h implies
 a p (g |h ) = 0. So for any  ≥ 0there is aδ ≥ 0 such that δ sg · sh ≤

 x g − x h . By Definition 2.3 we have sg|h =  x g − x h  + sg · sh . From this fol-
lows Lemma 5.1 for the proximity part of the assessment function. The other assess-
ment components are bounded by one, and the overall assessment is a conjunctive
fusion of them (a product). Thus the lemma follows. Even if—such as in Eq. 2.9—a
root function is used for assessment fusion, there will still be such δ though it may
be smaller. Chaining continuous functions yields again continuous functions.

Note, we should not remove any of the properties used in this proof: For example,
if we used disjunctive fusion, or if we allowed a discontinuous step in the proximity
assessment at argument zero or nonzero assessments at this argument, this decisive
monotonicity may be lost.

Definition 5.3 Given an algebra A on an assessed domain D, a finite set of primitives


P ∈ D and 0 ≤  ≤ 1 we define the A -closure of P as

{t ∈ T (A, P) | ta ≥ }

For  = 0 this closure will be infinite, since any primitive p ∈ P may appear an
arbitrary number of times in a term of A. However, the following theorem is essential
for our approach:

Theorem 5.1 For any  ≥ 0 the A -closure of P is finite.

Proof From Lemmas 5.1, 5.2, and 5.3 we know that with rising depth of a term graph
the scale feature will have to grow strictly monotone if all elements in it have at least
assessment . Thus the depth is limited. From a finite set of primitives, only a finite
number of term graphs is possible, when the depth is limited.

5.2 Empirical Experiments with Closure

Next to construction of bounds and theorems on the combinatorics of the closure—


such as Theorem 5.1—one may well be interested in the combinatorial behavior that
may be expected. Without any example images still empirical experiments can be
5.2 Empirical Experiments with Closure 91

Fig. 5.3 Random Gestalten clutter and its reflection closure with threshold 0.90: upper left—1000
primitives; upper right—level-1 |-Gestalten; then levels two and three and in the lower row levels
four and five; all higher levels are empty

performed by using random instance generation. The upper left part of Fig. 5.3 shows
a set of a thousand such primitives. Sampling used the following specifications:
• Uniform location in [0, 1000] in both horizontal and vertical coordinates;
• Uniform orientation in [0, π]; √
• Rayleigh distribution with parameter 1000 in scale, so that the instances are
packed densely but with little overlap;
92 5 Closure—Hierarchies of Gestalten

Table 5.1 Combinatorial growth of number of Gestalten as function of term level and assessment
threshold
– 0 1 2 3 4 5
0.89 1000 235 476 12012 8485493 ...
0.90 1000 62 22 28 43 1
0.91 1000 45 6 2 0 0

• Uniform assessment in [0, 1];


• Rotational frequency 2 for all objects.

The other parts of the figure show the |-terms resulting from these primitives. Here a
threshold of 0.90 was used. Accordingly, the upper right figure shows all reflection
Gestalten assessed better than 0.90, left in the middle row are the reflection Gestalten
resulting from those in turn—and again better assessed than 0.90, and so forth. Thus,
the single Gestalt displayed in the lower right is a level-5 reflection: It is made from 25
primitives. The growth in scale according to the monotony Lemma 5.1 can be clearly
seen. Note also that while the primitives appear distributed rather homogeneously the
higher-order Gestalten concentrate more and more in clusters, whereas other regions
remain empty. It appears that this threshold is at the edge of hallucination. This is
also supported by statistics on the combinatorial growth of Gestalten number with
varying threshold, as given in Table 5.1. With a slightly higher threshold the numbers
decline rapidly. Lowering the threshold slightly causes the numbers to explode in a
combinatorial nightmare. The usual machine capacities will not suffice to hold the
level-5 Gestalten anymore. However, we have Theorem 5.1: There will be a level
where no further Gestalten are possible.

5.3 Transporting Evidence through Gestalt Algebra Terms

Now, that the main terms have been defined, and the important tools are at hand, we
can make the core points of this book. Recall its title was “Hierarchical Perceptual
Grouping for Object Recognition”. Along the edges of a term tree information can
be propagated up and down. The tree can set correspondences between tiny remote
parts of a scene. It can therefore also help reducing the computational load: Recall not
everything in an image can be set in correspondence with everything else. This would
result in computational efforts growing faster than linear with the image, which is
unacceptable.
In [10] the defined structure was called simple Gestalt algebra foreseeing that
such stepwise reckoning, where the features and the assessment of an aggregated
Gestalt do only depend on the immediate predecessors, would not suffice. It can
only be the first bottom-up hypothesis. Following that, deeper top-down testing,
feature adjustment, and reassessment should follow. The ultimate goal should be
5.3 Transporting Evidence through Gestalt Algebra Terms 93

a least squares adjustment on the whole hierarchy, minimizing the deviations in


the primitives—i.e., our measurements. Moreover, it is possible to propagate the
evidence even further down into the primitive extraction, e.g., readjusting thresholds
in order to find primitives that are—a posteriori—more consistent with the overall
Gestalt. In order to do so first an important property of Gestalt terms must be defined:

Definition 5.4 A Gestalt term t is called balanced iff every branch of it is a tree of
identical structure.

The example given above as term in Eq. 5.1 and displayed in Fig. 5.1 was imbal-
anced, because its sub-terms are of different structures: first a rotational Gestalt,
second a rotational Gestalt where the parts are reflection pairs, and third a single
primitive. The structure given for the example below as term in Eq. 5.5 is balanced.
Such property is prerequisite for the adjustments presented there.
First, in Sect. 5.3.1 the propagation of simple additional features such as colors is
considered. Then, in Sect. 5.3.2 the geometric Gestalt features are adjusted through
the hierarchy.

5.3.1 Considering Additional Features

It was demonstrated in [11] that the use of additional features will often improve
the recognition performance. SIFT primitives were extracted from the benchmark
images. The SIFT keypoint yields perfectly the features demanded by the Gestalt
domain Sect. 11.4. Inspired by the successful symmetry recognition approach of Loy
and Eklundh [12] the SIFT descriptor vector was also utilized. Substantial improve-
ment of recognition rates could be achieved. When super-pixels are used, such as
in [13], the features of the Gestalt domain also come naturally with the extraction
process in Sect. 11.2. But there are unused additional features as well: eccentricity
and color in this case. Utilizing these in the assessment should again help improving
recognition performances.
A hint to such approach was already given in the definition of conjunctive assess-
ments in Definition 2.6. Let, e.g., for colors the classical three-byte format be used. So
cog is a 3-vector containing the red, green, and blue components of the mid-color of
the super-pixel g. Then the similarity between two colors cog and coh can simply be
assessed by the color difference weighted by the maximal possible color difference,
and the reassessing may be performed using

cog − coh
acol,g|h = ag|h · 1− √ . (5.3)
3 · 2562

A parameter may introduce weighting color evidence against the Gestalt assessment.
This can be an exponent in float format.
94 5 Closure—Hierarchies of Gestalten

Such additional assessments must not necessarily have this form, but they should
have similar properties as the Gestalt assessments, i.e.,
• Being one for optimal consistency—here color difference√ zero;
• Being zero for maximal dissent—here color difference 3 · 2562 ;
• Being differentiable in between in order to enable machine learning.
The point is that the aggregate Gestalt g |h now can also inherit the mid-color feature
of its predecessors by
cog + coh
cog|h = . (5.4)
2
The same construction is possible also for the operations Σ and Π .
In this way the color can be propagated and used for assessments also for terms
of arbitrary complexity. It can be expected that not only the recognition performance
is improved but also the computational effort caused by the combinatorics can be
reduced. This means that running into problems with computer storage capacity or
calculation times as indicated in Table 5.1 may either be mitigated or that lower
assessment thresholds can be used. The latter would avoid false negatives. Recall
there can be well-assessed complex Gestalten terms that contain a mediocre-assessed
intermediate-level Gestalt.
Of course color is not the only possible additional feature. For example, for super-
pixels there is also eccentricity. This is a scalar between zero and one. The absolute
of the difference between two eccentricities can serve as similarity-in-eccentricity
assessment. Other additional features will be of higher dimension and mathematical
structure. For example, for MSTAR primitives (see Sect. 11.3) often normalized,
circular descriptions of the perimeter or shape are used. For all such descriptors
matching functions are given in the corresponding literature, which can also be
utilized in assessment functions for such additional features.
All these additional features have the following properties in common:
• All complex Gestalten inherit the extended feature domain from the primitives.
• Relatively small objects that may be thousands of pixels away from each other
are set in correspondence, and matched and assessed in the state-of-the-art way.
The correspondence is constructed by the Gestalt term tree. A full all-against-all
search can be avoided.
There is also a possibility to have the domain growing with term depth. Recall the
soccer team picture presented in Fig. 1.1, where the shirts are reflection symmetric
in shape, but the two halves show different colors. A similar situation is encountered
in the analysis of aerial pictures of suburban terrain (in particular in Europe). In
production system similar to the knowledge-based approach of Matsuyama [14],
Jurkiewicz and Stilla grouped rows of houses along roads in scenes from Germany
[15]. In that country houses most frequently feature gabled roofs. Under oblique
lighting they will frequently appear with brighter color toward the sun and darker
color on the other half. When grouping a row of such reflection symmetric Gestalten
the assessment should include a comparison of both colors.
5.3 Transporting Evidence through Gestalt Algebra Terms 95

The influence of color consistency (also multiple color consistency) on perceptual


grouping on aerial images was investigated in [16]. There the Vaihingen benchmark
of the International Society for Photogrammetry and Remote Sensing (ISPRS) was
used.

5.3.2 Propagation of Adjustments through the Hierarchy

As an example, consider #27 in the frieze part of the 2017 symmetry recognition
competition [17]. This image displays the gateway of a fire station. It is shown in
Fig. 5.4. The gateway consists of four gull-winged doors in a row with wings featuring
two segments, each equipped with a window. Thus, one way to construct the gateway
hierarchically is
1. A wing consists of two symmetric segments, each featuring a window.
2. A door consists of two symmetric wings.
3. The gateway consists of four doors in a row.
As a Gestalt algebra term it has the following structure:
        
g1 |g2 | g3 |g4 , . . . , g13 |g14 | g15 |g16 (5.5)

Note the positions of the sixteen windows happen to be aligned along a straight line;
however, they are not equidistantly spaced. Thus, an alternative decomposition of
the same set of windows in a simpler and shallower hierarchy is also possible:

g1 , . . . , g16 (5.6)

However, the term in Eq. 5.5 should give a substantially better assessment as com-
pared to the term in Eq. 5.6.
Figure 5.5 shows the result of the feature extraction stage. Here a threshold of 160
was used, resulting in about a hundred segments. Sixteen of these primitive Gestalten
correspond to the doors’ windows and represent the segments of the gateway. Note
there are substantial faults in this extraction. For instance the sixth window from
the left, i.e., the right window of left wing of the second gateway, is segmented into
two parts. Thus, g6 is too low in position and too small in scale, and has a fairly
bad assessment. There are similar problems with some of the others corresponding
primitives as well, while the majority—in this case ten of sixteen—fit well.
Such faults can be expected. While a posteriori always an explanation can be
found, the best a priori way to account for such deviations is in assigning covariance
matrices for their feature vector. Such covariances express expected segmentation
errors and measurement uncertainties. The covariance assigned to a primitive may
well depend on its assessment—better-assessed segments can be expected to have
less deviations.
96 5 Closure—Hierarchies of Gestalten

Fig. 5.4 Example of hierarchical grouping. Four gull-winged doors and their windows in a row
feature a hierarchy of symmetries

Fig. 5.5 Primitives obtained from the image displayed in Fig. 5.4, note the sixteen vertically
oriented Gestalten close to row 350, corresponding to the sixteen windows

The first bottom-up step consists of the formation of the eight wings using the
sixteen primitives. The operation | of Chap. 2 is used in the variant outlined in Eq. 2.4
on each of the pairs (see also Fig. 2.4). The representation of a wing is obtained by
adjusting the observations for a pair of segments. Figure 5.6 compares the measure
with the adjusted observations for the primitives and the constructed eight aggregate
Gestalten corresponding to the wings. The adjusted orientations obey the concur-
rence constraint 2.4, and the sizes of paired Gestalten are identical. The positions
5.3 Transporting Evidence through Gestalt Algebra Terms 97

level 1: observed windows


360
340
0 100 200 300 400 500 600

level 1: adjusted segments


360
340
100 200 300 400 500 600

level 2: constructed wings


380
360
340
320
100 200 300 400 500 600

Fig. 5.6 Measured observations of the primitives (level—0, top), adjusted primitives (level—1,
middle), and constructed wings (level—2, bottom). The adjusted Gestalten feature pairwise identical
sizes and enforced axis symmetry

level 2: adjusted wings


380
360
340
320
100 200 300 400 500 600

level 3: constructed doors


400

350

300
100 200 300 400 500 600

Fig. 5.7 Adjusted features of the wing Gestalten (level- 2, top) and constructed door Gestalten
(level- 3, bottom). As in the lower level, the adjusted Gestalten feature pairwise identical sizes and
enforced reflection symmetry

and orientations of the newly formed Gestalts are obtained by the constructions,
accompanied by error propagation to obtain uncertainties for the Gestalten on the
superior level.
In the second bottom-up step the eight newly constructed wings are the base for
the construction of the four doors on the third level—again assuming reflection sym-
metry. Figure 5.7 shows the adjusted features of the eight wings and the constructed
four double wing doors. Basically, the same operation (enforcing Eq. 2.4) is used
again.
98 5 Closure—Hierarchies of Gestalten

level 3: adjusted doors


400

350

300
100 200 300 400 500 600
level 4: row of doors
400

350

300
100 200 300 400 500 600

Fig. 5.8 Adjusted Gestalt features of the four doors (level- 3, top) and constructed gateway (level-
4, bottom). The positions of the adjusted Gestalts are equidistant and incident with a straight line.
Furthermore, all orientations are identical

The final bottom-up step is the alignment of the four double wing doors in a row
and the construction of the gateway as a derived high-level Gestalt using the operation
Σ given in Chap. 3. Recall the position of the aggregate is given by Eq. 3.3, and
the generator is given by Eq. 3.4. Orientations and sizes of the adjusted Gestalten
representing the four doors are enforced to be equal. Their adjusted positions are
incident to a straight line. Using Eqs. 3.3 and 3.4 the unknown parameters of this
line are estimated, together with the spacing of the door positions. Figure 5.8 shows
the adjusted features of the doors and the eventually derived representation of the
gateway.
The solution of the hierarchical grouping exemplified above yields sub-optimal
results since on each level the new constructed entities subsume the statistical prop-
erties of the components only approximately. Thus, a common adjustment of all
observed Gestalt features is advisable. It considers all possible correlations in a
statistical rigorous manner. Furthermore, the identification of blunders or outliers
is facilitated due to the increased redundancy. The bottom-up constructions given
above—i.e., the corresponding Gestalt term—are a prerequisite for the following
top-down adjustment. Figure 5.9 shows the original observations and the Gestalts
aligned in a row with equidistant positions and identical orientations. In this example
the result has been obtained by introducing 9 unknown parameters and 63 constraints
between observations, and observations and parameters, respectively. The parameters
are:
• The common size of all windows (row symmetry);
• The common orientation of all doors (row symmetry);
5.3 Transporting Evidence through Gestalt Algebra Terms 99

level 1: observed windows


360
340
100 200 300 400 500 600

level 1: adjusted windows


360
350
340
100 200 300 400 500 600

Fig. 5.9 Top: Gestalt features of the sixteen windows as given by the primitive extraction. Bottom:
Adjusted Gestalten in a row now fulfilling the 63 constraints. Neighboring pairs of Gestalts are axis
symmetric on each aggregation level. Furthermore, the four groups of Gestalten each representing
a door are aligned in row

• The straight line utilized to align the doors by point-line incidences (row symme-
try);
• The distance between subsequent doors positions;
• The angle between a window orientation and the axis orientation of the corre-
sponding pair of windows (axis symmetry);
• The angle between a wing orientation and the axis orientation of the corresponding
pair of wings (axis symmetry);
• The distance between a window position and the corresponding symmetry axis of
pairs;
• The distance between a wing position and the corresponding symmetry axis of
pairs.
All entities are represented by homogeneous coordinates to ease the formulation of
the constraints. Given the position (x, y) and orientation φ of a Gestalt, the straight
line representing the Gestalt’s orientation is l = [sin(φ), − cos(φ), −d]T with the
distance d = sin(φ)x −cos(φ)y between the line and the origin of the coordinate
system. The angle α between the pair of straight lines l and m is then
 
α = arctan −lT Sm, −lT Gm (5.7)

using the two arguments inverse tangent function with the skew-symmetric
matrix S = [e1 ]× for the vector e1 = [1, 0, 0]T and the diagonal matrix G =
Diag([1, 1, 0]) [18].
Point-line incidences are enforced by the constraint xT l = 0 between a point x and
a straight line l.
This is a maximum a posteriori solution yielding the most likely features of the
sixteen primitives given by the primitive extraction procedure. It assumes normally
distributed measurement uncertainty as well as the indicated hierarchical structure.
Moreover, it is also a minimum description length solution. We emphasize here that
such adjustment of the features of the primitives can be obtained for any balanced
Gestalt term, as defined above in Definition 5.4. The list of parameters and con-
straints corresponding to any particular such term can be compiled automatically
100 5 Closure—Hierarchies of Gestalten

from its term structure. Given this list the Jacobian matrix corresponding to this set
of parameters and constraints can also be automatically set.
Just like in any other usual regression estimation, natural interest concentrates on
the outliers after adjustment or on those elements where the largest residua occur, such
as the sixth from right, g6 in notation in Eq. 5.5. Here the top-down reasoning may well
dig deeper, i.e., into the original image and into the segmentation process yielding
the primitives. Since now, a posteriori, the most likely features for such object are
known (depicted in the bottom row of Fig. 5.9), parameters of the extraction process
can be varied and applied to the local neighborhood of the primitive in question. For
instance, in the case at hand, a lower threshold for image binarization will merge
the corresponding segment with its upper neighbor resulting in a much better fitting
primitive. With this new result the whole adjustment then should be repeated. Such
procedure can be iterated until the residuals are sufficiently small.
Further note that in this example all symmetry axes turn out close to perpendicular
to the row generator. In such situation the methods outlined below in Chap. 9 apply.

References

1. Rosenfeld A (1979) Picture languages. Academic Press


2. Fu KS (1974) Syntactic methods in pattern recognition. Academic Press
3. Narasimhan R (1964) Labeling schemata and syntactic description of pictures. Inf Control
7:151–179
4. IAPR. Technical committees of the international association for pattern recognition. https://
www.iapr.org/committees/committees.php?id=6/. Last accessed September 2018
5. Minsky M, Papert SA (1987) Perceptrons, new edition. MIT Press
6. Milgram DL, Rosenfeld A (1972) A note on grammars with coordinates. In: Graphic languages,
pp 187–194
7. Marriott K (1998) Visual language theory. Springer
8. Malcev AI (1973) Algebraic systems. Springer
9. Niemann H (1990) Pattern analysis and understanding. Springer
10. Michaelsen E, Yashina VV (2014) Simple gestalt algebra. Pattern Recogn Image Anal
24(4):542–551
11. Michaelsen E (2014) Gestalt algebra-a proposal for the formalization of gestalt perception and
rendering. Symmetry 6(3):566–577
12. Loy G, Eklundh J (2006) Detecting symmetry and symmetric constellations of features. In:
European conference on computer vision (ECCV), pp II:508–521
13. Michaelsen E, Arens M (2017) Hierarchical grouping using gestalt assessments. In: CVPR
2017, workshops, detecting symmetry in the wild
14. Matsuyama T, Hwang VS-S (1990) SIGMA, A knowledge-based aerial image understanding
system. Springer
15. Jurkiewicz K, Stilla U (1992) Understanding urban structure. In: ISPRS
16. Michaelsen E (2012) Perceptual grouping of row-gestalts in aerial NIR images of urban terrain.
In: PRRS
17. Funk C, Lee S, Oswald MR, Tsokas S, Shen W, Cohen A, Dickinson S, Liu Y (2017) 2017
iccv challenge: detecting symmetry in the wild. In: ICCV 2017, workshops
18. Förstner W, Wrobel B (2016) Photogrammetric computer vision. Springer
Chapter 6
Search

In Chap. 5 algorithms were given which enumerate the set of all Gestalten, which
have better assessment than a fixed constant τ > 0, and can be aggregated from
a finite set of primitive input Gestalten. These algorithms can cause considerable
computational loads. In particular, the workload is data-dependent. The run-time
and required storage can grow critically with the size of the input set. Moreover, if
there are certain regularities in the data, the required computational resources may
rise dramatically.

6.1 Stratified Search

Assuming that there is a finite set of primitive Gestalten L 0 ⊂ G extracted from


an image, and a set of Gestalt operations {|, , , . . .} operating on G, we may
exhaustively list all one-step applications and keep the better-assessed results in a
set L 1 . Then we can proceed in a stratified way from L i to L i+1 :
Definition 6.1 For each i the search level set is defined by
 
L i+1 = g = h1 |h2 ∩ g = h1 · · · hn
 
∩g = h1 · · · hn · · · |h1 · · · hn ∈ L i ∪ ag > τ .

The threshold τ controls the growth and decline of the sets L i with rising i. We
can prove that with any τ > 0 all L i will be finite, and in fact for every finite L 0 there
will be a maximal depth i max with all L j = ∅ if j > i max . This method implements
breadth-first search. It has been used for most previous work on Gestalt algebra
operations [1–5].

© Springer Nature Switzerland AG 2019 101


E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6_6
102 6 Search

6.2 Recursive Search

In [6] a recursive search was presented. The approach works as follows: Given
a finite set of primitive Gestalten L 0 ⊂ G and a recognition level 0 ≤  < 1, the
recognition task can be formulated as a recursive enumeration of all terms combined
with testing them for the property as (g ) ≥ 1 − . At first glance this enumeration
might never stop because in algebraic terms one p ∈ P may appear multiply. In [7]
lemmas are given stating that repetition will lead to arbitrary small assessments.
Eventually, they will be smaller than . Algorithm 3 calls the operations one after the
other. Algorithm 4 implements the pairwise enumeration of |-Gestalten. Algorithm 5
prolongs -Gestalten until the assessment declines. Algorithm 6 calls Algorithm 3
recursively and thus enumerates the hierarchy.

Algorithm 3 Search for all aggregated Gestalten assessed better than a threshold 
Input: Sorted list BasicGestalten of primitives according to their assessment,
assesment threshold 
Output: Sorted list Gestalten of recognized Gestalten according to assessment
BasicGestalten ← select Gestalten g ∈ BasicGestalten with a(g ) > 
ListMirror ← calculateMirror(BasicGestalten, )
List2Row ← calculate2Row(BasicGestalten, )
ListRow ← calculateRow(BasicGestalten, List2Row, )
Gestalten ← recursiveGestalting(ListMirror, ListRow)
sort itGestalten descendingly concerning each assessment a(g )

Algorithm 4 calculateMirror (Searching for maximal meaningful mirrors)


Input: Sorted list of Gestalten Gestalten, assessment threshold 
Output: List of aggregated mirror Gestalten ListMirror
ListMirror ← ∅
for all q p ∈ Gestalten do
for all gq ∈ Gestalten\g p do
gs ← g p |gq
if a(gs ) >  then
ListMirror ← ListMirror ∪ {gs }
end if
end for
end for
return ListMirror
6.3 Monte Carlo Sampling with Preferences 103

Algorithm 5 calculateRow (Searching max. meaningful rows)


Input: List of Gestalten BasicGestalten, list of recursively aggregated row
Gestalten RowGestalten, assessment threshold 
Output: Augmented list of recursively aggregated row Gestalten
if RowGestalten = ∅ then
return RowGestalten
else
ExtRowGestalten ← ∅
for all r ∈ RowGestalten do
for all g ∈ BasicGestalten do
s ← appendGestalt(r , g )
if a(s ) >  then
ExtRowGestalten ← ExtRowGestalten ∪ {s }
end if
end for
end for
return RowGestalten ∪ calculateRow(BasicGestalten, ExtRowGestalten, )
end if

Algorithm 6 recursiveGestalting (Recursive search for Gestalten)


Input: List of aggregated mirror Gestalten ListMirror, List of aggregated row
Gestalten ListRow, assessment threshold 
Output: List of further aggregated Gestalten Gestalten
Gestalten ← ∅
ListMirrorRec ← ∅
ListMirrorRec ← calculateMirror(ListMirror, )
ListMirrorRec ← ListMirrorRec ∪ calculateMirror(ListRow, )
ListRowRec ← ∅
ListRowRec ← calculateRow(ListMirror, ListRow, )
ListRowRec ← ListRowRec ∪ calculateRow(ListRow, ListRow, )
if ListMirrorRec
= ∅ or ListRowRec
= ∅ then
Gestalten ← recursiveGestalting(ListMirrorRec, ListRowRec, )
Gestalten ← Gestalten ∪ ListMirrorRec ∪ ListRowRec
end if
return Gestalten

6.3 Monte Carlo Sampling with Preferences

Often, when full enumeration of a space of possibilities is infeasible, Monte Carlo


sampling sets a reasonable alternative. When discussing row prolongation and finding
ends and beginnings of rows in facades, S. Wenzel proposed marked point sampling
[8]. She reported very good results on the eTRIMS [9] data with moderate computa-
tional efforts. Probabilities on combinatorial domains of possibilities which may be
partially overlapping are a very complicated topic. Recall, they need to be normed
to one. The marked point sampling is one possibility of handling such methods in a
mathematically sound way. A similar approach has been proposed by Radim Tyleček
in his thesis [10], using as well the eTRIMS [9] as the competition data of 2013 [11].
He used probabilistic sampling methods given by Radim Šára of the technical uni-
104 6 Search

versity of Prague where the decisive details on norming are only available in internal
reports and to our knowledge not published yet. Tyleček’s results set the state-of-
the-art performance, and it would be of interest to pursue this path further. However,
proper random sampling in the hierarchical domain indicated in Chap. 5 is probably
even more demanding with respect to mathematical expertise.

6.4 Any-time Search Using a Blackboard

In [12] an any-time interpreter was given for knowledge-based machine vision sys-
tems. Though originally designed for production systems coding knowledge, such
technique can also utilized for Gestalt grouping. Basically, it administrates a queue
of working hypotheses. Each such hypothesis has an assessment associated with it,
and the cue is frequently sorted with respect to these assessments. Thus the better-
assessed hypotheses, and with them the more important objects, are handled first,
and the less important possibilities have lower priority.
More formally: Let G be a finite set of Gestalten, and O be a finite set of operations
o : G n → G, then we can enumerate all simple (pairwise) hypotheses in a sorted
sequence with respect to the assessment of the Gestalten:
Definition 6.2 A processing queue is sequence of hypotheses h i = (g, o, a) where
g ∈ G, o ∈ O, a = ag , which is sorted with rising i with respect to the assessments
a.
The operations are then coded in search modules. Such search module gets a
hypothesis as input and produces a set of new Gestalten:
Definition 6.3 A search  module
 is a function
 that works on the globally held
Gestalt set G: m : h = g , o, a −→ o g , f1 , . . . , fn ; fi ∈ G . The newly aggre-
gated Gestalten all contain g as a part.
The simplest example for such a module is the search for partners for o = | (the
reflection operations). It enumerates all f ∈ G and returns g |f . Reasonable mod-
ules will use a threshold τ and only yield those aggregates with better assessment:
g |f ; ag| f > τ . Note that this implies that f must be in proximity. So such module
may have constant time complexity, provided that an upper bound can be given for
the number of possible partners f ∈ G in proximity.
Figure 6.1 shows how the processing queue interacts with the search modules in
order to implement assessment-driven search. Note:
• The queue compares the assessments of all Gestalten whether they are primitives
or very sophisticated aggregates. Therefore, if the assessments tend to be better
with aggregation level, this will perform depth-first search, and vice versa.
6.4 Any-time Search Using a Blackboard 105

Fig. 6.1 Any-time search


diagram

• G needs to be administered as a set, so that adding an element to it that is already


present should not construct a new entry. Things like that are going to happen
often—such as if g |f is already there and now (f , |, a) is triggering the search and
finding g as possible mate. It is evident that such multiple instances of the same
construction cause overheads.

References

1. Michaelsen E, Münch D, Arens M (2013) Recognition of symmetry structure by use of gestalt


algebra. In: CVPR 2013 competition on symmetry detection
2. Michaelsen E (2014a) Searching for rotational symmetries based on the gestalt algebra oper-
ation. In: OGRW 2014, 9th Open german-russian workshop on pattern recognition and image
understanding
3. Michaelsen E (2014b) Gestalt algebra—a proposal for the formalization of gestalt perception
and rendering. Symmetry 6(3):566–577
4. Michaelsen E, Gabler R, Scherer-Negenborn N (2015) Towards understanding urban patterns
and structures. In: Photogrammetric image analysis PIA 2015, archives of ISPRS
5. Michaelsen E, Arens M (2017) Hierarchical grouping using gestalt assessments. In: CVPR
2017, workshops, detecting symmetry in the wild
106 6 Search

6. Michaelsen E, Münch D, Arens M (2016) Searching remotely sensed images for meaningful
nested gestalten. In: ISPRS 2016
7. Michaelsen E, Yashina VV (2014) Simple gestalt algebra. Pattern Recogn Image Anal
24(4):542–551
8. Wenzel S (2016) High-level facade image interpretation using marked point processes. PhD
thesis, Department of Photogrammetry, University of Bonn
9. Korč F, Förstner W (2009) eTRIMS image database for interpreting images of man-made
scenes. Technical Report TR-IGG-P-2009-01, Department of Photogrammetry, University of
Bonn. http://www.ipb.uni-bonn.de/projects/etrims_db/. Accessed Aug 2018
10. Tyleček R (2016) Probabilistic models for symmetric object detection in images. PhD thesis,
Czech Technical University in Prague
11. Liu J, Slota G, Zheng G, Wu Z, Park M, Lee S, Rauschert I, Liu Y (2013) Symmetry detection
from realworld images competition 2013: summary and results. In: CVPR 2013, workshops
12. Michaelsen E, Doktorski L, Lütjen K (2012) An accumulating interpreter for cognitive vision
production systems. Pattern Recogn Image Anal 22(3):1–6
Chapter 7
Illusions

So far, our thoughts and results on illusion are somewhat callow. In the commonsense
discourse “illusion” is used pejorative, assuming that illusions are as well false as
detrimental. This raises the following question: “If this was true, why then would our
visual system produce so many illusions?” Among the scientific Gestalt community
most people are convinced that most, if not almost all, illusions are as well true, as they
are of avail, if not compulsory. General agreement is given on the importance of this
topic for successful seeing. The human visual system obviously has a strong tendency
toward illusory perception. Many are convinced that there must be an advantage in
that in a Darwinian sense. However, we are not aware of any sound theory of illusion.
This chapter therefore contains much text and little technical content. Still this topic
is very important, and we would like to present our thinking about it here.

7.1 Literature about Illusions in Seeing

In particular in contour following, where the gap-filling has been a main focus of
attention for decades [1], and also in automatic facade analysis there is general
agreement on the necessity of illusion.
Kanizsa’s book [2] gives numerous figures with striking illusions. On every single
one of them objects are seen which are not there. In the end the reader, or rather the
viewer, wonders why he or she does almost never encounter such examples in the
real world.
Sometimes the illusion issue is hidden in the term hypothesis. For instance in
Matsuyama’s SIGMA [3] (e.g., page 111) a database entry can be an object instance
measured or inferred by mostly abductive rule application, or it can be a hypothesis,
where and how an object is expected, so as to fit into a larger and more hierarchic

© Springer Nature Switzerland AG 2019 107


E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6_7
108 7 Illusions

aggregate object. Such hypothesis entry has exactly the same format as an instance.
It is to be verified provided enough computational resources are available in some
future state of the search. But, for the time being, it is rather assumed to be there then
missing. Clearly such mechanisms are a step in the direction of constructive illusion.

7.2 Deriving Illusion from Top-down Search

The definitions of Gestalt operations given throughout the book at hand are declar-
ative in nature. With them a generative language of Gestalten is given. In this world
there can be no illusion at all. However, in Chap. 6, and often in special sections
with the operations, search procedures were discussed. This can be a starting point
for a theory of illusion: The search often constructs particular search regions in the
domain, in which specific Gestalten are expected completing the symmetry of a larger
pattern. For example, near the end of Sect. 5.3.2, a posteriori, top-down adjustments
of parameters of the primitive extraction method were proposed. The hope was that
with some different parameter setting, erroneous over- or under-segmentation might
be avoided, and the missing primitives completing the balanced Gestalt term could
be found. But what if nothing can be found in such situation?

7.3 Illusion as Tool to Counter Occlusion

The human species belongs to the order of primates whose primary habitat is the
forest. Correspondingly, the visual system has evolved in an environment where
occlusion must be the predominant cause of failure of vision. Counter occlusion
mechanisms must have high priority in a world where even objects only few meters
away are likely to be at least partially occluded by trunks, branches, twigs, and
foliage. This may explain the preference our visual system has toward gap bridging,
and contour completion, as it is demonstrated in Chap. 8.
In many places in this book, such as in Sects. 4.5, 8.2, 8.3, 8.6, and 10.4, the
advantage gained from illusory Gestalten is mentioned in the context of different
operations. Here we summarize the common main reason. Here we try to analyze
such phenomena and their reproduction in machine vision with more detail and rigor.
An example for starting:
 
• Given a perfect row Gestalt r = g1 , ..., gn in the scene we assume that there
is a certain missing-probability p > 0 for any of the parts gi . Missing means here
that the Gestalt might exist in the scene, but has been lost either during projection
(e.g., because of occlusion), or during primitive extraction (e.g., because of over- or
under-segmentation), or during inference if it is an aggregate of smaller Gestalten.
• In the absence of better knowledge it is wise to assume independence of the
occurrence of a miss for any gi from all others.
7.3 Illusion as Tool to Counter Occlusion 109

• Then the probability for all n parts to be actually there at the hands of the machine,
when they are needed can be easily estimated by use of the appearance-probability
q = 1 − p. It is q n . The success of inferring the row Gestalt correctly is declining
exponentially with the number of members in it.
The consequence of this is dramatic. Experienced image processing engineers know
that they have done a good job if they missed only 10 or 5% of the objects they were
looking for. For n = 10 in the first case the probability of losing the row is 65% and
even with the second more optimistic assumption it would still be around 40%. The
same engineers are probably not satisfied with such rates.

References

1. Medioni G, Lee MS, Tang CK (2000) A computational framework for segmentation and group-
ing. Elsevier
2. Kanizsa G (1980) Grammatica del vedere: saggi su percezione e gestalt. Il Mulino
3. Matsuyama T, Hwang VSS (1990) SIGMA: a knowledge-based aerial image understanding
system. Springer
Chapter 8
Prolongation in Good Continuation

The term good continuation (German “Gute Fortsetzung”) is not only used for frieze
or row symmetry as it is defined in Chap. 3. A large part of the literature on Gestalt
laws treats contour or line prolongation and illusory virtual contours using the same
term. In this book contour or line prolongation is distinguished from repetition in
equidistant spacings and treated separately in its own chapter.
Man-made structures, in particular buildings, preferably have straight outlines.
Automatic mapping of these from imagery containing large amounts of clutter and
noise is a non-trivial endeavor. A very important application of prolongation methods
is also road extraction from aerial of space-borne imagery. For roads a certain amount
of curvature must be tolerated. In large-scale imagery roads appear as lines of different
color as compared to their environment. If the scale is finer two contours appear on
either side of the structure which is then several pixels wide. In those scales also the
Gestalt law of parallelism is applicable which is treated in Chap. 9. In any case road
extraction requires grouping in good continuation along the road direction and gap
closing.
Grouping methods of similar structure are an important issue in medical image
processing, e.g., in blood vessel mapping. Medical data often come in 3D, that is, as
voxel block. Accordingly, the grouping methods are generalized to one additional
dimension. There are two ways for this generalization: good continuation in one
direction, like a stick or vector, for blood vessels or dendrites of neurons, or good
continuation along a surface, like a plate or plane, for all kinds of mutual tissue
surfaces. The book at hand deliberately avoids the detailed treatment of such 3D
domains. Most of such generalizations are obvious, but technical details can be
tedious.

© Springer Nature Switzerland AG 2019 111


E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6_8
112 8 Prolongation in Good Continuation

8.1 Related Work on Contour Chaining, Line Prolongation,


and Gap Filling

Contour chaining is a very frequent operation in machine vision. Often grouping of


contour segments is preferably performed on collinear parts, prolonging successively
until a gap or strong curvature is reached.
In Gestalt psychology the corresponding law is known by the name of “good con-
tinuation”. The notion of “maximal meaningful segment” results from an a contrario
test foundation for this law—see Desolneux et al. [1]. Important earlier contributions
to this topic of non-accidentalness already appear in Lowe’s book [2]. Chapter 3
of that book treats the appearance of contours on a background of clutter objects,
and Chap. 4 makes a proposal where contours should be merged into one object and
where a curve should not be further prolonged. Lowe proposes to prolong as long as
the curvature is low and breaks the process on local maximal curvature locations and
junctions. At these locations a new medium-level object is to be constructed: the key
point. Always in Lowe’s work, such prolongation and segmentation are valid only
for a particular scale of the picture.
The next section of this chapter treats the tensor voting approach to contour
completion as presented by G. Medioni. After, that we will recall some of our own
previous work and re-formulate it as yet another operation in Gestalt algebra, where
the properties deviate in some aspects from the operations defined earlier in this
book.

8.2 Tensor Voting

G. Medioni introduced tensor voting as a method for implementing the law of good
continuation in image processing machines [3]. In particular, the impressive contour
completion capabilities of human perception are the model for the tensor voting
approach. It must not be confused with tensor flow, a today very popular open source
machine learning framework mainly for deep learning nets provided by Google
research.
Looking at the image displayed in Fig. 8.1 we immediately see a black square in
front of four white disks. We constructed this image in the style of Kanizsa. In his
book [4] numerous variants of such illusion can be found. This has five hundred and
twelve pixel in both dimensions, with the square having three hundred and twelve.
The square is supported only by two times seventy pixels of edge on each side.
The other hundred and seventy-two pixels, i.e., more than half, are not there. Still
everyone sees them instantaneously. Why?
For comparison we included a standard gradient into the figure, the Sobel gradient.
The magnitude is displayed right next to the picture, and below it are the horizontal
and vertical components. The gradient is zero everywhere except around the circu-
lar disks around the locations (100, 100), (100, 412), (412, 100), and (412, 412),
8.2 Tensor Voting 113

Fig. 8.1 Illusory black square constructed following the principles outlined by Kanizsa and its
gradient

respectively. Each disk has a quarter segment cut out. There is no square in this
image. Yet, we are convinced that an automaton not constructing the illusory square
contour will never model human seeing. Here, tensor voting can contribute a lot, as
Fig. 8.2 shows.
In contrast to the broad meaning the word tensor has in tensor calculus as a branch
of linear algebra, the meaning here is very specific: A tensor is a positive semi-definite
2 × 2 matrix T . Such matrices are symmetric, T = T T . They have two nonnegative
eigenvalues λ1 ≥ λ2 , with their corresponding eigenvectors e1 , e2 which are of unit
length and mutual orthogonal:
   T
  λ1 0 e1
T = e1 e2 (8.1)
0 λ2 e2T
114 8 Prolongation in Good Continuation

Fig. 8.2 Tensor voting and subsequent non-maxima suppression on the illusory black square exam-
ple

Fig. 8.3 Tensor voting field presented as stick drawing

They can be decomposed into a sum of a ball tensor and a stick tensor :
 
T = λ2 · e1 e1T + e2 e2T + (λ1 − λ2 ) · e1 e1T (8.2)
   
λ 0 cos(Φ) cos(Φ) in(Φ) cos(Φ)
= 2 + (λ1 − λ2 ) (8.3)
0 λ2 sin(Φ) cos(Φ) sin(Φ) sin(Φ)

Visualization is preferably done as Φ tilted ellipse of length 2λ1 and width 2λ2 .
Ball tensors thus appear circular since their eigenvalues are equal. Stick tensors
only have one nonzero eigenvalue. Thus, their rank is one, and they are displayed
as lines where the length codes their strength. Figure 8.3 displays such tensors. Of
course also null tensors exist. They are displayed white on white ground, just like
our meaningless Gestalten. Medioni uses the term “salience” for the strength of a
tensor, and we see strong relations to our assessments.
The word “tensor” is used in accordance with its associations in physics and
differential geometry; i.e., a field is meant that assigns such an object to every location
in the continuous 2D vector space. Figure 8.3 shows a stick tensor voting field
displaying the tensor at raster positions. Such field exists for every object g . It is
rotated and scaled with the object. Proximity demands that the strength of this tensor
field declines with the distance to its location xg . In tensor voting proximity is not
8.2 Tensor Voting 115

measured along a straight connection between xg and the location under question.
It is integrated along the path of minimal curvature. But anyway, very far away the
field must be zero. Gaussian decay is preferred, i.e., exponentially with the square
of the distance, and there is a scale parameter in the field of every object. This law is
used for both the ball field and the stick field.
The second law for stick field voting is the law of low curvature. The connecting
path between xg and the location at hand is a minimal curvature path constrained
to φg at xg , and the curvature is integrated. Salience declines with accumulation of
curvature, and here also a Gaussian decay is preferred. The fusion of both laws is
conjunctive, i.e. the proximity–salience is multiplied with low-curvature salience just
like in Eq. 2.8.
The corresponding field looks a bit like a dipole field. Visualizing it as raster
of sticks like in Fig. 8.3 emphasizes its orientation variation. The tensor has four
continuous components and can thus be displayed in four grayscale images, as in
Fig. 8.4. This emphasizes its continuous field character existing on any location

Fig. 8.4 The four components of a stick voting field displayed separately
116 8 Prolongation in Good Continuation

in the 2D domain, not only on the raster. Note, here the mid-gray-tone codes zero.
Negative values, i.e., darker ones, only occur in the off-diagonal entries. Such display
also shows its symmetry and smooth beauty. Even a certain illusion of depth is evident
from the shading. Of course, the stick tensor voting field can also be visualized in
a single color image. Recall, it has only three distinct components because of the
symmetry of its off-diagonals.
The most important algebraic property of such a tensor is its linearity. Tensors
can be added componentwise, which is done in the voting process, and they can be
multiplied by any scalar.
In contrast to most of the content of this book, tensor voting works on a discrete
domain. In the case of standard image processing, each pixel in the pixel lattice
stores a tensor. Tensors in each pixel are accumulated. This works very similar to
the approximation of continuous convolution by summation over a smaller kernel
template at every pixel. In this way Fig. 8.2 has been obtained from the contours in
Fig. 8.1 by stick tensor voting.
Once all votes are completed, an analysis of the accumulated tensors follows. To
this end the tensor in each pixel is decomposed into ball and stick part using Eq. 8.3.
Where contours have high curvature, such as in the region around the corners of the
black square, the accumulated tensor will have a strong ball part and a weak stick part.
Note that mutually perpendicular stick tensors sum up to a perfect ball tensor. Where
contours—no matter if they are real or result from illusory gap closing—exhibit
low curvature the accumulated tensor will have a strong stick part and a weak ball
part. Accordingly, objects can be extracted: corner objects and contour segments,
respectively. The location and orientation features result from local optima. Thus
tensor voting may well be regarded as a primitive extraction method and should
therefore belong to Chap. 11 of this book with equal rights.
Tensor voting is even more successful in 3D. There the decomposition has a third
component between the stick and the ball tensor: the plate tensor. Medioni gives
numerous examples in his book [3], in particular on medical volume data, where the
contour hints may be very sparse and where clutter and noise may be quite dominant;
tensor voting can compete with human vision in marking the important surfaces. Gen-
eralization to higher-dimensional lattices, such as 4D, is straightforward. However,
there are little applications and data demanding such n D tensor voting.

8.3 The Linear Prolongation Law and Corresponding


Assessment Functions

In this section a method is presented that was originally defined as production in


a knowledge-based image analysis system [5]. This production was used for linear
contour prolongation. Figure 8.6 shows the situation intended by this prolongation
operation. There is a set of n part locations given. And the law for their aggregate is
good continuation in a straight line. It is clear that the orientation of the aggregate
8.3 The Linear Prolongation Law and Corresponding Assessment Functions 117

is given by this line. However, location and scale features are less self-evident.
Accepting the mean location of the parts as location of the aggregate, as it was
defined for the other operations in the previous chapters, is not appropriate here. The
parts may well be unevenly spread and clustered along the line. Instead, the location
should be the mean of the begin location and the end location, where the first part
and the last part along the line, respectively, project to the line. And also the scale of
the aggregate is obtained from these two locations.
In analogy to Definition, 10.2 linear prolongation can be defined as follows:
Definition 8.1 An n-ary operation Λ : G n → G is called linear prolongation oper-
ation iff for all g1 , . . . gn ∈ G:  
xΛ results from the mean of the extreme positions 1/2 Ḿ + M̀ along the con-
structed line, as defined below around Eq. 8.7.
φΛ = arctan 2(l1 , −l2 ) mod π as resulting from solution 8.5,
sΛ = Ḿ − M̀ as defined below around Eq. 8.7,
f Λ = 2, and
aΛ is a conjunctive combination of line distance residual assessing 8.2, orientation
similarity of parts with the aggregate, overlap assessment, and assessment inheritance
from the parts, as described below.
Figure 8.5 shows this operation at work, where the assessment of the aggregates is
0.81, 0.72, and 0.64, respectively. This operation contains no proximity assessment.
Objects may be in arbitrary distance. Thus the theoretical results, as, e.g., presented
in Lemma 5.2, cannot be given for this operation. Similarity in scale assessing, which
was also a part of the assessment fusion of the other operations given above, is also
not included here. Long contour pieces may contribute as well as short ones.
The resulting aggregated line Gestalt does not depend at all on the sequence of
enumeration in the input. Full permutation commutativity law holds. This can easily
be verified observing that in the construction 8.5 as well as for the all the other
features the enumeration sequence is irrelevant.
For line distance residual assessing a different regression as compared to the
operation Σ is used. Recall that the displacement used in Chap. 3 in Fig. 3.4 for
the sum of squares Eq. 3.2 was a point-to-point distance, the distance between the
location of a part Gestalt and one of the set locations of the aggregate row. In the
prolongation case there are no set locations. A point-to-line distance must be used
as residual. For the aggregate a straight line is constructed represented by a normal
form l = [l1 , l2 , l3 ]T . This is a homogeneous entity that can be scaled arbitrarily by
any nonzero scalar without losing its identity. The displacement of a point location
x = [x, y]T from the line is given by x · l1 + y · l2 + l3 , and it can be positive or
negative. Figure 8.6 displays such point-to-line displacements. The squares of these
distances are always positive and can be summed to form a proper goal function.
118 8 Prolongation in Good Continuation

Fig. 8.5 Prolongation operation: top to bottom declining assessment, number of parts 32, 16, and
8, respectively
8.3 The Linear Prolongation Law and Corresponding Assessment Functions 119

Fig. 8.6 Displacements perpendicular to a straight line. This particular line results from orthogonal
regression

Minimization of this sum of squared displacements leads to orthogonal regression


which is a special case of total least squares (TLS). It should not be confused with
standard regression, where the displacements are taken in y-direction. For the solution
one may set, e.g., l2 = −1, and form the partial derivatives of the goal function with
respect to the remaining entries of l. Then the solution reads as


2
C2,2 − C1,1 + C2,2 − C1,1 + 2C1,2
2
l1 = (8.4)
2C1,2
l3 = ȳ − l1 x̄ (8.5)

Here C is the covariance matrix of the locations, and x̄ and ȳ are the means in the
respective directions. This approach only works if the line is not vertical. For almost
vertical lines one may prefer to set l1 = −1 instead of l2 for numerical stability, and
replace all the entries in 8.5 accordingly. But still, there may be configurations where
the approach fails because the denominator turns out zero. In such situations the
line will be either vertical or horizontal so that standard regression can be used or C
is even isotropic. In the latter case every straight line that passes through the mean
location fits equally well. If there are more than two entries to the minimization there
will be residuals, and the sum of squared residuals will be used for assessing the
aggregate:
120 8 Prolongation in Good Continuation

Definition 8.2 A function a : G n → [0, 1] is called line distance residual assess-


ment iff n > 2, and there is a scale parameter τ > 0 with

τ n

2
a(g1 , . . . , gn ) = exp − 2 xg1 ,1 l1 + xg1 ,2 l2 + l3 , (8.6)
u (n − 2) i=1

where u is the geometric mean of the scale of the parts.

Once the normal form of the line has been obtained, the orientation similarity
of parts with the aggregate can be calculated. Recall in the operation Σ only the
mutual orientation similarity of the parts was assessed, by first calculating a mean
orientation. Here the distance to the orientation of the regression line is assessed for
each part Gestalt.
Overlap assessing uses the direction orthogonal to the normal form l, i.e., homo-
geneous line coordinates of the form (−l2 , l1 , ...). For each participating Gestalt gi
a different third component results using

l3,i = xgi · l2 − ygi · l1 (8.7)

We used that 2D direction vector already in Definition 8.1 where the orientation of
the aggregate was obtained from it via arc tangent. It should be normalized to unit
length. In that case the scale of each part Gestalt gi can directly be used to set an
overlap interval  
l3,i − 1/2sgi , l3,i + 1/2sgi

The minimum and maximum of all interval borders



M̀ = min l3,i − 1/2sgi Ḿ = max l3,i + 1/2sgi


i i

set the extreme ends of the aggregated line Gestalt. With them, they also set the
location of its center and its scale as already given in Definition 8.1. The central
reference location of the new Gestalt xΛgi results from taking the cross-product of
 
the line l with the direction line, with third component being 1/2 Ḿ + M̀ . Of
course a homogeneous location coordinate results from the cross-product. It needs
to be transformed into a Euclidean by division.
Once all intervals have been constructed the overlap ratio can be determined. It is
the ratio of the overall length covered by any interval, to the overall length Ḿ + M̀.
This ratio obviously is bounded by zero and one and can thus be directly accepted
as overlap assessment. Logical union of intervals may cause some awkward nested
code, with Booleans and comparison, which may be hard to debug and verify. We
found it easier to initialize a histogram between Ḿ and M̀ , e.g., in unit steps, and
increment the bins while enumerating the intervals. The gaps show as zero regions
afterward. This is a good approximation.
8.4 Greedy Search for Maximal Line Prolongation and Gap Closing 121

8.4 Greedy Search for Maximal Line Prolongation


and Gap Closing

In Sect. 3.5 greedy sequential search for rows was presented. Starting from row-seeds
that are made from pairs, the rows are prolonged on either side adding one more part,
as long as the assessment does not decline. In the end, this results in finding maximal
meaningful elements. For the prolongation operation, the search can even be more
greedy, but it should be augmented by outlier removal:
1. The contour primitives are enumerated. Preferably, the sequence of enumeration
can be guided by their assessment.
2. Each primitive defines a search region. For simplicity this used to be a tilted
rectangle around the line primitive where the width was one fixed parameter, and
the length a fixed factor, such as three or five times the length of the line primitive.
3. The set of primitives inside this region is formed which are furthermore also
oriented roughly parallel to it. With this set the prolongation operation, as defined
in Definition 8.1, is applied, and the resulting aggregated line is constructed.
4. After constructing a prolonged aggregate line, outliers can be removed, i.e., line
primitives with equivocally large residuum. If the assessment of an aggregate
becomes thus better, the removal will be accepted.
5. Newly constructed and prolonged lines define an even longer search region using
the same factor as in step 2. The search continues with step 3. Note not prolonged
lines are the entry into further prolongation. Instead, the operation still works on
primitives, only on a larger set.
6. This can be repeated until no further prolongation is possible without declining
assessment.
The declarative hierarchy, i.e., the compositional depth as Gestalt term, remains
shallow. Intermediate prolonged lines only serve as auxiliary structures to guide the
search. Fig. 8.7 shows how such search for long lines works on a synthetic scene with
horizontally oriented foreground with small deviations, and uniformly distributed
background. Note, that the procedure starts to produce illusions at the same amount
of clutter where human perception also begins to fail.
Recall that cluster search for reflection symmetry axes, as presented in Sect. 2.8,
poses the same problem as contour prolongation and can thus be treated with the very
same method. Only that the “primitives” are than non-primitive Gestalten, in partic-
ular reflection symmetric aggregates. Those can be of arbitrary size and hierarchical
depth.

8.5 Prolongation in Good Continuation as Control Problem

Often images contain curved structure, and thus the straight line model (cf. Eq. 8.5)
does not suffice. At first glance, a generalization to a quadratic model, such as a
122 8 Prolongation in Good Continuation

Fig. 8.7 Straight contour Gestalt on clutter: left column input data, right column grouping result;
top to bottom rising amount of clutter, 250, 500, 1000, and 2000
8.5 Prolongation in Good Continuation as Control Problem 123

circular line, may suggest itself. However, for tasks, such as blood vessel mapping, or
road mapping, neither a straight nor a circular model will suffice. A more appropriate
model are splines. Also snakes, zip-lock filters, etc., are being proposed and perform
well. The remote sensing community knows such mechanisms traveling along the
road by the term “dynamic programming and grouping” [6]. In such approaches a
separation of declarative object definitions from procedural search mechanisms, as it
has been our guiding principle throughout the book at hand, can hardly be maintained.
Instead, the realm of filter and control theory takes over. An automaton navigates
through the picture. Moreover, here the complete scene may replace the picture with
its margins: Imagine an unmanned aerial vehicle with a nadir-looking camera, that is
tracing a power line, a road, a river, etc. Then this is not a virtual automaton, and it is
the flight control of a physical device. The input data for the control are some pixel
colors along the target stripe to be compared to some pixel colors left and right of the
target. The motor control signals would be recorded in a long growing list, mapping
the curvature of the target, and also special events, such as crossings, endings, and
gaps. What used to be illusion throughout the book would just be filter prediction in
the absence of useful measurements.
We can imagine that such automaton may use perceptual grouping, also in its
hierarchical form assessing complex aggregates by Gestalt operations. The simplest
and most frequent Gestalt operation utilized here will be  as defined below in Chap. 9.
However, also friezes of road markings may well be contributing for the measurement
of the central axis location with precision, as well as reflection symmetric patterns,
or lattices deliberately painted on the target. However, it is hard to imagine how
the ever-growing list and map acquired by the automaton could be a Gestalt in the
domain given in Sect. 1.3 and thus part of an even larger aggregate in a hierarchical
Gestalt term.

8.6 Illusory Contours at Line Ends

The endpoint of a stripe or line Gestalt is of special interest to the human percep-
tive system. There is a considerable preference to perceive an occlusion at such
locations [4]. In Fig. 8.8 evidence for this is given in the classical way—i.e., using
the reader as test subject. As in the examples before, a certain amount of clutter lines
is used as background. Here the length of the lines varies uniformly between 10 and
20 units, and their orientation is also uniformly distributed, as well as their location.
They are occluded by a white line ten times wider than the clutter lines. This has
a length between 60 and 120 units and also a varying orientation and location such
that it fits in the frame.
Such a white line has no contrast to the white background color on which the
figure is drawn. It should be invisible. A simple explanation might be that the clutter
lines, when they are de-focused, can be perceived as light gray background that
gives a faint contrast to the foreground Gestalt. This may be the true reason if the
reader steps back from the book some three meters. However, the illusion also works
124 8 Prolongation in Good Continuation

Fig. 8.8 White line Gestalt on clutter: left to right rising amount of clutter, 200, 300, 400, and 500

on close distance in perfect focus. Obviously, line endings or gaps in lines can be
subject to a test on the good continuation law following Sect. 8.3. And if they pass
that test a salient—and illusory—foreground Gestalt is perceived. Of course, the
effect is stronger with rising amount of clutter objects and also with the size of the
foreground Gestalt. In Fig. 8.8 the illusory line is weak in the left upper image and
strong in right lower image with its much higher clutter line density. This can be
explained by the rising number of line endings resulting from occlusion.
Such grouping can be included in the presented approach in the following way: As
additional primitive extraction step, a Gestalt is constructed at each end of a given line
Gestalt. Thus from n line Gestalten 2n additional end Gestalten are given. Next to their
location they need values for the other compulsory features of the Gestalt domain.
The orientation should not be the same as the orientation of the corresponding line.
Instead, the orientation perpendicular to that is chosen. The periodicity was fixed
as two. The scale can either be a certain fraction of the line length, such as 1/10,
or a fixed small scale. The assessment should be inherited from the line Gestalten.
Figure 8.9 (a) shows this construction. In (b) such line end Gestalten are displayed for
the upper right image given in Fig. 8.8. For better visibility, only those line-ends are
displayed that result from the occlusion. Obviously, on such set of primitives there
is a good chance of establishing the desired illusory contour by use of operation
8.6 Illusory Contours at Line Ends 125

Fig. 8.9 Gestalten at line end location: left-30 random line Gestalten with their corresponding line
end Gestalten, right-occluding line end Gestalten from the 300 lines in Fig. 8.8 (upper right part)

Λ as described in Definition 8.1. However, the other line ends not resulting from
occlusion are not displayed here. They could possibly mask the effect at bit. Also,
there are large gaps to bridge, the orientation deviation from the optimum is quite
high leading to a comparably bad assessment, and the ends of the new illusory line
margins are uncertain.

References

1. Desolneux A, Moisan L, Morel JM (2008) From gestalt theory to image analysis: a probabilistic
approach. Springer
2. Lowe DG (1985) Perceptual organization and visual recognition. Kluwer Academic Publishing
3. Medioni G, Lee MS, Tang CK (2000) A computational framework for segmentation and group-
ing. Elsevier
4. Kanizsa G (1980) Grammatica del vedere: saggi su percezione e gestalt. Il Mulino
5. Jurkiewicz K, Stilla U (1992) Understanding urban structure. In: ISPRS
6. Wang W, Yang N, Zhang Y, Wang F, Cao T, Eklund P (2016) A review of road extraction from
remote sensing images. J Traffic Transp Eng 3(3):271–282
Chapter 9
Parallelism and Rectangularity

The first section of this chapter treats parallelism together with close proximity.
Empirical Gestaltist research from Wertheimer to Kanizsa has revealed parallelism
as a strongly preferred law for Gestalt formation [1, 2]. For example Pizlo explains
that parallel structures in the 3D world will almost always project to parallel structures
in projections—as long as they are close to each other [3]. Of course that is not true
for structures in greater distance to each other under strong perspective distortion,
but parallelism turns out to be among the most stable relations surviving central
projection, at least approximately. Almost any other symmetric arrangement suffers
more seriously from such projection.
Rectangularity (or orthogonality which is used synonymously here) is quite unsta-
ble under central projection. Yet it is a very important law of organization in the man-
made world. For example Leyton uses the square as the most symmetric mother of all
figures from which the others are derived by successive deformation processes [4].
There is a preference in the human visual system to see rectangle, a trapeze, even
a general quadrangle drawn on the plane, as a square tilted in 3D. The square has
rotational periodicity four. Because four is an even number rotational periodicity
two is contained as a sub-group, which means that parallelism is also contained.
Furthermore, there are also four axes of reflection symmetry that map a square on
itself. Section 9.3 will be on orthogonality and parallelism for polygons.

9.1 Close Parallel Contours

Figure 9.1 displays sets of randomly drawn straight line segments. All feature the
same length, which is used as scale feature in their representation in the Gestalt
domain. Location and orientation have been obtained from a uniform distribution at
random, with one exception: In each image one line has a parallel partner nearby.
© Springer Nature Switzerland AG 2019 127
E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6_9
128 9 Parallelism and Rectangularity

Fig. 9.1 Parallel contour Gestalten and clutter: left to right rising amount of clutter, top to bottom
declining proximity

Proximity is varying from top to bottom, and the number of Gestalten is varying
from left to right. Such experimental random line images can give evidence where
proximity is preferred by the human seeing and how much clutter can be tolerated
without masking the salient pair of parallel contours.
We leave it as an exercise to the reader to write a small program for such experi-
ments, and note at what parameter setting he or she still perceives immediate salience,
and from what settings on only cumbersome search can unveil the location where
the non-random object is. Another exercise is following Desolneux’s approach [5]
estimating the expectation for the number of sufficiently parallel pair aggregates,
i.e., line pairs that happen to be parallel by chance. Here, deviation in orientation,
proximity, and clutter density should be the parameters. Both exercises should come
up with roughly the same result! If the expectation is greater than one, meaning that
it is likely that at least one such pair exists in a random image, it is not seen as salient.
On the other hand, if the expectation is much smaller than one, for instance in the
order of 0.01, then such object will be immediately perceived if it is present.
As compared to the reflection symmetry operation given in Chap. 2, and also to
the other operations given in this book, proximity must be differently parametrized
here. As compared to their size, the parts must be much closer to each other. Also
very tight tolerance must be demanded for the similarity to the orientation of the
9.1 Close Parallel Contours 129

parts. They should never intersect. Therefore, we define a separate operation for this
important perceptual aggregate:
Definition 9.1 A binary operation || : G × G → G is called close parallelism oper-
ation iff for all g p , gq ∈ G:
• x p||q = (x p + x q )/2,
• φ p||q ,
• s p||q results from the overlap,
• f p||q = 2,
and a p|q is a conjunctive assessment combination of parallel orientation, proximity,
overlap along, and assessment inheritance from both parts.

9.2 Drawing on Screens as Graphical User Interface

This section might fit as well in Chap. 11, because it is about a method to obtain
primitive pictorial instances. Ever since the days of Milgram and Rosenfeld [6] engi-
neers, designer, developers, teachers, etc., have been dreaming of a smart blackboard
as interface, where handwritten characters and formulae, sketches, etc., are used as
interface to the machine instead of a keyboard (or in those days punch card read-
ers). Today we use our tablets and smartphones very naturally in this way, so that
ever-rising need is given for the automatic interpretation of such data.
Human beings communicate various structures and organizations by drawing
diagrams and schemes. This includes all kinds of maps and plans, where there is
a close geometric or at least topologic mapping between some scene in the real
world and the scheme, for example, interior design plans for placing furniture or
drawings in mechanical engineering. For such cases parallelism and orthogonality are
often inherited from the depicted scene. However, also included in graphical human
communications are more abstract structures such as UML flowcharts. In those cases
the content has no planar structure, and thus the designer is free to place his or her
objects and links anywhere. It turns out that in such situation people ubiquitously
make use of Gestalt laws. They place similar objects in rows and columns orthogonal
to each other and are aligned with horizontal and vertical directions, and they use
parallelism and proximity for communicating their content. Thus Gestalt recognition
can contribute substantially to the automatic analysis of such human-drawn schemes.
As an example, we present in Fig. 9.2 a hand-drawn sketch of an electronic circuit.
This schema was drawn by hand on a tablet computer, using a program devised by
us for such purpose. The result is not only a set of black pixels on white background.
There is also the temporal sequence in which the pixels are set, and this contains
valuable additional information. It means that the topology list, that otherwise an
automaton as indicated in Sect. 8.5 would have to provide, is not necessary. A straight
line segment is drawn serially from one end to the other, and the sequence is already
130 9 Parallelism and Rectangularity

45 constraints

Fig. 9.2 A small hand-drawn electronic schema

given here. Once such a line from start to end location has been detected, operation 
from Definition 8.1 can be applied, with the along sequence and the extreme locations
Ḿ and M̀ already known.

9.3 Orthogonality and Parallelism for Polygons

In all ages of human history and in all regional cultures, there is a strong prevalence
of the right angle over all other angles. For our global technical civilization this is
self-evident, but it also appears, e.g„ in native American patterns, or in the forbidden
city in China. Engineers prefer to lay out their graphical schemes in a parallel and
rectangular manner, and when they transfer, e.g., an electric scheme to the circuit
board, it will still be preferably organized in parallel and rectangular patterns, at the
risk of feedback and cross talk. Also the preferred layout of cities—at least as well
in contemporary America as in antique Roman settlements—is the “block.”
The universal modeling language (UML) standards do not set parallelism or rect-
angular arrangements as compulsory. Yet the vast majority of diagrams we encounter
are organized in this manner. Obviously, this has no semantic reason. An UML dia-
gram still refers to the same content if it shows oblique connections. It is a preference
of Gestalt and order against unnecessary and arbitrary chaos. On the other hand, the
9.3 Orthogonality and Parallelism for Polygons 131

American language knows the antonyms “hip” and “square.” Obviously, the techni-
cal engineers’ world is so overloaded by these Gestalten that the more artistic and
sentimental part of the population is overfed with it and calls for some oblique or
round exception. This is the most interesting example for salience through breaking
of symmetry as an artistic design principle. We find the antagonism against the square
Gestalt in many designers and architects from Gaudi to Hundertwasser.
Much of the literature on automatic scheme recognition from scanned or hand-
drawn samples treats enforcement of parallelism and rectangular arrangement [7].
About half of the book of Leyton [4] is on the square and its corresponding sym-
metries. This is his master symmetry from which the percepts are constructed by
deforming processes.
Schemes, diagrams, technical drawings, such as the one presented in Fig. 9.2,
often show a Gestalt hierarchy much as it was treated in Chap. 5. Proximity and in
this case also the sequence of drawing can help establishing the most appropriate
Gestalt term:
• Most salient is the input as well as the output capacitor, which are drawn as pairs of
neatly parallel lines in close proximity and with almost perfect overlap. Operation
|| as defined in Definition 9.1 will yield high assessments. It is very likely that the
two lines of such a ||-pair are drawn immediately one of the other.
• Also salient are the elongated rectangles representing resistors. Most engineers will
draw a resistor in a similar way as a capacitor. That is, he or she would draw a pair
of parallel lines in proximity to each other first and then close the rectangle with
two short strokes. Some people might prefer drawing it in one flow, circumscribing
it. Operation || as defined in Definition 9.1 will yield high assessments for the pair
of long sides.
• The two resistors are aligned and connected in a voltage divider. Such elongated
rectangles have rich symmetries on their own, so that the reflection operation | as
defined in Definition 2.3 and the frieze operation Σ as defined in Definition 10.2
yield the same result. There is a certain asymmetry in the length of the two rect-
angles. Thus, the assessment of this aggregate will not be very close to one.
• There is a reflection symmetric triangle, i.e., two lines that form a very well-
assessed |-aggregate, closed by a third stroke. Engineers indicate amplifiers in this
manner. These two lines are the only ones that are oblique. All other lines are
roughly horizontal or vertical. With high probability such triangle is drawn in one
flow, circumscribing it.
• Finally, the Gestalten thus found are connected by either vertical or horizontal
lines in a closed circle: the negative feedback loop. And we have three open ends,
where other circuits may connect: input, output, and ground.
We realize that the highest assessed Gestalt term in such a diagram may often already
indicate the semantics of such a drawing, although no domain knowledge was uti-
lized.
The Gestalt term indicates which constraints should be enforced, in order to
rearrange the sketch so as to transform it in the direction of a proper technical diagram.
In the end all such constraints should be enforced in one common equation system,
132 9 Parallelism and Rectangularity

45 constraints

Fig. 9.3 Electronic schema with constraints detected and enforced

Table 9.1 Constraints inferred by hypotheses testing for the freehand sketch depicted in Fig. 9.2.
22 constraints form a minimal set, i.e., a set of consistent and redundant-free constraints required
to describe the 45 found geometric relations
Constraint Enforced Required
Orthogonal 22 20
Parallel 22 1
Concurrent 1 1
Sum 45 22

just as it was proposed by Pohl et al. [8]. In Fig. 9.3 48 constraints are enforced. Of
these only 23 are independent:
Table 9.1 summarizes the 45 geometric relations found by testing hypotheses for
the adjacent segments of straight lines depicted in Fig. 9.2. These constraints form a
set of consistent but redundant equations. A rectangle, for instance, can topologically
be described by three right angles or by two pairs of parallel straight lines and
one orthogonality constraint. A greedy algorithm can identify sets consistent and
redundant-free—either by numerical checks or by algebraic methods. Here, a set of
22 constraints has been identified. Note that this result depends on the order of the
constraints within the greedy algorithm.
9.3 Orthogonality and Parallelism for Polygons 133

Note, not only the usual Gestalt organization as indicated in Fig. 9.3 is present.
There are also certain conventions in this domain, e.g., on how to indicate ground and
how to draw an amplifier, a capacitor, or a resistor. Such conventions exist for any
domain of practical relevance. They can be the topic of automatic knowledge-based
analysis of such imagery, which is treated in Chap. 12.

References

1. Wertheimer M (1923) Untersuchungen zur Lehre der Gestalt. II Psychologische Forschung


4:301–350
2. Kanizsa G (1980) Grammatica del vedere: saggi su percezione e gestalt. Il Mulino
3. Pizlo Z, Li Y, Sawada T, Steinman RM (2014) Making a machine that sees like us. Oxford
University Press
4. Leyton M (2014) Symmetry, causality, mind. MIT Press, Cambrige Ma
5. Desolneux A, Moisan L, Morel JM (2008) From gestalt theory to image analysis: a probabilistic
approach. Springer
6. Milgram DL, Rosenfeld A (1972) A note on grammars with coordinates. Graph Lang:187–194
7. Marriott K (1998) Visual language theory. Springer
8. Pohl M, Meidow J, Bulatov D (2017) Simplification of polygonal chains by enforcing few
distinctive edge directions. In: Sharma P, Bianchi FM (eds)Scandinavian conference on image
Analysis (SCIA). Lecture Notes in Computer Science, vol 10270, pp 1–12
Chapter 10
Lattice Gestalten

With the term lattice we refer to an aggregate Gestalt composed of a row of columns
which are preferably perpendicularly oriented to the row, or at least not collinear
with it. Such constructions are ubiquitous in a man-made world and also result from
numerous natural causes such as crystallization or convection. Actually, such simple
organization is only one of seventeen possible tilings of the 2D plane. These possible
tilings are corresponding to the seventeen wallpaper groups mapping the complete
2D plane on itself. Many publications on the visual symmetry topic such as [1, 2]
emphasize this fact.
Temptation is strong to understand lattices as Gestalten of hierarchy two, as a row
of rows, or as
n  m
l = gk, j (10.1)
k=1 j=1

using the operation Σ defined in Chap. 3 (Def. 3.5). But there are several severe
problems with this model:
• Both operations in 10.1 construct a generator vector. Let us call them v for the
row formation running with k and w for the row formation running with j. If w is
substantially longer than v—say m—times—choosing the k aggregation as outer
and the j aggregation as inner grouping would yield a much better-assessed lattice
Gestalt than vice versa.
• Formation of the outer row is only possible after all part rows have already been
constructed. If aggregation of one of them fails for whatever reason the whole
Gestalt cannot be established. Given a lattice of 6 × 4 members, and a false negative
rate for the parts of 5%, the probability of finding the lattice reduces to 0.9524 ≈
0.29. However, common sense teaches that if, e.g., 20 out of 24, parts are present
and well located, there will be enough evidence for such lattice.

© Springer Nature Switzerland AG 2019 135


E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6_10
136 10 Lattice Gestalten

10.1 Related Work on Lattice Grouping

Perceptual grouping according to lattice structures has been the focus of numerous
publications in the past decades, though much of the papers would rather assign their
approaches to knowledge-based image analysis or syntactic recognition. The topic
has been fostered by facade recognition endeavors, in particular the European project
eTRIMS [3]. The outcome was as follows:
• Bayesian network was utilized modeling hierarchical structures.
• Markov random fields were utilized modeling peer-to-peer relations.
• Logical structures were proposed for the representation of taxonomical and com-
positional hierarchies
• 2D grammars were utilized modeling the structural relations syntactically
Several PhD theses where based on eTRIMS, including the one of Wenzel [4].
A sophisticated probabilistic model was presented, and a sampling method called
marked point sampling. Tylecek [5] also presented a probabilistic model together with
a sampling method and achieved very high performance as well on the eTRIMS [3]
as on the 2013 symmetry competition data [2].
On the latest space-borne synthetic aperture RADAR (SAR) imagery, facades also
appear as lattices. This triggered a series of papers on the topic. In order to achieve
acceptable results Schack and Soergel [6] exploited almost all available knowledge
on the mapping geometry, as well as on the geometric properties of facades. They
were quite pessimistic about the feasibility of automatic grouping based on Gestalt
laws only, saying: “This means that clustering in the 2D SAR geometry is foredoomed
to fail,” and they recommend using 3D grouping, because these particular SAR data
come not only with the geographic location in North and East. There is also a phase
feature measured for each scatterer, which corresponds to elevation. Yet, looking at
the 2D SAR intensity data, one perceives salient lattices immediately. It is evident
that some non-random organization is present. And we think at least some automatic
lattice grouping should be possible. The example presented in Sect. 10.5 below is
based on the same data.

10.2 The Lattice Gestalt as Defined on Locations

As indicated by Fig. 10.1 our lattices are grids generated by two vectors. However,
unlike the lattice symmetries treated in the symmetry literature our lattices have mar-
gins. The figure displays a general case with non-orthogonal generators of different
lengths. Such drawings always induce strong spatial illusions in human observers.
We will discuss that later and ask the reader to look at this drawing as what it is—a
2D
 sketch of some vectors and points. Such lattice Gestalt needs at least four parts
g1,1 , g1,2 , g2,1 , g2,2 , while a first estimate for the generators v and w can already
be given from only three of these parts. So in case of measurement errors there
10.2 The Lattice Gestalt as Defined on Locations 137

will already be residuals for the minimal quadruple setting. Let us denote the new
operation by # with two indices running:
n,m
l = #k=1, j=1 gk, j (10.2)

Set locations are given by

1 1
x k, j = xl + (k − 1)v − (n − 1)vl + ( j − 1)w − (m − 1)wl . (10.3)
2 2
In Fig. 10.1 we have n = 6 and m = 4. It can be easily verified that there is a closed-
form linear solution to the problem of minimizing the squared residuals similar to
Eq. 3.4 in Chap. 3. The calculation can be separated into one linear system for the
horizontal coordinates of the location and a second independent linear system for
the vertical coordinate system. Both systems have the same form, namely
⎡ ⎤
nm 1
2
n (n − 1) m 1
2
n (m − 1) m
⎢1 ⎥
⎣ 2 n (n − 1) m 16 n (n − 1) (2n − 1) m 1
4
n (n − 1) (m − 1) m ⎦
1
2
n (n − 1) m 1
4
n (n − 1) (m − 1) m 16 n (m − 1) (2m − 1) m
⎡ ⎤ ⎡ ⎤
x0 Σ n,m
j=1,k=1 x jk
⎢ ⎥ ⎢ n,m ⎥
⎣ vl ⎦ = ⎣Σ j=1,k=1 j · x jk ⎦ , (10.4)
wl Σ n,m
j=1,k=1 k · x jk

where the sums Σ indicate the sum of all part locations in their horizontal respectively
vertical component.
x0 gives here the set location of the corner element with indices k = j = 1. In
order to get the location of the new lattice Gestalt x l , which is in the center, we have
to add half of (n − 1)v and half of (m − 1)w, following Eq. 10.3. The new location
x l , by the way, turns out as just the mean location of all parts.
Let us assume without loss of generality that n · v ≥ m · w. Then we set the
scale or size
sl = n · v , (10.5)

and we also obtain the orientation

φl = arctan(vl,y /vl,x ). (10.6)

The rotational self-similarity periodicity will be set to two, just like for the rows.
In rare exceptional cases of n = m and equal length v = w and orthogonal
generation v T · w = 0, it may be more wise to switch to rotational self-similarity
periodicity four.
138 10 Lattice Gestalten

Fig. 10.1 The lattice locations: Solid dots are set locations x; empty dots are measured locations
x; the aggregate lattice Gestalt will be located at the position indicated by the larger gray dot

The assessment of a newly generated lattice Gestalt will be fused from several
components. In analogy to Definition 3.1, we obtain the first assessment component
from the residuals as
Definition 10.1 A function a : G n×m → [0, 1] is called residual lattice assessment
iff n, m > 2, and there is a scale t > 0 with
⎛ ⎞
t 
n,m
a(g1 , . . . , gn ) = exp ⎝− 2 xk, j − xk, j 2 ⎠ , (10.7)
u (n − 3) k=1, j=1

where u is the geometric mean of the scale of all the parts.


For lattices we will reward similar length of generators, which would be automati-
cally punished in particular for larger m for the corresponding row-of-rows Gestalt as
indicated above when introducing the chapter. In addition, a reward for orthogonality
of the generators is plausible.

10.3 The Role of Similarity in Lattice Gestalt Grouping

It suffices to refer to Sect. 3.4 for the details of similarity assessments. In particular,
there is similarity with respect to scale—see Definition 3.3, and similarity with respect
to orientation—see Definition 3.4. There is almost no difference here, except that we
have double indexing over k, j instead of single indexing over i there.
In the case of facade recognition where the parts arranged in a lattice are mostly
windows, similarity between these parts gives strong evidence for the presence of
the lattice.
10.3 The Role of Similarity in Lattice Gestalt Grouping 139

With these definitions at hand, we can formalize the definition of the operation #
sketched above in 10.2:

Definition 10.2 An n · m-ary operation # : G n·m → G is called lattice symmetry


operation iff for all g11 , . . . gnm ∈ G n·m :
1 n,m
x L = n·m i=1, j=1 x gi the mean,
φ L = arctan 2(v Σ ) mod π as resulting from solution (10.4), and where v · n >
w · m is assumed without loss of generality,
 1/(n·m)
s L = (n − 1)|v Σ | + sgi, j , where v · n > w · m is assumed without loss
of generality,
f L = 2, and
a L is a conjunctive combination of residual lattice assessment, cf. Definition
10.1, orientation similarity assessment, proximity assessment, similarity in scale
assessment, punishment of illusions, and assessment inheritance of the parts.

Algebraic closure and most formal properties of this operation are similar to the
row operation. Basically, all that was said in Chap. 5 holds here as well. Below in
Sect. 10.4 an efficient greedy search procedure is given.

10.4 Searching for Lattices

In Sect. 3.5 the combinatorial nature of the search for proper subsets of a set of given
Gestalten and of proper enumeration in tuples for finding well-assessed friezes or
rows was discussed. For lattices the situation is similar if not worse. The enumeration
now uses two running indices. Minimal lattices have four members, and lattices with
hundreds of members are not rare. Theoretically, the power set of the possible parts
has to searched for the set of maximal meaningful Gestalten, just the same as in
Sect. 3.5. However, in practice we are not dealing here with subsets of ten or twelve
elements, but with subsets of sixty or eighty elements. Recall, here we have binomial
coefficients. That means we are many orders of magnitude apart. Sound solutions
to such search problems that guaranty to find the best solution in all situations are
usually of intractable computational complexity.
In Sect. 3.5.2 a greedy search for row Gestalten was presented instead for the
smaller sets treated there. Such heuristic solutions give long rows early and do not
keep sub-maximal rows that are part of longer aggregations. For lattices we need to
be even more greedy in order to keep the search in feasible efforts. Augmentation
of lattices is a bit more complicated as compared to row prolongation fore and
aft. Figure 10.2 depicts such augmentation step. Given an n × m lattice, one-time
extrapolations of the generators v and w in either directions yield 2(n + 1) + 2(m +
1) search locations round the perimeter of the object. These are indicated in the figure
as dotted circles. For each of these locations only the closest partner can be accepted
as corresponding to the index pair.
140 10 Lattice Gestalten

Fig. 10.2 The lattice Gestalt augmentation: a path of 2(n + 1) + 2(m + 1) search locations around
an n × m lattice

Thus the elements for the index pairs (1, 1), (2, 1), …, (n + 2, 1), (n + 2, 2),
…(n + 2, m + 2), (n + 1, m + 2), …, (1, m + 2), (1, m + 1), …, and (1, 2) are
found. The inner elements are the old elements. So all their indices are incremented
by one.
On real data and with an imperfect primitive extraction method, a nonzero false
negative rate φ has to be expected. Under independence assumption the probability
for the presence of all 2(n + 1) + 2(m + 1) elements would thus result as

p = (1 − φ)2(n+1)+2(m+1) . (10.8)

Even for small false negative rates, this is approaching zero with rising numbers n
and m very quickly. Such loss of performance can only be compensated by illusion.
A certain portion λ—e.g., λ=0.3—of missing elements is tolerated. Then the corre-
sponding probability of finding an existing lattice under false negative rate φ for the
primitives is much higher and acceptable. It can be reckoned or estimated by use of
the binomial distribution formulae.
If the illusion rate is used on the entire search path, only lattices with even numbers
n and m can be constructed by successive augmentation of 2 × 2 lattice seeds. In
order to allow lattices with odd numbers in either direction a case-by-case analysis
is done:
• If more than λ · (n + 2) elements are missing in the partial path (1, 1), …, (n +
2, 1) then this part of the lattice will be cropped. The result is a lattice with m + 1
members instead of m + 2 members. Second indices must be decremented by one
accordingly.
10.4 Searching for Lattices 141

• If more than λ · (m + 2) elements are missing in the partial path (1, 1), …, (1, m +
2) then this part of the lattice will be cropped. The result is a lattice with n + 1
members instead of n + 2 members. First indices must be decremented by one
accordingly.
• If more than λ · (n + 2) elements are missing in the partial path (1, m + 2), …,
(n + 2, m + 2) then this part of the lattice will be cropped. The result is a lattice
with m + 1 members instead of m + 2 members.
• If more than λ · (m + 2) elements are missing in the partial path (n + 2), 1, …,
(n + 2, m + 2) then this part of the lattice will be cropped. The result is a lattice
with n + 1 members instead of n + 2 members.
The crop actions are delayed until all four conditions are known. If all four conditions
are given, then the lattice cannot be augmented and will be marked as maximal
meaningful lattice in the sense of Desolneux [7]. This terminates the search. Else,
if one or more of the conditions permit augmentation, the search for the maximal
lattice continues.
The start of the search for lattices again uses seeds. For lattices, a seed is a
configuration of four similar objects in proximity of each other and arranged roughly
in a parallelogram configuration. Basically it uses the row forming operation Σ
introduced in Chap. 3 twice.
In Chap. 7 missing members were discussed in general. Such considerations are
most important for the lattice Gestalt. The more parts have to be considered for an
aggregate the higher will be the probability that some of them will fail to appear.

10.5 An Example from SAR Scatterers

Figure 10.3 shows the application of lattice grouping to a type of pictorial data with
which many machine vision experts or human perception researcher are not familiar.
These are remotely sensed scatterers of some section of the city of Berlin. However,
regardless how these data were obtained, any human observer will instantaneously
perceive salient patterns. It is our intention to code machines so that they can have
similar recognitions. The upper picture in the figure shows the primitives, the middle
picture shows the lattice seeds, from which the search procedure outlined above
in Sect. 10.4 starts, and the resulting Gestalten are depicted in the lower picture. To
some extent, the behavior is similar to human perception and there remains also some
difference. Maybe the illusion parameter is set a bit too liberal. This is a first trial on
these data with certain default parameter setting. One may now—either by hand or
by some automatic means—adjust the parameter values with the goal of consistence
with human perception. That requires ground truth which should be obtained from a
representative group of (non-expert) observers marking what they perceive as salient.
Skewed lattices as shown in the lower part of Fig. 10.3 are perceived by humans
as tilted rectangular or even square lattices. There is a strong impression of depth in
such pictures. Many may refer to this effect as an illusion. Pizlo [8] insists that in a
142 10 Lattice Gestalten

Fig. 10.3 Finding salient


lattice Gestalten in SAR
data; upper: primitives,
middle: seeds, lower: lattices
10.5 An Example from SAR Scatterers 143

natural environment, under non-degenerate view, this will not be an illusion, but a
valuable, and most often true, 3D perception by use of symmetry as prior. Leyton [9]
also sees no illusion in this effect. Instead according to his theory perception works
as inference. The observer infers from the presence of asymmetry that there has been
a process in the past that tilted the symmetric square raster out of the viewing axis.
In the case of the SAR data at hand, this perception of depth is deceiving. We should
be aware that in SAR images the two axis directions of the image have a different
meaning: One direction—in Fig. 10.3 the horizontal direction—corresponds to signal
travel time, i.e. distance to the antenna. The other direction is given by the synthetic
aperture. The details are explained in the corresponding literature, e.g., [10]. In
Chap. 12 we return to this point. Knowledge about the SAR process may be utilized
in the Gestalt grouping, and this will improve the grouping, so that the true Gestalten
will be preferred, while false illusions are avoided.

10.6 Projective Distortion

A substantial fraction of the example lattices given with the 2013 and 2017 com-
petition data of the Penn State [2, 11] shows strong projective foreshortening. For
remotely sensed imagery, such as aerial images or SAR data of urban terrain, this
would be exceptional. However, ground-based urban views are often subject to pro-
jective distortions. The look-from location is most often restricted by human body
height, and multistory buildings are much taller. Thus, often the camera is pointed
upward. It is therefore justified to include many such pictures in data sets meant to
be representative for what is called “images in the wild”.
On such imagery lattice search as outlined above in Sect. 10.4 is doomed to fail
because the geometric model intrinsic to operation # is not valid. An easy way to
circumvent this problem is using automatic perspective correction of converging
lines. There are established tools for this, frequently used by amateurs and profes-
sionals doing architectural photography, such as ShiftN [12]. Because this problem
was regarded as largely solved by the scientific community, important other facade
data sets, such as the eTRIMS [3], were given with the projective correction and
re-sampling already done.

References

1. Mitra NJ, Pauly M, Wand M, Ceylan D (2013) Symmetry in 3D geometry: extraction and
applications. Comput Graph Forum 32(6):1–23
2. Liu J, Slota G, Zheng G, Wu Z, Park M, Lee S, Rauschert I, Liu Y (2013) Symmetry detection
from realworld images competition 2013: summary and results. In: CVPR 2013, workshops
3. Korč F, Förstner W (2009) eTRIMS image database for interpreting images of man-made
scenes. Technical report TR-IGG-P-2009-01, Department of Photogrammetry, University of
Bonn. http://www.ipb.uni-bonn.de/projects/etrims_db/. Accessed Aug 2018
144 10 Lattice Gestalten

4. Wenzel S (2016) High-level facade image interpretation using marked point processes. PhD
thesis, Department of Photogrammetry, University of Bonn
5. Tyleček R (2016) Probabilistic models for symmetric object detection in images. PhD thesis,
Czech Technical University in Prague
6. Schack L, Soergel U (2014) Exploiting regular patterns to group persistent scatterers in urban
areas. IEEE-JSTARS 7(1):4177–4183
7. Desolneux A, Moisan L, Morel J-M (2008) From gestalt theory to image analysis: a probabilistic
approach. Springer
8. Pizlo Z, Li Y, Sawada T, Steinman RM (2014) Making a machine that sees like us. Oxford
University Press
9. Leyton M (2014) Symmetry, causality, mind. MIT Press, Cambrige, Ma
10. Sörgel U (ed) (1990) Radar remote sensing of urban areas. Springer
11. Funk C, Lee S, Oswald MR, Tsokas S, Shen W, Cohen A, Dickinson S, Liu Y (2017) ICCV
challenge: detecting symmetry in the wild. In: ICCV 2017, workshops
12. Hebel M (2018) Shiftn – automatic correction of converging lines. http://www.shiftn.de/.
Accessed Aug 2018
Chapter 11
Primitive Extraction

In the symmetry recognition or Gestalt grouping community, there is an ongoing


dispute whether to use a set of certain primitive objects extracted from the image
like in [1], or to fill certain accumulators directly from the raw pixel colors like in
Hough transform methods, or [2, 3]. The latter usually results in nested enumeration-
loops, and may thus cause high computational efforts, while being conceptually fairly
simple. The former will suffer from loss of information during primitive extraction.
Generally, the best choice of the primitive extraction method depends on the type
of image to be processed. Moreover, the task for which the image analysis is per-
formed should be considered. In the following sections we give examples for primi-
tive extraction methods. All of these yield the compulsory Gestalt features location,
orientation, scale, periodicity, and assessment, which are needed for the grouping
operations. Most of them also give additional features such as colors. All primitive
extraction methods that we know of use parameters, and the proper adjustment of
these for optimal recognition performance, or an acceptable compromise between
recognition performance and computational efforts is a topic on its own (for each
method). It should be treated by use of statistical models or with similar machine
learning considerations like in Chap. 13.
The extracted objects vary strongly with the method used. For example, some
of the methods set the location of a primitive in the center of an image segment
of comparably constant color. Such primitives may well correspond to objects in
the depicted scene. Other methods avoid such location because there is no gradient
information there, no direction, and no energy. These methods would prefer locations
on contours, corners, or isolated points. Then correspondence between a scene object
and a primitive is unlikely, instead such correspondence would then be more likely
between aggregated Gestalten constructed from such primitives.

© Springer Nature Switzerland AG 2019 145


E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6_11
146 11 Primitive Extraction

11.1 Threshold Segmentation

The threshold operation is the simplest possible image processing operation trans-
forming an intensity image into a binary image. Figure 11.1 shows such image
obtained from the example picture given in the introduction. Following this, one
may obtain a set of primitive objects by forming the connected bright components
using a 4- or 8-neighborhood. Then a primitive Gestalt can be constructed from each
connected component p by assigning the mean pixel location as position feature x p ,
the root of the number of pixels in it as scale feature s p , the eigenvector direction
corresponding to the larger eigenvalue of the second moment as orientation feature
φ p , and 2 as periodicity, respectively. A natural choice for assessment feature a p is
the brightness above threshold. The mean intensity iˆp of all pixels in the segment will
be at least as high as the threshold. If this mean intensity was just equal to the thresh-
old the segmentation would be very unstable. Then a p = 0 is a consistent choice.
Else, if the segment is maximally bright, in the case of a byte-image iˆp = 255, the
segmentation would be most stable. Then a p = 1 is a consistent choice. In between
the simplest choice is a p = (iˆp − τ )/(255 − τ ), where τ is the threshold.
The resulting set of primitives is displayed in Fig. 11.1. Given the very simple
nature of this method, the result is remarkably good for this example. Visually, the
main content remains untouched in Fig. 11.1. The situation is different in Fig. 11.2.
With the standard Gestalt domain features only, much visual information is lost.

Fig. 11.1 Example of a binary image using threshold 128 on the group picture Fig. 1.1
11.1 Threshold Segmentation 147

Fig. 11.2 Primitive Gestalten extracted from the binary picture Fig. 11.1

The threshold was arbitrarily chosen as τ = 128, which is just the mean between
maximal and minimal intensity. In this particular example, changing the threshold in
either direction does not alter the result substantially. In other images with less con-
trast, the choice of the best threshold is non-trivial. Automatic methods for it include
histogram analysis. Local minima in the intensity histogram are good candidates for
segmentation thresholds.
There may well be multiple thresholds, but then overlapping primitives will result,
which may even be at the same positions. This is very bad for the combinatorial
growth in the number of possible aggregated Gestalten, when hierarchies of sym-
metries are analyzed. In the case of use of multiple thresholds, such cases should be
avoided deciding for the best assessed Gestalt, and excluding competing near other
possibilities. Moreover, the objects of interest may as well be dark instead of bright.
Therefore, in the absence of better knowledge, inverting the threshold comparison
operation should also be considered. The maximal stable extremal region extraction
method discussed in more detail below in Sect. 11.3 is actually a variant of the more
sophisticated threshold segmentation methods.
The loss of information between Figs. 11.1 and 11.2 is serious. Some grouping
in accordance with human perception might still be possible, such as the formation
of dominant horizontal rows. However, very many decisive details are now lost. One
148 11 Primitive Extraction

possible way of mitigation is the use of additional features. Some of these are already
provided by the process, e.g., the mean intensity iˆp . Others can be obtained with little
extra effort, e.g., the eccentricity can be easily reckoned from the second moment,
which was calculated anyway for the orientation. These features can then be utilized
for similarity assessing in the grouping operations. If the intensities were obtained
from a color variant of the image, one may also determine a mean color for each
primitive, from which similarity assessing can benefit.
An even higher dimensional feature domain can be used by analyzing the distance
of the contour of the segment around the location, as is done for the maximal stable
extremal region descriptors discussed below in Sect. 11.3. Even a resampled patch
of the image around the location in the given scale and orientation can be used as
high-dimensional feature (as was already indicated in Sect. 3.4). Using resampled
patches minimizes the information loss to a degree similar to the loss due to the same
resampling in symmetry recognition methods that use no primitives at all and work
directly on given image data.
Setting the periodicity feature to two is not much more than a first guess, assuming
that the primitive object has roughly the shape of an ellipse with the scale, orientation,
and eccentricity provided by the second moments. It is more reasonable to test the
object for rotational self-similarity. Such test can be done on the segment level rotating
it around the object’s location and counting pixels that find correspondence versus
those that do not find a corresponding partner. It can also be done on the intensities
or colors using cross-correlation.

11.2 Super-Pixel Segmentation

Segmentation of images into connected regions (subsets of pixels) has always been
in the focus of image processing. The union of all resulting segments must be the
whole image, and the intersection between two different segments must be empty.
There are two types of failure for such segmentation methods: Over-segmentation
produces more than one segment on the intended object. In other words, the resulting
segment is too small, pixels that should be united in one segment are spread over
different segments. And under-segmentation produces segments that cover more than
one of the intended objects, or includes background pixels into the same segment
with object pixels. We use the fast super-pixel implementation given by Achanta
et al. [4], which is called simple linear iterative clustering (SLIC).
The super-pixel method is a seeded region growing segmentation of the image
starting from a hexagonal grid. One of the parameters of the method is the number
of such seeds. The seeds may accidentally be located on edges or noise locations.
Therefore, in the vicinity of the seeds a minimum of the intensity gradient magni-
tude is searched, and the seed is shifted to that location. The key of the method is
a 5D distance measure. Three dimensions are color, where SLIC uses a perceptu-
ally uniform color space called CIELAB. At least small color distances should be
consistent with human seeing in this space. The other two dimensions are the compo-
11.2 Super-Pixel Segmentation 149

nents of the vector connecting the centers of the segments in the image. The weights
for these two partial distances depend on the number of super-pixels and the image
size. Successively the best fitting pixels are added to the super-pixels with respect
to the mentioned 5D distance. With each step the super-pixel centers may again be
relocated to the new mean.
In the end, there is a list of pixels belonging to each super-pixel object, a center
location, and a color for each object. From these the Gestalt domain features outlined
in Chap. 1 are obtained. In particular, the number of pixels will deliver scale (which
cannot vary very much), and the second moments yield the orientation. Eccentricity
comes naturally as additional feature. These are displayed for the example group
picture in Fig. 11.3 (only the intensities in this book, but originally SLIC is intended
for color images). The visual information loss in SLIC is obviously fairly low, given
the very large data compression level. SLIC-primitives were used in [5]. It is our
impression that they perform superior to the scale-invariant feature transform prim-
itives used in the earlier publications and described below in Sect. 11.4. The colors
and the eccentricities should be used for additional similarity assessment and propa-
gated through the hierarchy. This kind of primitive extraction was used in Sects. 2.7
and 2.8.
SLIC super-pixels come without a quality measure. However, they are obviously
meaningless in homogeneous regions. In such image areas, SLIC will just reproduce
the hexagonal tiling resulting from the regular seeding. It was therefore reasonable
to assign assessment 0 to super-pixels that have no color difference to their six
neighbors. Accordingly, assessment 1 was assigned to those that feature maximal

Fig. 11.3 Elliptic representation of super-pixels extracted from the group picture Fig. 1.1
150 11 Primitive Extraction

Fig. 11.4 Super-pixels primitive Gestalten extracted from the group picture Fig. 1.1

color difference to their neighbors, and correspondingly with medium contrasts.


Figure 11.4 displays the SLIC-primitives obtained from the example image using
the standard Gestalt domain display format, i.e., with gray-tone coding assessment.

11.3 Maximally Stable Extremal Regions

Standard reference for the maximally stable extremal regions is [6] of Matas et al. It
belongs to the class of multilevel threshold segmentation methods discussed above
in Sect. 11.1. However, it is more sophisticated than just using simple binarization
thresholds. The idea was born from the difficulties in wide-based stereo correspon-
dence. Invariance was seen as key property, invariance w.r.t. direction of view, dis-
tances, and also w.r.t. variations of lighting. There is a chain of definitions on the
pixel lattice filled with ordered intensity values:
• A region is a connected subset of the pixels.
• In an extremal region all pixels are brighter than all the pixels on the outer margin,
i.e., pixels directly connected to it, but not belonging to it.
• Such region is a maximally stable extremal region (MSER) if the change in size
with varying threshold is minimal.
Of course, one may also be interested in the dark regions. Then “brighter” is just
replaced by “darker”.
11.3 Maximally Stable Extremal Regions 151

Most Gestalt domain features are straightforwardly given by the MSER method,
such as location, scale, and orientation. MSER often provides very different scales
from the same image with the same parameter setting. The natural choice for the
assessment feature is derived from the stability value, i.e., the growth of the region
for declining intensity threshold. If this value is just better than the corresponding
threshold parameter, the primitive will be assessed as 0. If a region is not growing at
all at that level it will be assessed as optimal, i.e., 1. Between that, a linear rise will
be constructed. Unfortunately, the standard MSER implementations do not yield this
stability feature. It must be added by augmenting the code accordingly. As standard
feature, MSER also provides an eccentricity, and the drawing routine coming with
the implementation displays ellipses. Therefore, the default setting for the periodicity
feature is 2. However, there is also the pixel list with the object, so that a rotation match
can be easily added testing for rotational self-similarity. If that fails the periodicity
will be set to 1.
Figure 11.5 shows MSER-primitives extracted from the example group picture
given in Fig. 1.1. In order to understand the correspondence to the scene content, the
original picture is depicted in a brighter tone as background. Some of the primitives
correspond to faces, but most faces are missed. The bright football socks in the lower
part of the image lead to almost perfect primitives with stable orientation and quite
stable eccentricity and scale, so that a row of reflection symmetric pairs of socks will
gain a good assessment. The white left sides of the shirts lead to large primitives with
stable orientation. A row of these will probably also gain good assessment. However,
there are several cases of under-segmentation. Different parameter settings may well

Fig. 11.5 Primitive Gestalten extracted from picture Fig. 11.1 using MSER segmentation
152 11 Primitive Extraction

lead to better results. We did not attempt to tweak everything to optimal performance.
Instead we used default parameter settings, so that the faults can be discussed, that
are likely to occur if such a method is applied to previously unseen imagery.
MSER was intended for correspondence construction. In order to improve the
robustness of such correspondence it is combined with a contour descriptor. This is
a circular function giving the distance of the contour from the reference location.
Such contour descriptor can be normalized in its length, i.e., its dimension, so that
scale invariance is achieved. There are also matching procedures that yield rota-
tional invariance. Even invariance with respect to affine distortions is possible. Such
descriptors can of course also be used for Gestalt similarity assessment.

11.4 Scale-Invariant Feature Transform

The scale-invariant feature transform (SIFT ) has been proposed by Lowe [7]. Empha-
sis is on scale space, i.e., on image pyramid construction. From an input image (in
the most commonly used version only intensities, no colors) a set of keypoints, is
produced that may be attributed with descriptors. The procedure consists of the
following steps:
• Scale-space extrema detection: The image is filtered using several octaves of
Gaussians. Per octave, a fixed number of steps is selected (e.g., two). Then the
difference between adjacent versions is calculated. These images are known as
difference of Gaussians (DoG). SIFT selects local maxima as well in the DoGs as
through the stack of scales.
• Keypoint selection: Among the extrema, only those are kept which are stable
with respect to their location. Two criteria must be fulfilled: (1) The contrast in the
DoG stack must be significant. There is a threshold parameter for this. Thus points
in almost homogeneous regions are discarded; (2) in order to exclude locations
along edges, which may be stable with respect to directions across the edge but
will not be stable with respect to directions along the edge, the principal curvature
is calculated, i.e., the eigenvalues of Hessian of the DoG. A threshold parameter
on the ratio of the eigenvalues is used to exclude weakly located elements.
• Assignment of orientations: The orientation is given by the intensity gradient
direction at the given location and scale.
• Calculating the descriptor: For a patch around the found location corresponding
in size and orientation of the keypoint, a description of the local image content is
stored. The idea here was robustness with respect to illumination changes. Lowe
decided to use rough small intensity gradient histograms. In the standard imple-
mentations, the patch is tiled in sixteen subpatches, and in these eight gradient
direction bins are accumulated. The result is a descriptor containing 128 bytes.
Thus SIFT keypoints naturally have the very same features that are required for
the Gestalt domain outlined in Chap. 1, namely location, scale, and orientation. As
periodicity we may set 1, since self-similarity with respect to rotation is unlikely.
11.4 Scale-Invariant Feature Transform 153

Fig. 11.6 SIFT-primitives extracted from the group picture Fig. 1.1

There are several threshold decisions in the SIFT keypoint extractor—namely the
minimal contrast and the minimal curvature in step two. These can be used to assign an
assessment feature to each SIFT keypoint. Elements that where just above threshold
get assessments close to 0, and those with maximal value over the threshold get
assessment 1. The descriptor vector that comes with each SIFT keypoint, can be
used as additional feature. Loy and Eklundh gave the reordering for the descriptor if
used for matching under reflection symmetry [1]. In [8] a gain in performance was
found when using such descriptor matching as additional similarity assessment.
Figure 11.6 shows primitive Gestalten extracted from the example image in
Fig. 1.1 by the use of SIFT keypoint extraction. This method produces a large variety
of scales. Some may be more than a factor of ten larger than others. Almost never
will a SIFT-primitive correspond to an object of interest. Instead, they tend to be
located on the contours, and in particular on corners of scene objects. Thus symmet-
ric objects—or rather those that appear symmetric in the image—can be found as
hierarchical Gestalt aggregates of SIFT-primitives. Figure 11.6 gives the primitives
resulting from the example group picture using this method displayed in the Gestalt
conventions. The information loss seems to be substantial at first glance. However, in
particular if the descriptors are used for additional similarity assessment the method
can have considerable symmetry recognition performance.
154 11 Primitive Extraction

11.5 Multimodal Primitives

Originally for use in cognitive vision systems, the multimodal primitive extraction
method was elaborated at the beginning of the century by Krueger [9]. This is based
as well on the theory of the monogenic signal by Felsberg and Sommer [10] as on
psychological and neuro-physiological evidence. There are no pixels in the mono-
genic signal; it is an analytic 2D signal, derived from irrotational and solenoidal
vector fields using the Riesz transform. Symmetry, energy, and orthogonality are
preserved, and it has an allpass transfer function.
The term “multimodal” was introduced by N. Krueger because preservation of
visual information across several modalities was emphasized. These include energy
in certain frequency bands, phase, orientation, color, and optical flow. Apparently, the
scale was fixed in this extraction method, so that if used to produce Gestalten in the
domain given in Sect. 1.3 the scale feature should be set to an appropriate fixed value.
Location and orientation come naturally with the modalities. The energy can be used
as assessment demanding a certain minimal amount, and fixing the assessment 1
for the energy yielded by maximal contrasts. The phase modality distinguishes edge
patches from line segment patches. It should be used as additional feature. Most
multimodal primitives have two colors, one on each side of the edge, or one inside
the line and one outside. These should also be used as additional features. Some
special primitives are located on corners or junctions, and in textured regions. Most
often, a multimodal primitive will have no rotational self-similarity. Thus as default
periodicity we set 1. Line-like primitives will better be featured with periodicity 2.
Higher periodicities will be very rare.
Multimodal primitives of considerable energy will never be located in homoge-
neous image regions. Thus such primitives often do not correspond to objects. How-
ever, the aggregated hierarchical Gestalten derived from them may well correspond
to objects.
Unfortunately, no publicly available software tool for this method survived. This
is a pity. Our impression was that this method had the least loss of information in view
of the vast reduction of data. A mega-pixel color image is reduced to a few hundred
primitives with a handful of features each. When displaying these on a screen, often
the impression of an artist’s view on the same scene is given. The relevant content
is mostly preserved, but the style is quite clear and abstract now. The publications
contain very beautiful and striking examples.

11.6 Segmentation by Unsupervised Machine Learning

One image alone constitutes a set of often millions of measurements, and often these
are colors in the 3D red–green–blue domain. Thus, even one image can be a suitable
database for training an unsupervised learning machine. Such machines are known
for decades. One of the most striking examples are the topological maps or self-
11.6 Segmentation by Unsupervised Machine Learning 155

organizing maps by Kohonen [11]. The machine will learn a map of colors, as they
occur in the data. Preferably, the topology of a torus is chosen.

11.6.1 Learning Characteristic Colors from a Standard


Three Bytes Per Pixel Image

For our group picture example, and also for the spectra in [12], we chose a torus
of 40 × 40 elements (or “neurons”). Always the element is chosen where the inner
product between the input color and the stored color in it yields the highest response.
This element learns the presented color, and its neighboring elements will learn the
same color with less weight.
Self-organizing map neurons should be normalized to unit length after each learn-
ing step. Otherwise few neurons may always win, and some never. Therefore, the 3D
color (in RGB) needs to be transformed to a sphere surface in 4D. This is achieved
by introducing a fourth value, the darkness. It will be one if all colors are zero, and
zero if all colors are maximal. Then the 4D-vector is normalized. Actually, this uses
only one octant of the sphere, but this does not matter. Figure 11.7 shows the four
components of a map learned from the example group picture.
Certain regions on the map stand for colors that repetitively occur in the picture.
There is, e.g., a bright sharp spot in the blue channel. It is interesting to see what
image parts activate this region. The upper part of Fig. 11.8 shows the activation map
for this small region. Obviously, the self-organizing map adapted here to the color
of the sponsor logo shown on the shirts. The lower part is from a different region,
which adapted to skin colors.
In the end, some of the activation maps will yield segments which are obviously
a very good input to hierarchical grouping, and of equal importance to any analysis
of the image content. Other activation maps, from other regions, e.g., here the green
colors of the background, will yield clutter, where hierarchical grouping will fail to
produce well assessed hierarchies. Probably those regions are background.
Of course the primitive extraction task is not accomplished yet with these binary
images. Further steps, such as morphological filtering, forming connected compo-
nents of a certain minimal size, reckoning first and second moments, and assessing
the resulting objects must follow. These steps have been explained in Sect. 11.1. The
same additional features for similarity assessment can be utilized here.

Fig. 11.7 4D Self-organizing map as learned from the example group picture
156 11 Primitive Extraction

Fig. 11.8 Examples of activation maps of the self-organizing map obtained from the example
image: upper—pixels that activate a small blue region on the SOM, mostly on the logo on the shirts;
lower—pixels from a skin-tone region, mostly on faces

11.6.2 Learning Characteristic Spectra from a


Hyper-Spectral Image

Learning characteristic colors will lead to better results if the occurring colors are
more specific. This is often true if the color space has higher dimension. Today high-
dimensional-color-space images are called hyper-spectral images. A pixel contains
a sampled spectrum. It must not necessarily cover the visual domain. In 2014 IEEE
and Telops Inc. released a benchmark for data fusion [13]. Telops produces devices
11.6 Segmentation by Unsupervised Machine Learning 157

Fig. 11.9 Three thermal spectra: left—full 84d-spectral vector, right—only 20 bands

that can measure spectra—in this case thermal spectra—looking down from an air-
craft. Each pixel on a line perpendicular to the flight path contains a vector of about
80 intensity values, each representing radiation in a very narrow thermal frequency
band. While the aircraft moves, a data cube is assembled: One direction is perspec-
tive projection perpendicular to the flight path, the second direction is orthogonal
projection along the flight path, and the third dimension is the wavelength. The data
set at hand is taken from a suburban region in Canada. In a pre-processing step, the
data a resampled in geo-coordinates, so that now the vertical direction points North
and the horizontal direction East. The pixels have approximately one meter ground
sampling distance in both directions. For the fusion purpose, the data came combined
with aerial color images taken in the same flight and partial ground-truth on object
classes such as roads, vegetation, and buildings with different roof materials.
Figure 11.9 depicts three such spectra (in full on the left-hand side, and a section
to the right). The upper and darker line is from a pixel labeled as road, the central
one drawn in mid-gray-tone is from a building, and the coldest and lightest spec-
trum is vegetation. Thermal spectra do not differ very much from each other. They
are dominated by temperature emission. We are in a completely different world as
compared to the visual spectral domain example picture given in Sect. 11.6.2. It is
more comparable to a blacksmith observing how his workpieces are glowing in the
forge.
The self-organizing map procedure can map such highly correlated data to a space
of lower dimension emphasizing the differences and removing the correlation to a
certain extent. The map still has 40 × 40 elements with the topology of a torus, but
each element has now 84 weights to be learned, corresponding to the 84 wavelengths
bands of the data. No additional darkness dimension was added, because we are not
interested in the absolute surface temperature, only in the spectra corresponding to
the materials. After training the map on the data, a standard watershed segmentation1

1 Some changes are necessary as compared to standard image segmentation, because the self-
organizing map has torus topology.
158 11 Primitive Extraction

Fig. 11.10 Activation Map of the Second Largest Spectral Segment on the Self-organizing Map

is performed on it. After that the most dominant segment—i.e., the most frequent
spectra—are of the cool type depicted in light gray in Fig. 11.9.
The second largest segment contains spectra of the type drawn in mid-gray-tone in
Fig. 11.9. By Marking every pixel (i.e., square-meter in the geographic map) where
the spectrum is mapped into this segment in black, an activation map is obtained
again. It is displayed in Fig. 11.10. Any non-expert human observer taking a quick
look at this image will instantaneously perceive the fish-bone pattern in which the
buildings are arranged in this suburb. People will see a hierarchical construction,
11.6 Segmentation by Unsupervised Machine Learning 159

Fig. 11.11 Primitive


Gestalten obtained from the
IEEE-Telops spectra

a reflection symmetry with a backbone axis, and rows of similar houses in similar
spacing arranged along either side of the roads forming the bones of the pattern.
Here as well some suitable standard image processing steps are needed to obtain
a set of properly disconnected segments. Some details on that are given in [12]. Basi-
cally, morphological filtering is used to close the gaps between the often isolated
pixels, while also avoiding connections between segments that should be separated.
Then the connected components are formed, and a primitive Gestalt is constructed
from each using the first and second moments as outlined above. The assessment fea-
ture is calculated here using the ratio between the number of pixels and the geometric
mean of the eigenvalues. Thus fuzzy aggregations or those having many and large
holes are punished. Without any further information, the rotational self-similarity
periodicity is set again to 2. Only 384 primitives survive the threshold set on the
quality. These are depicted in Fig. 11.11. Note that these pictures are not organized
in pixel coordinates—here we have geo-coordinates in East and North. The aggre-
gation of hierarchical Gestalten from these primitives is discussed in Sect. 12.3.

11.7 Local Non-maxima Suppression

The combinatorial nature of our constructive approach to perceptual grouping was


outlined in Sect. 5.1 and is used all over the book. Objects are searching for partners
to form aggregated objects. This will not work if they do not find any partners,
because the laws determining what mates and what mates not are too strict. It will
work perfectly, if the objects usually find just one partner. If the objects often find
more than one partner it may still work for a short while. However, any sensible
amount of resources in memory and computation will very soon be overloaded.
160 11 Primitive Extraction

Fig. 11.12 Space-borne SAR-primitives of urban terrain: Left, whole data set; right, section with
close multiple objects

The best way to overcome this problem is controlling the local density of objects.
Then the proximity law inherent in all grouping operations will prefer only an
adjustable number of expected partners. The search process becomes controllable
and feasible. There should never be more than one primitive in one location—where
location is understood in the scale of the present object. In a way, this enforces local
consistency. The standard way of assuring this is called non-maxima suppression.
Many segmentation methods and key location extractors already have built-in mech-
anisms of this kind. That is, for instance, the case in the extractors given in Sects. 11.3,
11.4, and 11.5 above. Other very simple segmentation methods, such as threshold
segmentation, never can give more than one object in one location.
However, some methods may yield sets of primitives that sometimes cluster very
densely somewhere, while leaving large spaces empty. As an example, Fig. 11.12
shows a set of permanent scatterers obtained by synthetic aperture RADAR (SAR)
satellites from an urban site in Berlin.
For the details of SAR-image processing in general and permanent scatterers in
particular, we refer to the corresponding handbook of Uwe Soergel [14]. The close-
up on the right side of the figure clearly shows the problem: Sometimes four or five
scatterer objects are located in very tight proximity. In such situation a successive re-
assessing can help. Let S be a set of objects. Then the following steps are performed:
1. Initialize T empty.
2. Pick the best assessed element s from S and add it to T .
3. Re-assess all elements t ∈ S with respect to their distance d from s . Use, e.g.,
1 − exp (d[t , s ]) as factor. This will yield 0 assessment for s now and bad assess-
ments for close neighbors. It will not touch the assessment of distant objects.
4. Continue with step 2 until assessments become worse than a threshold.
In the end, T will contain a sparse set of objects to start with. The set of scatterers
given above in Fig. 11.12 was modified in this way, and then used as input for
the lattice grouping in Sect. 10.5. Recall, however, that this method may well be
11.7 Local Non-maxima Suppression 161

misleading. It may suppress the correct objects. A top-down analysis may well later
revise the re-assessments in favor of alternative choices. Therefore, the original data
should not be deleted or overwritten.

References

1. Loy G, Eklundh J (2006) Detecting symmetry and symmetric constellations of features. In:
European conference on computer vision (ECCV), pp 508–521
2. Pătrăucean V, von Gioi RG, Ovsjanikov M (2013) Detection of mirror-symmetric image
patches. In: 2013 IEEE conference on computer vision and pattern recognition workshops,
pp 211–216
3. Kondra S, Petrosino A, Iodice S (2013) Multi-scale kernel operations for reflection and rotation
symmetry: further achievements. In: CVPR 2013 competition on symmetry detection
4. Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Susstrunk S (2012) SLIC superpixels compared
to state-of-the-art superpixel, methods. Trans Pattern Anal Mach Intell 34(11):2274–2281
5. Michaelsen E, Arens M (2017) Hierarchical grouping using gestalt assessments. In: CVPR
2017, workshops, detecting symmetry in the wild
6. Matas J, Chum O, Urban M, Pajdla T (2002) Robust wide baseline stereo from maximally
stable extremal regions. In: British machine vision conference BMVC 2002, pp 384–396
7. Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the
international conference on computer vision (ICCV ’99), pp 1150–1157
8. Michaelsen E (2014) Gestalt algebra—a proposal for the formalization of gestalt perception
and rendering. Symmetry 6(3):566–577
9. Krüger N, Lappe M, Wörgötter F (2004) Biologically motivated multi-modal processing of
visual primitives. Interdisc J Artif Intell Simul Behav 1(5):417–427
10. Felsberg M, Sommer G (2001) The monogenic signal. IEEE Trans. Signal Process
49(12):3136–3144
11. Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern
43(1):59–69
12. Michaelsen E (2016) Self-organizing maps and gestalt organization as components of an
advanced system for remotely sensed data: an example with thermal hyper-spectra. Pattern
Recogn Lett 83(2):169–177
13. 2018 ieee grss data fusion contest (2014). http://www.grss-ieee.org/community/technical-
committees/data-fusion/2014-ieee-grss-data-fusion-contest
14. Sörgel U (ed) (1990) Radar remote sensing of urban areas. Springer
Chapter 12
Knowledge and Gestalt Interaction

Knowledge-based machine vision used to be an auspicious topic some decades ago.


The term knowledge in this community refers to machine-interpretable data, such
as ontologies, semantic networks, systems of production rules, or expert systems.
The idea is, that the machine can do the reasoning along the laws of logic, assigning
meaning to objects segmented from the image by appropriate image processing tools.
Thus knowledge-based machine vision was a sub-topic of artificial intelligence, as it
was understood in those days. We briefly give an introduction to this field and show
how it is interrelated to perceptual grouping, as well as how these two approaches
may collaborate.

12.1 Visual Inference

Knowledge utilization on given pictorial data takes the form of logical inference.
Let us have a look at a typical example from aerial image analysis: Encyclopedic
knowledge for the term runway reads like:
• A runway is a
defined rectangular area on a land aerodrome prepared for the landing and takeoff of
aircraft. ... Runways may be a man-made surface (often asphalt, concrete, or a mixture of
both) or a natural surface (grass, dirt, gravel, ice, or salt).

Obtained from English Wikipedia Sep. 1, 2018.


The same entry bounds dimensions between 245 m × 8 m and 5500 m × 80 m. This
can be coded in a rule for processing of remotely sensed images reading:

© Springer Nature Switzerland AG 2019 163


E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6_12
164 12 Knowledge and Gestalt Interaction

• If segment s is a runway then it is an elongated rectangle with length 245 m ≤ ls ≤


5500 m and width 8 m ≤ ws ≤ 80 m.
Written in a formal notation this may look like:

runway (s) → rectangle (s) ∧ (12.1)


∧ ls ≥ 245 m ∧ ls ≤ 5500 m
∧ ws ≥ 8 m ∧ ws ≤ 80 m

where ls indicates the length and ws indicates the width of a rectangle. It is assumed
that rectangles have these features. The left-hand side of 12.1 is called premise. The
right-hand side—in this case a conjunctive composition of several facts—is called
the logical consequence. Note that the first line of the consequence contains a Gestalt
that may be instantiated or tested using the methods and the operations presented in
Chap. 9.
The features ls and ws for that are used in the following four bounding relations
are defined for such rectangular objects only, not for any segment s. This is fairly
typical: Knowledge-based machine vision is often perceptual grouping with respect
to specific Gestalt laws with the addition of particular extra constraints obtained from
the knowledge source.
Note also, serious mistakes or inaccuracies may already be made at the formal-
ization step. The consequence of rule 12.1 includes rectangles of 80 m width and
245 m length (which really fits better to a parking lot), as well as narrow stripes of
8 m width and 5500 m length (which is probably rather a straight fairly narrow road).
When in practice such problems arise, often new parameters are introduced, such as
bounds on the ratio of length and width.
Nevertheless, formalized rules can be coded in logic programming languages such
as PROLOG. Thus, they are the basis for automatic inference. The sound deductive
use of such rules from the data to the object is negative. It infers:
“If an image segment is not rectangular or if a rectangular segment is shorter than
245 m or if a rectangular segment is longer than 5500 m or if a rectangular segment
has a width less than 8 m or if a rectangular segment is wider than 80 m then it cannot
be a runway.”
This kind of machine reasoning is usually correct—at least if the knowledge as
well as the image segmentation was reliable—but it is not helpful. In the literature
on knowledge-based interpretation of remotely sensed data you may rather find an
inference in the other direction:
“If s is an elongated rectangular image segment with 245 m ≤ ls ≤ 5500 m and
8 m ≤ ws ≤ 80 m then s is probably a runway.”
This inference figure is called abductive. Abductive inference is unsafe and not
sound—but it may be useful. Many other things, apart from runways, may also appear
in this way in images. You never know what kind of new things appear in a new image
you have not seen yet. The most important word in this abductive inference sentence
is the word probably. This word is used in its common sense meaning here. No
probabilities are estimated. Abductive inference is heuristic. A very recommendable
12.1 Visual Inference 165

short paper on the utilization of abductive inference for the analysis of images in
general, and for building extraction from aerial imagery in particular, has been given
by T. Schenk more than twenty years ago [1].
In the absence of statistics, one may use fuzzy inference for this rule. We can
build a membership function being maximal, i.e. 1.0, for a long rectangle of 2500 m
× 50 m, and less for smaller or bigger dimensions, and violations of parallelism
and orthogonality. If that function yields zero for violation of the constraints, then
that perfectly matches the sound deductive use of the rule. Such assessment on the
form and dimension can be fused with other such rules that prefer colors of asphalt
or concrete, or the presence of aircraft in proximity of the segment. Eventually, one
might end up with a fairly reliable machine vision system for this task.
Knowledge about things likely appearing in images comes in many forms. The
following short list of examples cannot claim completeness:
• Knowing which direction is North in a satellite or aerial image is of substantial help.
Human image interpreters may utilize such knowledge with or without awareness.
In such data a dark segment north of a brighter segment is often a shadow cast by
an elevated object—provided we are in the northern hemisphere and the picture
was taken around noon. In such cases a pair of segments is grouped from two parts
with different gray-tones or colors. Other Gestalt similarities such as in scale and
orientation, and, in particular, a proximity law remain valid.
• Most facade recognition proposals assume horizontal and vertical row grouping
directions as given and fixed; i.e., their search procedures do not accept oblique
generator vectors. Otherwise their row and lattice grouping follow mostly the
laws outlined in Chap. 3 and in Chap. 10. Some facade recognition proposals
would also use reflection symmetry—but only with vertical axes. If what is meant
by “horizontal” and “vertical” is fixed a-priori using an orientation interval, or a
heuristic threshold on admissible orientation deviations, we call this “knowledge.”
Else, if a distribution on the orientations is automatically estimated on a learning
set, we do not call this knowledge application anymore. Then it is machine learning.
• The lighting direction determines very much how an object appears in a synthetic
aperture radar (SAR) image. In fact, in such imagery the vertical and the hori-
zontal directions result from completely different measurement principles—time-
of-flight, i.e., distance in lighting direction and the synthetic aperture sharpening
perpendicular to the lighting. SAR image interpreters are aware of the correspond-
ing effects. Not only shadow is important here. Still, the laws of Gestalt perception
also play a considerable part, and the experts may not be aware of the role of Gestalt
laws in their inferences.
• When analyzing biological imagery—such as microscopic pictures of pollen or
pictures of plant parts—an expert will use her/his knowledge about the field to infer
species, etc. Often such knowledge is laid down in monographs, or it is admissible
in electronic data banks. There are formats, e.g. ontologies, that allow machine
inference.
A general theory on the combination and interaction of knowledge, learning,
and Gestalt perception in automatic image analysis is not in sight yet. Currently,
166 12 Knowledge and Gestalt Interaction

the machine vision community strongly focuses on automatic learning. In the next
section of this chapter we will therefore review the literature on knowledge-based
image analysis, so that the reader may get familiar with the terminology. Then, in
the following sections, some examples will be given what can be accomplished now
in distinguishing the three forms of machine vision: knowledge-based, learning, and
Gestalt grouping. Readers with good background in knowledge-based image analysis
may skip the first reviewing section.

12.2 A Small Review on Knowledge-Based Image Analysis

A well-elaborated terminology for knowledge-based image analysis has been given


by Niemann in [2]. In our opinion this is still a good source and sets the standard.
Along these lines a semantic net is a graph, where the nodes are concepts. For
example, the entities runway and rectangle presented in Production 12.1 can be
regarded as such concepts. Different sorts of links are distinguished:
• A special link indicates a more specific concept; e.g., grass runway is more specific
than runway, and the latter is more general than the former. Some authors prefer
the term is-a link for this type of link. The most general concept is often referred
to as object.
• A concrete link indicates a step from more abstract and task-related concepts
to concepts closer to measurements. This would be the adequate link for Pro-
duction 12.1: Finding a runway is the task. Such object shows more concrete as
elongated rectangle in the pictorial data.
• A part of link indicates a step in the hierarchy of aggregation. Parts and aggregate
are on the same level with respect to the other two link types. For instance, a
rectangle will have four parts: two long side margins and two short side margins.
Often there are specific mutual relations between the parts, which must hold in
order to yield a valid aggregate, such as parallel or orthogonal. All the Gestalt
operations given in the book at hand can be understood in this part of hierarchy.
It may appear to the reader that the Gestalt operations remain on a rather low level
as well in the special hierarchy as in the concrete hierarchy. Most of the examples
given in this book operate on primitives segmented from pictorial data by standard
image processing methods. However, on the one hand the operations are defined in
a rather general way, on a rather general domain. “Gestalt” is more special than just
“object,” but not much. Thus, the operations define quite general grouping principles,
which need not be newly coded every time for any specific new recognition task,
or object class. During an inheritance down the special hierarchy they may well be
re-parametrized, or augmented, but need not be newly coded or learned. And on
the other hand, the primitives of Gestalt grouping may well result from a semantic
segmentation, i.e., on a less concrete level. For example, if a classifier segments
objects of type house or road from an aerial image, the perceptual grouping will
12.2 A Small Review on Knowledge-Based Image Analysis 167

aggregate rows of houses. This is clearly on a more abstract, more task-related level,
and less concrete, and less close to the sensor data.
Niemann emphasizes separation between the declaration of knowledge and the
construction of inference engines utilizing such knowledge. For the knowledge repre-
sentation he reviews several approaches, such as production systems and grammars.
His favorite is the semantic net. It is evident that often the efforts for complete
combinatorial enumeration of all possible inferences becomes infeasible. Therefore,
he proposes smart control mechanisms for the knowledge utilization. Search tree
traverse algorithms are proposed, and these are based on scores. Such values may
be probabilities or fuzzy memberships. Different nodes of the search tree with dif-
ferent numbers of instantiations should be comparable, so that the control module
can decide on which branches computational efforts should be used and where they
would probably be wasted. Such scores are very similar in their functional role to
the assessments used throughout this book for Gestalten. Only our assessments are
stored with the instances, instead of search states—admittedly a violation of the
separation between declarative knowledge and its utilization.
Many of Niemann’s examples for semantic net instantiation are taken from the
domain of automatic language understanding. However, he claims applicability also
to image and video analysis. Some image understanding examples are from medical
image analysis. Today the performance that can be achieved with such approach may
not be state-of-the-art anymore. However, recall that a knowledge-based system can
analyze observations of objects unseen before. It can recognize patterns without a
single training example. This can be an advantage in application domains where
representative labeled data are not provided at all or very expensive. Also, every
resulting inference can be explained step by step, rule by rule, in case something
goes wrong.
A pioneer in knowledge-based analysis of remotely sensed imagery, in particular,
is T. Matsuyama. In his system, called SIGMA [3], also three kinds of links connect
the objects, which correspond to Niemann’s concepts:
• The a-kind-of link is just another word for specialization.
• The appearance of link can be seen as special kind of concretization. Namely, it
connects scene objects in the world to their appearance in aerial images. A deeper
hierarchy is not used with respect to these links
• The part of link has the same name, meaning, and function.
Emphasis is on part of analysis. The relations between the concepts are laid down in
production rules. In principle, there is a separation between knowledge and search
control as well. But the control in SIGMA is performed by a geometric reasoning
expert, meaning this is not a general purpose knowledge interpreter, which may as
well analyze language data. This control module includes consistency tests, which
are even more specifically tailored to aerial imagery and not appropriate for other
types of imagery. Moreover, the global database does not only contain iconic data
and inferred symbolic instances, it also contains hypotheses. So the search is not
only bottom up and data driven. It also includes focus-of-attention mechanisms.
We discussed such issues close to the end of Sect. 5.3.2, and everywhere, where
168 12 Knowledge and Gestalt Interaction

we treated illusion. If separation between knowledge and search control should be


maintained, hypotheses cannot be a part of the database, i.e. the set of observed and
inferred instances, whereas illusory instances can be entries in such database.
A large part of the knowledge that is used in the examples given by T. Matsuyama
is in fact perceptual organization, i.e., repetition in rows, parallelism, proximity, good
continuation. Other parts are domain-specific and on a more symbolic level, such
as “Houses are connected to roads by drive-ways.” Most of reasoning performed
by SIGMA is abductive. Therefore, emphasis is on an elaborate query-and-answer
interface to a human user or expert. Thus, software-bugs and misunderstanding can
be fixed, and there is hope that such system may improve with its use.
One of the most advanced systems for the extraction of roads from aerial imagery
has been developed by Hinz et al. [4]. Figure 12.1 shows the fundamental declarative
road model given there. Again, in such systems it is possible to distinguish which
parts are knowledge utilization and which parts are in fact perceptual grouping along
the lines of Gestalt laws:
• The road model is given in the standards of a semantic net with part-of links, spe-
cialization links, concretization links, and so-called general relation links, respec-
tively. Vertical hierarchy in the net is concretization with the topmost semantic
level being “road network” and the lowest level being primitives obtained from
the images, such as lines, blobs, or signs found by template matching. Such declar-
ative modeling clearly belongs to the domain of knowledge-based image analysis.
• The work emphasizes the role of scale space. In the figure displaying the model
the other direction, i.e., the horizontal direction, is devoted to scale, scales with

Fig. 12.1 A semantic net for road extraction from aerial images, courtesy of Hinz et al. [4]
12.2 A Small Review on Knowledge-Based Image Analysis 169

fine details to the left and overview scales to the right. Utilization of scale space is
a general property of sophisticated image processing, also reflected in the Gestalt
domain used throughout this book. However, using different models on different
scales, as Hinz and Baumgartner propose, is clearly knowledge utilization.
• The least salient links in the model display are the general relation links, and
most of these may well be associated with Gestalt laws such as “is aligned,” see
Chaps. 3 and 8, or “is parallel or orthogonal,” see Chap. 9. Note that these links
are sometimes recurrent. That indicates that search procedures are used, similar to
the ones outlined in these chapters. The figure may well be simplified, and some
such recurrent general relation links may well be omitted. For example, “road
segment” needs such good continuation law grouping, including gap bridging
mechanisms. Probably a large part of the actual computation efforts of such system
is actually resulting from these perceptual groupings. Moreover, a proper setting of
the tolerance parameters for these relations is crucial for the success and stability
of the system.

12.3 An Example from Remotely Sensed Hyper-spectral


Imagery

In Sect. 11.6, an unsupervised machine learning method was described for primi-
tive extraction on hyper-spectral imagery. Thus, the Gestalt grouping can now be
performed on a more symbolic level with respect to concretization. These are not
just any kind of spots; these are objects in the geographic plane with a very specific
common thermal spectral signature. We repeat them in the upper left part of Fig. 12.2.
A run of the search for hierarchical Gestalten using terms as outlined in Sect. 5.1
gives the result shown in the upper right of the Figure. Only reflection symmetries
of rows of primitives with assessment better than a certain minimal assessment are
shown. These are clustered using the method outlined in Sect. 2.8. Basically two big
cluster axes result, where the dominant one is shown in lighter gray for better visibility
on the dark background. The other cluster is perpendicular to it and indicated darker.
The point is that the best axes cluster is well enough consistent with the human fish
bone Gestalt perception. However, when analyzing in detail, i.e. drawing only the
Gestalten that participate in this cluster—as has been done in the lower left part of
the figure—it can be seen that not many of the fish bone rows participate. Instead,
some rows are participating that run along the axis on both sides and that a human
observer would not prefer, before being pointed on by the system.
It turns out that much of what the human observer sees in the raw data (Fig. 11.10)
is already lost in the primitive extraction step. We admit that this loss might be mit-
igated to a certain degree using more efforts in the settings of the image processing
chain. However, we emphasize: This is a typical experience shared by many peo-
ple testing knowledge-based inference systems on real data. Such results must be
expected, and the developer should not be disappointed or discouraged at this stage.
170 12 Knowledge and Gestalt Interaction

Fig. 12.2 Hierarchical Gestalt grouping on hyper-spectral imagery: Upper left—primitives; upper
right—hierarchical perceptual grouping result; lower left—best axis cluster of Σ|Σ-type; lower
right—adding the next self-organizing map segment fitting best to the row-end queries

The miraculous quality of human seeing is a result of much Gestalt grouping, some
knowledge utilization, and a little bit of learning. This is why humans are so much
better than standard image processing chains.
What we can learn from the pioneers of knowledge-based image analysis, such
as Niemann, Matsuyama, Sarkar, or Hinz, is that at this stage hypothesis-driven top-
down search can help. For example, when keeping record of the row-prolongation
steps, which are executed in the manner outlined in Sect. 3.5.2, one should ask:
• “What spectra are encountered where the rows participating in the grand Gestalt
could not be prolonged?”
A system that is capable of administrating such queries can be imagined. It is just
a question of coding skill, endurance, and diligence. The answer to this query is
that most of the spectra in those locations fall into a particular other segment on the
self-organizing map. If then this segment is merged with the original segment, the
pattern will become much more complete. Performing the same primitive extraction
on the new united segment gives the set of primitives shown in the lower right part
12.3 An Example from Remotely Sensed Hyper-spectral Imagery 171

of the figure. And now we can start the bottom-up grouping again with much better
prospects.

12.4 An Example from Synthetic Aperture RADAR


Imagery

In Sect. 10.5 an example for lattice grouping is given. Recall that the primitives
appearing as dots in Fig. 10.3 are permanent scatterers in the sense of [5]. Such
objects are sensed from a satellite repeatedly flying the same orbit and sensing the
same strong response at the same location.
The following knowledge fragments can be utilized, either automatically or by
hand:
• When taking the data, the satellite looked at an urban area in which large buildings
are likely.
• On the facades of large urban buildings, windows and other structure are preferably
organized in vertical columns.
• In SAR imagery one direction (here horizontal) records the signal flight time.
Taking into account the signal speed this gives the distance between antenna and
scatterer. This is known as radar principle. Since the looking direction is always
oblique from above, a higher object, such as the roof of a tall building, is sensed
closer than a lower part, such as the foot of the building. The parameters of the
mapping geometry are known and given with the image.
• Urban buildings are mostly built for humans; thus a vertical organization into
structural levels of something like three meters height is likely, just high enough
even for the taller exemplars and with some half meter for the structure and another
half meter tolerance.
From these knowledge fragments, an inference can be drawn that rows of horizontal
scatterers can be expected in these data. The generator should be about eleven units
long. A vertical column of windows or similar facade structures will appear this way
in such image. As we emphasized earlier this does not mean that the inference from
a horizontal row of such scatterers to the presence of a window column is sound.
This is an abductive inference. A row of parked vehicles on the ground pointing
directly toward the antenna, and having the correct spacing, may cause the same
appearance. However, this is a somehow degenerate1 setting. Vehicle rows or other
repetitive metal structure will rather follow the road directions, which are unlikely
to be parallel to the looking direction.
Such inference is stronger if also the horizontal grouping on the facades in the
scene is utilized. Chapter 10 gives good reasons why such lattice grouping is much
stronger than exclusive grouping in one of the directions. It is also stronger than

1 Degenerate here in the sense of [6], the probability that a horizontal grouping accidentally points
exactly toward the antenna is zero, it almost never happens.
172 12 Knowledge and Gestalt Interaction

Fig. 12.3 Lattice grouping on the SAR data also used in Chap. 10, but here with knowledge-based
constraint on horizontal grouping of facades

hierarchical grouping first in one direction and then in the other. Obviously, there is
another useful knowledge fragment:
• Horizontal repetition along a structural level of an urban facade is very likely.
However, while the expected length of the vertical generator is more or less deter-
mined by the height of the human body, the horizontal repetition generator can
take a large variety of lengths. It can be half a meter as well as five meters.
The search for lattice Gestalten on such data can therefore be modified, and compu-
tational resources can be used more productively, as well as the expected recognition
performance will improve: The initial row pair formation mentioned near the end
of Sect. 10.4 can be restricted to horizontal rows. It can also be restricted further
to pairs of a well-known distance in comparably tight tolerances. The correspond-
ing generator vector can be constrained to horizontal direction, and the assessment
function can include a term punishing residual deviations from this direction. Only
a small portion of the initial row pairs that gave the lattice seeds in the central part
of Fig. 10.3 is consistent with these additional constraints.
A pair of row pairs is needed for the formation of the initial lattice seeds (see
Sect. 10.4). If we have been more demanding in the first step, we can now be more
liberal, because we are aware that almost any direction is now permitted, and there
can be a large variety in the lengths of possible generators. This gives a different set
of seeds as compared to the ones presented in Fig. 10.3, where no domain knowledge
was used.
12.4 An Example from Synthetic Aperture RADAR Imagery 173

Then, in the augmentation steps, the more horizontal generator can also be con-
strained to the horizontal direction. A primitive that tends to draw it too much off
this preferred direction can rather be ignored and replaced by an illusion. The result
is shown in Fig. 12.3, and these are not any salient lattices anymore. It is very
unlikely that some artist or other unknown process placed very large parallelograms
on the ground in such urban area and aligned it with the viewing direction of the
satellite looking later at it. Here we can follow [7] and infer that the tilt of the
parallelograms—the violation of their symmetry—is caused by the mapping process
and that the objects in the scene are in fact big rectangular lattices standing upright;
i.e., they are facades. A false positive is extremely unlikely.
However, comparison with GIS data of the area yields that there are more facades
in the area; i.e., there are false negatives. Probably the machine vision part cannot be
blamed for this. These additional facades simply give not enough response in such
imagery. It is miracle enough that such results can be obtained from a distance of
several hundred kilometers and through clouds and atmosphere.

References

1. Schenk T (1995) A layered abduction model of building recognition. In: Automatic extraction
of man-made objects from aerial and space images, Ascona workshop of the ETH Zurich, pp
117–123
2. Niemann H (1990) Pattern analysis and understanding. Springer
3. Matsuyama T, Hwang VS-S (1990) SIGMA, a knowledge-based aerial image understanding
system. Springer
4. Hinz S, Baumgartner A, Steger C, Mayer H, Eckstein W, Ebner H, Radig B (1999) Road
extraction in rural and urban areas. In: Förstner W, Liedtke C-E, Bückner J (eds) Semantic
modelling for the acquisition of topographic information from images and maps (SMATI 1999),
pp 133–153
5. Sörgel U (ed) (1990) Radar remote sensing of urban areas. Springer
6. Pizlo Z, Li Y, Sawada T, Steinman RM (2014) Making a machine that sees like us. Oxford
University Press
7. Leyton M (2014) Symmetry, causality, mind. MIT Press, Cambrige, Ma
Chapter 13
Learning

All methods outlined in the other chapters of this book are constructed without use
of any example data or labels. The examples were only used for clearer explanation.
This is a somewhat oppositional approach. As technical or scientific term “pattern
recognition” is used today quite deviating from its common sense meaning. We are
writing this book in times of general agreement that the recognition performance
of massive deep learning approaches cannot be beaten. Such machines must be
trained with millions of labeled images before they start beating the performance of
recognition methods designed by engineers according to their view on the problem.
If all the very many parameters in these deep learning machines are properly adjusted
by the use of huge masses of data, the performance figures will become impressing
indeed.
Given a corpus of representative and properly labeled data one may introduce
and adjust certain parameters so as to improve the recognition rates and estimation
precision of the Gestalt search as well. This chapter starts with a discussion of existing
and suitable labeling procedures. Then we give some examples for parameters which
may be built into the operations, assessments, etc., and outline training procedures
for the adjustment of such parameters.

13.1 Labeling of Imagery for Evaluation and Performance


Improvement

The laws of Gestalt perception are meant to capture properties of the human per-
ception. It is therefore necessary to evaluate how well machine Gestalt perception
and human Gestalt perception conform. In particular, reliable quantitative figures
are required. Actually, more interdisciplinary efforts with psychologists should be
aspired. We emphasize that machine vision researchers, and in particular those who
are working on perceptual grouping, as well as the psychologists working on the
corresponding branch, should be avoided as test subjects. Their academic view on
© Springer Nature Switzerland AG 2019 175
E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6_13
176 13 Learning

the topic will necessarily bias their natural instincts and unaware functions of their
seeing.
There exist several corpora of data that have been compiled for use in the symmetry
recognition competitions, e.g., along with the CVPR 2013 [1] and ICCV 2017 [2].
For a single reflection symmetry, the observer had to mark two locations on the
image: the begin location of the reflection axis and the end location of the reflection
axis. Figure 13.1 shows an example from the 2017 contest. The axis marked as
ground truth is displayed as white thick line. This gives no indication as to how far
the symmetric part of the picture reaches to the right and left of the axis. However,
there is one very important advantage: Marking one symmetry in one image requires
only two mouse clicks. Thus, a fairly large number of pictures can be labeled by each
test subject within the given time constraints.
This type of ground truth can straightforwardly be translated into the Gestalt
domain: The position results from the mean of the two locations, and size and ori-
entation result from the connecting 2D vector. The corresponding element is drawn
into Fig. 13.1 as black line.
For frieze and lattice pictures, the observer had to mark a lattice of locations on the
image and indicate whether they are valid or virtual. The latter would be used if, e.g.,
due to occlusions, parts of the pattern were invisible, but inferred by the observer.

Fig. 13.1 Ground truth as given for the 2017 competition, #81, single reflection: white—reflection
axis as given; black—corresponding Gestalt
13.1 Labeling of Imagery for Evaluation and Performance Improvement 177

Fig. 13.2 Ground truth as given for the 2017 competition, #27, frieze: white—given valid part of
lattice; black—corresponding Gestalt and predecessors

Figure 13.2 shows the example used in Sect. 5.3.2. The lattice of valid points has
2 × 5 vertices and is displayed again in white thick lines. The regions inside the four
quadrangles are regarded as corresponding repetition in a frieze by the observer.
The generator vectors of such ground truth lattice were not constrained in the
sense of Chap. 3 or 10. Thus, the observer could also mark perspective foreshortening
or even arbitrary distortions. Accordingly, such ground truth contains considerably
more information than the corresponding row Gestalt, which is drawn again in thin
black lines together with four possible predecessors. There is no one-to-one relation
between these formats like in the reflection case above.
Instead of clicking so many points, an alternative would be marking only one
location, giving the numbers in row and column, and adjusting the generators. There
is a very important disadvantage in marking so many locations. A fairly large number
of clicks in each image means that only a small number of images can be labeled by
each test subject within the given time constraints. Moreover, so many clicks lead to a
certain sloppiness, a lack of diligence. There is very much information in this format,
and not all of it may be meaningful or accurate. Note in this example a considerable
tilt is evident.
178 13 Learning

13.2 Learning Assessment Weight Parameters

A large portion of the Gestalt literature deals with empirical evidence on the mutual
strength or superiority of the laws (see, e.g., [3]). Such findings can be included in a
heuristic way by the introduction of weights into the assessment fusion 2.9. Recall
that this fusion is a product. Thus, the usual weighted sum approach cannot be applied
here. Instead exponents αi are used:
  α +···+α
1

acombined = a1α1 · · · anαn 1 .


n
(13.1)

Setting a weight parameter small reduces its influence. For example, if one of the
assessment weight parameters is set to zero, the corresponding Gestalt law will not
apply at all in this fusion. If the literature says that reflection symmetry is much
stronger than proximity, the user may set the corresponding weights to 2.0 and 0.5,
respectively, while leaving all other weights at 1. Maybe then the results will come
closer to the human Gestalt seeing. The normalizing exponent at the end of Eq. 13.1
automatically avoids decreasing or increasing assessment with the number of laws
under consideration or the depth in hierarchy.
The reader may worry about heuristic parameter setting. Choosing good or optimal
assessment weights may well be a tedious and tricky procedure. With no represen-
tative and labeled training data at hand, based only on the literature, we see no other
way. However, function 13.1 is analytical. Partial derivatives with respect to any
of the parameters can be used for gradient descent parameter learning. The partial
derivatives as:
∂acombined  −1  
= ln a1α1 · · · anαn
∂αi [α1 + . . . + αn ] 2

1 
+ · ln (ai ) · acombined . (13.2)
α1 + . . . + αn

Note the first term in the sum in the brackets is always positive and always equal for
all i. The factor −1 compensates for the logarithm of a value smaller than one. The
second term is always negative. Its size depends on ln(ai ), so it will be small if ai is
close to one, and can become arbitrarily large when ai approaches zero.
Examples for positive as well as negative training Gestalten established by our
search methods are given in Fig. 13.3, Tables 13.1 and 13.2. Such samples can be
used for gradient decent learning using Eq. 13.2.
13.3 Learning Proximity Parameters with Reflection Ground Truth 179

Fig. 13.3 Positive and negative instance on #81: white—closest to ground truth among ||-Gestalten;
black—furthest from ground truth among the thousand best ||-Gestalten

Table 13.1 Examples of positive and negative learning Gestalten on image #81
– Horiz. Vert. Orient Frequ. Scale Assess
Negative 22.2985 314.0919 0.4774 2.0000 60.9924 0.8570
Positive 300.5831 240.6288 0.9484 2.0000 191.1783 0.6178
gr.-truth 294.8550 195.7720 0.9592 2.0000 288.5493 –

Table 13.2 Assessment components of positive and negative learning Gestalten on image #81
Negative 0.7919 0.9977 0.9864 0.6069 0.9775
Positive 0.1904 0.8342 0.9730 0.5865 0.9929

13.3 Learning Proximity Parameters with Reflection


Ground Truth

As first example we concentrate on the proximity law in the reflection Gestalt


operation |; see Sect. 2.5. One option for the corresponding assessment function
was given in Eq. 2.6. This particular choice has the form of a Rayleigh density func-
tion. It is well known that such densities often are given with a parameter σ. …. This
may be done for the assessment function as well yielding the form
 
  |x p − x q | |x p − x q |2
ad g p , gq = σ · e · √ exp −σ · . (13.3)
s p · sq s p · sq
180 13 Learning

Let us assume a ground truth |-Gestalt gg with its location at x g . It has also a scale
sg and an orientation φg . Let us further assume a |-Gestalt g f positively corresponding
to gg which was found by one of the search methods outlined in Chap. 6. Naturally,
the goal function is the assessment acombined, f as defined in 2.9. This goal is to be
maximized under variation of a parameter σ of the of the goal function. If the goal
function is differentiable with respect to this parameter we can define a learning step:

∂acombined, f
σn+1 = σn + α (13.4)
∂σ
The derivative of the assessment with respect to the parameter yields a sign and
strength. A learning parameter α controls the amount of change. Note that the pre-
decessors g p and gq of the found g f must be known: g f = g p |gq . So record has to
be kept on the Gestalt constructions.
Looking again at example #81 of the single reflection competition data, which
was already used in Sect. 13.2 above, the construction term of the best fitting Gestalt
found is displayed in Fig. 13.4. The distance between the direct predecessors in this
case is about 136 pixels, while the mean scale of the two direct predecessors is only
55 pixels. With an initial σ0 = 1 we have a proximity assessment of about 0.19.
Obviously, with larger σ this assessment component can be raised considerably for
example #81.
Such adjustments can also be done by propagating corresponding signals through
the hierarchy of a nested Gestalt construct. This resembles the well-known back-

Fig. 13.4 Best fitting term on #81: underlying a brighter version of the image; same Gestalt as in
Fig. 13.3 but with its gray-tone corresponding to its assessment and also with all predecessors of it
as overlay
13.3 Learning Proximity Parameters with Reflection Ground Truth 181

propagation adjustments used in artificial neural nets. In the example at hand, the
distance between the primitives in the left predecessor is about 34 pixels with a mean
scale of about 17 pixels, yielding a proximity assessment component of about 0.43,
and the distance between the primitives in the right predecessor is about 40 pixels
with a mean scale of about 19 pixels, yielding a proximity assessment component
of about 0.38, respectively. So in both cases larger σ will raise the corresponding
assessment component for the parts of example #81 as well, and with them, through
assessment inheritance, also the assessment of example #81. However, the impact on
this final assessment is smaller, because the effect is propagated through the fusion
formulae.

13.4 Assembling Orientation Statistics with Frieze Ground


Truth

It is more along the lines of pattern theory—as it has been outlined, e.g., by
Grenander [4], Mumford and Desolneux [5]—to see the smooth functions intro-
duced here as “assessments” as probability densities. The model assumes a condi-
tional probability between the features of the parts of an aggregate. Parts are called
generators there, and the aggregate is called configuration. Taking that view on the
Gestalt grouping at hand, we are looking at a Bayesian net. The task then is assem-
bling sufficient and representative statistics for the estimation of these densities, or
for estimating parameters of density models for them.
With properly labeled data sets at hand, automatic estimation of such densities
becomes feasible. As an example let us take the law of similarity in orientation used
in the grouping of rows in Chap. 3. The ground truth will give a location, orientation,
and size for each labeled object—like it is displayed in white for #4 of the frieze part
of the 2017 competition [2] on Fig. 13.5. The ground truth will also give the number
of parts, in this case eight.
Table 13.3 lists the orientation features of the parts in degree—recall these Gestal-
ten have rotational self-similarity frequency 2, so if their orientation is given in degree,
it should be between zero and hundred and eighty degrees.
Section 1.5 gives a short review on how to handle statistics on orientations: The
unit vectors corresponding to the orientation values are summed up in 2D. If the
resulting sum is not the null vector, a meaningful argument can be assigned. For the
values at hand in Table 13.3, this results as 96.5994 ◦ . The correct estimation of the
parameter κ of the von Mises distribution requires inversion of a term containing
Bessel functions. But there are less complicated approximate estimations known for
this 2D case, i.e., von Mises–Fisher models [6]: Let the sum of unit vectors be s and
the number of them be n. Then the important figure is r = |s| /n. If all unit vectors
are equal, r will reckon as one. In the example above, it results as 0.9833 which is
quite large, i.e., evidence for a narrow distribution. If r turns out to be close to zero

Table 13.3 Orientations of the part Gestalten in Fig. 13.5 in degree


87.28 91.97 93.26 96.33 101.34 99.96 104.68 97.91
182 13 Learning

there is evidence for a uniform distribution. The von Mises–Fisher approximation


for κ is  
r 2 − r2
κ≈ . (13.5)
1 − r2

For our example this yields 30.71. Comparing that with the plots in Fig. 2.3 in
Sect. 2.3 a very large difference between the assessment function used by default
and this very sharp density can be seen. There is an iterative improvement for the κ
estimation in Eq. 13.5, but considering the relative small evidence here, such efforts
may be in vain.
The naive way of handling such orientation statistic would be simply reckoning
a mean and a standard deviation from it, the mean being 96.5926 ◦ and the deviation
5.6101 ◦ . This treats the values as if they were elements of a vector space—which
they are definitely not. However, such naive mean is almost correct here (only seven
thousandth degree off). And also the deviation—i.e., the shape of the corresponding
normal density wrapped around the domain—is very close to the von Mises–Fisher
density model outlined above. This is because the deviations are comparably small in
this example. Yet, the automatic use of naive statistics on orientation data cannot be
recommended even in such narrow cases. Recall this is also a benign example because

Fig. 13.5 Best fitting row and predecessors on #4: underlying a brighter version of the image;
ground truth displayed in white and gray-tone of automatically found Gestalten corresponding to
their assessment
13.4 Assembling Orientation Statistics with Frieze Ground Truth 183

the mean is here quite in the middle of the interval used for representation of the
elements. For mean values closer to horizontal orientation—i.e., zero or hundred and
eighty degree—awkward case splits would be required in the code. We recommend
using the methods outlined above and in Sect. 1.5 in all cases where orientation
statistics are treated.
Actually, the row Gestalt given in Fig. 13.5 may be the closest to the ground truth,
but there are alternatives, which are almost as close. With the parameter settings
used for the competition [7], a cluster of 46 row Gestalten can be identified which
is close to the ground truth on this image. The example is contained in this set, and
all of them are very similar to it. Many of them have some parts in common with the
presented example. Thus there would be a risk of bias in larger orientation statistics
based on such cluster.

13.5 Estimating Parametric Mixture Distributions


from Orientation Statistics

In [8] we outlined an estimation procedure for the parametrization of a likelihood


density used in a perception-action cycle. As an example task visual unmanned
aerial vehicle navigation was assumed, and a given automatic object recognition
system localizing landmarks on aerial images was fixed. In that work an old-fashioned
knowledge-based production rule system with an any-time interpreter was used, such
as it is presented in this book in Sect. 6.4. With a fixed recognition and localization
machine, the question arises how much confidence can be put in its results when using
them in the navigation control loop. How often does it see an illusory landmark where
there is none? How often does it fail in reporting an existing landmark in the image?
And how can we model its residual displacement when reporting a true positive
location?
Statistics, on which answers to these questions can be based, can be acquired by
flying a vehicle in the desired operation area, taking pictures of all the modeled land-
marks along the path, running the recognizer, counting the recognition performance,
and collecting all the residual deviations. Evidently, such endeavor will cause con-
siderable efforts and costs. So the question is, what of these tasks can be automated
or replaced by simulation? In [8] we decided to use the virtual globe viewer Google
Earth and export screen-shots from it at the given times, instead of really flying a
vehicle equipped with a camera. Thus, perfect ground truth was provided for the real
look-from location and also for the location of the landmark in the image. Virtual
flights can also be repeated as often as desired with almost no extra efforts, so that
large statistics can be assembled.
The lessons learned from the visual navigation example are:
• With ground truth set locations on one hand and localizations found automatically
on the other, there are statistics on the residuals which can be used for the estimation
184 13 Learning

of parameters of a density model. Utilizing such model instead of rather arbitrary


or heuristic functions will improve the performance in fulfilling the task.
• Simple two-component models with, e.g., one inlier component and one outlier
component, often do not really fit the data thus found. This corresponds to the
experience that for some instances the decision whether it is an outlier or an
inlier can become problematic. Such half-correct detections or “inbetweeners”
have a distribution which is wider than the really true positives, but sharper than
illusions, which may occur anywhere in the image. They still cluster around the
target. Thus this component probably also helps in fulfilling the navigation task.
The corresponding three-component distribution fits the empirical statistics much
better.
• The estimation of such mixture density parameters can be obtained by expectation-
maximization (EM) iteration starting from plausible initial settings such as the ones
underlying the heuristics used before [10]. EM is robust and fast on this task.
The estimation method outlined in [8] is straightaway applicable to the assessment of
proximity as defined in Sect. 2.5, Definition 2.4. We have seen that a Rayleigh density
is a possible function form for this Gestalt law. Of course this would also hold for
mixtures of a small number of Rayleigh densities with rising parameter. We repeat
at this point that the asymptotic convergence of the proximity assessment to zero
for rising distances is definitely necessary for Theorem 5.1 in Sect. 5. If Gestalten
may directly relate to each other, although they are very far away from each other
as compared to their scale, intractable search efforts will result. Such interaction
can only be permitted by using top-down reasoning on hierarchies of Gestalten as
indicated in Sect. 5.3.2. Thus, “heavy-tail” components must be avoided here because
of their consequences on the search efforts.
When a similar mixture for orientation features is to be estimated von Mises
distributions and uniform distributions should be the components. In [9] an example
was given, how this can be done for building outlines in remotely sensed data.
Application of the expectation–maximization algorithm for such mixture models
requires guessing the number of components first. Based on our experience with the
landmarks in [8] we prefer two von Mises components with the same expectation but a
different width for the inliers and the inbetweeners, respectively. As third component
a uniform component is chosen for rows where the parts do not fulfill any law of
similar orientation at all.
The probability density function of a von Mises distribution reads

1
p(α|φ, κ) = exp {κ cos(α − φ)} , 0 ≤ ϕ ≤ 2π, 0 ≤ κ ≤ ∞ (13.6)
2π I0 (κ)

where I0 (κ) is the modified Bessel function of order zero, φ is the mean direc-
tion, and κ is the so-called concentration parameter. As the concentration param-
eter κ approaches 0, the distribution converges to the uniform distribution; as κ
approaches infinity, the distribution tends to the point distribution concentrated in the
direction φ.
13.5 Estimating Parametric Mixture Distributions from Orientation Statistics 185

We expect an unknown number of dominant orientations plus background clut-


ter. Thus, we utilize a mixture of D von Mises distributions 13.6 and the uniform
distribution, i.e.,
D
p(α) = w0 · p(α) + wd · p(α|φd , κd ) (13.7)
d=1

For the estimation of the distribution parameters {φd , κd } and the weights wd , d =
0, . . . , D, we apply the expectation–maximization algorithm [10]. By considering
information theoretic criteria such as the Akaike information criterion or the Bayesian
information criterion, the number D of components can be determined [11, 12].
Along with the 2017 ICCV in Venice a research team around Y. Liu from PSU
organized a competition on symmetry recognition [2]. Among other categories, there
also was frieze recognition. Fifty images were published with manually marked
ground truth. In most of these images one frieze is marked, in some more than one
(but a small number), and in one none. For this work we use at most one ground
truth per image, the first, so we have forty-nine ground truth frieze objects. The
ground truth format for a n-member frieze is 2(n + 1) marked image locations in
row–column format. All locations are marked by hand, so that arbitrary deformations
are possible, which is mainly used to cope with perspective distortions, but there are
also examples with free-form deformations. The intention is that the image content
of each of the n-quadrangles should be similar.
Figure 13.5 actually shows one of these images and the corresponding ground truth
and detection. In each of the forty-nine images a set of primitive Gestalten is extracted
using SLIC super-pixel segmentation [13]. The number of elements depends on the
image size with preferably 200 pixels per primitive object, but not more than 2,000
primitives in total. And then an assessment-driven constant-false-alarm-rate search is
performed on each set of such primitives. It searches for shallow-hierarchy Gestalten
using the operations | (for reflection symmetry) and Σ (for frieze formation) of [14].
Search for friezes is greedy, following the maximal meaningful element rationale of
[15] and the procedures outlined in [16]. At most 500 instances are kept on each
hierarchy level. The frieze displayed in Fig. 13.5 in black color is a hierarchy 1 Σ-
Gestalt. It is the one that best fits the ground truth displayed in white. Its preceding
eight parts (which are primitives) are displayed with it. It can be seen that their
orientations are all very similar.
In each case, where a best row Gestalt is found, the statistics of the orientations of
the parts were centered to the mean orientation and recorded. All in all, in this way 234
orientations were gathered, and it is on this statistic that we estimate the parameters of
a mixture using the methods outlined above. The result is displayed in Figs. 13.6 and
13.7. More than half of the mass is uniformly distributed. There is a sharp narrow
peak component that accounts for successful examples such as the one given in
Fig. 13.5, where the orientations of the parts are very similar. Interestingly, between
such outlier and inlier components there exists an intermediate component, which is
still narrower than the default cosine-dominated assessment function presented above
as Eq. 2.3. This result suggests that on these data such heuristic default function is
186 13 Learning

Fig. 13.6 Histogram of observed orientations and estimated probability density functions of the
mixture model

Fig. 13.7 Cyclic histogram and density functions of the mixture model

sub-optimal and that a considerable performance improvement can be expected if


assessment functions of this new estimated form are used.
The very idea of hierarchical Gestalt grouping rests on its universal claim to be
valid independent of any learning data—representative always only for a portion
of the world. Such perceptual grouping should be already working to some degree
even with inputs of a kind never seen before. To this end, some of its functions—in
particular in following [7], the assessment functions—are chosen either on question-
able assumptions or rather heuristically. Some parameters in the system may have
13.5 Estimating Parametric Mixture Distributions from Orientation Statistics 187

initial default values turning out to be sub-optimal. Gestaltists never denied the value
of machine learning [17]. Always, better results can be achieved if some of these
parameters are trained using suitable data. In this paper, the focus was on the orien-
tation similarity assessment. It turns out that additive mixture models are required to
capture what is encountered in the orientations of the parts of true positive Gestalten.
It is essential to utilize data not selected and labeled by the authors of the system
themselves. For this paper, we acknowledge the work provided by the PSU team
resulting in the fifty images and ground truth. Even if they obviously addressed other
topics, such as perspective distortion, other deformations, lighting problems, they
still labeled, what they saw as salient. Accordingly, that source allows the estima-
tion of parameters of distribution models for the improvement of the corresponding
assessment functions. Given the rather limited number (fifty images, about three
hundred data points), only a small number of mixture components could be chosen.
Two inlier components, one very narrow and one more liberal, were modeled by von
Mises distributions. The third component is a uniformly distributed component. This
copes for the cases when row aggregates do well suit the ground truth that are made
from parts whose orientations are not similar at all.

References

1. Liu J, Slota G, Zheng G, Wu Z, Park M, Lee S, Rauschert I, Liu Y (2013) Symmetry detection
from realworld images competition 2013: summary and results. In: CVPR 2013, workshops
2. Funk C, Lee S, Oswald MR, Tsokas S, Shen W, Cohen A, Dickinson S, Liu Y (2017) 2017
ICCV challenge: detecting symmetry in the wild. In: ICCV 2017, workshops
3. Kanizsa G (1980) Grammatica del vedere. Saggi su percezione e gestalt. Il Mulino
4. Grenander U (1993) General pattern theory. Oxford University Press
5. Mumford D, Desolneux A (2010) Pattern theory. CRC Press, A K Peters Ltd., Natick MA
6. Fisher NI (1995) Statistical analysis of circular data. Cambridge University Press
7. Michaelsen E, Arens M (2017) Hierarchical grouping using gestalt assessments. In: CVPR
2017, workshops, detecting symmetry in the wild
8. Michaelsen E, Meidow J (2014) Stochastic reasoning for structural pattern recognition: an
example from image-based uav navigation. Pattern Recognit 47(8):2732–2744
9. Pohl M, Meidow J, Bulatov D (2017) Simplification of polygonal chains by enforcing few
distinctive edge directions. In: Sharma P, Bianchi FM (eds) Scandinavian conference on image
analysis (SCIA). Lecture Notes in Computer Science, Part II, vol 10270, pp 1–12
10. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the
EM algorithm. J R Stat Soc Ser B 39(1):1–38
11. Akaike H (1973) Information theory and an extension of the maximum likelihood principle.
Springer, New York, pp 199–213
12. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
13. Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Susstrunk S (2012) SLIC superpixels compared
to state-of-the-art superpixel, methods. Trans Pattern Anal and Mach Intell 34(11):2274–2281
14. Michaelsen E, Yashina VV (2014) Simple gestalt algebra. Pattern Recognit Image Anal
24(4):542–551
15. Desolneux A, Moisan L, Morel J-M (2008) From gestalt theory to image analysis: a probabilistic
approach. Springer.
188 13 Learning

16. Michaelsen E, Münch D, Arens M (2016) Searching remotely sensed images for meaningful
nested gestalten. In: ISPRS 2016
17. Sarkar S, Boyer KL (1994) Computing perceptual organization in computer vision. World
Scientific
Appendix A
General Adjustment Model with Constraints

Once a group of primitive Gestalts has been found by hypotheses generation and ver-
ification, and the subsequent step is to establish the corresponding model instance
by parameter estimation. These results in turn can be regarded as observations for
the next level of grouping within hierarchical agglomeration. In the following we
briefly describe a general adjustment model and the corresponding estimation pro-
cedure. The model consists of a functional model for the unknown parameters and
the observations, a stochastic model for the observations, an optimization criterion,
and an iterative estimation procedure for nonlinear problems.
We introduce two types of constraints for the true observations  l and the true
unknown parameters  x: constraints g( l,x) = 0 for the observations and parameters
and constraints h( x) = 0 for the parameters only. For more general models which
can take also constraints for observation only into account, refer to [1, 2]. The error-
free observations  l are related to the observations l by  l = l + , where the true
corrections  are unknown. Since the true values remain unknown, they are replaced
by their estimates x,
l, and. The estimated corrections are negative residuals. Thus,
together we have the two constraints g( x) = 0 and h(
l, x) = 0.
An initial covariance matrix Σ ll(0) of the observations is assumed to be known
which subsumes the statistical properties of the observations. Thus, l is assumed to
be normally distributed l ∼ N ( l, Σ ll ) and the matrix is related to the true covariance
matrix Σ ll by
Σ ll = σ02 Σ ll(0)

with the possibly unknown variance factor σ02 [3]. This factor can be estimated from
the estimated corrections 
; see A.11 below.
Optimal estimates x andl for x and l can be found by minimizing the Lagrangian

1 T −1
L( x, λ, μ) = 
,  Σ ll 
 + λT g(l +
,
x) + μT h(
x) (A.1)
2

© Springer Nature Switzerland AG 2019 189


E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6
190 Appendix A: General Adjustment Model with Constraints

in the least squares manner with the Lagrangian vectors λ and μ. For solving this
(0)
nonlinear problem in an iterative manner, we need approximate valuesx (0) and
l for
(0)
the unknown parameter estimates  x =  and
x (0) + Δx l =  The corrections
l + Δl.
for the unknowns and the observations are obtained iteratively. With the Jacobians
  
∂g(l, x)  ∂g(l, x)  ∂h(x) 
A= , B= , H= (A.2)
∂ x x (0) , l (0) ∂l x (0) , l (0) ∂ x x (0)

(0)
and the relation 
l =  = l +
l + Δl  we obtain the linear constraints by Taylor
series expansion

g(
l,  + B
x) = g 0 + AΔx  + B(l − l 0 ) = 0 (A.3)
h(  =0
x) = h0 + H Δx (A.4)

(0)
with g 0 = g(
l , x (0) ) and h0 = h(
x (0) ).
Setting the partial derivatives of A.1 zero yields the necessary conditions for a
minimum
∂L
= Σ ll−1
 + BT λ = 0 (A.5)
∂
T
∂L
 + B
= AΔx  + B(l − l 0 ) + g 0 = 0 (A.6)
∂λT
∂L
= AT λ + H T μ = 0 (A.7)
 T
∂ Δx
∂L
 + h0 = 0.
= H Δx (A.8)
∂μT

Substituting = −Σ ll B T λ in A.6 yields the Lagrangian λ = Σ −1 


gg AΔx + g with
the contradictions g = B(l − l 0 ) + g 0 and their covariance matrix Σ gg = BΣ ll B T .
The constraints A.7 and A.8 can be collected in the system of linear equations
    
AT Σ −1
gg A H
T 
Δx −g
= (A.9)
H O μ −h0

 and the estimated corrections are


to solve for the unknown parameter updates Δx,
 
 = Σ ll B T Σ −1
 
gg g − AΔx . (A.10)

With estimated corrections   we obtain the fitted observations 


l = l +
 and the
estimate for the variance factor σ02 is given by the maximum likelihood estimate [3]
Appendix A: General Adjustment Model with Constraints 191

Ω T Σ ll−1
 

σ02 = = (A.11)
R N −U + H

with the squared sum of residuals Ω and the redundancy R, computed with the
number of observations N , the number of parameter restrictions H , and the number
of parameters U .
We finally obtain the estimated covariance matrix Σxx = 
σ02 Σ xx of the estimated
parameters, where Σ xx results from the inverted normal equation matrix

  T −1
−1
Σ xx · A Σ gg A H
= . (A.12)
· · HT O

For nonlinear problems, the approximate values have to be iteratively improved;


(i)
i.e., the estimates in the ith iteration are   and 
x (i) = x (i−1) + Δx l = l 0 + . A
useful stopping criterion is that the maximal change of all updates Δx  j (i) in the ith
iteration should be less than a certain percentage, e.g., 1%, of the corresponding
 (i)  (i)
standard deviation, i.e., max j Δx j   σx j < 0.01.
Several specializations can be obtained. For problems with no restrictions for the
parameters, the constraints h(Δx) have to be omitted. For problems where the ob-
servations can be formulated explicitly as functions of the parameters, i.e., l = f (x),
 
B = −I holds and everything boils down to the estimates Δx  = AT Σ ll−1 A −1 AT Σ ll−1
(l − l 0 ) and   − l in the linear model l − l 0 +
 = AΔx 
 = AΔx.

References

1. Meidow J, Beder C, Förstner W (2009) Reasoning with uncertain points, straight lines, and
straight line segments in 2D. ISPRS J Photogramm Remote Sens 64(2):125–139
2. Förstner W, Wrobel B (2016) Photogrammetric computer vision. Springer
3. Koch KR (1999) Parameter estimation and hypothesis testing in linear models, 2nd edn. Springer,
Berlin
Index

A Conjunctive assessment combination, 35


A-contrario testing, 46 Constant false alarm rate, 67
Additional features, 93 Constrained Multiset Grammars, 86
Algebraic closure, 37, 85, 87, 90
Algebraic distance, 29
Assesment D
scale similarity, 63 Descriptor, 152
Assessed domain, 86 Dihedral group, 82
Assessment, 12, 14 Distance function, 14
line distance residual, 119 Domain
orientation similarity, 63 assessed, 86
proximity, 59, 75
residual row, 58
rotational fit, 75 E
Assessment feature, 13 Elation, 69
Assessment function, 27 Euclidean distance, 30
normalized, 15

B
Balanced norm, 16 G
Balanced term, 87, 93 Gestalt, 11
Ball tensor, 114 assessment of, 14
Blackboard, 104 generation, 24
search, 25
Bonferroni inequality, 47
Gestalt algebra
simple, 87
Gestalt Domain, 11
C Good continuation
Clustering along line, 111
axes, 41 Guided matching, 83
Commutativity law
permutation, 117
Concept, 166 H
Conjugate rotation, 82 Hierarchy in perceptual grouping, 35, 85

© Springer Nature Switzerland AG 2019 193


E. Michaelsen and J. Meidow, Hierarchical Perceptual Grouping
for Object Recognition, Advances in Computer Vision and Pattern Recognition,
https://doi.org/10.1007/978-3-030-04040-6
194 Index

Homology Orthogonal regression, 119


planar, 69 Overlap assessment, 120
Hyper-spectral images, 156, 169

P
I Parallelism, 127
Illusion, 23, 107, 140 Perceptual inference net, 10
rate, 140 Periodicity, 11
Inference, 163 Primitve, 145
abductive, 164, 171 Production rule, 163, 167
deductive, 164 Projective distortion, 143
fuzzy, 165 Proximity, 31
Intensity Proximity assessment, 32
histogram, 147 Proximity score function, 31

K R
Key-point, 112 Rayleigh score function, 32
Knowledge, 163 Rectangularity, 127, 130
Reflection orientation assessment
definition, 28
L Reflection symmetry, 23, 26
Lattice symmetry operation Reflection symmetry operation
definition, 139 definition, 31
Local consistency, 160 Residual reflection constraint assessment
Location, 11 definition, 30
Location feature, 12 Rotation, 72
conjugate, 82

M
Mahalanobis distance, 15, 18 S
Maximally stable extremal regions, 150 Scale, 11
Maximal meaningful set, 139 Scale feature, 12
Maximum a posteriori features, 29 Scale invariant feature transform, 152
Mises assessments, 28 Score function
Monogenic signal, 154 proximity, 31
Multi-modal Primitives, 154 Rayleigh, 32
Search
breadth first, 101
N lattices, 139
Non-accidentalness, 10, 112 recursive, 102
Non-maxima suppression, 160 stratified, 37, 101
Non-negativity , 15 Segmentation, 148
Null property, 17 Self-organizing maps, 154
Semantic net, 166, 168
SIFT key-point, 33
O Similarity, 33
Observations, 29 Similarity function, 14
Operation Similarity in scale, 34
rotational symmetry, 77 Simple gestalt algebra, 87
Optimal fit, 29 Stick tensor, 114
Orientation, 11 Subadditivity, 15
Orientation feature, 12 Super-pixel, 37, 148
Orthogonality, 127 Symmetry
Index 195

point reflection, 78 Total least squares, 119


Symmetry of a metric, 15

T U
Tensor voting, 112, 116 UML, 130
Term, 86
Term tree, 87
Threshold segmentation, 145 W
T-Norms, 16 Wallpaper groups, 135

You might also like