Contour and Texture Analysis For Image Segmentation
Contour and Texture Analysis For Image Segmentation
Abstract. This paper provides an algorithm for partitioning grayscale images into disjoint regions of coherent
brightness and texture. Natural images contain both textured and untextured regions, so the cues of contour and
texture differences are exploited simultaneously. Contours are treated in the intervening contour framework, while
texture is analyzed using textons. Each of these cues has a domain of applicability, so to facilitate cue combination we
introduce a gating operator based on the texturedness of the neighborhood at a pixel. Having obtained a local measure
of how likely two nearby pixels are to belong to the same region, we use the spectral graph theoretic framework of
normalized cuts to find partitions of the image into regions of coherent texture and brightness. Experimental results
on a wide range of images are shown.
Keywords: segmentation, texture, grouping, cue integration, texton, normalized cut
1.
Introduction
Malik et al.
Figure 1. Some challenging images for a segmentation algorithm. Our goal is to develop a single grouping procedure which can deal with all
these types of images.
Figure 2. Demonstration of texture as a problem for the contour process. Each image shows the edges found with a Canny edge detector for the
penguin image using different scales and thresholds: (a) fine scale, low threshold, (b) fine scale, high threshold, (c) coarse scale, low threshold,
(d) coarse scale, high threshold. A parameter setting that preserves the correct edges while suppressing spurious detections in the textured area
is not possible.
Figure 3. Demonstration of the contour-as-a-texture problem using a real image. (a) Original image of a bald eagle. (b) The groups found
by an EM-based algorithm (Belongie et al., 1998).
10
Malik et al.
Introducing Textons
2.
11
Figure 4. Left: Filter set f i consisting of 2 phases (even and odd), 3 scales (spaced by half-octaves), and 6 orientations (equally spaced from
0 to ). The basic filter is a difference-of-Gaussian quadrature pair with 3 : 1 elongation. Right: 4 scales of center-surround filters. Each filter is
L 1 -normalized for scale invariance.
f 2 (x, y) = Hilbert( f 1 (x, y))
where is the scale, is the aspect ratio of the filter, and C is a normalization constant. (The use of the
Hilbert transform instead of a first derivative makes f 1
and f 2 an exact quadrature pair.) The radially symmetric portion of the filterbank consists of Difference-ofGaussian kernels. Each filter is zero-mean and L 1 normalized for scale invariance (Malik and Perona, 1990).
Now suppose that the image is convolved with such
a bank of linear filters. We will refer to the collection of
response images I f i as the hypercolumn transform
of the image.
Why is this useful from a computational point of
view? The vector of filter outputs I f i (x0 , y0 ) characterizes the image patch centered at x0 , y0 by a set
of values at a point. This is similar to characterizing
an analytic function by its derivatives at a pointone
can use a Taylor series approximation to find the values of the function at neighboring points. As pointed
out by Koenderink and van Doorn (1987), this is more
than an analogy, because of the commutativity of the
operations of differentiation and convolution, the receptive fields described above are in fact computing
blurred derivatives. We recommend Koenderink and
van Doorn (1987, 1988), Jones and Malik (1992), and
Malik and Perona (1992) for a discussion of other advantages of such a representation.
The hypercolumn transform provides a convenient
front end for contour and texture analysis:
12
Malik et al.
(1)
Textons
Though the representation of textures using filter responses is extremely versatile, one might say that it is
overly redundant (each pixel value is represented by
Nfil real-valued filter responses, where Nfil is 40 for our
particular filter set). Moreover, it should be noted that
we are characterizing textures, entities with some spatially repeating properties by definition. Therefore, we
do not expect the filter responses to be totally different at each pixel over the texture. Thus, there should
be several distinct filter response vectors and all others
are noisy variations of them.
This observation leads to our proposal of clustering the filter responses into a small set of prototype
response vectors. We call these prototypes textons. Algorithmically, each texture is analyzed using the filter
bank shown in Fig. 4. Each pixel is now transformed
to a Nfil dimensional vector of filter responses. These
vectors are clustered using K -means. The criterion for
this algorithm is to find K centers such that after assigning each data vector to the nearest center, the sum
13
Figure 5. (a) Polka-dot image. (b) Textons found via K -means with K = 25, sorted in decreasing order by norm. (c) Mapping of pixels to the
texton channels. The dominant structures captured by the textons are translated versions of the dark spots. We also see textons corresponding
to faint oriented edge and bar elements. Notice that some channels contain activity inside a textured region or along an oriented contour and
nowhere else.
14
Malik et al.
Figure 6. (a) Penguin image. (b) Textons found via K -means with K = 25, sorted in decreasing order by norm. (c) Mapping of pixels to the
texton channels. Among the textons we see edge elements of varying orientation and contrast along with elements of the stochastic texture in
the rocks.
nearly uniform brightness. The pixel-to-texton mapping is shown in Fig. 5(c). Each subimage shows the
pixels in the image that are mapped to the corresponding texton in Fig. 5(b). We refer to this collection of
discrete point sets as the texton channels. Since each
pixel is mapped to exactly one texton, the texton channels constitute a partition of the image.
Textons and texton channels are also shown for the
penguin image in Fig. 6. Notice in the two examples
how much the texton set can change from one image
to the next. The spatial characteristics of both the deterministic polka dot texture and the stochastic rocks
texture are captured across several texton channels. In
general, the texture boundaries emerge as point density
changes across the different texton channels. In some
cases, a texton channel contains activity inside a particular textured region and nowhere else. By comparison, vectors of filter outputs generically respond with
some value at every pixela considerably less clean
alternative.
15
Figure 7. Illustration of scale selection. (a) Closeup of Delaunay triangulation of pixels in a particular texton channel for polka dot image. (b)
Neighbors of thickened point for pixel at center. The thickened point lies within inner circle. Neighbors are restricted to lie within outer circle.
(c) Selected scale based on median of neighbor edge lengths, shown by circle, with all pixels falling inside circle marked with dots.
16
Malik et al.
I [T ( j) = k]
(2)
jW(i)
Wij
iA, jB
(3)
where y = {a, b} N is a binary indicator vector specifying the group identity for each pixel, i.e. yi = a if
pixel i belongs to group A and y j = b if pixel j belongs
to B. N is the number of pixels. Notice that the above
expression is a Rayleigh quotient. If we relax y to take
on real values (instead of two discrete values), we can
optimize Eq. (3) by solving a generalized eigenvalue
system. Efficient algorithms with polynomial running
time are well-known for solving such problems.
The process of transforming the vector y into a discrete bipartition and the generalization to more than
two groups is discussed in (5).
4.
17
Figure 8. Left: the original image. Middle: part of the image marked by the box. The intensity values at pixels p1 , p2 and p3 are similar.
However, there is a contour in the middle, which suggests that p1 and p2 belong to one group while p3 belongs to another. Just comparing
intensity values at these three locations will mistakenly suggest that they belong to the same group. Right: orientation energy. Somewhere along
l2 , the orientation energy is strong which correctly proposes that p1 and p3 belong to two different partitions, while orientation energy along l1
is weak throughout, which will support the hypothesis that p1 and p2 belong to the same group.
where Mij is the set of local maxima along the line joining pixels i and j. Recall from (2) that pcon (x), 0 <
pcon < 1, is nearly 1 whenever the orientated energy
maximum at x is sufficiently above the noise level. In
words, two pixels will have a weak link between them
if there is a strong local maximum of orientation energy
along the line joining the two pixels. On the contrary, if
there is little energy, for example in a constant brightness region, the link between the two pixels will be
strong. Contours measured at different scales can be
taken into account by computing the orientation energy maxima at various scales and setting pcon to be
the maximum over all the scales at each pixel.
4.2.
K
[h i (k) h j (k)]2
1
2 k=1 h i (k) + h j (k)
where h i and h j are the two histograms. For an empirical comparison of the 2 test versus other texture
similarity measures, see Puzicha et al. (1997).
WijTX is then defined as follows:
WijTX = exp( 2 (h i , h j )/TX )
(4)
General Images
18
Malik et al.
K
[h L (k) h R (k)]2
1
2 k=1 h L (k) + h R (k)
1
2
1 + exp LR
/
(5)
(6)
19
Figure 10. Gating the contour cue. Left: original image. Top: oriented energy after nonmaximal suppression, OE . Bottom: 1 ptexture . Right:
p B , the product of 1 ptexture and pcon = 1 exp(OE /IC ). Note that this can be thought of as a soft edge detector which has been
modified to no longer fire on texture regions.
Figure 11. Gating the texture cue. Left: original image. Top: Textons label, shown in pseudocolor. Middle: local scale estimate (i). Bottom:
1 ptexture . Darker grayscale indicates larger values. Right: Local texton histograms at scale (i) are gated using ptexture as explained in 4.3.3.
h i (0) = N B +
(1 ptexture ( j))
jN (i)
20
Malik et al.
F(x) F(x)
log 1 +
|F(x)|
0.03
|F(x)|
Note that these parameters are the same for all the results shown in (6).
5.
With a properly defined weight matrix, the normalized cut formulation discussed in (3) can be used to
compute the segmentation. However, the weight matrix defined in the previous section is computed using
only local information, and is thus not perfect. The
ideal weight should be computed in such a way that
region boundaries are respected. More precisely, (1)
texton histograms should be collected from pixels in a
window residing exclusively in one and only one region. If instead, an isotropic window is used, pixels
near a texture boundary will have a histogram computed from textons in both regions, thus polluting
the histogram. (2) Intervening contours should only be
considered at region boundaries. Any responses to the
filters inside a region are either caused by texture or are
simply mistakes. However, these two criteria mean that
we need a segmentation of the image, which is exactly
the reason why we compute the weights in the first
place! This chicken-and-egg problem suggests an iterative framework for computing the segmentation. First,
use the local estimation of the weights to compute a segmentation. This segmentation is done so that no region
boundaries are missed, i.e. it is an over-segmentation.
Next, use this intial segmentation to update the weights.
Since the initial segmentation does not miss any region
boundaries, we can coarsen the graph by merging all
the nodes inside a region into one super-node. We can
then use these super-nodes to define a much simpler
segmentation problem. Of course, we can continue this
iteration several times. However, we elect to stop after
1 iteration.
The procedure consists of the following 4 steps:
1. Compute an initial segmentation from the locally
estimated weight matrix.
2. Update the weights using the initial segmentation.
3. Coarsen the graph with the updated weights to reduce the segmentation to a much simpler problem.
4. Compute a final segmentation using the coarsened
graph.
5.1.
Figure 12.
21
It should be noted that this strategy for using multiple eigenvectors to provide an initial oversegmentation
is merely one of a set of possibilities. Alternatives include recursive splitting using the second eigenvector
or first converting the eigenvectors into binary valued
vectors and using those simultaneously as in Shi and
Malik (2000). Yet another hybrid strategy is suggested
in Weiss (1999). We hope that improved theoretical insight into spectral graph partitioning will give us a better way to make this, presently somewhat ad hoc choice.
5.2.
Updating Weights
22
Malik et al.
Figure 13.
Initial segmentation of the image used for coarsening the graph and computing final segmentation.
Figure 14.
Figure 15.
5.4.
23
After coarsening the graph, we have turned the segmentation problem into a very simple graph partitioning
problem of very small size. We compute the final segmentation using the following procedure:
1. Compute the second smallest eigenvector for the
generalized eigensystem using W .
2. Threshold the eigenvector to produce a bipartitioning of the image. 30 different values uniformly spaced within the range of the eigenvector
are tried as the threshold. The one producing a partition which minimizes the normalized cut value is
chosen. The corresponding partition is the best way
to segment the image into two regions.
3. Recursively repeat steps 1 and 2 for each of the
partitions until the normalized cut value is larger
than 0.1.
24
Malik et al.
Figure 16.
5.5.
Segmentation in Windows
Figure 17.
25
Segmentation of paintings.
Results
We have run our algorithm on a variety of natural images. Figures 1417 show typical segmentation results.
In all the cases, the regions are cleanly separated from
each other using combined texture and contour cues.
Notice that for all these images, a single set of parameters are used. Color is not used in any of these examples and can readily be included to further improve
the performance of our algorithm.8 Figure 14 shows
results for animal images. Results for images containing people are shown in Fig. 15 while natural and
26
Malik et al.
man-made scenes appear in Fig. 16. Segmentation results for paintings are shown in Fig. 17. A set of
more than 1000 images from the commercially available Corel Stock Photos database have been segmented
using our algorithm.9
Evaluating the results against ground truthWhat
is the correct segmentation of the image?is a challenging problem. This is because there may not be a
single correct segmentation and segmentations can be
to varying levels of granularity. We do not address this
problem here; a start has been made in recent work in
our group (Martin et al., 2000).
Computing times for a C++ implementation of the
entire system are under two minutes for images of size
108176 pixels on a 750 MHz Pentium III machine.
There is some variability from one image to another
because the eigensolver can take more or less time to
converge depending on the image.
7.
3.
4.
5.
6.
7.
8.
9.
necessary. Merging in this manner decreases the number of channels needed but necessitates the use of phase-shift information.
This is set to 3% of the image dimension in our experiments. This
is tied to the intermediate scale of the filters in the filter set.
This is set to 10% of the image dimension in our experiments.
Finding the true optimal partition is an NP-hard problem.
The eigenvector corresponding to the smallest eigenvalue is constant, thus useless.
Since normalized cut can be interpreted as a spring-mass system
(Shi and Malik, 2000), this normalization comes from the equipartition theorem in classical statistical mechanics which states that
if a system is in equilibrium, then it has equal energy in each mode
(Belongie and Malik, 1998).
When color information is available, the similarity Wij becomes
a product of 3 terms: Wij = WijIC WijTX WijCOLOR . Color similarity, WijCOLOR , is computed using 2 differences over color
histograms, similar to texture measured using texture histograms.
Moreover, color can clustered into colorons, analogous to textons.
These results are available at the following web page: http://
www.cs.berkeley.edu/projects/vision/Grouping/overview.html
Conclusion
References
Belongie, S., Carson, C., Greenspan, H., and Malik, J. 1998. Colorand texture-based image segmentation using EM and its application to content-based image retrieval. In Proc. 6th Int. Conf.
Computer Vision, Bombay, India, pp. 675682.
Belongie, S. and Malik, J. 1998. Finding boundaries in natural images: A new method using point descriptors and area completion.
In Proc. 5th Euro. Conf. Computer Vision, Freiburg, Germany, pp.
751766.
Binford, T. 1981. Inferring surfaces from images. Artificial Intelligence, 17(13):205244.
Canny, J. 1986. A computational approach to edge detection. IEEE
Trans. Pat. Anal. Mach. Intell., 8(6):679698.
Chung, F. 1997. Spectral Graph Theory, AMS. Providence, RI.
DeValois, R. and DeValois, K. 1988. Spatial Vision. Oxford
University Press. New York, N.Y.
Duda, R. and Hart, P. 1973. Pattern Classification and Scene Analysis, John Wiley & Sons. New York, N.Y.
Elder, J. and Zucker, S. 1996. Computing contour closures. In
Proc. Euro. Conf. Computer Vision, Vol. I, Cambridge, England,
pp. 399412.
Fogel, I. and Sagi, D. 1989. Gabor filters as texture discriminator.
Biological Cybernetics, 61:103113.
Geman, S. and Geman, D. 1984. Stochastic relaxation, Gibbs distribution, and the Bayesian retoration of images. IEEE Trans. Pattern
Anal. Mach. Intell., 6:721741.
Gersho, A. and Gray, R. 1992. Vector Quantization and Signal Compression, Kluwer Academic Publishers, Boston, MA.
Heeger, D.J. and Bergen, J.R. 1995. Pyramid-based texture analysis/synthesis. In Proceedings of SIGGRAPH 95, pp. 229238.
Jacobs, D. 1996. Robust and efficient detection of salient convex
groups. IEEE Trans. Pattern Anal. Mach. Intell., 18(1):2337.
Jones, D. and Malik, J. 1992. Computational framework to determining stereo correspondence from a set of linear spatial filters.
Image and Vision Computing, 10(10):699708.
27
Morrone, M. and Owens, R. 1987. Feature detection from local energy. Pattern Recognition Letters, 6:303313.
Mumford, D. and Shah, J. 1989. Optimal approximations by piecewise smooth functions, and associated variational problems.
Comm. Pure Math., 42:577684.
Parent, P. and Zucker, S. 1989. Trace inference, curvature consistency, and curve detection. IEEE Trans. Pattern Anal. Mach. Intell., 11(8):823839.
Perona, P. and Malik, J. 1990. Detecting and localizing edges composed of steps, peaks and roofs. In Proc. 3rd Int. Conf. Computer
Vision, Osaka, Japan, pp. 5257.
Puzicha, J., Hofmann, T., and Buhmann, J. 1997. Non-parametric
similarity measures for unsupervised texture segmentation and
image retrieval. In Proc. IEEE Conf. Computer Vision and Pattern
Recognition, San Juan, Puerto Rico, pp. 267272.
Raghu, P., Poongodi, R., and Yegnanarayana, B. 1997. Unsupervised
texture classification using vector quantization and deterministic
relaxation neural network. IEEE Transactions on Image Processing, 6(10):13761387.
Shaashua, A. and Ullman, S. 1988. Structural saliency: The detection of globally salient structures using a locally connected network. In Proc. 2nd Int. Conf. Computer Vision, Tampa, FL, USA,
pp. 321327.
Shi, J. and Malik, J. 1997. Normalized cuts and image segmentation.
In Proc. IEEE Conf. Computer Vision and Pattern Recognition,
San Juan, Puerto Rico, pp. 731737.
Shi, J. and Malik, J. 2000. Normalized cuts and image segmentation.
IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888905.
Weiss, Y. 1999. Segmentation using eigenvectors: A unifying view.
In Proc. IEEE Intl. Conf. Computer Vision, Vol. 2, Corfu, Greece,
pp. 975982.
Wertheimer, M. 1938. Laws of organization in perceptual forms (partial translation). In A Sourcebook of Gestalt Psychology, W. Ellis
(Ed.). Harcourt Brace and Company, pp. 7188.
Williams, L. and Jacobs, D. 1995. Stochastic completion fields: A
neural model of illusory contour shape and salience. In Proc. 5th
Int. Conf. Computer Vision, Cambridge, MA, pp. 408415.
Young, R.A. 1985. The Gaussian derivative theory of spatial vision: Analysis of cortical cell receptive field lineweighting profiles. Technical Report GMR-4920, General Motors
Research.