Statistical Modeling of Texture Sketch: Abstract. Recent Results On Sparse Coding and Independent Component
Statistical Modeling of Texture Sketch: Abstract. Recent Results On Sparse Coding and Independent Component
1 Introduction
As argued by Mumford (1996) and many other researchers, the problem of vision
can be posted in the framework of statistical modeling and inferential computing.
That is, top-down generative models can be constructed to represent a visual
system’s knowledge in the form of probability distributions of the observed im-
ages as well as variables that describe the visual world, then visual learning and
perception become a statistical inference (and model selection) problem that can
be solved in principle by computing the likelihood or posterior distribution. To
guide reliable inference, the generative models should be realistic, and this can
be checked by visually examining random samples generated by the models.
Recently, there has been some progress on modeling textures. Most of the
recent models involve linear filters for extracting local image features, and the
texture patterns are characterized by statistics of local features. In particular, in-
spired by the work of Heeger and Bergen (1996), Zhu, Wu, and Mumford (1997)
and Wu, Zhu, and Liu (2001) developed a self-consistent statistical theory for
texture modeling; borrowing results from statistical mechanics, they showed that
a class of Markov random field models is a natural choice under the assumption
A. Heyden et al. (Eds.): ECCV 2002, LNCS 2352, pp. 240–254, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Statistical Modeling of Texture Sketch 241
in this article, we shall isolate the problem of modeling the sketch of the texture
images, while assuming that the sketch can be obtained from the image. Our
sketch model is a causal Markov chain model whose conditional distributions
are characterized by a set of simple geometrical feature statistics automatically
selected from a pre-defined vocabulary.
Embellished versions of our model can be useful in the following regards. In
computer vision, it can be used for image segmentation and perceptual grouping.
In computer graphics, as the sketch captures the geometrical essence of images, it
may be used for non-photo realistic rendering. For understanding human vision,
it provides a model-based representational theory for Marr’s primal sketch (Marr
1982).
The essential idea of sparse coding (Olshausen and Field, 1996), independent
component analysis (Bell and Sejnowski, 1999), and their combination (Lewiki
and Olshausen, 1999) is the assumption that an image I can be represented as
the superposition of a set of image bases. The bases are selected from an over-
complete basis (vocabulary) {b(x,y,l,θ,e) }, where (x, y) is the central position of
the base on the image domain, and (l, θ, e) is the type of the base. l is the scale
or length of the base, θ the orientation, and e the indicator for even/odd bases.
The DC components of the bases are 0, and the l2 norm of the bases are 1.
Therefore,
I= c(x,y,l,θ,e) b(x,y,l,θ,e) + N (0, σ 2 ), (1)
(x,y,l,θ,e)
Since the bases are over-completed, the coefficients have to be inferred by sam-
pling a posterior probability. In this section, we extend the heuristic matching
pursuit algorithm of Mallat and Zhang (1993) to a more principled Markov chain
Monte Carlo (MCMC) algorithm.
We sample the posterior distribution of the coefficients p({c(x,y,l,θ,e) } | I) in
Model (1) in order to find a symbolic representation or a sketch of image I. We
assume the parameters (ρ, τ 2 , σ 2 ) are known for this model. First, let’s consider
the Gibbs sampler for posterior sampling. For simplicity, we use j or i to index
(x, y, l, θ, e) and we define zj = 1 if bj is active, i.e., cj = 0, and zj = 0 otherwise.
Then the algorithm is as follows:
1. Randomly select a base bj . Compute R = I − i=j ci bi , i.e., the residual
image. Let rj =< R, bj > /(1 + σ 2 /τ 2 ), and σ 2 = 1/(1/σ 2 + 1/τ 2 ).
2. Compute the Bayes factor by integrating out cj ,
The problem with this Gibbs sampler is that if σ 2 is small, the algorithm
is too willing to activate the base bj even though the response rj is not that
large, or in other words, the algorithm is too willing to jump into a local energy
minimum. The idea of matching pursuit of Mallat and Zhang (1993) can be
used to remedy this problem. That is, instead of randomly selecting a base bj to
update, we randomly choose a window W on the image domain, and look at all
the inactive bases within this window W . Then we sample one base by letting
these bases compete for an entry. So we have the following windowed-Gibbs
sampler or a Metropolized matching pursuit algorithm:
For the simple model (1), if we let σ 2 → 0, then the Metropolis algorithm
described above goes to the windowed version of the matching pursuit algorithm.
In this paper, we shall just use the latter algorithm for simplicity. We feel that
the MCMC version of matching pursuit is useful in two aspects. Conceptually, it
helps us to understand matching pursuit as a limiting MCMC for posterior sam-
pling. Practically, we believe that the MCMC version, especially with the moves
for updating coefficients, positions, lengths, orientations of the active bases, is
useful for us to re-estimate the image sketch after we fit a better prior model for
sketch.
a b c d
Now we shall improve the simple prior model (1) by a sophisticated sketch model
that accounts for the spatial arrangements of image bases.
246 Y.N. Wu, S.C. Zhu, and C.-e. Guo
We may use the following two representations interchangeably for the sketch
of a texture image I, and let’s denote the sketch by S.
1. A list: let n be the number of active bases, then we have
If δx,y = 0, i.e., there is no active base, then all the (c, l, θ, e) take null values.
Using the first representation, the two-level top-down model is of the follow-
ing form:
1
p(S | Λ) = exp{λ0 n + λ1 (st ) + λ2 (st1 , st2 )}, (4)
Z(Λ) t t ∼t 1 2
where t1 ∼ t2 means that st1 and st2 are neighbors. So this model is a pair-
potential Gibbs point process model (e.g., Stoyan, Kendall, Mecke, 1985). Guo
et al. (2000) further parameterized the model by introducing a small set of
Gestalt features (Koffka, 1935) for spatial arrangement. Again, the model can
be justified by the maximum entropy principle.
The Gibbs models (3) and (4) are analytically intractable because of the in-
tractability of the normalizing constant Z(Λ). Sampling and maximum likelihood
estimation (MLE) have to be done by MCMC algorithms.
2. Causal Markov models. One way to get around this difficulty is to use a
causal Markov model. The causal methods have been used extensively in early
work on texture generation, most notably, the causal model of Popat and Picard
(1993). In the causal model, the joint distribution of the image I is factorized
into a sequence of conditional distribution by imposing a linear order on all the
pixels (x, y), e.g.,
m
n
p(I) = p(Ix,y | IN (x,y) ),
x=1 y=1
where x, y index the pixel, and N (x, y) is the neighboring pixels within a certain
spatial distance that are scanned before (x, y) The causal model is analytically
tractable because of the factorization form.
The causal plan has been successfully incorporated in example-based texture
synthesis by Efros and Leung (1999), Liang et al. (2001), Efros and Freeman
(2001). Hundred of realistic textures have been synthesized.
With the above preparation, we are ready to describe our model for image sketch.
Let S be the sketch of I, and let SN (x,y) be the sketch of the causal neighborhood
248 Y.N. Wu, S.C. Zhu, and C.-e. Guo
of (x, y). Recall that both S and SN (x,y) have two representations. Our model
is of the following causal form
m
n
p(S) = p(sx,y | SN (x,y) ).
x=1 y=1
where Z is the normalizing constant, and if lx,y and θx,y take on the null values
when δx,y = 0, λ1 ( ) and λ2 ( ) are 0. If λ1 ( ) and λ2 ( ) are always 0, then
the model reduces to the simple Poisson model (1). Like in the FRAME model
(Zhu, et al. 1997) and the Gestalt model (Guo, et al. 2001), this conditional
distribution can be derived or justified as the maximum entropy distribution
that reproduces the probability that there exists a linelet, the distribution of the
orientation and length of the linelet, and the pair-wise configuration made up
by this linelet and a nearby existing linelet.
In this model, the probability that we sketch a linelet at (x, y) depends on
the attributes of this linelet, and more important, how this linelet lines up with
existing linelets nearby, for instance, whether the proposed linelet connects with
a nearby existing linelet, or whether the proposed linelet is parallel with a nearly
existing linelet, etc. One can envisage this conditional model as modeling the way
an artist sketches a picture by adding one stroke at a time. Similar maximum
entropy models are also used in language[3].
We can also write the conditional model in a log-additive form
p(δx,y = 1, lx,y , θx,y | SN (x,y) )
log = λ0 + λ1 (lx,y , θx,y )
p(δx,y = 0 | SN (x,y) )
+ λ2 (lx,y , θx,y ; lt , θt ; xt − x, yt − y).
st ∈SN (x,y)
One may argue that a causal model for spatial point process is very con-
trived. We agree. A causal order is physically nonsensical. But for the purpose
of representing visual knowledge, it has some conceptual advantages because of
its analytical tractability. The situation is very similar to view-based methods
for objective recognition. Moreover, the model is also suitable for the purpose of
coding and compression. Mathematically, one may view this model as a causal
(or factorization) approximation to the Gestalt model of Guo et al. (2001). It
is expected that the causal approximation loses some of the expressive power of
the non-causal model, but this may be compensated by making the causal model
more redundant.
The current form of the model only accounts for pair-wise configurations of
the linelets. We can easily extend the model to account for configurations that
involves more than two linelets.
Statistical Modeling of Texture Sketch 249
In our work, H(sx,y | SN (x,y) ) have one to two hundred components. It can be
easily shown that p(sx,y | SN (x,y) ) achieves maximum entropy under the con-
straints on < H(sx,y | SN (x,y) ) >p(sx,y |SN (x,y) ) , where < >p means expectation
with respect to distribution p. The full model is
250 Y.N. Wu, S.C. Zhu, and C.-e. Guo
m
n
p(S | Λ) = p(sx,y | SN (x,y) )
x=1 y=1
m n m n
1
={ } × exp{< Λ, H(sx,y | SN (x,y) ) >}.
x=1 y=1
Z(Λ | SN (x,y) ) x=1 y=1
Now, let’s consider model fitting. Let Sobs be the observed sketch of an image.
Then we can estimate Λ by maximizing the log-likelihood
l(Λ | Sobs ) = {< Λ, H(Sobs obs obs
x,y | SN (x,y) ) > − log Z(Λ | SN (x,y) )}.
x,y
∂
l(Λ | Sobs ) = {H(sobs obs obs
x,y | SN (x,y) )− < H(sx,y | SN (x,y) ) >p(sx,y |Sobs ,Λ) },
∂Λ x,y
N (x,y)
and
∂2
2
l(Λ | Sobs ) = − {Varp(sx,y |Sobs ,Λ) [H(sx,y | Sobs
N (x,y) )]}.
∂Λ x,y
N (x,y)
Therefore, the log-likelihood is concave, and the model can be fitted by the
Newton-Raphson or equivalently in this case, the Fisher scoring algorithm,
∂2 ∂
Λt+1 = Λt − [ 2
l(Λ | Sobs )] |−1
Λt l(Λ | Sobs ) |Λt .
∂Λ ∂Λ
The convergence of Newton-Raphson is very fast; usually 5 iterations can already
give very good fit.
Both the first and second derivatives of the log-likelihood are of the form
x,y g(x, y). For each pixel (x, y), we need to evaluate the probabilities of all
possible sx,y . So the computation is still quite costly, although much more man-
ageable compared to MCMC type of algorithms. To further increase the effi-
ciency, we choose to sample a small number of pixels instead of going through
all of them. More specifically, for each (x, y), let πx,y ∈ [0, 1] be the probability
that pixel (x, y) will be included in the sample. Then after we collect a sample
of (x, y) by independent coin-flipping according to πx,y , we can approximate
g(x, y) ≈ g(x, y)/πx,y ,
x,y (x,y)∈sample
where the right hand side is the Hovitz-Thompson unbiased estimator of the left
hand side. As to the choice of πx,y , if there is a linelet at (x, y) on Sobs , then
we always let πx,y = 1. For other empty pixels (x, y), we can set πx,y according
to our need for speed. Usually, even if we take πx,y = .01 for empty pixels, the
algorithm can still give satisfactory fit.
Statistical Modeling of Texture Sketch 251
It is often the case that some components of x,y H(sobs obs
x,y | SN (x,y) ) are 0, and
if we implement the usual Newton-Raphson procedure, then the corresponding
components of Λ will go to −∞, thereby creating hard constraints. While this
is theoretically the right result, it can make the algorithm unstable. We choose
to
stop fitting such components as long as the corresponding components of
obs
x,y < H(s x,y | SN (x,y) ) >p(sx,y |Sobs ) drop below a threshold, e.g., .5.
N (x,y)
For a specific observed image, we do not want to use all the one to two hun-
dred dimensions in our model. In fact, we can just select a small number of com-
ponents of H(sx,y | SN (x,y) ) using some model selection methods such as Akaike
information criterion (AIC). While best-set selection is too time-consuming, we
can consider a feature pursuit scheme studied by Zhu, et al. (1997), i.e., we start
with only λ0 and δx,y in our model. Then we repeatedly add one component
of H at a time, so that the added component leads to the maximum increase
in log-likelihood. Although the log-likelihood is analytically tractable for the
causal model, the computation of the increase of log-likelihood for each candi-
date component of H is still quite costly. So as an approximation, we choose the
component Hk so that
{ x,y Hk (sobs obs obs
x,y | SN (x,y) )− < Hk (sx,y | SN (x,y) ) >p̂ }
2
gk = obs
,
x,y Varp̂ [Hk (sx,y | SN (x,y) )]
4 Discussion
There are two major loose ends in our work. One is that the coefficients of
the active bases are not modeled. The other is that the model fitting is not
rigorous. The model can be further extended to incorporate more sophisticated
local structures, such as local shapes and lighting, as well as more sophisticated
organizations such as flows and graphs. The key is that these concepts should be
understood in the context of a top-down generative model. For some stochastic
textures, sparse decomposition may not be achievable, and therefore, we might
have to stay with models built on pixel values such as the FRAME model.
We would like to stress that our goal in this work is to find a top-down
generative model for textures. We are not merely trying to come up with a line-
drawing version of the texture image by some edge detector, and then synthesize
252 Y.N. Wu, S.C. Zhu, and C.-e. Guo
similar line-drawings. We would also like to point out that our work is inspired
by Marr’s primal sketch (Marr, 1982). Marr’s method is bottom-up procedure-
based, that is, there does not exist a top-down model to guide the bottom-up
procedure.
Our eventual goal is to find the top-down model as a visual system’s concep-
tion of primal sketch, so that the largely bottom-up procedure will be model-
based. The hope is that the model is unified and explicit like a language, with
rich vocabularies for local image structures as well as their spatial organizations.
When fitted to a particular image, an automatic model selection procedure will
identify a low-dimensional sub-model as the most meaningful words. The model
should lie between physics-based models (that are not explicit and unified) and
Statistical Modeling of Texture Sketch 253
References
1. A. J. Bell and T.J. Sejnowski, “An information maximization approach to blind
separation and blind deconvolution”, Neural Computation, 7(6): 1129-1159, 1995.
2. F. Bergeaud and S. Mallat, “Matching pursuit: adaptive representation of images
and sounds.” Comp. Appl. Math., 15, 97-109. 1996.
3. A. Berger, V. Della Pietra, and S. Della Pietra, “A maximum entropy approach to
natural language processing”, Computational Linguistics, vol.22, no. 1 1996.
4. E. J. Candès and D. L. Donoho, “Curvelets - A Surprisingly Effective Nonadaptive
Representation for Objects with Edges”, Curves and Surfaces, L. L. Schumaker et
al. (eds), Vanderbilt University Press, Nashville, TN.
5. A. P. Dempster, N.M. Laird, and D. B. Rubin, “Maximum likelihood from incom-
plete data via the EM algorithm”, Journal of the Royal Statistical Society series
B, 39:1-38, 1977.
6. A. A. Efros and T. Leung, “Texture synthesis by non-parametric sampling”, ICCV,
Corfu, Greece, 1999.
7. A. A. Efros and W. T. Freeman, “Image Quilting for Texture Synthesis and Trans-
fer”, SIGGRAPH 2001.
8. S. Geman and D. Geman. “Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images”. IEE Trans. PAMI 6. pp 721-741. 1984.
9. C. E. Guo, S. C. Zhu, and Y. N. Wu, “Visual learning by integrating descriptive
and generative methods”, ICCV, Vancouver, CA, July, 2001.
10. D. J. Heeger and J. R. Bergen, “Pyramid-based texture analysis/synthesis”, SIG-
GRAPHS, 1995.
11. M. S. Lewicki and B. A. Olshausen, “A probabilistic framework for the adaptation
and comparison of image codes”, JOSA, A. 16(7): 1587-1601, 1999.
12. L. Liang, C. Liu, Y. Xu, B.N. Guo, H.Y. Shum, “Real-Time Texture Synthesis By
Patch-Based Sampling”, MSR-TR-2001-40, March 2001.
13. J. Malik, and P. Perona, “Preattentive texture discrimination with early vision
mechanisms”, J. of Optical Society of America A, vol 7. no.5, May, 1990.
14. S. G. Mallat, “A theory for multiresolution signal decomposition: the wavelet rep-
resentation”, IEEE Trans. on PAMI, vol.11, no.7, 674-693, 1989.
15. S. Mallat and Z. Zhang, “Matching pursuit in a time-frequency dictionary”, IEEE
trans. on Signal Processing, vol.41, pp3397-3415, 1993.
16. D. Marr, Vision, W.H. Freeman and Company, 1982.
17. D. B. Mumford “The Statistical Description of Visual Signals” in ICIAM 95, edited
by K.Kirchgassner, O.Mahrenholtz and R.Mennicken, Akademie Verlag, 1996.
18. B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field proper-
ties by learning a sparse code for natural images” Nature, 381, 607-609, 1996.
19. K. Popat and R. W. Picard, “Novel Cluster-Based Probability Model for Texture
Synthesis, Classification, and Compression.” Proc. of the SPIE Visual Comm. and
Image Proc., Boston, MA, pp. 756-768, 1993.
20. J. Portilla and E. P. Simoncelli, “A parametric texture model based on joint statis-
tics of complex wavelet coefficients”, IJCV, 40(1), 2000.
21. Y. Wu, S. C. Zhu, and X. Liu, (2000)“Equivalence of Julesz texture ensembles and
FRAME models”, IJCV, 38(3), 247-265.
22. S. C. Zhu, Y. N. Wu and D. B. Mumford, “Minimax entropy principle and its
application to texture modeling”, Neural Computation Vol. 9, no 8, Nov. 1997.