A Robust Shape Model For Multi-View Car Alignment: Yan Li Leon Gu Takeo Kanade
A Robust Shape Model For Multi-View Car Alignment: Yan Li Leon Gu Takeo Kanade
Abstract
We present a robust shape model for localizing a set of feature points on a 2D image. Previous shape alignment models assume Gaussian observation noise and attempt to t a regularized shape using all the observed data. However, such an assumption is vulnerable to gross feature detection errors resulted from partial occlusions or spurious background features. We address this problem by using a hypothesis-and-test approach. First, a Bayesian inference algorithm is developed to generate object shape and pose hypotheses from randomly sampled partial shapes - subsets of feature points. The hypotheses are then evaluated to nd the one that minimizes the shape prediction error. The proposed model can effectively handle outliers and recover the object shape. We evaluate our approach on a challenging dataset which contains over 2,000 multi-view car images and spans a wide variety of types, lightings, background scenes, and partial occlusions. Experimental results demonstrate favorable improvements over previous methods on both accuracy and robustness.
model or undesirable conditions such as shadows and occlusions. Due to the iterative nature of most algorithms, these gross errors may become arbitrarily large and therefore cannot be averaged out, as is typically done in the least-squares framework. Rogers and Graham [18] attempt to address this problem by use of M-estimators. However, M-estimators tend to suffer from local optima, and pose parameters have been ignored in their model. Another limitation in the previous models is the sensitivity to initialization. The objective functions are usually highly nonlinear and a suboptimal initialization may cause the model to get stuck at local minimums. Previous work attempt to tackle this problem by sampling - starting from multiple initializations and choosing the optimal resulting shape [17, 23]. However, each individual sample was evaluated and matched in a least-squares fashion, so that the alignment process could still fail in the presence of outliers. It is also unclear how many samples are sufcient to achieve the best solution. In this paper, we address these two problems in a hypothesis-and-testing framework. Our key insight is the following: since object shape typically resides in a lowdimensional subspace, the degree-of-freedom of a shape model is considerably less than the number of the observed features; therefore, a small subset of good features are sufcient to jump start the matching and produce a reasonable estimate. We adopt the random sample consensus (RANSAC) paradigm of Fischler and Bolles [9]. In particular, a Bayesian inference algorithm is developed for a generating shape and pose hypothesis from a randomly sampled subset of features; each hypothesis is matched against the full observation by a robust measure to identify the optimal one; and the hypothesis is further rened by incorporating more inliers into the corresponding subset. We apply the approach to multi-view car alignment identifying detailed car shapes from different viewpoints. The task is challenging because car images are often subject to signicant amount of occlusions, and detecting individual parts are difcult. Combining our alignment model with a random forest [2] based detector we develop a robust, fully automatic car alignment system.
1. Introduction
Deformable shape matching has been studied extensively in the past two decades with the emphasis on the alignment of human faces and anatomical structures. Representative work include Snakes [14], Active Shape Model [4] and Active Appearance Model [3], Bayesian shape model [24, 13], nonlinear shape models [4, 19, 25], view-based [6] and three dimensional models [1, 12], and models for weak initialization [16, 17, 23]. A common assumption in these models is that the observation noise is Gaussian distributed. However, in realworld images shape observations are usually corrupted by large-scale measurement errors which are in gross disagreement with the true underlying shape. Such errors, usually called outliers, are caused by the failures of the appearance
Partial support provided by National Science Foundation (NSF) Grant IIS-0713406.
2466
2. Problem Formulation
Consider the shape of a deformable object which consists of a set of 2D landmark points. Let Y = T (u1 , v1 , . . . , uN , vN ) denote the locations of the points observed from an input image. The observation contains not only noises, but also gross outliers. Our goal is to estimate the true underlying shape from such observation, and identify the outliers. Instead of using the whole observation Y for estimation, we will rst use a randomly selected subset of Y , denoted by Yp , to generate a shape hypothesis. The subset of points are postulated as inliers which, by assumption, satisfy the underlying noise model, Yp = Mp T (S) + . (1)
The vector S denotes the normalized true shape which we refer to as canonical shape. It is transformed onto the image plane by T (S) = sRS + t with rotation R, scale s and translation t. Mp is a 2M 2N indicator matrix which species the subset. Observation noise N (0, ) is assumed to be independent for individual points. One should note that large-scale measurement errors will not conform with the Gaussian noise assumption, therefore the model (1) applies only to Yp . The canonical shape S is parameterized by a probabilistic PCA model [22, 24], S = + b + (2)
with the mean shape , the low-dimensional eigen subspace spanned by , and the shape deformation parameter b. Each element of b controls the magnitude of deformation along the corresponding axis in the subspace. A diagonal prior b N (0, ) (3)
where ||, || denotes the Euclidean distance, and T is the rigid transform which brings the canonical shape S to the observation space by .
3.1. Discussion
The BPSI algorithm provides us some insights to the noise-presenting shape model. However, from the optimization point of view, the objective function and search method remain obscure. In this section, we re-examine the BPSI algorithm and focus on its optimization method. In Step 6, we rst compute the posterior mean of p(b|S) which can be viewed as the probabilistic version of PCA projection. In addition to the subspace projection performed in PCA, BPSI applies an inhomogeneous shrinkage on each subspace dimension. The shrinkage parameter is dened by i = i i + 2 (i = 1, . . . , r) (9)
is put on b, where = diag {1 , 2 , . . . , r }, and i s are eigenvalues. The shape noise is chosen to be isotropic, N 0, 2 I , and its variance 2 = 2N1r 2N i is i=r+1 determined by the residual, off-eigenspace shape energy. Combining (1)(3), we have established a hierarchical probabilistic model that can be used for generating hypotheses. Specically, our problem is to estimate shape deformation b and pose = {R, s, t} from a partial shape Yp , i.e., nd the MAP {b , } = argmax p(b, |Yp ). This is a typb,
ical missing data problem that can be solve by ExpectationMaximization as described in Sec. 3. Given a hypothesis of b and , we can easily hallucinate the rest part of the shape Yh = Mh (sRS + t), (4)
where Mh is a binary matrix that indicates the remaining set of points. The hallucinate shape Yh is then used to test the hypothesis. Sec. 4 explains the details.
Recall that 2 is the average of the remaining eigen-values. Since b captures signicant amount of variance (98% in our implementation), 2 has a very small value (i.e., 2 0 and i 1). This implies that the PCA projection and reconstruction in Step 6 would not alter Sp substantially.
2467
Algorithm 1 Bayesian Partial Shape Inference (BPSI) Input: Partial observation Yp . b and from last iteration. Output: Updated b and . Initialize b = b and = for t = 1 to T do 3: E-Step: 4: Update Sp by blending, and Sh by reconstruction
1: 2:
1 Sp W1 T (Yp ) + W2 (b + )p Sh (b + )h
(6) (7)
where
W1 = s2 2 (s2 2 I + )1 W 2 = I W1
0 10
10
5: 6:
7:
Estimate pose (Procrustes analysis [11]) arg min ||Yp T (Sp )||
M end for
(Sp Sp )(Yp Yp )t ,
t
R=VU ,
8:
s = tr(W )/tr(M ),
= Yp [W1 Yp + W2 T (b + )p ] = (I W1 )Yp W2 T (b + )p = W2 [Yp T (b + )p ] It shows that Step 7 in BPSI solves a weighted least-squares problem
M
min
i=1
(10)
Fig. 1 shows the prole of the weight function. Recall that i is dened as the prediction residual from the previous step (Eqn. 5). Thus, the BPSI algorithm minimizes the sum of square errors via the iterative reweighted least-squares (IRLS).
M
min
i=1
wi (i
(t1)
)2 (b, ) i
where p = 5 is the number of features in one subset. P is the expected chance that at least one of the proposal subsets is good. In our implementation, we assume = 40% and require P = 0.99, thus m = 57. (k) Given the proposal subsets Yp (k = 1, . . . , K), the resulting shape b can be obtained by the least median of squares (LMedS) estimator [20]
2 min Med ri Yp , Yh k i (k) (k)
(12)
2468
Algorithm 2 Robust Shape Alignment Input: Observation Y . b and from last iteration. Output: Regularized Y . Updated b and .
1: 2:
(1)
(2)
(K)
Hallucination: Yh
(k) i
(k)
Mh T(k) (b(k) + )
(k) (k)
5:
Figure 2. The partial shape Yp (red dots) is used to hallucinate the remaining shape Yh (gray dots). The marginal variance of the hallucinated points can be calculated and shown here in ellipses.
6: 7: 8:
where ri is the residual between the i-th corresponding point of Yp and Yh . In the traditional RANSAC literature, one usually assumes no a priori knowledge about the target model and the voting inliers are assumed to be iid. For instance, in the line tting example, any two points can determine a model and the residual is simply the Euclidean distance from a voting sample to the tted line. However, in a deformable shape alignment task varying amounts of residuals should be accommodated to deal with the inherent shape variation. Note that the the hallucinated shape Yh is generated from b through the canonical shape S. By propagating the information in b, we obtain the prior distribution of Yh
5. Experiments
5.1. The Dataset
We evaluate our model on the MIT StreetScene dataset 1 . This dataset contains over 3,000 street scene images which were originally created for the task of object recognition and scene understanding under uncontrolled environment. We labeled 3,433 cars which span a wide variety of types, sizes, background scenes, lighting conditions, and partial occlusions. All the shapes are normalized to roughly the size 250x130 by the Generalized Procrustean Analysis [8]. The labeled data were manually classied into three views: 1,400 half-front view, 803 prole view and 1,230 half-back view. We randomly select 400 images from each view for training, and the rest for testing. For the occluded landmarks, we place their label at the most probable locations, but the corresponding local patches are excluded during training the appearance model.
E[Yh ] = Mh (sR + t)
t Var[Yh ] = s2 Mh R(t + 2 I)Rt Mh
In general, the points in Yh are correlated, thus the LMedS estimator cannot be applied directly. To remedy this problem, we make an independent assumption and use the marginal variance i of each point to compute the residual
ri is essentially the Mahalanobis distance between Yp (i) and Yh (i). Fig. 2 illustrated the inhomogeneous prior variance exhibited in Yh . Although the LMedS estimator is highly resistent to outliers, it has a relatively low statistical efciency and the estimate tends to be variable [21]. A post-processing must be employed to incorporate more inliers and re-estimate the model. Alg. 2 summarizes the complete hypothesis-and-test algorithm.
2469
descriptor
T1
T2
TN
Figure 3. Random forest for posterior estimation. The descriptor is dropped to N decision trees. The nal posterior is the average over all the resulting histograms reached by the input descriptor.
(a) (b) (c) Figure 4. (a) A normalized image. We apply the trained random forest on the entire image. Posterior maps are shown for the wheels (b) and the top-right corner (c).
for each landmark. Local patches are further described by the Histogram of Oriented Gradients (HOG) descriptor [7]. The HOG descriptors are computed over dense and overlapping grids of spatial blocks, with image gradient features extracted at 9 orientations and gathered into a 576-dimensional feature vector (we use 8x8 cells, and 2x2 blocks). The extracted descriptors are fed to a Random Forest [2] for discriminative learning. A random forest is essentially an ensemble of decision trees which are induced by bootstrapped data. Specically, we adopt the Extremely Randomized Trees of Geurts et al. [10] for training. The random forest consists of N randomly generated decision trees, each of which is trained by 5000 bootstrapped samples. At each non-terminal node, two random dimensions, denoted by i and j, are chosen from the descriptor d. The splitting measure at that node is specied as B(d) = 1, if d(i) < d(j) 0, otherwise
as we only need to examine a subset of randomly selected feature dimensions. In addition, by combining all the landmarks and training the forest jointly, the model implicitly captures the image context information, thus being able to distinguish between neighboring landmarks. Fig. 4 illustrates the random forest result.
where B(d) indicates the branch that d should continue. At each terminal node, we save a normalized histogram that counts the frequency of each class reaching the node. Our random forest representation is similar to the feature classication trees by Lepetit et al. [15]. However, our task is to estimate the posterior of the landmark given the observed patch rather than classify it into different categories. Since the decision trees are generated randomly, we can even combine all the landmarks into one random forest. In this case, each landmark represents a distinct class, while all the negative samples from different landmarks are combined into one single negative class. The resulting random forest is shown in Fig. 3. Given an input descriptor d, the posterior that it belongs to landmark li is given by p(li |d) = 1 N
N
pj (li |d)
j=1
(15)
where pj (li |d) is the posterior returned by tree Tj . The proposed random forest model offers two benets: First, training and testing the model are extremely efcient
2470
Figure 5. (a) The observed shape. (b) ASM. (c) BTSM. (d) Our approach (solid colored points represent the partial shape that generates the optimal hypothesis; white ones are the inliers included in the renement step; and blacks ones are the outliers rejected by the model). (e) Random shape hypotheses generated by RANSAC. Top row shows an example with spurious background features; and bottom row shows an example with partial occlusion.
view1 (noise=5) 8.5 8 7.5 7 9 RMSE RMSE 8 7 5.5 5 4.5 4 ASM BTSM Our Approach 0 2 4 6 8 landmark index 10 12 14 6 5 4 7 6 5 4 RMSE 6.5 6 9 8 12 11 10 ASM BTSM Our Approach view2 (noise=5) 13 12 11 10 ASM BTSM Our Approach view3 (noise=5)
5 6 landmark index
10
6 8 landmark index
10
Figure 6. Test errors for ASM, BTSM and our approach. The initial shape is set to be the mean shape plus 5 pixels random noise on each landmark. For each test image, we use the same initialization for all three methods. The RMSE of each landmark is shown for different views: half-front view (left pane), prole view (middle pane), and half-back view (right pane).
alignment model drops as the noise level increases. However, the average error increases less than 1 pixel even when 20 pixels random shift is added to the initial shape. This is because our algorithm relies on a minimal subset of features to generate a hypothesis, therefore can recover the meaningful shape in a couple of iterations. Traditional approaches are more likely to fail in this case because shape observation is contaminated by more outliers. Fig. 6 shows the landmark-wise average error over the entire test dataset. To investigate the error distribution, we need to make a side by side comparison for each example. We focus on the half-frontal view which contains 1,400 images. For each example, we run BTSM and Robust alignment respectively, using the same initialization. In Fig. 8, we use the sorted error of BTSM as reference and plot the corresponding error of the proposed method. A cubic curve is also tted on the blue plot to provide a global illustration of the error distribution. As we can see, the two methods are comparable on the rst 600 or so examples, while robust method overtakes BTSM in the remaining ones. Further inspections show that many of those difcult examples correspond to occlusion images. Fig. 9 shows some alignment results by our approach.
We demonstrate car images with various viewpoints, lightings, occlusion patterns, and cluttered background.
200
400
600
800
1000
2471
)e(
12 14
)d(
)c(
)b(
)a(
1200
1400
Robust Alignment (view1) 8 7.5 7 6.5 RMSE RMSE 6 5.5 5 4.5 4 noise=0 noise=5 noise=10 noise=20 0 2 4 6 8 landmark index 10 12 14 12 11 10 9 8 7 6 5 4 noise=0 noise=5 noise=10 noise=20
10
6 8 landmark index
10
12
14
Figure 7. Test error for our approach using different initializations. The initial shape (mean shape) is perturbed by different levels of noise from 0 to 20 pixels.
6. Conclusions
We have described a RANSAC-based approach for robust object alignment, and applied it to a challenging multiple-view car alignment task. It is encouraging to see that the approach is capable of dealing with large measurement errors such as occlusions. The current algorithm takes locally detected feature point as input. However, there are great potentials for extending the RANSAC framework to operate over multiple, globally detected feature points. We plan to explore this approach in the future work.
References
[1] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In SIGGRAPH, pages 187194, 1999. 1 [2] L. Breiman. Random forests. Machine Learning, 45:532, 2001. 1, 5 [3] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. PAMI, 23(6):681685, 2001. 1 [4] T. F. Cootes and C. J. Taylor. A mixture model for representing shape variation. Image and Vision Computing, pages 110119, 1997. 1 [5] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape model their training and application. Computer Vision and Image Understanding, 61(1):3859, Jan 1995. 5 [6] T. F. Cootes, K. Walker, and C. J. Taylor. View-based active appearance models. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, 2000. 1 [7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005. 5 [8] I. Dryden and K. Mardia. Statistical Shape Analysis. John Wiley & Sons, 1998. 4 [9] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model tting with application to image analysis and automated cartography. pages 381395, 1981. 1, 3 [10] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63:342, 2006. 5 [11] J. Gower and G. Dijksterhuis. Procrustes Problems. Oxford University Press, 2004. 3
[12] L. Gu and T. Kanade. 3d alignment of face in a single image. In Proceedings of Computer Vision and Pattern Recognition, 2006. 1 [13] L. Gu and T. Kanade. A generative shape regularization model for robust face alignment. In Proceedings of The 10th European Conference on Computer Vision, 2008. 1 [14] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. IJCV, 1(4):321331, 1988. 1 [15] V. Lepetit, P. Lagger, and P. Fua. Randomized trees for realtime keypoint recognition. In Proceedings of Computer Vision and Pattern Recognition, 2005. 5 [16] L. Liang, R. Xiao, F. Wen, and J. Sun. Face alignment via component-based discriminative search. In Proceedings of European Conference on Computer Vision, 2008. 1 [17] C. Liu, H. Shum, and C. Zhang. Hierarchical shape modeling for automatic face localization. In Proceedings of European Conference on Computer Vision, 2002. 1 [18] M. Rogers and J. Graham. Robust active shape model search. In Proceedings of European Conference on Computer Vision, 2002. 1 [19] S. Romdhani, S. Gong, and A. Psarrou. A multi-view nonlinear active shape model using kernel PCA. In BMVC, 1999. 1 [20] P. J. Rousseeuw. Robust regression and outlier detection. Wiley, New York, 1987. 3 [21] C. Steward. Robust parameter estimation in computer vision. SIAM Review, 41(3):513537, 1999. 4 [22] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61:611622, 1999. 2 [23] J. Tu, Z. Zhang, Z. Zeng, and T. Huang. Face localization via hierarchical CONDENSATION with Fisher boosting feature selection. In Proceedings of Computer Vision and Pattern Recognition, 2004. 1 [24] Y. Zhou, L. Gu, and H. J. Zhang. Bayesian tangent shape model: estimating shape and pose parameters via Bayesian inference. In Proceedings of Computer Vision and Pattern Recognition, 2003. 1, 2, 5 [25] Y. Zhou, W. Zhang, X. Tang, and H. Shum. A Bayesian mixture model for multi-view face alignment. In Proceedings of Computer Vision and Pattern Recognition, 2005. 1
2472
Figure 9. Alignment results by our approach. For each test image, we show the nal result on the top and the observed shape at the bottom.
2473