Object Co-Skeletonization With Co-Segmentation, CVPR'17
Object Co-Skeletonization With Co-Segmentation, CVPR'17
Koteswar Rao Jerripothula1,2 , Jianfei Cai2 , Jiangbo Lu3,4 and Junsong Yuan2
1 2
Graphic Era University, India Nanyang Technological University, Singapore
3 4
Advanced Digital Sciences Center, Singapore Shenzhen Cloudream Technology, China
[email protected], {asjfcai,jsyuan}@ntu.edu.sg, [email protected]
Abstract
6205
(a) (b) (c)
(d) (e)
Figure 3. Inherent interdependencies of co-skeletonization and co-
Figure 2. Example challenges of co-skeletonization. The quality segmentation can be exploited to achieve better results through a
of segmentation affects the quality of skeletonization. (b) The re- coupled iterative optimization process.
sult of [21] for (a). (c) Our result. Skeletons lie on homogeneous
regions, such as in (d) and (e), which are difficult to be detected
achieves state-of-the-art co-skeletonization performance in
and described.
the weakly supervised setting.
as shown for the image in Fig. 2(a) which has unsmooth 2. Related Work
segmentation. The skeleton produced by [21] in Fig. 2(a) Skeletonization: The research on skeletonization can
has too many unnecessary branches, while a more desirable be divided into three categories. First, there are some al-
skeleton to represent the cheetah would be the one obtained gorithms [17, 3, 19] which can perform skeletonization if
by our method in Fig. 2(c). Thus, the quality of the provided the segmentation of an object is given. Generally, these al-
shape becomes crucial, which is challenging for the con- gorithms are quite sensitive to the distortions of the given
ventional co-segmentation methods because their complex shape. However, this problem can be tackled through re-
way of co-labeling many images may not provide good and cent methods such as [21]. Second, there are also some tra-
smooth shapes. Second, joint processing of skeletons across ditional image processing methods [28, 29, 11] which can
multiple images is quite tricky. Because most of the skele- generate skeletons by exploiting gradient intensity maps.
ton points generally lie on homogeneous regions as shown They generate skeletons even for stuffs like sky, sea, etc,
in Fig. 2(d) and (e), it is not easy to detect and describe which usually need some object prior to be suppressed.
them for the purpose of matching. Third, how to couple the Third, there are also some supervised learning based meth-
two tasks so that they can synergistically assist each other ods which require groundtruth skeletons of training images
is another challenge. for learning. This class of methods includes both tradi-
Our key observation is that we can exploit the inher- tional machine learning based methods [25, 20] and the
ent interdependencies of two tasks to achieve better results recent deep learning based methods [27, 22]. The perfor-
jointly. For example, in Fig. 3, although the initial co- mance of the traditional machine learning based methods is
segmentation produces a poor result, most of the skeleton not satisfactory due to the limited feature learning capabil-
pixels still remain on the horse, which gradually improve ity in homogeneous regions. On the other hand, the recent
the segmentation by providing good seeds for segmentation deep learning based methods have made great progress in
in the subsequent iterations of joint processing. In turn, co- the skeletonization process as reported in [22] at the cost of
skeletonization also becomes better as the co-segmentation requiring complex training process on a substantial amount
improves. Our another observation is that we can exploit of annotated data. In contrast, our method is a weakly su-
the structure-preserving quality of dense correspondence to pervised one, although it can utilize the annotated data as
overcome the skeleton matching problem. well, if available.
To the best of our knowledge, there is only one dataset Segmentation: Image segmentation is a classical prob-
where co-skeletonization could be performed in a weakly lem, and there are many types of approaches like interac-
supervised manner, i.e. WH-SYMMAX dataset [20], and tive segmentation [15, 24], image co-segmentation [4, 7, 5],
it only contains horse images. To extensively evaluate semantic segmentation [18], etc. While interactive segmen-
co-skeletonization, we construct a new benchmark dataset tation needs human efforts, image co-segmentation exploits
called CO-SKEL dataset, which consists of images rang- weak supervision in the form of requiring the association of
ing from animals, birds, flowers to humans with total 26 same category images and uses an inter-image prior to help
categories. Extensive experiments show that our approach segment each individual image. Semantic image segmenta-
6206
tion not only segments objects but also provides a label for skeleton pruning problem and can be solved using the ap-
each pixel. In the past few years, deep learning based meth- proach similar to [21], where branches in the skeleton are
ods such as fully convolution networks (FCN) have greatly iteratively removed as long as it reduces the energy. Sim-
o
advanced the performance of semantic image segmentation. ilarly, if we combine both the inter-image prior ψpr and
o
Recently, [10] proposed a joint framework to combine in- the skeleton prior ψin as the data term, (3) become a stan-
teractive segmentation with FCN based semantic segmen- dard MRF-based segmentation formulation, which can be
tation [18] so as to help each other. In a similar spirit, in solved using GrabCut [15]. Thus, compared with the ex-
this work, we propose coupling of co-skeletonization and isting works, the key differences of our formulation lie in
co-segmentation to assist each other. the designed inter-image prior terms as well as the interde-
pendence terms, which link the co-skeletonization and co-
3. Proposed Method segmentation together.
Iteratively solving (2) and (3) requires a good initial-
In this section, we discuss our joint framework of co-
ization. We propose to initialize O by Otsu thresholded
skeletonization and co-segmentation in detail.
saliency maps and K by the medial axis mask [3]. Alg. 1
3.1. Overview of Our Approach summarizes our approach, where (ψpr + ψin + ψsm )(t) de-
notes the objective function value of (1) at the tth iteration
Given a set of m similar images belonging to the same k o k o k o
and ψpr = ψpr +ψpr , ψin = ψin +ψin , ψsm = ψsm +ψsm .
category, denoted by I = {I1 , I2 , · · · , Im }, we aim to
provide two output sets: K = {K1 , K2 , · · · , Km } and
O = {O1 , O2 , · · · , Om }, comprising skeleton masks and Algorithm 1: Our approach for solving (1)
segmentation masks, respectively, where Ki (p), Oi (p) ∈ Data: An image set I containg images of the same
{0, 1} indicating whether a pixel p is a skeleton pixel category
(Ki (p) = 1) and whether it is a foreground pixel (Oi (p) = Result: Sets O and K containing segmentations and
1). skeletons of images in I
Our overall objective function for an image Ii is defined (0)
Initialization: ∀Ii ∈ I, Oi = Otsu thresholded
as (0) (0)
saliency map and Ki = ma(Oi );
min λψpr (Ki , Oi |Ni ) + ψin (Ki , Oi |Ii ) + ψsm (Ki , Oi |Ii ) Process: ∀Ii ∈ I,
Ki ,Oi
do
s.t. Ki ⊆ ma(Oi ) 1) Obtain Oi
(t+1)
by solving (3) using [15] with
(1) (t) (t)
where the first term ψpr accounts for the priors from the set O and Ki .
(t+1)
of neighbor images denoted as Ni , the second term ψin is 2) Obtain Ki by solving (2) using [21] with
(t) (t+1) (t+1) (t+1)
to enforce the interdependence between the skeleton Ki and K and Oi , s.t. Ki ∈ ma(Oi ).
the shape / segmentation Oi in image Ii , the third term ψsm while
is the smoothness term to enforce smoothness, and λ is a (λψpr + ψin + ψsm )(t+1) ≤ (λψpr + ψin + ψsm )(t) ;
parameter to control the influence of the inter-image prior O ← O(t) and K ← K(t)
term. The constraint in (1) means the skeleton must be a
subset of medial axis (ma) [3] of the shape.
We resort to the typical alternative optimization strategy 3.2. Object Co-skeletonization
to solve (1), i.e., dividing (1) into two sub-problems and
solve them iteratively. In particular, one sub-problem is as As shown in Alg. 1, the step of object co-skeletonization
follows. Given the shape Oi , we solve co-skeletonization is to obtain K (t+1) by minimizing (2), given the shape
by O(t+1) and the previous skeleton set Kt . Considering the
(t+1) (t+1)
constraint of Ki ∈ ma(Oi ), we only need to
k k k
min λψpr (Ki |Ni ) + ψin (Ki |Oi ) + ψsm (Ki ) search skeleton pixels from the medial axis pixels. We build
Ki
(2) up our solution based on [21], but with our carefully de-
s.t. Ki ⊆ ma(Oi ). signed individual terms for (2) as explained below.
k
The other sub-problem is that given the skeleton Ki , we Prior Term (ψpr ): In the object co-skeletonization, a
solve co-segmentation by good skeleton pixel will be the one which is repetitive
o o o
across images. To account for this repetitiveness, we need
min λψpr (Oi |Ni ) + ψin (Oi |Ki , Ii ) + ψsm (Oi |Ii ). (3) to find corresponding skeleton pixels in other images. How-
Oi
ever, skeleton pixels usually lie on homogeneous regions
k
If we treat both the inter-image prior term ψpr and the shape (see Fig. 2(d)&(e)) and are thus difficult to match. Thus,
k
prior term ψin as a combined prior, (2) turns out to be a instead of trying to match sparse skeleton pixels, we make
6207
Hump?
Missing parts
6208
branches. This is different from the criteria in [21] which Denoting θ(Ki , Ii ) as the developed appearance models,
o
only aims for less number of skeleton pixels. Specifically, we define the interdependence term ψin as
k
we define the smoothness term ψsm as
X
|b(Ki )|
o
ψin (Oi |Ki , Ii ) = − log P Oi (p) | θ(Ki , Ii ), Ii (p)
X 1
k p∈Di
ψsm (Ki ) = |b(Ki )| × (8)
u=1 length bu (Ki ) (11)
where P Oi (p) | θ(Ki , Ii ), Ii (p) denotes how likely a
where b(Ki ) = {b1 (Ki ), · · · , b|b(Ki )| (Ki )} denotes the pixel of color I(p) will take the label Oi (p) given θ(Ki , Ii ).
set of branches of the skeleton Ki . In this way, we pun- o
ψin is similar to the data term in the interactive segmenta-
ish skeletons with either large number of branches or short- tion method [15].
length branches. Smoothness Term (ψsm o
): For ensuring smooth fore-
3.3. Object Co-segmentation ground and background segments, we simply adopt the
smoothness term of GrabCut [15], i.e.,
The object co-segmentation problem here is as follows. X 2
Given the skeleton Ki , find the optimal Oi that minimizes o
ψsm (Oi |Ii ) = γ [Oi (p) 6= Oi (q)]e(−β||Ii (p)−Ii (q)|| )
the objective function defined in (3). The individual terms (p,q)∈Ei
in (3) are defined in the following manner. (12)
o
Prior Term (ψpr ): We generate an inter-image co- where Ei denotes the set of neighboring pixel pairs in the
segment prior, similar to that for co-skeletonization. In par- image Ii , and γ and β are segmentation smoothness related
ticular, we align segmentation masks of neighboring images parameters as discussed in [15].
and fuse them with that of the concerned image, i.e.,
P 3.4. Implementation Details
Oi + Wji (Oj )
Ij ∈Ni We use the saliency extraction method [2] for initializa-
Oei = (9)
|Ni | + 1 tion of our framework in our experiments. We use the same
default setting as that in [15] for the segmentation parame-
where Wji is the same warping function from image j to ters γ and β in (12) throughout our experiments. For the pa-
ei , we define our inter-
image i. Then, with the help of O rameters of SIFT flow [12], we follow the setting in [16] in
image prior term as order to handle the possible matching of different semantic
objects. The parameter λ in both (2) and (3), which controls
X 1 X
o
ψpr (Oi |Ni ) = − Oi (p) log ei (q)
O the influence of joint processing, is set to 0.1.
|N(p)|
p∈Di q∈N(p)
! 4. Experimental Results
1 X
+ 1 − Oi (p) log 1 − ei (q)
O 4.1. Datasets and Evaluation Metrics
|N(p)|
q∈N(p)
(10) Datasets: There is only one publicly available dataset,
which encourages the shape to be consistent with O ei . Here i.e. WH-SYMMAX dataset [20], on which weakly super-
again we account for pixel correspondence errors by neigh- vised co-skeletonization can be performed, but it contains
borhood N(p) (in the pixel domain Di ) averaging. only the horse category of images. In order to evaluate
Interdependence Term (ψin o
): For the co-segmentation the co-skeletonization task extensively, we develop a new
process to benefit from co-skeletonization, our basic idea is benchmark dataset called the CO-SKEL dataset. It con-
to build up foreground and background appearance models sists of 26 categories with total 353 images of animals,
based on the given skeleton Ki . Particularly, we use GMM birds, flowers and humans. These images are collected
for appearance models. The foreground GMM model is from the MSRC dataset, CosegRep, Weizmann Horses and
learned using Ki (i.e., treating skeleton pixels as foreground iCoseg datasets along with their groundtruth segmentation
seeds), whereas the background GMM is learned using the masks. Then, we apply [21] (with our improved terms) on
background part of Ki ’s reconstructed shape R(Ki , Oi ). In these groundtruth masks, in the same manner as the WH-
this manner, the appearance model is developed entirely us- SYMMAX dataset has been generated from the Weizmann
ing the skeleton. Note that at the beginning it is not robust Horses dataset [1]. Fig. 6 shows some example images, and
to build up the GMM appearance models in this manner their skeletons using [21] and our improvement of [21]1 . It
since the initial skeleton extracted based on saliency is not can be seen that our skeletons are much smoother and better
reliable at all. Thus, at initialization, we develop the fore- in representing the shapes.
ground and background appearance models based on the 1 We will make our dataset with groundtruths and code publicly avail-
6209
Image [21] Ours Image [21] Ours Image [21] Ours
Figure 6. Given the shape, we improve skeletonization method [21] using our improved terms in their objective function. It can be seen
that our skeletons are much smoother and better in representing the shape. We use these improved results as groundtruths in our CO-SKEL
dataset.
6210
Figure 7. Some examples of steadily improving skeletonization and segmentation after each iteration. The top-right example shows that
our model continues to reproduce similar results once the optimal shape and skeleton are obtained.
m F0 F1 F2 F3 J 0.8
bear 4 0.075 0.1714 0.213 0.246 0.846 0.7
iris 10 0.363 0.600 0.658 0.698 0.837 0.6
camel 10 0.224 0.353 0.395 0.432 0.674 0.5
F0
cat 8 0.118 0.360 0.469 0.523 0.733 F1
0.4
cheetah 10 0.078 0.221 0.287 0.335 0.735 F2
0.3
cormorant 8 0.351 0.545 0.606 0.642 0.768 F3
J
cow 28 0.142 0.437 0.580 0.669 0.789 0.2
6211
Image Groundtruth Ours Image Groundtruth Ours Image Groundtruth Ours Image Groundtruth Ours
Figure 9. Sample co-skeletonization results along with our final shape masks. It can be seen that both are quite close to the groundtruths.
6212
References In Computer Vision and Pattern Recognition (CVPR), pages
1939–1946. IEEE, 2013.
[1] E. Borenstein and S. Ullman. Class-specific, top-down seg-
[17] P. K. Saha, G. Borgefors, and G. S. di Baja. A survey on
mentation. In European Conference on Computer Vision
skeletonization algorithms and their applications. Pattern
(ECCV), pages 109–122. Springer Berlin Heidelberg, 2002.
Recognition Letters, 76:3 – 12, 2016. Special Issue on Skele-
[2] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and tonization and its Application.
S.-M. Hu. Global contrast based salient region detection. [18] E. Shelhamer, J. Long, and T. Darrell. Fully convolu-
In Computer Vision and Pattern Recognition (CVPR), pages tional networks for semantic segmentation. IEEE Trans-
409–416. IEEE, 2011. actions on Pattern Analysis and Machine Intelligence (T-
[3] W.-P. Choi, K.-M. Lam, and W.-C. Siu. Extraction of the PAMI), 39(4):640–651, 2017.
euclidean skeleton based on a connectivity criterion. Pattern [19] W. Shen, X. Bai, R. Hu, H. Wang, and L. J. Latecki. Skeleton
Recognition, 36(3):721 – 729, 2003. growing and pruning with bending potential ratio. Pattern
[4] J. Dai, Y. N. Wu, J. Zhou, and S.-C. Zhu. Cosegmenta- Recognition, 44(2):196 – 209, 2011.
tion and cosketch by unsupervised learning. In International [20] W. Shen, X. Bai, Z. Hu, and Z. Zhang. Multiple instance
Conference on Computer Vision (ICCV). IEEE, 2013. subspace learning via partial random projection tree for local
[5] K. R. Jerripothula, J. Cai, and J. Yuan. Group saliency prop- reflection symmetry in natural images. Pattern Recognition,
agation for large scale and quick image co-segmentation. In 52:306 – 316, 2016.
International Conference on Image Processing (ICIP), pages [21] W. Shen, X. Bai, X. Yang, and L. J. Latecki. Skeleton prun-
4639–4643. IEEE, 2015. ing as trade-off between skeleton simplicity and reconstruc-
[6] K. R. Jerripothula, J. Cai, and J. Yuan. Cats: Co-saliency tion error. Science China Information Sciences, 56(4):1–14,
activated tracklet selection for video co-localization. In Eu- 2013.
ropean Conference on Computer vision (ECCV), pages 187– [22] W. Shen, K. Zhao, Y. Jiang, Y. Wang, Z. Zhang, and X. Bai.
202. Springer, 2016. Object skeleton extraction in natural images by fusing scale-
[7] K. R. Jerripothula, J. Cai, and J. Yuan. Image co- associated deep side outputs. In Computer Vision and Pattern
segmentation via saliency co-fusion. IEEE Transactions on Recognition (CVPR), pages 222–230. IEEE, 2016.
Multimedia (T-MM), 18(9):1896–1909, Sept 2016. [23] A. Sironi, V. Lepetit, and P. Fua. Multiscale centerline detec-
[8] T. S. H. Lee, S. Fidler, and S. Dickinson. Detecting curved tion by learning a scale-space distance transform. In Com-
symmetric parts using a deformable disc model. In Interna- puter Vision and Pattern Recognition (CVPR), pages 2697–
tional Conference on Computer Vision (ICCV), pages 1753– 2704. IEEE, 2014.
1760. IEEE, 2013. [24] M. Tang, L. Gorelick, O. Veksler, and Y. Boykov. Grabcut
[9] A. Levinshtein, S. Dickinson, and C. Sminchisescu. Mul- in one cut. In International Conference on Computer Vision
tiscale symmetric part detection and grouping. In Interna- (ICCV), pages 1769–1776, 2013.
tional Conference on Computer Vision (ICCV), pages 2162– [25] S. Tsogkas and I. Kokkinos. Learning-based symmetry de-
2169. IEEE, 2009. tection in natural images. In European Conference on Com-
[10] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribble- puter Vision (ECCV), pages 41–54. Springer Berlin Heidel-
sup: Scribble-supervised convolutional networks for seman- berg, 2012.
tic segmentation. In The IEEE Conference on Computer Vi- [26] N. Widynski, A. Moevus, and M. Mignotte. Local sym-
sion and Pattern Recognition (CVPR). IEEE, 2016. metry detection in natural images using a particle filtering
[11] T. Lindeberg. Edge detection and ridge detection with au- approach. IEEE Transactions on Image Processing (T-IP),
tomatic scale selection. International Journal of Computer 23(12):5309–5322, 2014.
Vision, 30(2):117–156, 1998. [27] S. Xie and Z. Tu. Holistically-nested edge detection. In
[12] C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense corre- International Conference on Computer Vision (ICCV), pages
spondence across scenes and its applications. IEEE Trans- 1395–1403. IEEE, 2015.
actions on Pattern Analysis and Machine Intelligence (T- [28] Z. Yu and C. Bajaj. A segmentation-free approach for skele-
PAMI), 33(5):978–994, 2011. tonization of gray-scale images via anisotropic vector diffu-
[13] F. Meng, J. Cai, and H. Li. Cosegmentation of multiple sion. In Computer Vision and Pattern Recognition (CVPR),
image groups. Computer Vision and Image Understanding pages 415 – 420. IEEE, 2004.
(CVIU), 146:67 – 76, 2016. [29] Q. Zhang and I. Couloigner. Accurate centerline detec-
[14] A. Oliva and A. Torralba. Modeling the shape of the scene: A tion and line width estimation of thick lines using the radon
holistic representation of the spatial envelope. International transform. IEEE Transactions on Image Processing (T-IP),
journal of computer vision (IJCV), 42(3):145–175, 2001. 16(2):310–316, 2007.
[15] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Inter- [30] H. Zhu, F. Meng, J. Cai, and S. Lu. Beyond pixels: A com-
active foreground extraction using iterated graph cuts. In prehensive survey from bottom-up to semantic image seg-
Transactions on Graphics (TOG), volume 23, pages 309– mentation and cosegmentation. Journal of Visual Communi-
314. ACM, 2004. cation and Image Representation (JVCIR), 34:12 – 27, 2016.
[16] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu. Unsupervised
joint object discovery and segmentation in internet images.
6213