0% found this document useful (0 votes)
7 views

Li_Adaptive_Prototype_Learning_and_Allocation_for_Few-Shot_Segmentation_CVPR_2021_paper

This paper introduces the Adaptive Superpixel-guided Network (ASGNet) for few-shot segmentation, which enhances prototype learning by using two novel modules: Superpixel-guided Clustering (SGC) and Guided Prototype Allocation (GPA). These modules allow for the extraction of multiple adaptive prototypes and their allocation based on object characteristics, improving segmentation accuracy significantly. ASGNet outperforms state-of-the-art methods by achieving a 5% improvement in 5-shot segmentation tasks on the COCO dataset.

Uploaded by

qin yang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Li_Adaptive_Prototype_Learning_and_Allocation_for_Few-Shot_Segmentation_CVPR_2021_paper

This paper introduces the Adaptive Superpixel-guided Network (ASGNet) for few-shot segmentation, which enhances prototype learning by using two novel modules: Superpixel-guided Clustering (SGC) and Guided Prototype Allocation (GPA). These modules allow for the extraction of multiple adaptive prototypes and their allocation based on object characteristics, improving segmentation accuracy significantly. ASGNet outperforms state-of-the-art methods by achieving a 5% improvement in 5-shot segmentation tasks on the COCO dataset.

Uploaded by

qin yang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Adaptive Prototype Learning and Allocation for Few-Shot Segmentation

Gen Li1,3 , Varun Jampani2 , Laura Sevilla-Lara1 , Deqing Sun2 , Jonghyun Kim3 , Joongkyu Kim3 *
1 2 3
University of Edinburgh Google Research Sungkyunkwan University

Abstract Support Mask

Prototype learning is extensively used for few-shot seg-


Pool Expand
mentation. Typically, a single prototype is obtained from the
Support
support feature by averaging the global object information.
However, using one prototype to represent all the informa-
tion may lead to ambiguities. In this paper, we propose two Query
(a) Single prototype learning
novel modules, named superpixel-guided clustering (SGC)
Support Mask
and guided prototype allocation (GPA), for multiple pro-
totype extraction and allocation. Specifically, SGC is a
parameter-free and training-free approach, which extracts


Cluster Allocate
more representative prototypes by aggregating similar fea- Support
ture vectors, while GPA is able to select matched prototypes
to provide more accurate guidance. By integrating the SGC
and GPA together, we propose the Adaptive Superpixel- Query
(b) Adaptive prototype learning and allocation
guided Network (ASGNet), which is a lightweight model
and adapts to object scale and shape variation. In addition, Figure 1. Comparison between (a) single prototype learning and
our network can easily generalize to k-shot segmentation (b) proposed adaptive prototype learning and allocation. We uti-
with substantial improvement and no additional computa- lize superpixel-guided clustering to generate multiple prototypes
tional cost. In particular, our evaluations on COCO demon- and then allocate them pixel-wise to query feature.
strate that ASGNet surpasses the state-of-the-art method by
5% in 5-shot segmentation.1
challenging problem as the test data are novel categories
which do not exist in the training set, and there are usually
1. Introduction large variations in appearance and shape between the sup-
port and query images.
Humans have a remarkable ability to learn how to recog- Current few-shot segmentation networks usually extract
nize novel objects after seeing only a handful of exemplars. features from both query and support images, and then pro-
On the other hand, deep learning based computer vision sys- pose different approaches for feature matching and object
tems have made tremendous progress, but have largely de- mask transfer from support to query image. This feature
pended on large-scale training sets. Also, deep networks matching and mask transfer are usually performed in one
mostly work with predefined classes and are incapable of of two ways: prototypical feature learning or affinity learn-
generalizing to new ones. The field of few-shot learning ing. Prototypical learning techniques condense the masked
studies the development of such learning ability in artificial object features in a support image into a single or few proto-
learning systems, where only a few examples of the new typical feature vectors. Then these techniques find the pixel
category are available. locations of similar features in the query image to segment
In this work, we tackle the few-shot segmentation prob- the desired object. A key advantage of prototype learning is
lem, where the target is learning to segment objects in a that the prototypical features are more robust to noise than
given query image while only a few support images with pixel features. However, prototypical features inevitably
ground-truth segmentation masks are available. This is a drop spatial information, which is important when there is a
* Corresponding author large variation in the object appearance between the support
1 Code is available at https://git.io/ASGNet. and query images. In addition, most prototypical learning

8334
networks [40, 42, 32, 29, 4] merely generate a single pro- each pixel in a query image. In summary, the SGC module
totype by masked average pooling as shown in Figure 1(a), provides adaptive prototype learning both in terms of the
thus losing information as well as discriminability. number of prototypes and their spatial extents, and the GPA
Affinity learning techniques [39, 36, 31] on the other module provides adaptive allocation of the learned proto-
hand, directly try to match object pixels in a support image types when processing query features. These two modules
to query image pixels thereby transferring the object mask. make ASGNet highly flexible and adaptive to varying object
These techniques predict cross-image pixel affinities (also shapes and sizes, allowing it to generalize better to unseen
called connection strengths) using learned features, which object categories. We make the following contributions:
perform feature matching while preserving spatial informa-
• We propose the Adaptive Superpixel-guided Network
tion better than prototypical learning approaches. How-
(ASGNet), a flexible prototypical learning approach
ever, affinity learning techniques are prone to over-fitting
for few-shot segmentation that is adaptive to different
on training data as they try to solve an under-constrained
object scales, shapes and occlusions.
pixel-matching problem with dense affinity matrices.
In this work, we propose a novel prototypical learning • We introduce two novel modules, namely Superpixel-
technique that addresses some of the main shortcomings of guided Clustering (SGC) and Guided prototype alloca-
existing ones. In particular, we want to adaptively change tion (GPA), for adaptive prototype extraction and allo-
the number of prototypes and their spatial extent based on cation respectively. They can serve as effective plug-
the image content, making the prototypes content-adaptive and-play components on feature matching.
and spatially-aware. This adaptive, multi-prototype strategy
is important to deal with large variations in object scales • ASGNet achieves top-performing results with fewer
and shapes across different images. Intuitively, when an ob- parameters and less computation. Specifically, the pro-
ject occupies a large portion of the image, it carries more posed method obtains mIoUs of 64.36%/42.48% in the
information and thus requires more prototypes to represent 5-shot setting on Pascal-5i /COCO-20i , exceeding the
all the necessary information. On the contrary, if the ob- state-of-the-art by 2.40%/5.08%.
ject is fairly small and the proportion of the background is
large, then a single or few prototypes are sufficient. In ad- 2. Related Work
dition, we want the support region (spatial extent) for each
of the prototypes to be adaptive to object information that Semantic Segmentation. Most existing semantic segmen-
is present in the support image. Concretely, we aim to di- tation methods are based on fully convolutional networks
vide the support feature into several representative areas ac- (FCNs) [19], which replace fully connected layers with
cording to the feature similarity. We also want to adaptively fully convolutional ones for pixel-level prediction. Recent
choose more important prototypes while finding similar fea- breakthroughs in semantic segmentation have mainly come
tures in a query image. As different object parts are visible from multi-scale feature aggregation [43, 2, 35, 10] or at-
in different image regions and in different query images, tention mechanisms [7, 38, 15, 28, 3, 44, 41]. These meth-
we want to dynamically allocate different prototypes across ods often use dilated convolution kernels [37] and set up an
query image for feature matching. For example, some parts encoder-decoder structure to obtain a large receptive field
of the object can be occluded in a query image and we want while maintaining the feature resolution. Although these
to dynamically choose the prototypes that are correspond- methods achieve tremendous success, they need long train-
ing to the visible parts in the query image. ing time and a large amount of pixel-level labeled ground
We achieve this adaptive, multi-prototype learning and truth to fully supervise the network. Also, in the inference
allocation with our Adaptive Superpixel-guided Network stage, trained models cannot recognize new classes that do
(ASGNet) that leverages superpixels for adapting both not exist in the training set.
the number and support regions of the prototypes. The Few-shot Learning. Few-shot learning focuses on the gen-
schematic illustration is presented in Figure 1(b). In particu- eralization ability of models, so that they can learn to predict
lar, we propose two modules, Superpixel-guided Clustering new classes given only a few annotated examples. Existing
(SGC) and Guided Prototype Allocation (GPA), which form methods mainly concentrate on metric-learning [30, 27, 25]
the core of ASGNet. The SGC module does fast feature- and meta-learning [33, 6, 22]. The core idea of metric learn-
based superpixel extraction on the support image and the ing is distance measurement, and it is generally formulated
resulting superpixel centroids are considered as prototypical as an optimization of distance/similarity between images
features. Since superpixel shapes and numbers are adaptive or regions. In meta-learning approaches, the main idea is
to the image content, the resulting prototypes also become to define specific optimization or loss functions to achieve
adaptive. The GPA module uses an attention-like mecha- fast learning capability. Among these methods, the con-
nism to allocate most relevant support prototype features to cept of prototypical networks has been extensively adopted

8335
in few-shot segmentation, which largely reduces computa- 3. Problem Definition
tional budget while maintaining high performance. Most
methods focused on image classification, while recently The key difference between few-shot segmentation and
few-shot segmentation has received growing attention. a general semantic segmentation is that the categories in
training and testing set do not intersect. It means that, in
Few-shot Segmentation. Few-shot segmentation is an ex- the inference stage, the testing set has classes totally unseen
tension of few-shot classification, and it tackles a more in the training. Specifically, given a training set Strain =
challenging task of predicting a label at each pixel instead {(IS/Q , MS/Q )} and testing set Stest = {(IS/Q , MS/Q )},
of predicting a single label for the entire image. This the categories of the two sets do not intersect (Strain ∩
research problem was introduced by Shaban et al. [23], Stest = ∅). Here I ∈ RH×W ×3 indicates the RGB im-
who proposed a classical two-branch network. Later on, age and M ∈ RH×W denotes the segmentation mask. Sub-
PL [4] introduced the idea of using prototypes. In that scripts S and Q represent support and query, respectively.
work, each prediction is generated by measuring the sim- Following the first one-shot segmentation work [23], we
ilarity between prototypes and pixels in query image. SG- align training and testing with the episodic paradigm [30].
One [42] proposed masked average pooling to obtain the In each episode, the input to the model is composed by a
object-related prototype, which has been widely used in query image IQ and K samples (ISi , MSi ), i ∈ {1, ..., K}
subsequent works. PANet [32] introduced a novel prototype from the support set. All support and query images have the
alignment regularization to fully exploit the support knowl- same class c. We estimate the query mask M̃Q to approxi-
edge, making the final prediction with only the measure- mate the ground truth mask MQ .
ment of cosine distance. Based on masked average pooling,
CANet [40] expanded the prototype to the same size of the 4. Proposed Method
query feature and concatenated them together. This work
also used an iterative optimization module to refine the seg- In this section, we first introduce the proposed two
mentation result. PGNet, BriNet and DAN [39, 36, 31] in- modules for prototype generation and matching, which are
troduced dense pixel-to-pixel connections between support superpixel-guided clustering (SGC) and guided prototype
and query features to maintain spatial information. More re- allocation (GPA). Then, we discuss the adaptive ability of
cently, PMMs [34] leveraged the expectation-maximization these two modules. After that, we introduce the overall net-
(EM) algorithm to generate multiple prototypes. However, work architecture, named Adaptive Superpixel-guided Net-
all prototypes have the same relevance in this model, which work (ASGNet), which integrates the SGC and GPA mod-
can potentially be sensitive to poorly matched prototypes. ules in one model. The overall structure is shown in Figure
Instead, in our work we utilize the similarity between each 2. Finally, we elaborate the k-shot setting in ASGNet.
prototype and the query feature to select the most relevant
prototype at each pixel location. 4.1. Superpixel-guided Clustering
Superpixel Segmentation. A superpixel is defined as a set The core idea of SGC is inspired by the superpixel
of pixels with similar characteristics (color, texture, cate- sampling network (SSN) [13] and MaskSLIC [12]. SSN
gory). Superpixels have been effective in many computer was the first end-to-end trainable deep network for super-
vision tasks, and have been used as the basic units for the pixel segmentation. The key contribution in SSN is con-
image segmentation task. Superpixels carry more informa- verting the nearest-neighbor operation in SLIC [1] into a
tion than pixels, and can provide more compact and con- differentiable one. The traditional SLIC superpixel algo-
venient image representations for the downstream vision rithm uses k-means clustering iteratively with two steps:
tasks. For more details about traditional and existing meth- Pixel-superpixel association and Superpixel centroid up-
ods on superpixels, please refer to [1, 26]. date. Based on the color similarity and proximity, pixels
Our research is inspired by the maskSLIC [12] and su- are assigned to different superpixel centroids. Specifically,
perpixel sampling network (SSN) [13]. MaskSLIC adapts the input image I ∈ Rn×5 is usually in a five-dimensional
SLIC [1] to a defined region of interest (RoI), and the main space (labxy) with n pixels, where lab represents the pixel
contribution is in the placement of seed points within the vector in CIELAB color space and xy indicates the pixel lo-
RoI. SSN [13] proposed the first end-to-end trainable su- cation. After iterative clustering, the algorithm outputs the
perpixel algorithm by making the SLIC algorithm differen- association map where each pixel n is assigned to one of the
tiable. Inspired by the insights of these two techniques, we m superpixels.
propose the masked superpixel clustering in feature space, This simple method inspires us with an insightful idea,
which can gather similar features together and generate su- which is to aggregate the feature map into multiple super-
perpixel centroids as prototypes. Instead of representing the pixel centroids in a clustering way, and here superpixel cen-
information of the entire object, superpixel cenroids stand troids can serve as prototypes. Therefore, instead of com-
for the parts of the object with similar characteristics. puting the superpixel centroids in image space, we estimate

8336
Superpixel FPN-like Top-down
Support Centroids

SGC


C

Shared CNN
Query

GPA

Figure 2. Overall architecture of the proposed Adaptive Superpixel-guided Network.

Algorithm 1 Superpixel-guided Clustering (SGC)


Superpixel

Input: Support feature Fs , support mask Ms , and initial Centroids

superpixel seeds S 0


concatenate the absolute coordinates to Fs append append

extract masked features Fs via support mask Ms coordinate coordinate

for each iteration t do Figure 3. Illustration of proposed Superpixel-guided Clustering.


Compute association between each pixel p and super-
′ 2
− F −S t−1
pixel i, Qtpi = e p i
els according to the distance function D:
Update superpixel
PNm t ′ t P t centroids,
Sit = Z1t p=1 Qpi Fp ; Zi = p Qpi ′ t−1

− Fp −Sit−1
2
i Qtpi = e−D(Fp ,Si )
=e . (2)
end for
remove coordinates information Then, new superpixel centroids are updated as the weighted
return final superpixel centroids S ⊲ (Nsp , C) sum of masked features:
Nm
1 X ′
Sit = Qt F , (3)
them in feature space by clustering similar feature vectors. Zit p=1 pi p
The whole SGC process is delineated in Algorithm 1.
Given support feature Fs ∈ Rc×h×w , support mask where Zit = p Qtpi is a constant for normalization. The
P
Ms ∈ Rh×w and initial superpixel seeds S 0 ∈ Rc×Nsp above process is visualized in Figure 3.
(Nsp is the number of superpixels), we aim to obtain the fi- Here, we elaborate the selection of initial seeds. Gener-
nal superpixel centroids, which act as multiple compact pro- ally, in superpixel algorithm, a H ×W image is evenly parti-
totypes. First, we concatenate the coordinates of each pixel tioned into regular grid cells of size h×w, and each grid cell
with the support feature map to introduce positional infor- is considered as an initial seed (i.e., superpixel). However,
mation. Then, we define the distance function D following this initialization is not suitable for our purposes where a
SLIC [1], which consists of feature and spatial distance: foreground mask is given for the support image and we only
q need to initialize the seeds inside this foreground region. To
D = (df )2 + (ds /r)2 , (1) uniformly initialize seeds in the masked region, we refer to
MaskSLIC [12] for iteratively placing each initial seed, and
where df , ds are the Euclidean distance for features and the pipeline is depicted in Figure 5. This seed initialization
coordinate values, and r is a weighting factor. We filter results in faster convergence of superpixel-guided cluster-
out the background information with the support mask and ing with only a few iterations.
only keep the masked features, compressing the feature map
′ 4.2. Guided Prototype Allocation
from Fs ∈ Rc×h×w to Fs ∈ Rc×Nm , where Nm is the
number of pixels inside the support mask. After extracting prototypes, previous methods mostly
Then we compute superpixel-based prototypes in an it- follow the design of CANet [40], expanding a single pro-
erative fashion. For each iteration t, we first compute the totype to the same size of the query feature and concatenat-
association map Qt between each pixel p and all superpix- ing them together. However, this operation results in equal

8337
Superpixel
Centroids
0 3 3 1
3 3 1 0
2 1 2 0 Prototypes
… 0 0 2 1

Cosine Guide map Guide feature


Similarity

obabilit map
Probability
Merged feature

Figure 4. Illustration of proposed Guided Prototype Allocation.

Binary mask Step 1 Step 2 Final coordinates where ⊕ indicates the concatenation operation along chan-
Iterate nel dimension, and f (·) is a 1 × 1 convolution.
4.3. Adaptability
Figure 5. The pipeline of placing initial seeds. (1) Step 1: We As mentioned before, we argue that one of the key at-
conduct Distance Transform and choose the maximum point. (2) tributes in the proposed network is its adaptive ability for
Step 2: We set that point to False and then repeat Step 1. After few-shot semantic segmentation. In Figure 6, we provide
that, we iterate Step 2 until we get coordinates of all initial seeds. some examples illustrating the adaptive ability of SGC and
GPA. In SGC, to make it adaptive to object scale, we define
a criterion to regulate the number of superpixel centroids as
guidance to every location in the query feature. To make
prototype matching more adaptive to the query image con- 
Nm

tent, we propose the Guided Prototype Allocation (GPA), Nsp = min( , Nmax ), (7)
Ssp
illustrated in Figure 4. We first compute the cosine distance
to measure the similarity between each prototype and each where Nm is the number of pixels in the support mask;
query feature element: Ssp is the average area assigned to each initial superpixel
seed, and we set to 100 empirically. When the foreground
Si · Fqx,y
Cix,y = i ∈ {1, 2, ..., Nsp }, (4) is fairly small, Nsp = 0 or 1, this method degrades to a
kSi k · kFqx,y k general masked average pooling as shown in Figure 6(a).
In addition, to reduce the computational burden, we set a
where Si ∈ Rc×1 is the ith superpixel centroid (prototype), hyper-parameter Nmax to constrain the maximum number
and Fqx,y ∈ Rc×1 is the feature vector at location (x, y) of prototypes. In GPA, we can observe its adaptability to
of the query feature. We use this similarity information as object shape. In other words, it is resilient to occlusion.
input to a two-branch structure. The first branch computes When severe occlusion exists in query image, e.g., Figure
which prototype is the most similar at each pixel location as 6(b), GPA can choose the best matched prototype for each
follows: query feature location.
Gx,y = arg max Cix,y , (5)
i∈{0,..,Nsp } 4.4. Adaptive Superpixel-guided Network
where the argmax operator is adopted to obtain Gx,y , which Based on the above SGC and GPA modules, we pro-
is a single index value representing a particular prototype. pose the Adaptive Superpixel-guided Network (ASGNet)
Putting all index values together, we get a guide map G ∈ for few-shot semantic segmentation as illustrated in Fig-
Rh×w . Then, by placing corresponding prototype in each ure 2. First, the support and query images are fed into
position of the guide map, we obtain the guide feature FG ∈ a shared CNN (pretrained on ImageNet [14]) to extract
Rc×h×w to achieve pixel-wise guidance. While in the other features. Then, by passing the support features through
branch, all the similarity information C is added up across SGC with support mask, we obtain the superpixel centroids,
all the superpixels to get the probability map P . which are considered as prototypes. After that, for more
Finally, we concatenate the probability map and the accurate pixel-wise guidance, we adopt GPA module to
guide feature with the original query feature FQ to provide match the prototypes with the query feature. Finally, we
the guiding information, and thus obtain the refined query use the feature enrichment module [29] and set up a FPN-

feature FQ : like [16] top-down structure to introduce multi-scale infor-

FQ = f (FQ ⊕ FG ⊕ P ), (6) mation. As demonstrated in [29], transferring features from

8338
Highest Similarity

SGC MAP


(a) SGC is adaptive to object scale variation (b) GPA is adaptive to object shape variation

Figure 6. Illustration of the model adaptability in prototype learning and allocation. SGC and GPA indicate the proposed superpixel-guided
clustering and guided prototype allocation respectively, and MAP denotes masked average pooling proposed in [42].

fine to coarse promotes the feature interaction, so we fol- We use mean intersection-over-union (mIoU) as a pri-
low their design for fast multi-scale aggregation. Finally, mary evaluation metric for the ablation study, as it is com-
all different scales are concatenated, and each scale yields a monly recognized in image segmentation. In addition,
segmentation result for computing the loss. for more consistent comparisons, results of foreground-
background IoU (FB-IoU) are also reported.
4.5. K-shot Setting
In previous work, the k-shot setting is usually tackled 5.2. Implementation Details
via feature averaging or attention-based fusion. However, it We adopt ResNet [11] as the backbone network [40], and
turns out that the improvements from such strategies is mi- we concatenate block2 and block3 to generate feature maps.
nor, while requiring heavy computation. In contrast, based We train the model with a SGD optimizer on Pascal-5i for
on the proposed SGC and GPA, our ASGNet can easily 200 epochs and COCO-20i for 50 epochs. We set the initial
adopt an efficient k-shot strategy without collapsing sup- learning rate to 0.0025 with batch size 4 on Pascal-5i , and
port features. Specifically, in each support image and mask 0.005 with batch size 8 on COCO-20i . The number of iter-
pair, we implement SGC to obtain the superpixel centroids. ations in the SGC module is set to 10 during training and 5
By collecting all superpixel centroids together, we get the during inference. We use data augmentation during training
overall superpixel centroids S from k shots: - input images are transformed with random scale, horizon-
tal flip and rotation from [-10, 10], and then all images are
S = (S 1 , S 2 , ..., S k ), (8) cropped to 473 × 473 (Pascal) or 641 × 641 (COCO) as
training samples. We implement our model using Pytorch
where the number of superpixel centroids is Nsp =
Pk i and run our experiments on a workstation with Nvidia Tesla
i=1 Nsp . By doing so, the GPA can receive a larger range V100 GPU. To increase the variance of the cosine measure-
of selections from multiple shots, and thus yield more accu- ment, we remove the ReLU layer before both support and
rate guidance to segment the object in the query image. query features, so the similarity metric is bounded in [-1, 1]
rather than [0, 1].
5. Experiments
5.1. Datasets and Evaluation Metric 5.3. Ablation Study

We choose Pascal-5i and COCO-20i , two widely used To verify the effectiveness of the proposed modules,
datasets for few-shot semantic segmentation, to analyze our we implement extensive ablation studies with a ResNet-50
model performance. Pascal-5i [23] includes images from backbone on Pascal-5i . We use floating point operations
the PASCAL VOC 2012 [5] and extra annotations from (FLOPs) to represent the amount of computation, and take
SBD [9]. The total 20 categories are evenly partitioned into both addition and multiplication into account.
four splits, and the model training is conducted in a cross- Number of Superpixel Centroids. To explore the effect
validation manner. Specifically, three splits are selected in of the number of superpixel centroids, we conduct exper-
the training process and the remaining one is used for test- iments with different Nmax in 1-shot segmentation. As
ing. During inference, 1000 support-query pairs are ran- stated in Section 4.3, Nmax is a hyper-parameter set to regu-
domly sampled for evaluation [23]. Different from Pascal- late the maximum number of prototypes. Particularly, when
5i , MSCOCO [17] is a large-scale dataset with 82,081 im- Nmax = 1, the prototype generation process degrades to the
ages in the training set. Following FWB [20], the overall masked average pooling. As shown in Table 1, when Nmax
80 classes from MSCOCO [17] are evenly divided into four equals 5, ASGNet achieves the best results in split 0 and 2,
splits with the same cross-validation strategy. For more sta- as well as the best mean performance, which demonstrates
ble results, we randomly sample 20,000 pairs during evalu- the validity of superpixel-guided clustering. The results
ation [29]. get improved gradually when Nmax increases from 1 to 5,

8339
and then decrease slightly after that, denoting that exces- Generation Matching mIoU
FLOPs∆
sive prototypes bring negative effect and are prone to over- MAP SGC Expand GPA 1-shot 5-shot
fitting. Finally, we choose Nmax as 5 for both 1-shot and X X 58.96 60.19 0.9G
5-shot segmentation. In addition, we analyzed the impact of X X 58.31 61.24 K*8.5G+0.5G
the proposed adaptive scheme on the number of superpixel X X 59.29 63.94 K*5.5M+0.9G
centroids Nsp and show the results in Table 2. Compared
Table 3. Ablation study on prototype generation (Masked aver-
with using the fixed number (5) of superpixels, the results age pooling (MAP) vs. SGC) and matching (Expand vs. GPA).
show that the adaptive setting (Eqn. 7) can reduce redundant FLOPs∆ denotes the computational cost from prototype matching
computation while obtaining better performance. process, and K is the adaptive number of prototype (K≤ 5).

Nmax s-0 s-1 s-2 s-3 mean


1 57.40 67.52 55.69 54.36 58.74 conducted in Pascal-50 . As reported in Table 4, our sim-
3 57.92 67.86 55.86 53.74 58.85 ple strategy achieves the best performance and the largest
5 58.84 67.86 56.79 53.66 59.29 increment over the 1-shot baseline (4.82%) without addi-
7 58.84 68.11 55.85 54.00 59.20 tional computation. On the contrary, attention-based fusion
9 57.61 67.97 56.68 53.27 58.88 requires a large amount of computation, and has limited per-
formance improvement. This demonstrates that the GPA
Table 1. Ablation study on the maximum number of superpixel module is highly effective when given a large number of
centroids Nmax . s-x denotes different cross-validation splits. selections.

5-shot setting mIoU IoU∆ FLOPs∆


Centroids s-0 s-1 s-2 s-3 mean
1-shot basline 58.84 - -
Fixed 57.87 67.62 57.10 53.61 59.05 Attention 60.69 1.85 42.6G
Adaptive 58.84 67.86 56.79 53.66 59.29 Feature-avg 62.10 3.26 3.7M
Table 2. Performance comparison between fixed and adaptive su- Ours 63.66 4.82 None
perpixel centroids.
Table 4. Ablation study of different k-shot fusion strategy.

SGC and GPA. To demonstrate the effectiveness of pro-


posed SGC and GPA modules, we conduct diagnostic ex- 5.4. Comparison to State-of-the-art
periments on prototype generation and matching. We first
implement a baseline model with the single prototype learn- Pascal-5i . Comparisons to state-of-the-art methods are
ing proposed in PFENet [29]. Then, we introduce SGC to shown in Table 5 and 6, where two different metrics are
generate multiple prototypes, and fuse them in a dense man- adopted. In Table 5, with ResNet-101 as the backbone, AS-
ner with reference to PMMs [34]. Finally, we replace the GNet outperforms recent methods with a considerable mar-
prototype expansion with the proposed allocation scheme. gin of 2.40% in 5-shot segmentation, while being on par
For ablations on the SGC module, as shown in Table 3, in with state-of-the-arts under the 1-shot setting. In Table 6,
the 1-shot setting we observe that replacing mean average we further make comparisons on FB-IoU and the number
pooling (MAP) with SGC can worsen the representations of trainable parameters. Once again, the proposed ASGNet
and lead to degraded performance, but the model benefits in achieves significant improvement over state-of-the-arts in
the 5-shot scenario. We deem the main reason is that exces- 5-shot setting (75.2% vs.73.9%), and it also has the largest
sive prototypes become highly similar on a single support performance increment of 5.0% over 1-shot result. In ad-
sample, and cosine distance can not distinguish them apart. dition, we have far fewer trainable parameters than other
Finally, when GPA module is adopted for prototype match- methods. In Figure 7, we show some representative seg-
ing, the performance improves by 2.70% compared to the mentation examples. We observe that the proposed ASGNet
prototype expansion, and also the computational overhead can generate accurate segmentation results even when there
is much lower. are large variations in appearance and pose between support
K-shot Fusion Setting. As mentioned in Section 4.5, AS- and query images.
GNet is able to tackle the k-shot learning problem with- COCO-20i . In Table 7, we present the performance com-
out collapsing the support features. To explore the effect parison of mean IoU and FB-IoU on COCO-20i . As can
of different fusion methods, we compare our k-shot setting be seen, ASGNet achieves state-of-the-art results in both
with two other commonly used solutions: 1) feature aver- 1-shot and 5-shot setting in terms of the mean IoU, and it
age fusion [21] and 2) attention weighted summation [40]. substantially outperforms previous methods. Particularly,
For simplicity, experiments of this ablation study are only ASGNet achieves a margin of 3.98% and 6.96% higher

8340
1-shot 5-shot
Backbone Methods ∆
s-0 s-1 s-2 s-3 mean s-0 s-1 s-2 s-3 mean
OSLSM [23] 33.60 55.30 40.90 33.50 40.80 35.90 58.10 42.70 39.10 43.95 3.15
co-FCN [21] 36.70 50.60 44.90 32.40 41.10 37.50 50.00 44.10 33.90 41.40 0.30
AMP [24] 41.90 50.20 46.70 34.40 43.40 40.30 55.30 49.90 40.10 46.40 3.00
VGG-16
SG-One [42] 40.20 58.40 48.40 38.40 46.30 41.90 58.60 48.60 39.40 47.10 0.80
PANet [32] 42.30 58.00 51.10 41.20 48.10 51.80 64.60 59.80 46.50 55.70 7.60
FWB [20] 47.04 59.64 52.61 48.27 51.90 50.87 62.86 56.48 50.09 55.08 3.18
CANet† [40] 52.50 65.90 51.30 51.90 55.40 55.50 67.80 51.90 53.20 57.10 1.70
PGNet† [39] 56.00 66.90 50.60 50.40 56.00 57.70 68.70 52.90 54.60 58.50 2.50
RPMMs [34] 55.15 66.91 52.61 50.68 56.34 56.28 67.34 54.52 51.00 57.30 0.96
ResNet50 SimPropNet [8] 54.82 67.33 54.52 52.02 57.19 57.20 68.50 58.40 56.05 60.04 2.85
PPNet [18] 47.83 58.75 53.80 45.63 51.50 58.39 67.83 64.88 56.73 61.96 10.46
PFENet [29] 61.70 69.50 55.40 56.30 60.80 63.10 70.70 55.80 57.90 61.90 1.10
ASGNet (ours) 58.84 67.86 56.79 53.66 59.29 63.66 70.55 64.17 57.38 63.94 4.65
FWB [20] 51.30 64.49 56.71 52.24 56.19 54.84 67.38 62.16 55.30 59.92 3.73
DAN† [31] 54.70 68.60 57.80 51.60 58.20 57.90 69.00 60.10 54.90 60.50 2.30
ResNet101
PFENet [29] 60.50 69.40 54.40 55.90 60.10 62.80 70.40 54.90 57.60 61.40 1.30
ASGNet (ours) 59.84 67.43 55.59 54.39 59.31 64.55 71.32 64.24 57.33 64.36 5.05

Table 5. Comparison with state-of-the-art on Pascal-5i . † indicates multi-scale inference is adopted. ∆ means increment over 1-shot
segmentation result.

FB-IoU
Methods ∆ #Params
Support

1-shot 5-shot
OSLSM [23] 61.3 61.5 0.2 272.6M
co-FCN [21] 60.1 60.2 0.1 34.2M
AMP [24] 62.2 63.8 1.6 14.9M
Prediction

SG-One [42] 63.1 65.9 2.8 19.0M


CANet† [40] 66.2 69.6 3.4 10.5M
PGNet† [39] 69.9 70.5 0.6 -
Ground Truth

PANet [32] 66.5 70.7 4.2 14.7M


DAN† [31] 71.9 72.3 0.4 -
SimPropNet [8] 73.0 72.9 -0.1 -
PFENet [29] 73.3 73.9 0.6 10.8M
Figure 7. Qualitative examples of 1-shot segmentation on the
ASGNet (RN-50) 69.2 74.2 5.0 10.4M
Pascal-5i . Best viewed in color and zoom in.
ASGNet (RN-101) 71.7 75.2 3.5 10.4M

Table 6. Comparison of FB-IoU and the number of trainable pa- 6. Conclusions


rameters on Pascal-5i . † indicates multi-scale inference is adopted. In this paper we propose ASGNet for few-shot image
∆ means increment over 1-shot segmentation result.
segmentation. Targeting the limitations of existing single
mIoU FB-IoU prototype based models, we introduce two new modules,
Backbone Methods
1-shot 5-shot 1-shot 5-shot
named Superpixel-guided Clustering (SGC) and Guided
FWB [20] 21.19 23.65 - -
ResNet101 DAN [31] 24.20 29.60 62.30 63.90
Prototype Allocation (GPA), for adaptive prototype learn-
PFENet [29] 32.40 37.40 58.60 61.90 ing and allocation. Concretely, SGC aggregates similar fea-
ResNet50
RPMMs [34] 30.58 35.52 - - ture vectors with feature-based superpixel clustering, and
ASGNet 34.56 42.48 60.39 66.96 GPA aims to allocate the most relevant prototype to each
query feature element by measuring the similarity with co-
Table 7. Comparison with state-of-the-arts on COCO-20i . sine distance. Extensive experiments and ablation stud-
mean IoU over RPMMs [34] in 1-shot and 5-shot segmen- ies have demonstrated the superiority of ASGNet, and we
tation respectively. Also, our ASGNet obtains competi- achieve state-of-the-art performance on both Pascal-5i and
tive 1-shot results and top-performing 5-shot results with COCO-20i without any additional post-processing steps.
respect to FB-IoU. These results demonstrate that the pro- Acknowledgements This work was supported by
posed method is capable of handling more complex cases, the National Research Foundation of Korea (NRF)
as MSCOCO is a much more challenging dataset with di- grant funded by the Korea government (MSIT)
verse samples and categories. (No.2020R1A2C101215911).

8341
References [18] Yongfei Liu, Xiangyi Zhang, Songyang Zhang, and Xum-
ing He. Part-aware prototype network for few-shot semantic
[1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien segmentation. In ECCV, 2020.
Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpix-
[19] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
els compared to state-of-the-art superpixel methods. PAMI,
convolutional networks for semantic segmentation. In
34(11):2274–2282, 2012.
CVPR, pages 3431–3440, 2015.
[2] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian
Schroff, and Hartwig Adam. Encoder-decoder with atrous [20] Khoi Nguyen and Sinisa Todorovic. Feature weighting and
separable convolution for semantic image segmentation. In boosting for few-shot segmentation. In ICCV, pages 622–
ECCV, pages 801–818, 2018. 631, 2019.
[3] Sungha Choi, Joanne T Kim, and Jaegul Choo. Cars can’t [21] Kate Rakelly, Evan Shelhamer, Trevor Darrell, Alyosha
fly up in the sky: Improving urban-scene segmentation via Efros, and Sergey Levine. Conditional networks for few-shot
height-driven attention networks. In CVPR, pages 9373– semantic segmentation. In ICLR Workshop, 2018.
9383, 2020. [22] Sachin Ravi and Hugo Larochelle. Optimization as a model
[4] Nanqing Dong and Eric P Xing. Few-shot semantic segmen- for few-shot learning. In ICLR, 2017.
tation with prototype learning. In BMVC, 2018. [23] Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and
[5] Mark Everingham, Luc Van Gool, Christopher KI Williams, Byron Boots. One-shot learning for semantic segmentation.
John Winn, and Andrew Zisserman. The pascal visual object In BMVC, 2017.
classes (voc) challenge. IJCV, 88(2):303–338, 2010. [24] Mennatullah Siam, Boris N Oreshkin, and Martin Jagersand.
[6] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- Amp: Adaptive masked proxies for few-shot segmentation.
agnostic meta-learning for fast adaptation of deep networks. In ICCV, pages 5249–5258, 2019.
arXiv preprint arXiv:1703.03400, 2017. [25] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical
[7] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei networks for few-shot learning. In NeurIPS, pages 4077–
Fang, and Hanqing Lu. Dual attention network for scene 4087, 2017.
segmentation. In CVPR, pages 3146–3154, 2019. [26] David Stutz, Alexander Hermans, and Bastian Leibe. Su-
[8] Siddhartha Gairola, Mayur Hemani, Ayush Chopra, and Bal- perpixels: An evaluation of the state-of-the-art. Computer
aji Krishnamurthy. Simpropnet: Improved similarity propa- Vision and Image Understanding, 166:1–27, 2018.
gation for few-shot image segmentation. In IJCAI, 2020. [27] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS
[9] Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Torr, and Timothy M Hospedales. Learning to compare: Re-
Subhransu Maji, and Jitendra Malik. Semantic contours from lation network for few-shot learning. In CVPR, pages 1199–
inverse detectors. In ICCV, pages 991–998. IEEE, 2011. 1208, 2018.
[10] Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu [28] Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hierarchi-
Qiao. Adaptive pyramid context network for semantic seg- cal multi-scale attention for semantic segmentation. arXiv
mentation. In CVPR, pages 7519–7528, 2019. preprint arXiv:2005.10821, 2020.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[29] Z Tian, H Zhao, M Shu, Z Yang, R Li, and J Jia. Prior
Deep residual learning for image recognition. In CVPR,
guided feature enrichment network for few-shot segmenta-
pages 770–778, 2016.
tion. PAMI, 2020.
[12] Benjamin Irving. maskslic: regional superpixel generation
[30] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan
with application to local pathology characterisation in medi-
Wierstra, et al. Matching networks for one shot learning.
cal images. arXiv preprint arXiv:1606.09518, 2016.
In NeurIPS, pages 3630–3638, 2016.
[13] Varun Jampani, Deqing Sun, Ming-Yu Liu, Ming-Hsuan
Yang, and Jan Kautz. Superpixel sampling networks. In [31] Haochen Wang, Xudong Zhang, Yutao Hu, Yandan Yang,
ECCV, pages 352–368, 2018. Xianbin Cao, and Xiantong Zhen. Few-shot semantic seg-
mentation with democratic attention networks. In ECCV,
[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
2020.
Imagenet classification with deep convolutional neural net-
works. Communications of the ACM, 60(6):84–90, 2017. [32] Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou,
[15] Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen and Jiashi Feng. Panet: Few-shot image semantic segmenta-
Lin, and Hong Liu. Expectation-maximization attention net- tion with prototype alignment. In ICCV, pages 9197–9206,
works for semantic segmentation. In ICCV, pages 9167– 2019.
9176, 2019. [33] Yu-Xiong Wang and Martial Hebert. Learning to learn:
[16] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Model regression networks for easy small sample learning.
Bharath Hariharan, and Serge Belongie. Feature pyramid In ECCV, pages 616–634. Springer, 2016.
networks for object detection. In CVPR, pages 2117–2125, [34] Boyu Yang, Chang Liu, Bohao Li, Jianbin Jiao, and Qixiang
2017. Ye. Prototype mixture models for few-shot semantic seg-
[17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, mentation. In ECCV, 2020.
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence [35] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan
Zitnick. Microsoft coco: Common objects in context. In Yang. Denseaspp for semantic segmentation in street scenes.
ECCV, pages 740–755. Springer, 2014. In CVPR, pages 3684–3692, 2018.

8342
[36] Xianghui Yang, Bairun Wang, Kaige Chen, Xinchi Zhou,
Shuai Yi, Wanli Ouyang, and Luping Zhou. Brinet: Towards
bridging the intra-class and inter-class gaps in one-shot seg-
mentation. In BMVC, 2020.
[37] Fisher Yu and Vladlen Koltun. Multi-scale context
aggregation by dilated convolutions. arXiv preprint
arXiv:1511.07122, 2015.
[38] Yuhui Yuan and Jingdong Wang. Ocnet: Object context net-
work for scene parsing. arXiv preprint arXiv:1809.00916,
2018.
[39] Chi Zhang, Guosheng Lin, Fayao Liu, Jiushuang Guo,
Qingyao Wu, and Rui Yao. Pyramid graph networks with
connection attentions for region-based one-shot semantic
segmentation. In ICCV, pages 9587–9595, 2019.
[40] Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chunhua
Shen. Canet: Class-agnostic segmentation networks with it-
erative refinement and attentive few-shot learning. In CVPR,
pages 5217–5226, 2019.
[41] Fan Zhang, Yanqin Chen, Zhihang Li, Zhibin Hong, Jingtuo
Liu, Feifei Ma, Junyu Han, and Errui Ding. Acfnet: Atten-
tional class feature network for semantic segmentation. In
ICCV, pages 6798–6807, 2019.
[42] Xiaolin Zhang, Yunchao Wei, Yi Yang, and Thomas S
Huang. Sg-one: Similarity guidance network for one-shot
semantic segmentation. IEEE Transactions on Cybernetics,
2020.
[43] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
CVPR, pages 2881–2890, 2017.
[44] Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xi-
ang Bai. Asymmetric non-local neural networks for semantic
segmentation. In ICCV, pages 593–602, 2019.

8343

You might also like