0% found this document useful (0 votes)
47 views

CRIS

This paper proposes a CLIP-Driven Referring Image Segmentation (CRIS) framework to transfer knowledge from CLIP pretraining for the task of referring image segmentation. CRIS uses a vision-language decoder to propagate text features to pixel features, and text-to-pixel contrastive learning to align text and related pixel features while separating unrelated pixels. It outperforms prior state-of-the-art methods on three benchmarks without post-processing.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

CRIS

This paper proposes a CLIP-Driven Referring Image Segmentation (CRIS) framework to transfer knowledge from CLIP pretraining for the task of referring image segmentation. CRIS uses a vision-language decoder to propagate text features to pixel features, and text-to-pixel contrastive learning to align text and related pixel features while separating unrelated pixels. It outperforms prior state-of-the-art methods on three benchmarks without post-processing.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CRIS: CLIP-Driven Referring Image Segmentation

Zhaoqing Wang1,2 * Yu Lu3∗ Qiang Li4∗ Xunqiang Tao3 Yandong Guo3


Mingming Gong5 Tongliang Liu1
1
University of Sydney; 2 OPPO Research Institute; 3 Beijing University of Posts and Telecommunications
4
Kuaishou Technology; 5 University of Melbourne
{derrickwang005,leetsiang.cloud,taoxunqiang}@gmail.com; [email protected]
[email protected]; [email protected]; [email protected]

Abstract (a) CLIP


Image
Image
Image222
Image Image
Encoder
Referring image segmentation aims to segment a referent ⋯
via a natural linguistic expression. Due to the distinct data I1 I2 ⋯ In
Transfer
properties between text and image, it is challenging for a Knowledge
T1 T1 ·I 1 T1 ·I 2 ⋯ T1 ·I n

network to well align text and pixel-level features. Exist- Text 222
Image Text
T2 T2 ·I 1 T2 ·I 2 ⋯ T2 ·I n
Image
Image
ing approaches use pretrained models to facilitate learning, Encoder ⋮ ⋮ ⋮ ⋮ ⋱ ⋮

Tn Tn ·I 1 Tn ·I 2 ⋯ Tn ·I n
yet separately transfer the language/vision knowledge from
pretrained models, ignoring the multi-modal corresponding (b) CRIS
information. Inspired by the recent advance in Contrastive Image
Text-to-Pixel Contrast
Image Encoder
Language-Image Pretraining (CLIP), in this paper, we pro-
pose an end-to-end CLIP-Driven Referring Image Segmen- Decoder
V

tation framework (CRIS). To transfer the multi-modal knowl-


edge effectively, CRIS resorts to vision-language decoding Text
Text T
Encoder
and contrastive learning for achieving the text-to-pixel align-
ment. More specifically, we design a vision-language de-
coder to propagate fine-grained semantic information from Figure 1. An illustration of our main idea. (a) CLIP [39] jointly
textual representations to each pixel-level activation, which trains an image encoder and a text encoder to predict the correct
promotes consistency between the two modalities. In addi- pairings of a batch of image I and text T , which can capture the
tion, we present text-to-pixel contrastive learning to explic- multi-modal corresponding information. (b) To transfer this knowl-
itly enforce the text feature similar to the related pixel-level edge of the CLIP model from image level to pixel level, we propose
features and dissimilar to the irrelevances. The experimental a CLIP-Driven Referring Image Segmentation (CRIS) framework.
results on three benchmark datasets demonstrate that our Firstly, we design a vision-language decoder to propagate fine-
grained semantic information from textual features to pixel-level
proposed framework significantly outperforms the state-of-
visual features. Secondly, we combine all pixel-level visual features
the-art performance without any post-processing. The code
V with the global textual feature T and adopt contrastive learning
will be released. to pull text and related pixel-wise features closer and push other
irrelevances away.

1. Introduction
categories, referring image segmentation is not limited to
Referring image segmentation [15, 49, 50] is a fundamen- indicating specific categories but finding a particular region
tal and challenging task at the intersection of vision and according to the input language expression.
language understanding, which can be potentially used in
Since the image and language modality maintain dif-
a wide range of applications, including interactive image
ferent properties, it is difficult to explicitly align textual
editing and human-object interaction. Unlike semantic and
features with pixel-level activations. Benefiting from the
instance segmentation [9,11,13,47], which requires segment-
powerful capacity of the deep neural network, early ap-
ing the visual entities belonging to a pre-determined set of
proaches [15, 22, 25, 33] concatenate textual features with
* Equal technical contribution. each visual activation directly, and use these combined fea-

1
Language: “a blond haired , blue eyed young boy in a blue jacket” Language: “a zebra ahead of the other zebra”

(a) Image (b) GT (c) Naïve (f) Ours (a) Image (b) GT (c) Naïve (f) Ours

Figure 2. Comparison between the direct fine-tuning and our proposed methods. “Naive” denotes the direct fine-tuning mentioned in
section 4. Compared with the direct fine-tuning, our method can not only leverage the powerful cross-modal matching capability of the
CLIP, but also learn fine-grained visual representations.

tures to generate the segmentation mask. Subsequently, to the linguistic and pixel-level visual features.
address the lack of adequate interaction between two modali- Our main contributions are summarized as follow:
ties, a series of methods [5,17,18,42,49] adopt the language-
vision attention mechanism to better learn cross-modal fea- • We propose a CLIP-Driven Referring Image Segmen-
tures. tation framework (CRIS) to transfer the knowledge of
Existing methods [5, 17, 18, 42, 49] leverage external the CLIP model for achieving text-to-pixel alignment.
knowledge to facilitate learning in common, while they
• We take fully advantage of this multi-modal knowledge
mainly utilize a single-modal pretraining (e.g., the pretrained
with two innovative designs, i.e., the vision-language
image or text encoder), which is short of multi-modal corre-
decoder and text-to-pixel contrastive learning.
spondence information. By resorting to language supervision
from large-scale unlabeled data, vision-language pretrain-
• The experimental results on three challenging bench-
ing [34, 39, 46] is able to learn ample multi-modal represen-
marks significantly outperform previous state-of-the-art
tations. Recently, the remarkable success of the CLIP [39]
methods by large margins (e.g., + 4.89 IoU on Ref-
has shown its capability of learning SOTA image-level vi-
COCO, + 8.88 IoU on RefCOCO+, + 5.47 IoU on G-
sual concepts from 400 million image-text pairs, which as-
Ref).
sists many multi-modal tasks achieve significant improve-
ments, including image-text retrieval [39], video-text re-
trieval [7, 31]. However, as shown in Figure 2, the direct 2. Related Work
usage of the CLIP can be sub-optimal for pixel-level pre- Vision-Language Pretraining. Vision-Language pretrain-
diction tasks, e.g., referring image segmentation, duo to the ing has made rapid progress in recent years and achieved
discrepancy between image-level and pixel-level prediction. impressive performance on various multi-modal downstream
The former focuses on the global information of an input tasks. By resorting to semantic supervision from large-scale
image, while the latter needs to learn fine-grained visual image data, several approaches [34, 39, 46] were proposed
representations for each spatial activation. to learn visual representations from text representations.
In this paper, we explore leveraging the powerful knowl- MIL-NCE [34] mainly explored leveraging noisy large-scale
edge of the CLIP model for referring image segmentation, Howto100M [35] instructional videos to learn a better video
in order to enhance the ability of cross-modal matching. encoder via an end-to-end manner. SimVLM [46] reduced
Considering the characteristics of referring image segmenta- the training complexity by leveraging large-scale weak su-
tion, we propose an effective and flexible framework named pervision, and adopted a single prefix language modeling
CLIP-Driven Referring Image Segmentation (CRIS), which objective in an end-to-end manner. Benefit from the large-
can transfer ample multi-modal corresponding knowledge scale image and text pairs collected from the Internet, a
of the CLIP for achieving text-to-pixel alignment. Firstly, recent approach, i.e., Contrastive Language-Image Pretrain-
we propose a visual-language decoder that captures long- ing (CLIP) [39], achieved the notable success of aligning two
range dependencies of pixel-level features through the self- modalities representations in the embedding space. CLIP
attention operation and adaptively propagate fine-structured adopted contrastive learning with high-capacity language
textual features into pixel-level features through the cross- models and visual feature encoders to capture compelling
attention operation. Secondly, we introduce the text-to-pixel visual concepts for zero-shot image classification. More
contrastive learning, which can align linguistic features and recently, a series of works [7, 31, 37, 43] were proposed
the corresponding pixel-level features, meanwhile distin- to transfer the knowledge of CLIP models to downstream
guishing irrelevant pixel-level features in the multi-modal tasks and achieved promising results, such as video caption,
embedding space. Based on this scheme, the model can video-text retrieval, and image generation. Different from
explicitly learn fine-grained visual concepts by interwinding these works, we transfer these image-level visual concepts

2
to referring image segmentation for leveraging multi-modal multi-modal information and boost the ability of cross-modal
corresponding information. matching.
Contrastive Learning Date back to [10], these approaches
learned representations by contrasting positive pairs against 3. Methodology
negative pairs. Several approaches [3, 4, 12, 23, 48] were
As illustrated in Figure 3, we introduce how the pro-
proposed to treat each image as a class and use contrastive
posed CRIS framework transfers the knowledge of CLIP to
loss-based instance discrimination for representation learn-
referring image segmentation to achieve text-to-pixel align-
ing. Recently, VADeR and DenseCL [38, 45] proposed to ex-
ment by leveraging multi-modal corresponding information.
plore pixel-level contrastive learning to fill the gap between
Firstly, we use a ResNet [14] and a Transformer [44] to
self-supervised representation learning and dense prediction
extract image and text features respectively, which are fur-
tasks. Besides, CLIP [39] proposed a promising alterna-
ther fused to obtain the simple multi-modal features. Sec-
tive that directly learns transferable visual concepts from
ondly, these features and text features are fed into the vision-
large-scale collected image-text pairs by using cross-modal
language decoder to propagate fine-grained semantic infor-
contrastive loss. In this paper, we propose a CLIP-Driven
mation from textual representations to pixel-level visual acti-
Referring Image Segmentation (CRIS) framework to trans-
vations. Finally, we use two projectors to produce the final
fer the knowledge of the CLIP model to referring image
prediction mask, and adopt the text-to-pixel contrastive loss
segmentation in an end-to-end manner.
to explicitly align the text features with the relevant pixel-
Referring Image Segmentation Referring image segmen- level visual features.
tation is to segment a target region (e.g., object or stuff) in an
image by understanding a given natural linguistic expression, 3.1. Image & Text Feature Extraction
which was first introduced by [15]. Early works [22, 25, 33]
As illustrated in Figure 3, the input of our framework
first extracted visual and linguistic features by CNN and
consists of an image I and a referring expression T .
LSTM, respectively, and directly concatenated two modal-
Image Encoder. For an input image I ∈ RH×W ×3 ,
ities to obtain final segmentation results by a FCN [28].
we utilize multiple visual features from the 2th-4th stages
In [50], they proposed a two-stage method that first extracted H W
of the ResNet, which are defined as Fv2 ∈ R 8 × 8 ×C2 ,
instances using Mask R-CNN [13], and then adopted lin- H W H W
Fv3 ∈ R 16 × 16 ×C3 , and Fv4 ∈ R 32 × 32 ×C4 , respectively.
guistic features to choose the target from those instances.
Note that C is the feature dimension, H and W are the
Besides, MCN [30] designed a framework achieving im-
height and width of the original image.
pressive results. They learned to optimize two related tasks,
Text Encoder. For an input expression T ∈ RL , we
i.e., referring expression comprehension and segmentation,
adopt a Transformer [44] modified by [40] to extract text
simultaneously.
features Ft ∈ RL×C . The Transformer operates on a lower-
As the attention mechanism arouses more and more in- cased byte pair encoding (BPE) representation of the text
terests, a series of works are proposed to adopt the attention with a 49,152 vocab size [41], and the text sequence is brack-
mechanism. It is powerful to extract the visual contents eted with [SOS] and [EOS] tokens. The activations of
corresponding to the language expression. [42] used the the highest layer of the transformer at the [EOS] token
vision-guided linguistic attention to aggregate the linguistic are further transformed as the global textual representation
context of each visual region adaptively. [49] designed a ′
Fs ∈ RC . Note that C and C ′ are the feature dimension, L
Cross-Modal Self-Attention (CSMA) module to focus on is the length of the referring expression.
informative words in the sentence and crucial regions in the Cross-modal Neck. Given multiple visual features and
image. [16] proposed a bi-directional relationship inferring the global textual representation Fs , we obtain the simple
network that adopted a language-guided visual and vision- H
multi-modal feature Fm4 ∈ R 16 × 16 ×C by fusing Fv4 with
W

guided linguistic attention module to capture the mutual Fs :


guidance between two modalities. Besides, LTS [19] de- F_{m4} = Up(\sigma (F_{v4}W_{v4}) \cdot \sigma (F_{s}W_{s})), \label {eq1} (1)
signs a strong pipeline that decouples the task into a “Locate-
Then-Segment” scheme by introducing the position prior. where U p(·) denotes 2× upsampling, · denotes the element-
EFNet [8] designs a co-attention mechanism to use language wise multiplication, σ denotes ReLU, Wv4 and Ws are two
to refine the multi-modal features progressively, which can learnable matrices to transform the visual and textual rep-
promote the consistent of the cross-modal information repre- resentations into the same feature dimension. Then, the
sentation. More recently, VLT [6] employs transformer to multi-modal features Fm2 and Fm3 are obtained by:
build a network with an encoder-decoder attention mecha-
nism for enhancing the global context information. Differ- \begin {aligned} &F_{m_3} = [\sigma (F_{m_4}W_{m_4}),\, \sigma (F_{v_3}W_{v_3})],\\ &F_{m_2} = [\sigma (F_{m_3}W_{m_3}),\, \sigma (F_{v_2}^{\prime }W_{v_2})],\,F_{v_{2}}^{\prime } = Avg(F_{v_2}), \label {eq2} \end {aligned}
ent from previous methods, we aim to leverage the knowl-
edge of the CLIP, in order to improving the compatibility of (2)

3
& Text Projector &
𝐹" ∈ ℝ% Image Projector
Text Encoder
𝐹' ∈ ℝ(×%

a baby sheep walking


amongst the grass

Embedding Space

Text Token + …
Position Encoding
Vision-Language
1S 2 3 …L E ×𝑛
Decoder
Image Encoder Prediction
1 2 3 …N
Position Encoding Pull Closer
Visual Token + Visual Token Push away
Neck 𝐹* ∈ ℝ+×% Position Encoding Text Token

Figure 3. The overview of the proposed CLIP-Driven Referring Image Segmentation (CRIS) framework. CRIS mainly consists of a
text encoder, an image encoder, a cross-modal neck, a vision-language decoder, and two projectors. The vision-language decoder is used to
adaptively propagate semantic information from textual features to visual features. The text-to-pixel contrastive learning is used to explicitly
learn fine-grained multi-modal corresponding information by interwinding the text features and pixel-level visual features.

where Avg(·) denotes a kernel size of 2 × 2 average pooling of the transformer [44], each layer consists of a multi-head
with 2 strides, respectively. [, ] is the concatenation operation. self-attention layer, a multi-head cross-attention layer, and
Subsequently, we concatenate three multi-modal features and a feed-forward network. In one decoder layer, Fv is first
use a 1 × 1 convolution layer to aggregate them: sent into the multi-head self-attention layer to capture global
contextual information:
F_{m} = Conv([F_{m_2},\,F_{m_3},\,F_{m_4}]), \label {eq3} (3)
F_{v}^{\prime } = MHSA(LN(F_{v})) + F_{v}, \label {eq5} (5)
H W
where Fm ∈ R 16 × 16 ×C
. Finally, we concatenate a 2D
H W where Fv′ is the evolved visual feature, M HSA(·) and
spatial coordinate feature Fcoord ∈ R 16 × 16 ×2 with Fm and
LN (·) denote the multi-head self-attention layer and Layer
fuse that by a 3 × 3 convolution [27]. The visual feature
H W Normalization [1], respectively. The multi-head self-
Fv ∈ R 16 × 16 ×C is calculated as follow,
attention mechanism is composed of three point-wise linear
F_{v} = Conv([F_{m},\, F_{coord}]). \label {eq4} (4) layers mapping Fv to intermediate representations, queries
Q ∈ RN ×dq , keys K ∈ RN ×dk and values V ∈ RN ×dv .
As shown in figure 3, we flatten the spatial domain of Fv into Multi-head self-attention is then calculated as follows,
a sequence, forming the visual feature Fv ∈ RN ×C , N =
H W
16 × 16 , which is utilized in the following process. MHSA(Q,\, K,\, V) = softmax(\frac {QK^{T}}{\sqrt {d_k}})V. \label {eq6} (6)
3.2. Vision-Language Decoder After that, the multi-head cross-attention layer is adopted to
We design a vision-language decoder to adaptively propa- propagate fine-grained semantic information into the evolved
gate fine-grained semantic information from textual features visual features, where one point-wise linear layer maps Fv′
to visual features. As shown in Figure 3, the decoder mod- to Q, and the other two linear layers map Ft to K and V .
ule takes textual features Ft ∈ RL×C and pixel-level visual To obtain the multi-modal feature Fc , the output query Q is
features Fv ∈ RN ×C as inputs, which can provide ample further computed by a MLP block of two layers with Layer
textual information corresponding to visual features. To Normalization and residual connections:
capture positional information, the fixed sine spatial posi- \begin {aligned} &F_{c}^{\prime } = MHCA(LN(F_{v}^{\prime }),\,F_t) + F_{v}^{\prime },\\ &F_{c} = MLP(LN(F_{c}^{\prime })) + F_{c}^{\prime }, \label {eq7} \end {aligned}
tional encodings are added to Fv [2] and Ft [44], respec- (7)
tively. The vision-language decoder composed of n layers
is applied to generate a sequence of evolved multi-modal where M HCA(·) denotes the multi-head cross-attention
features Fc ∈ RN ×C . Following the standard architecture layer, and Fc′ is the intermediate features. The evolved multi-

4
modal feature Fc is utilized for the final segmentation mask. 19,994 images with 142,210 referring expressions for 50,000
Note that the hyper-parameter n is discussed in the following objects, which are collected from the MSCOCO [24] via a
experiment section. two-player game [20]. The dataset is split into 120,624 train,
10,834 validation, 5,657 test A, and 5,095 test B samples,
3.3. Text-to-Pixel Contrastive Learning respectively. According to statistics, each image contains
Although the CLIP [39] learns powerful image-level vi- two or more objects and each expression has an average
sual concepts by aligning the textual representation with the length of 3.6 words.
image-level representation, this type of knowledge is sub- RefCOCO+ [20] dataset contains 141,564 language ex-
optimal for referring image segmentation, due to the lack of pressions with 49,856 objects in 19,992 images. The dataset
more fine-grained visual concepts. is split into train, validation, test A, and test B with 120,624,
To tackle this issue, we design a text-to-pixel contrastive 10,758, 5,726, and 4,889 samples, respectively. Compared
loss, which explicitly aligns the textual features with the with RefCOCO dataset, some kinds of absolute-location
corresponding pixel-level visual features. As illustrated in words are excluded from the RefCOCO+ dataset, which
Figure 3, image and text projector are adopted to transform could be more challenging than the RefCOCO dataset.
Fc and Fs as follow: G-Ref [36] includes 104,560 referring expressions for
54,822 objects in 26,711 images. Unlike the above two
\begin {aligned} &z_{v} = F_{c}^{\prime }W_{v}+b_{v},\,F_{c}^{\prime } = Up(F_{c}),\\ &z_{t} = F_{s}W_{t}+b_{t}, \label {eq8} \end {aligned} datasets, natural expressions in G-Ref are collected from
(8)
Amazon Mechanical Turk instead of a two-player game.
The average length of sentences is 8.4 words, which have
where zt ∈ RD , zv ∈ RN ×D , N = H4 × W 4 , U p denotes more words about locations and appearances. It is worth
4× upsampling, Wv and Wt are two learnable matrices to mentioning that we adopt UNC partition in this paper.
transform Fc and Fs into the same feature dimension D, bv
and bt are two learnable biases. 4.2. Implementation Details
Given a transformed textual feature zt and a set of trans-
formed pixel-level features zv , a contrastive loss is adopted Experimental Settings. We initiate the text and image
to optimize the relationship between two modalities, where encoder with CLIP [39], and adopt ResNet-50 [14] as the
zt is encouraged to be similar with its corresponding zv and image encoder for all ablation studies. Input images are
dissimilar with other irrelevant zv . With the similarity mea- resized to 416 × 416. Due to the extra [SOS] and [EOS]
sured by dot product, the text-to-pixel contrastive loss can tokens, and the input sentences are set with a maximum
be formulated as: sentence length of 17 for RefCOCO and RefCOCO+, and 22
for G-Ref. Each Transformer Decoder layer has 8 heads, and
the feed-forward hidden dimension is set to 2048. We train
L_{con}^{i}(z_{t},\,z_{v}^{i})= \begin {cases} -\log \sigma (z_{t} \cdot z_{v}^{i}), & i \in \mathcal {P},\\ -\log (1 - \sigma (z_{t} \cdot z_{v}^{i})), & i \in \mathcal {N},\\ \end {cases} (9) the network for 50 epochs using the Adam optimizer with
the learning rate λ = 0.0001. The learning rate is decreased
L_{con}(z_{t},\,z_{v}) = \frac {1}{\left \lvert \mathcal {P} \cup \mathcal {N} \right \rvert } \sum _{i \in \mathcal {P} \cup \mathcal {N}} L_{con}^{i}(z_{t},\,z_{v}^{i}), \label {eq10} (10) by a factor of 0.1 at the 35th epoch. We train the model with
a batch size of 64 on 8 Tesla V100 with 16 GPU VRAM.
During inference, we upsample the predicted results back
where P and N denote the class of “1” and “0” in the ground to the original image size and binarize them at a threshold of
truth, |P ∪ N | is the cardinality, σ is the sigmoid function. 0.35 as the final result. No other post-processing operations
Finally, to obtain the final segmentation results, we reshape are needed.
σ(zt · zv ) into H4 × W4 and upsample it back to the original Metrics. Following previous works [6, 22, 25, 33], we
image size. adopt two metrics to verify the effectiveness: IoU and
Precision@X. The IoU calculates intersection regions over
4. Experimental results union regions of the predicted segmentation mask and the
Our proposed framework is built on different image en- ground truth. The Precision@X measures the percentage
coders (e.g., ResNet-50, ResNet-101 [14]) and compared of test images with an IoU score higher than the threshold
with a series of state-of-the-art methods. To evaluate the X ∈ {0.5, 0.6, 0.7, 0.8, 0.9}, which focuses on the location
effectiveness of each component in our method, we con- ability of the method.
duct extensive experiments on three benchmarks, including
RefCOCO [20], RefCOCO+ [20], and G-Ref [32].
4.3. Ablation Study
The proposed CRIS framework consists of two main parts,
4.1. Datasets
i.e., text-to-pixel contrastive learning and vision-language
RefCOCO [20] is one of the largest and most commonly decoder. To investigate each component in our method, we
used datasets for referring image segmentation. It contains conduct extensive experiments on the validation set of three

5
Table 1. Ablation studies on validation set of three benchmarks. Con. denotes the proposed text-to-pixel contrastive learning. Dec.
denotes the proposed vision-language decoder. n denotes the number of layers in the vision-language decoder. We set N um = 3 as the
default. “Params” and “FPS” denote the parameter complexity (M) and inference speed, respectively. Given an image I ∈ R416×416×3 , they
are calculated on a Tesla V100 GPU. Gray lines denote the baseline network.

Dataset Con. Dec. n IoU Pr@50 Pr@60 Pr@70 Pr@80 Pr@90 Params FPS
- - - 62.66 72.55 67.29 59.53 43.52 12.72 131.86 27.30
✓ - - 64.64 74.89 69.58 61.70 45.50 13.31 134.22 25.79
RefCOCO
- ✓ 1 66.31 77.66 72.99 65.67 48.43 14.81 136.07 23.02
✓ ✓ 1 68.66 80.16 75.72 68.82 51.98 15.94 138.43 22.64
✓ ✓ 2 69.13 80.96 76.60 69.67 52.23 16.09 142.64 20.68
✓ ✓ 3 69.52 81.35 77.54 70.79 52.65 16.21 146.85 19.22
✓ ✓ 4 69.18 80.99 76.74 69.32 52.57 16.37 151.06 18.26
- - - 50.17 54.55 47.69 40.19 28.75 8.21 131.86 27.30
✓ - - 53.15 58.28 53.74 46.67 34.01 9.30 134.22 25.79
RefCOCO+
- ✓ 1 54.73 63.31 58.89 52.46 38.53 11.70 136.07 23.02
✓ ✓ 1 59.97 69.19 64.85 58.17 43.47 13.39 138.43 22.64
✓ ✓ 2 60.75 70.69 66.83 60.74 45.69 13.42 142.64 20.68
✓ ✓ 3 61.39 71.46 67.82 61.80 47.00 15.02 146.85 19.22
✓ ✓ 4 61.15 71.05 66.94 61.25 46.98 14.97 151.06 18.26
- - - 49.24 53.33 45.49 36.58 23.90 6.92 131.86 25.72
✓ - - 52.67 59.27 52.45 44.12 29.53 8.80 134.22 25.33
G-Ref
- ✓ 1 51.46 58.68 53.33 45.61 31.78 10.23 136.07 22.57
✓ ✓ 1 57.82 66.28 60.99 53.21 38.58 13.38 138.43 22.34
✓ ✓ 2 58.40 67.30 61.72 54.70 39.67 13.40 142.64 20.61
✓ ✓ 3 59.35 68.93 63.66 55.45 40.67 14.40 146.85 19.14
✓ ✓ 4 58.79 67.91 63.11 55.43 39.81 13.48 151.06 17.84

widely used datasets. attention operation, which can propagate fine-grained seman-
Effectiveness of Contrastive Learning & Vision- tic information from textual features to pixel-level features
Language Decoder. Firstly, we remove the parts of the to generate more discriminative visual representations and
text-to-pixel contrastive learning and vision-language de- obtain more accurate segmentation masks.
coder from the framework to build our baseline, which is Finally, combining the proposed contrastive loss and
same as the naive setting in Figure 2.(c). As illustrated in vision-language decoder, the IoU and Precision are signifi-
Table 1, we introduce the contrastive learning scheme, which cantly better than the baseline solely with the contrastive loss
significantly increases the IoU accuracy of 1.98%, 2.98%, or decoder module, which further achieves large margins at
and 3.43% than the baseline network on three datasets, re- about 4% - 8% on three datasets. The reason of this obvious
spectively. This superior performance gain proves that the complementary phenomenon is that the contrastive loss can
contrastive loss can encourage the model to explicitly pull guide the decoder to find the more informative emphasis
closer linguistic and relevant pixel-level visual representa- and transfer this knowledge to more accurate pixel-level vi-
tions and push away other irrelevances for learning fine- sual representations, which boosts the ability of cross-modal
structured multi-modal corresponding information. matching and generates precise segmentation masks.
Besides, we evaluate the performance of the proposed Numbers of Layers in Decoder. In Table 1, the results
vision-language decoder. Compared with the baseline net- illustrate the effect of utilizing different number of layers in
work, we use one layer in the decoder, bringing 3.65%, the vision-language decoder. When the visual representa-
4.56%, and 2.22% IoU improvements on RefCOCO, Re- tions are sequentially processed by more layers, our model
fCOCO+, and G-Ref, respectively. In particular, the self- can consistently get better IoU of 69.52%, 61.39%, and
attention operation can help the model sufficiently capture 59.35% on three benchmarks, respectively. The setting of
long-range dependencies across each pixel, which is helpful n = 1 may not taking full advantage of the multi-modal
for understanding complex scenarios. Furthermore, each corresponding information from both vision and language.
word encoded by the text encoder is used in the cross- Meanwhile, the setting of n = 4 introduces more parameters,

6
Table 2. Comparisons with the state-of-the-art approaches on three benchmarks. We report the results of our method with various
visual backbones. “⋆” denotes the post-processing of DenseCRF [21]. “-” represents that the result is not provided. IoU is utilized as the
metric.

RefCOCO RefCOCO+ G-Ref


Method Backbone
val test A test B val test A test B val test
RMI⋆ [25] ResNet-101 45.18 45.69 45.57 29.86 30.48 29.50 - -
DMN [33] ResNet-101 49.78 54.83 45.13 38.88 44.22 32.29 - -
RRN⋆ [22] ResNet-101 55.33 57.26 53.95 39.75 42.15 36.11 - -
MAttNet [50] ResNet-101 56.51 62.37 51.70 46.67 52.39 40.08 47.64 48.61
NMTree [26] ResNet-101 56.59 63.02 52.06 47.40 53.01 41.56 46.59 47.88
CMSA⋆ [49] ResNet-101 58.32 60.61 55.09 43.76 47.60 37.89 - -
Lang2Seg [5] ResNet-101 58.90 61.77 53.81 - - - 46.37 46.95
BCAN⋆ [16] ResNet-101 61.35 63.37 59.57 48.57 52.87 42.13 - -
CMPC⋆ [17] ResNet-101 61.36 64.53 59.64 49.56 53.44 43.23 - -
LSCM⋆ [18] ResNet-101 61.47 64.99 59.55 49.34 53.12 43.50 - -
MCN [30] DarkNet-53 62.44 64.20 59.71 50.62 54.99 44.69 49.22 49.40
CGAN [29] DarkNet-53 64.86 68.04 62.07 51.03 55.51 44.06 51.01 51.69
EFNet [8] ResNet-101 62.76 65.69 59.67 51.50 55.24 43.01 - -
LTS [19] DarkNet-53 65.43 67.76 63.08 54.21 58.32 48.02 54.40 54.25
VLT [6] DarkNet-53 65.65 68.29 62.73 55.50 59.20 49.36 52.99 56.65
CRIS (Ours) ResNet-50 69.52 72.72 64.70 61.39 67.10 52.48 59.35 59.39
CRIS (Ours) ResNet-101 70.47 73.18 66.10 62.27 68.08 53.68 59.87 60.36

which could increase the risk of over-fitting. Considering results demonstrate that our proposed approach manages to
the performance and efficiency, we set n = 3 as the default understand long and complex sentences that contain more in-
in our framework. formation and more emphases, and simultaneously perceive
the corresponding object. Apart from that, longer referring
4.4. Main Results expressions could contain complex scenarios, which need
We compare our proposed approach, CLIP-Driven Refer- a strong ability to model the global contextual information.
ring Image Segmentation, with a series of state-of-the-art Our proposed vision-language decoder is suitable to enhance
methods on three commonly used datasets. As illustrated in the holistic understanding of vision and language features.
Table 2, our method surpasses other methods on each split of
4.5. Qualitative Analysis
all datasets even though we utilize a shallow ResNet-50 [14].
On the RefCOCO dataset, our model significantly outper- Visualization. As illustrated in Figure 4, we present
forms the state-of-the-art Vision Language Transformer [6] some visualization results with different setting, which
by 4.82%, 4.89% and 3.37% on three splits, respectively, demonstrates the benefits of each component in our proposed
which indicates that our model effectively transfer the knowl- method. Firstly, compared with our full model, the baseline
edge of the CLIP model from image-level to pixel-level, network without the contrastive learning and vision-language
enhancing the ability of cross-modal matching. decoder generates worse segmentation masks, because the
Besides, in Table 2, our method achieves remarkable baseline network fails to interwind referring expressions
performance gains of about 4~8% than a series of state-of- with the corresponding regions. Secondly, the setting of (d)
the-art works on the more challenging RefCOCO+ dataset. and (e) can obtain more accurate segmentation results than
These obvious improvements over them suggest that our the baseline network, but the model is still confused in some
method can adequately leverage the powerful knowledge of hard regions. Finally, our proposed full model can generate
the CLIP to accurately focus the region referred by the given high-quality segmentation masks, which demonstrates the
language expression. effectiveness of our proposed method, i.e., CRIS.
Furthermore, on another more complex G-Ref dataset Failure Cases. We visualize some insightful failed cases
where the average length of referring expressions is com- in Figure 5. One type of failure is caused by the ambiguity of
plicated, our proposed method consistently achieve notable the input expression. For the top left example in Figure 5, the
improvement of around 5% IoU than the state-of-the-art expression of “yellow” is not enough to describe the region
Locate then Segmentation [19]. As shown in Table 2, the of the man in the yellow snowsuit. Besides, for the top right

7
Language: “man left cut off”

Language: “main guy on the tv”

Language: “shortest person”

Language: “black suit with goggles”

(a) Image (b) GT (c) Baseline (d) w/o Dec. (e) w/o Con. (f) Ours

Figure 4. Qualitative examples with different settings. (a) the input image. (b) the ground turth. (c) the baseline network. (d) CRIS
without Vision-Language Decoder. (e) CRIS without Contrastive Learning. (f) our proposed CRIS. Best viewed in color.
Language: “yellow” Language: “fingers holding hotdog”

Language: “keenling man” Language: “young man with face obscured by mans arm”

(a) Image (b) GT (c) Ours (a) Image (b) GT (c) Ours

Figure 5. Qualitative examples of failure cases. Best viewed in color.

example, some failures are also caused by the wrong label. with the direct fine-tuning, our proposed framework not
It is obvious that the top region is unrelated to “fingers”. only inherit the strong cross-modal matching ability of the
As shown in the bottom left example, the boundaries of the CLIP, but also learn ample fine-structured visual representa-
referent cannot be accurately segmented, but this issue can tions. The designed vision-language decoder can adaptively
be alleviated by introducing other technologies, such as the propagate sufficient semantic information of the language
refine module. Finally, occlusion could cause failure cases, expression into pixel-level visual features, promoting consis-
which is a challenging problem in many vision tasks. tency between two modalities. Furthermore, the introduced
text-to-pixel contrastive learning can explicitly interwind the
5. Conclusion text representation and relevant pixel-level visual features,
learning fine-grained multi-modal corresponding informa-
In this paper, we have investigated to leverage the power tion. Extensive ablation studies on three commonly used
of Contrastive Language-Image Pretraining (CLIP) mod- datasets have verified the effectiveness of each proposed
els to achieve text-to-pixel alignment for referring image component, and our approach significantly outperforms pre-
segmentation. And, we have proposed an end-to-end CLIP- vious state-of-the-art methods without any post-processing.
Driven Referring Image Segmentation (CRIS) framework to
well transfer the knowledge of the CLIP model. Compared Acknowledgements We thank Ziyu Chen for the help-

8
ful discussions on this work, Weiqiong Chen, Bin Long and [18] Tianrui Hui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu,
Rui Sun for AWS technical support. Faxi Zhang, and Jizhong Han. Linguistic structure guided
context modeling for referring image segmentation. In ECCV,
References 2020. 2, 7
[19] Ya Jing, Tao Kong, Wei Wang, Liang Wang, Lei Li, and Tieniu
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Tan. Locate then segment: A strong pipeline for referring
Layer normalization. arXiv preprint arXiv:1607.06450, 2016. image segmentation. In CVPR, 2021. 3, 7
4 [20] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Tamara Berg. ReferItGame: Referring to objects in pho-
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- tographs of natural scenes. In Conference on Empirical Meth-
to-end object detection with transformers. In ECCV, 2020. ods in Natural Language Processing, 2014. 5
4 [21] Philipp Krähenbühl and Vladlen Koltun. Efficient inference
[3] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- in fully connected crfs with gaussian edge potentials. In
offrey Hinton. A simple framework for contrastive learning NeurIPS, 2011. 7
of visual representations. In International Conference on [22] Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan
Machine Learning, 2020. 3 Qi, Xiaoyong Shen, and Jiaya Jia. Referring image segmenta-
[4] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Im- tion via recurrent refinement networks. In CVPR, 2018. 1, 3,
proved baselines with momentum contrastive learning. arXiv 5, 7
preprint arXiv:2003.04297, 2020. 3 [23] Shikun Li, Xiaobo Xia, Shiming Ge, and Tongliang Liu.
[5] Yi Wen Chen, Yi Hsuan Tsai, Tiantian Wang, Yen Yu Lin, and Selective-supervised contrastive learning with noisy labels.
Ming Hsuan Yang. Referring expression object segmentation In CVPR, 2022. 3
with caption-aware consistency. In BMVC, 2019. 2, 7 [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[6] Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Vision-language transformer and query generation for refer- Zitnick. Microsoft COCO: Common objects in context. In
ring segmentation. In ICCV, 2021. 3, 5, 7 ECCV, 2014. 5
[7] Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. [25] Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and
Clip2video: Mastering video-text retrieval via image clip. Alan Yuille. Recurrent multimodal interaction for referring
arXiv preprint arXiv:2106.11097, 2021. 2 image segmentation. In ICCV, 2017. 1, 3, 5, 7
[8] Guang Feng, Zhiwei Hu, Lihe Zhang, and Huchuan Lu. En- [26] Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha.
coder fusion network with co-attention embedding for refer- Learning to assemble neural module tree networks for visual
ring image segmentation. In CVPR, 2021. 3, 7 grounding. In ICCV, 2019. 7
[9] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei [27] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Pet-
Fang, and Hanqing Lu. Dual attention network for scene roski Such, Eric Frank, Alex Sergeev, and Jason Yosinski.
segmentation. In CVPR, 2019. 1 An intriguing failing of convolutional neural networks and
[10] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimension- the coordconv solution. In NeurIPS, 2018. 4
ality reduction by learning an invariant mapping. In CVPR. [28] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
IEEE, 2006. 3 convolutional networks for semantic segmentation. In CVPR,
[11] Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu 2015. 3
Qiao. Adaptive pyramid context network for semantic seg- [29] Gen Luo, Yiyi Zhou, Rongrong Ji, Xiaoshuai Sun, Jinsong
mentation. In CVPR, 2019. 1 Su, Chia-Wen Lin, and Qi Tian. Cascade grouped attention
[12] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross network for referring expression segmentation. In ACM MM,
Girshick. Momentum contrast for unsupervised visual repre- 2020. 7
sentation learning. In CVPR, 2020. 3 [30] Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin
[13] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- Wu, Cheng Deng, and Rongrong Ji. Multi-task collaborative
shick. Mask r-cnn. In ICCV, 2017. 1, 3 network for joint referring expression comprehension and
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. segmentation. In CVPR, 2020. 3, 7
Deep residual learning for image recognition. In CVPR, 2016. [31] Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei,
3, 5, 7 Nan Duan, and Tianrui Li. Clip4clip: An empirical study
[15] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg- of clip for end to end video clip retrieval. arXiv preprint
mentation from natural language expressions. In ECCV, 2016. arXiv:2104.08860, 2021. 2
1, 3 [32] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Cam-
[16] Zhiwei Hu, Guang Feng, Jiayu Sun, Lihe Zhang, and buru, Alan L Yuille, and Kevin Murphy. Generation and com-
Huchuan Lu. Bi-directional relationship inferring network prehension of unambiguous object descriptions. In CVPR,
for referring image segmentation. In CVPR, 2020. 3, 7 2016. 5
[17] Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao [33] Edgar Margffoy-Tuay, Juan C Pérez, Emilio Botero, and
Wei, Jizhong Han, Luoqi Liu, and Bo Li. Referring image Pablo Arbeláez. Dynamic multimodal instance segmentation
segmentation via cross-modal progressive comprehension. In guided by natural language queries. In ECCV, 2018. 1, 3, 5,
CVPR, 2020. 2, 7 7

9
[34] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan [50] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mo-
Laptev, Josef Sivic, and Andrew Zisserman. End-to-end hit Bansal, and Tamara L Berg. MAttNet: Modular attention
learning of visual representations from uncurated instructional network for referring expression comprehension. In CVPR,
videos. In Proceedings of the IEEE/CVF Conference on 2018. 1, 3, 7
Computer Vision and Pattern Recognition, pages 9879–9889,
2020. 2
[35] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac,
Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m:
Learning a text-video embedding by watching hundred mil-
lion narrated video clips. In ICCV, 2019. 2
[36] Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Mod-
eling context between objects for referring expression under-
standing. In ECCV, 2016. 5
[37] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,
and Dani Lischinski. Styleclip: Text-driven manipulation of
stylegan imagery. In ICCV, 2021. 2
[38] Pedro O Pinheiro, Amjad Almahairi, Ryan Y Benmaleck,
Florian Golemo, and Aaron Courville. Unsupervised learning
of dense visual representations. NeurIPS, 2020. 3
[39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
transferable visual models from natural language supervision.
International Conference on Machine Learning, 2021. 1, 2,
3, 5
[40] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Amodei, Ilya Sutskever, et al. Language models are unsuper-
vised multitask learners. OpenAI blog, 1(8):9, 2019. 3
[41] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural
machine translation of rare words with subword units. arXiv
preprint arXiv:1508.07909, 2015. 3
[42] Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu.
Key-word-aware network for referring expression image seg-
mentation. In ECCV, 2018. 2, 3
[43] Mingkang Tang, Zhanyu Wang, Zhenhua Liu, Fengyun Rao,
Dian Li, and Xiu Li. Clip4caption: Clip for video caption. In
Proceedings of the 29th ACM International Conference on
Multimedia, pages 4858–4862, 2021. 2
[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017. 3, 4
[45] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and
Lei Li. Dense contrastive learning for self-supervised visual
pre-training. In CVPR, 2021. 3
[46] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia
Tsvetkov, and Yuan Cao. Simvlm: Simple visual language
model pretraining with weak supervision. arXiv preprint
arXiv:2108.10904, 2021. 2
[47] Tianyi Wu, Yu Lu, Yu Zhu, Chuang Zhang, Ming Wu, Zhanyu
Ma, and Guodong Guo. Ginet: Graph interaction network for
scene parsing. In ECCV, 2020. 1
[48] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Unsupervised feature learning via non-parametric instance
discrimination. In CVPR, 2018. 3
[49] Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. Cross-
modal self-attention network for referring image segmenta-
tion. In CVPR, 2019. 1, 2, 3, 7

10

You might also like