From Text to Mask Localizing Entities Using the
From Text to Mask Localizing Entities Using the
Abstract
∗
Indicates equal contribution.
∗∗
Corresponding author.
Email address: [email protected] (Changshui Zhang)
1. Introduction
2
dCRF
aeroplane
Figure 1: An overview of our proposed framework. We first add noise to the latent
and then input it into the denoising U-net with specially designed text queries. Next,
we combine cross-attention and self-attention in the model to obtain the correlation map
between words and pixels. After comparing different correlation maps and post-processing
with dense CRF [16], we attain pseudo masks at last. Best viewed in color.
3
efits. It has been shown that generative models have a better spatial and
relational understanding than discriminative models [13], which is crucial for
segmentation. Besides, internet-scale training data endows the model with
the ability to handle open-vocabulary scenarios. Furthermore, customized
generation with text-to-image diffusion models has been proven feasible [17].
These methods typically map the subject to an identifier word, and novel
renditions of it can be obtained by inserting the word into various descrip-
tions. We integrate this technique with our method and locate user-specific
items by exploiting the correspondence between learned word embeddings
of personalized concepts and image segments. It is worth noting that we
can compose customized instances and textual contexts into multi-modal
segmentation queries, which was hard to achieve with previous methods [18].
To investigate the performance of different methods in this customized
case, we introduce a new task called “personalized referring image segmen-
tation” with a new dataset named Mug19. This task aims to locate user-
specific entities and has many possible practical application scenarios, such
as those for household robots. The dataset is collected in a laboratory sce-
nario and is conscientiously devised to ensure that in most cases, uni-modal
information is insufficient to locate the proper object. This setup makes it
a more pragmatic and user-friendly task and allows this new benchmark to
assess the multi-modal comprehension ability of different models.
We conduct experiments from different aspects to validate the effective-
ness of our method. We first evaluate the weakly-supervised semantic seg-
mentation ability of our approach on classic datasets: Pascal VOC 2012 [19]
and MS COCO 2014 [20]. Then we demonstrate that we can locate per-
4
sonalized embeddings just like locating category embeddings, showcasing the
generality of our framework. We further reveal that traditional approaches
using uni-modal information have difficulties in dealing with our proposed
task. Finally, ablation studies are conducted to validate the intuition of our
framework designs.
In summary, the contributions of this work are as follows:
2. Related Work
5
from seen categories to unseen ones. With the development of word em-
bedding [24], language generalization ability has been exploited to depict
semantic associations between different classes [25].
Recently, foundation models have made game-changing progress, and
zero-shot abilities have been found to automatically emerge from them [8].
Thus open-vocabulary paradigm1 , a more general zero-shot formulation, has
gradually replaced the classic paradigm [26]. Contrastive Language-Image
Pre-training (CLIP) [8], a discriminative foundation model that has learned
vast image-text knowledge from the internet, has become a common com-
ponent of open-vocabulary recognition methods [7]. Nevertheless, it has
been discovered that the representation of CLIP is sub-optimal for segmen-
tation tasks [13]. Therefore, we leverage a generative diffusion model in-
stead [11], which is believed to have a thorough perception of scene-level
structure [13, 27].
1
Also known as open-set paradigm or open-world paradigm.
6
and dense annotations are required for training. Instead of learning a differ-
ent denoising process, another group exploits the knowledge of pre-trained
diffusion models [12, 13]. They leverage diffusion features to train additional
networks that provide dense predictions. As pre-trained diffusion models
possess powerful generative capability, their internal representation captures
rich semantic information. Thus labels used for training and parameters to
be optimized can be greatly reduced, making their frameworks very efficient.
In order to further diminish dependence on pixel-wise labels, the last group
operates on synthetic data [30, 31]. They synthesize realistic images and
segmentation masks simultaneously and use this automatically constructed
dataset to train segmentation models. The trained model exhibits competi-
tive performance on real images to models trained on real data. Nevertheless,
due to the domain gap between synthetic data and real data, deliberate de-
signs of data processing are required for good performance.
Compared to these segmentation methods that employ diffusion mod-
els, our method is more flexible and does not require dense annotations,
segmentation-specific re-training, or complicated data processing designs.
Thus it only takes 1 second to infer an image in our framework while some
competing methods need minutes to segment one image.
7
tion. They have proposed various mechanisms ranging from affine transfor-
mation [35] to cross-attention [36].
Currently, with the development of vision-and-language pre-trained mod-
els [8], researchers have contemplated applying them to composed image re-
trieval [18, 34, 33, 37]. Among these approaches, [18, 37] took up early-fusion
and had freer reasoning capabilities than late-fusion methods [34, 33]. We
adopt early-fusion in our proposed method, but we care more about local-
ization ability, which is valuable for many practical applications. Although
PALAVRA [18] also considered the segmentation task, its experimental re-
sults indicated its failure to distinguish between context and subject. On
the contrary, we design the localization task to rely on the comprehension of
context and subject, and experimental results demonstrate the effectiveness
of our method.
3. Method
3.1. Preliminary
Diffusion models [9, 10] are a class of generative models which learn the
data distribution by progressively denoising from a tractable noise distribu-
tion. They can be interpreted as a sequence of time-conditional denoising
auto-encoders. To reduce resource consumption, Latent Diffusion Model [11]
is proposed which conducts the diffusion process in the latent space obtained
by a perceptual compression model [38]. Additionally, in order to achieve
flexible conditional generation, the denoising U-net backbone [39] is usually
augmented with the cross-attention and self-attention mechanism [36]. With
paired image-condition training data (x, y), the optimization objective of
8
denoising model ϵθ is ordinarily simplified as:
9
Image Self Cross CBP SelfCross
chair
dog
boat
Figure 2: Visualization of correlation maps. The texts on the left are the corresponding
categories. The 2-nd column depicts the spectral clustering result [42] utilizing the self-
attention map, the 3-rd column shows the cross-attention map, the 4-st column displays
the attention score attained after employing the clustering technique in CBP [43], and the
last column shows our final correlation map after propagation. Best viewed in color.
10
Inspired by [45, 46], we regard self-attention maps as semantic affinity
matrices and propagate cross-attention scores accordingly. Thereby, we ag-
gregate semantic information from regions with similar appearances. We se-
lect a particular layer with a fine resolution W H to extract the self-attention
map as Self ∈ RW H×W H and accumulate Cross(n) of different layers via
interpolation and averaging to obtain the aggregated cross-attention map
Cross ∈ RW H×l . We then leverage the patch-patch affinity matrix to refine
the patch-token correlation map, which can be formulated as:
where iter is the number of refining iterations. The obtained Self Cross ∈
RW H×l can serve as a substitute for Cross to complete localization tasks [44].
The originally predicted correlation between patch and token probably carries
noise, yet ensembling within relevant areas could yield more stable results.
This refinement operation also declines forecast discrepancy within seman-
tically similar regions. As shown in Figure 2, the correlation map attained
after propagation preserves the structure better and possesses finer details.
To attain the pseudo mask, we leverage the image-level label to identify
the categories contained in the image and use K to represent this set. Re-
garding class k ∈ K, we extract correlation maps of its relevant tokens from
corresponding positions of Self Cross and average along the token dimension
to obtain the object attribution map SCk ∈ RW H×1 . We reshape it to RW ×H
and then normalize it so that values at each position are between 0 and 1:
SCk (x, y) − minx,y SCk (x, y)
SCk (x, y) → . (4)
maxx,y SCk (x, y) − minx,y SCk (x, y)
We further estimate the attribution map of background given by SCbg (x, y) =
11
(1.0 − maxk∈K SCk (x, y))2 . By concatenating these attribution maps, we
acquire the final attention map SC ∈ R(|K|+1)×W ×H . After upsampling and
refining SC, we assign the label for each pixel to the class (including the
background) with the maximum attention score here.
3.4. Summary
12
with the text query y constructed as described in Section 3.3. During the
feed-forward calculation of ϵθ (zt , t, y), we distill Cross and Self from certain
layers of ϵθ and acquire Self Cross using Equaion 3. Next, we extract the
attribution map of each entity from corresponding positions of Self Cross
and normalize it using Equaion 4. Finally, after estimating the attribution
map of background, we determine which entity each pixel belongs to by
comparing the values of it on all attribution maps.
4. Mug19 Dataset
4.1. Motivation
To investigate the problem of personalized segmentation, we create a
dataset in a laboratory scenario. As mentioned in Section 1, this new task
aims to locate the user-specific instance corresponding to the textual query.
It requires a more refined multi-modal comprehension capability than com-
posed image retrieval [32] and may have potential application perspectives
in products like home robotics. Most existing datasets for similar tasks are
designed for retrieval only [32, 34], thus are not suitable for us. [18] created a
personalized segmentation benchmark repurposing from a video dataset [47].
However, the temporal continuity in the video may have information leakage
for localization. Besides, the depictions were not always useful for segmen-
tation as we found it hard to tell the difference between similar entities in
most images of this dataset.
4.2. Distractors
To mitigate these issues, we build a new dataset on which models with
good image-text reasoning ability would perform better. We consider vari-
13
mug_02 scene_32 a mug_02 filled with Cola
Figure 3: Examples in our proposed dataset. The first 2 columns display multi-view
photos of personalized items and the 3-rd column presents the image of different scenes.
The last column shows the highlighted segmentation result along with the text query.
ous ambiguities and ensure that the multi-modal references are necessary and
sufficient to discriminate different objects, which requires a careful arrange-
ment of scenes. As shown in the 1-st row of Figure 3, personalized knowledge
is needed to tell the difference between 2 mugs filled with Cola, while tex-
tual context is required to distinguish between 2 white mugs. These two
situations are termed “semantic distractors” and “visual distractors” in [18].
4.3. Statistics
14
special groups regarding the degree of ambiguity: semantic distractor and
visual distractor. Scenes in the semantic distractor split include different
items with similar contexts and it consists of 440 triplets. Scenes in the
visual distractor split contain various items with analogous appearances and
this group comprises 170 triplets. The triplets are carefully annotated by
experts and the instance mask labels are acquired by the pre-trained Mask
R-CNN [48] model in the Detectron2 library2 . This dataset will be made
available online upon acceptance.
5. Experiments
2
https://github.com/facebookresearch/detectron2
15
and distill Self from the 11-th layer of ϵθ . We set iter in Equaion 3 to be 2.
Following [5, 6], the obtained segmentation maps are further refined by dense
Conditional Random Fields (CRF) [16] to generate pseudo masks, which are
then used to train a standard segmentation network based on DeepLab [49].
Furthermore, due to the stochastic sampling process in the diffusion frame-
work, we can easily ensemble segmentation masks generated from different
noises. Thus we can sample multiple times for better pseudo masks.
Evaluation. We select Pascal VOC 2012 [19] and Microsoft COCO 2014 [20]
for evaluation. Pascal VOC 2012 is a semantic segmentation dataset with
20 object categories. It consists of 1464 training images, 10582 augmented
training images, 1449 validation images, and 1456 test images. MS COCO
2014 dataset contains 80 object categories and 1 background class. It has
82081 training images and 40137 validation images. In our experiment, only
image-level labels are used to generate pseudo masks. The mean Intersection
over Union (mIoU) value is reported as the evaluation metric.
16
IRN M&L SC-CAM SEAM AdvCAM CLIMS RIB OoD MCT CLIP-ES Ours Ours
Method
[46] [22] [50] [51] [52] [5] [53] [54] [55] [6] 1 time 10 times
Initial 48.8 49.6 50.9 55.4 55.6 56.6 56.5 59.1 61.7 70.8 72.7 74.2
dCRF 54.3 - 55.3 56.8 62.1 62.4 62.9 65.5 64.5 75.0 74.4 76.1
RW 66.3 67.0 63.4 63.6 68.0 70.5 70.6 72.1 69.1 - - -
Table 1: mIoU of pseudo masks on PASCAL VOC 2012 train set. The best results are
in bold. dCRF represents post-processing with dense CRF. RW denotes refining with
trained affinity networks.
Table 2: mIoU of pseudo masks on the data set used to train DeepLab. The best results
are in bold. dCRF represents post-processing with dense CRF.
on the trainaug set. As shown in Table 2, we obtain the best quality on this
set, and even our initial masks can exceed the results of previous methods
post-processing with dCRF. Utilizing these masks, we train segmentation
models based on DeepLabV2 and assess them on the val and test sets. As
shown in Table 3, our framework outperforms most previous methods and
we achieve a new state-of-the-art with the Imagenet pre-trained model. As
for MS COCO 2014, we first compare the generated pseudo masks on the
train images. As shown in Table 2, our method produces more precise masks.
After training a DeepLabV2 segmentation model with these pseudo masks,
we can achieve 45.7% mIoU on the val set as shown in Table 4, which is also
a new SOTA. As we adopt the classic DeepLab repository directly without
extensive modifications, the enhancement observed in the DeepLab results
17
Method Backbone Ver. Val Test
Table 3: DeepLab results on PASCAL VOC 2012 val and test sets. The best results are
‡
in bold. Ver. denotes the version. represents adopting COCO pre-trained models.
18
Method Backbone Sup. Val
Table 4: DeepLab results on MS COCO 2014 val set. The best results are in bold. I
and L represents image-level supervision and language supervision, respectively.
Qualitative Results. We visualize the results of our approach and other multi-
modal related methods [5, 6] in Figure 4. We produce pseudo masks with finer
structures, such as the feather of the bird and the legs of the horse as shown
in the image. Furthermore, we can directly generate the confidence map from
attention scores. If the value of a position is within the 0.05 interval of the
19
Image CLIMS CLIP-ES Ours Ground Truth
Table 5: Attention mechanism analysis on PASCAL VOC 2012 train set. The best results
are in bold.
20
Comp. Syn. BG. bird boat person train tvmonitor mIoU
85.6 74.2 38.4 74.1 62.3 72.1
✓ 87.1 72.8 59.3 73.5 60.6 73.1
✓ 80.3 71.7 36.1 81.9 66.3 72.7
✓ ✓ 83.9 71.6 63.5 81.5 67.3 74.3
✓ 85.3 76.4 45.2 75.2 62.8 73.2
✓ ✓ 87.0 75.7 56.9 74.7 64.5 73.8
✓ ✓ 79.4 77.1 41.0 82.2 66.3 73.1
✓ ✓ ✓ 85.0 77.0 59.7 82.5 65.6 74.4
Table 6: Prompt strategy comparison on PASCAL VOC 2012 train set. Comp. denotes
using one composed sentence, Syn. denotes adopting category synonyms, and BG. denotes
appending background prompts. The best results are in bold.
Prompt. We analyze different strategies for selecting the text query and de-
pict the IoU of some categories along with the overall average in Table 6.
Compared to using one composed sentence as the prompt, applying sepa-
rate texts to query multi-label scenes will decline segmentation precision and
increase inference time (the number of pseudo labels generated per second
decreases from 1.6 to 1.0 in practice). Then following [6], we substitute
the original class name with category synonyms that can reduce ambiguity.
For instance, the high-attention region corresponding to “person” tends to
overlook the body, and adding “clothes” to the query prompt of “person”
can enhance the performance. Furthermore, we append some background
prompts at the end of text queries. With the existing softmax operation in
21
Equation 2, category scores in co-occurring background regions [5] will be
naturally suppressed. For example, when “railway” and “track” are used
as the background prompts for “train”, the segmentation results will be im-
proved due to the exclusion of these background areas.
22
is similar to that in Section 5.1.
Evaluation. We use standard mIoU along with two kinds of accuracy for eval-
uation. Accuracy represents the proportion of correctly predicted instances
to all instances. The ground truth assignment between instances and seg-
ments is determined based on the relative relationship between the center
positions of segments. When we consider each object solely, the segment
most relevant to each object is predicted as its location. bf acc denotes the
accuracy under this protocol. When we consider all the items contained in
the scene together, we treat the correlation between items and segments as
the cost matrix of an assignment problem. Then Hungarian algorithm [60]
is adopted to attain the assignment between instances and segments. af acc
denotes the accuracy under this protocol. Usually, the latter accuracy will
be higher as we compare the correlation values not only between segments
but also between contained instances.
23
Split All Semantic Distractor Visual Distractor
Metric mIoU bf acc af acc mIoU bf acc af acc mIoU bf acc af acc
Mask R-CNN [48] 75.1 58.9 75.1 78.2 63.3 78.2 42.9 38.0 42.9
DINO-ViT [61] 52.7 42.3 72.8 70.7 43.3 93.2 39.9 33.7 52.0
Subject Only 62.3 60.1 79.5 76.2 79.3 95.0 31.9 42.0 40.0
Context Only 54.4 51.2 70.9 35.3 40.5 45.1 50.3 42.4 64.3
Arithmetic 28.5 35.1 37.2 27.9 37.7 35.2 55.2 46.7 70.4
Ours 64.9 60.2 83.3 78.8 73.5 98.3 56.8 49.4 71.8
Table 7: Evaluation results on different splits of Mug19 dataset. The best results are in
bold while the second best are underlined.
category embedding in the description with the average between CLIP image
embeddings of the instance and the CLIP text embedding of the class.
24
Object Subject Only Context Only SEEM Ours Ground Truth Scene
mug_02 a mug_02 filled with Cola scene_32
Figure 5: Localization results of different methods on Mug19 dataset samples. The 1-st
column shows the object, the last column shows the scene and the rest columns display
highlighted segmentation masks with the text reference.
We have found that the segmentation results will drop drastically when an
image contains semantically similar objects as shown in Figure 6. The same
phenomenon has also been found in [27] and was termed “Cohyponym En-
tanglement”. Incorporating extra visual knowledge into the learning stage of
25
Image Ours Ground Truth Image Ours Ground Truth
Figure 6: Failure cases: when “chair” and “sofa”, “cat” and “dog” are in the same image,
it is difficult for our method to distinguish them.
diffusion models [62] may also help enhance their discriminative abilities. Be-
sides, we have made preliminary attempts to adopt affordance-related texts.
When considering texts with richer semantics like “graspable part” and “cut-
table part”, our framework can provide meaningful masks for images contain-
ing knives. We will further explore how to efficiently equip our framework
with affordance ability in future works.
7. Broader Impacts
The booming generative models have caused many concerns. The unau-
thentic content they create can be misused. We instead pay attention to the
discriminative task and show the potential of these generative models from
another aspect. Our method can further help researchers to understand the
generation process in large models, which will also benefit the detection of
generated content. On the other hand, as we propose a more flexible seg-
mentation technology, surveillance of specific identities becomes easier, which
may be abused to infringe on personal privacy.
26
8. Conclusion
Acknowledgments
Datasets and Pre-trained Models. The licenses of the datasets and the pre-
trained models are listed here. The PASCAL VOC 2012 dataset [19] is
from http://host.robots.ox.ac.uk/pascal/VOC/index.html. The MS
COCO 2014 dataset [20] is from https://cocodataset.org/#home. The
pre-trained CLIP [8] is from https://github.com/openai/CLIP. The Sta-
ble Diffusion v1.4 model [11] is from https://huggingface.co/CompVis/
27
stable-diffusion. The DeepLab pre-trained model [49] is from https:
//github.com/kazuto1011/deeplab-pytorch. The Detectron2 pre-trained
model is from https://github.com/facebookresearch/detectron2, and
we select mask rcnn R 50 FPN 3x from official baseline models for its good
performance.
Implementation Details. We use Pytorch [63] for our experiments. Our code-
base builds heavily on https://github.com/CompVis/stable-diffusion.
For Pascal VOC 2012, MS COCO 2014, and Mug19 datasets, we sample 10,
1, and 1 times respectively to generate pseudo masks.
28
Image CLIP-ES Ours Ground Truth Image CLIP-ES Ours Ground Truth
VOC COCO
Figure B.7: More visualization results on PASCAL VOC 2012 and MS COCO 2014
datasets. Uncertain pixels are set to white.
29
75 75 75 75 75
mIoU (%)
70 70 70 70 70
65 65 65 65 65
thr ÷ 2
60 60 60 60 (thr − max)1 60
w/ dCRF w/ dCRF w/ dCRF (thr − max)2 w/ dCRF
55 w/o dCRF 55 w/o dCRF 55 w/o dCRF 55 (thr − max) 3 55 w/o dCRF
0 50 100 150 200 250 2-6 2-8 3-7 3-8 4-8 5-8 5-9 6-8 0 3 6 9 12 15 0.6 0.7 0.8 0.9 1.0 0.3 0.6 0.9 1.2 1.5
time step cross layer self layer background threshold Frame Per Second
Figure B.8: Analyses of different hyper-parameters. The mIoU values are evaluated on
PASCAL VOC 2012 train set. The default setting is 150 time step, 4-8-th cross layer,
11-th self layer, 1.0 background threshold with power 2, and sampling 1 time.
30
Seeds 40 41 42 43 44 Average
Table B.8: mIoU of pseudo masks on PASCAL VOC 2012 train set with different seeds.
Table B.9: Evaluation results of various combinations on Mug19 dataset. The best
results are in bold while the second best results are underlined. Custom + Self Cross is
the combination we ultimately adopt.
31
Custom Diffusion [17] for personalization, our framework for segmentation.
As shown in Table B.9, CLIP-ES has difficulties dealing with personalized
items. We attribute it to the instability of Grad-CAM [65] as we occasion-
ally get all zero CAMs for specific objects. In contrast, our mechanism is
robust to various personalized embeddings and we ultimately choose Custom
Diffusion because it performs better and is more consistent with our method.
Objects in the Same Series. We further construct a new split from Mug19
dataset regarding series. We build a scenario where we want to locate the
object with only access to images of its variants within the same series.
For instance, we aim to find the red transparent plastic mug in the scene,
but we only possess photos of its blue variant. We handle it by using the
identifier embedding of the blue variant along with the context “a red mug”
as the segmentation query. We pick out 632 triplets based on this notion and
name them “variant split” as a whole. Quantitative comparisons of various
baselines are conducted on this split in Table B.10. Our method fully utilizes
both subject and context information and addresses the localization issue of
this situation more effectively than all the baselines.
32
Split Variant
Metric mIoU bf acc af acc
Mask R-CNN [48] 71.9 59.5 71.9
DINO-ViT [61] 48.5 44.7 66.7
Subject Only 61.6 54.9 78.8
Context Only 54.7 52.6 71.1
Arithmetic 36.5 37.3 47.1
Ours 69.2 60.7 88.3
Table B.10: Evaluation results on the variant split of Mug19 dataset. The best results
are in bold while the second best results are underlined.
33
100
mIoU(%)
r = -0.73
75
50
0.3 0.4 0.5
Cosine Similarity
Figure D.9: The correlation between segmentation results and textual embedding simi-
larity. The r-value is -0.73.
and have not yet considered them in our dataset. In future work, we will
add this type of description to our dataset and consider how to enhance our
method’s ability to comprehend them.
References
34
learning approach for object affordance detection, in: ICRA, 2018, pp.
5882–5889.
[5] J. Xie, X. Hou, K. Ye, L. Shen, CLIMS: cross language image matching
for weakly supervised semantic segmentation, in: CVPR, 2022, pp.
4483–4492.
35
[11] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-
resolution image synthesis with latent diffusion models, in: CVPR,
2022, pp. 10684–10695.
36
[19] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, A. Zisser-
man, The pascal visual object classes (VOC) challenge, Int. J. Comput.
Vis. (2010).
[21] Y. Chong, X. Chen, Y. Tao, S. Pan, Erase then grow: Generating cor-
rect class activation maps for weakly-supervised semantic segmentation,
Neurocomputing 453 (2021) 97–108.
37
[27] R. Tang, A. Pandey, Z. Jiang, G. Yang, K. V. S. M. Kumar, J. Lin,
F. Ture, What the daam: Interpreting stable diffusion using cross at-
tention, in: ACL, 2022, pp. 5644–5659.
38
images with pre-trained vision-and-language models, in: ICCV, 2021,
pp. 2105–2114.
39
[42] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans.
Pattern Anal. Mach. Intell. (2000).
40
[50] Y. Chang, Q. Wang, W. Hung, R. Piramuthu, Y. Tsai, M. Yang,
Weakly-supervised semantic segmentation via sub-category exploration,
in: CVPR, 2020, pp. 8988–8997.
[56] J. Fan, Z. Zhang, C. Song, T. Tan, Learning integral objects with intra-
class discriminator for weakly-supervised semantic segmentation, in:
CVPR, 2020, pp. 4282–4291.
41
[58] Q. Chen, L. Yang, J. Lai, X. Xie, Self-supervised image-specific pro-
totype exploration for weakly supervised semantic segmentation, in:
CVPR, 2022, pp. 4278–4288.
[60] H. W. Kuhn, The hungarian method for the assignment problem, Naval
Research Logistics (NRL) (1955).
42
[65] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra,
Grad-cam: Visual explanations from deep networks via gradient-based
localization, in: ICCV, 2017, pp. 618–626.
43