Detector-Free Weakly Supervised Grounding by Separation

Arbelle, Assaf; Doveh, Sivan; Alfassy, Amit; Shtok, Joseph; Lev, Guy; Schwartz, Eli; Kuehne, Hilde; Levi, Hila Barak; Sattigeri, Prasanna; Panda, Rameswar; Chen, Chun-Fu; Bronstein, Alex; Saenko, Kate; Ullman, Shimon; Giryes, Raja; Feris, Rogerio; Karlinsky, Leonid

Computer Science > Computer Vision and Pattern Recognition

arXiv:2104.09829 (cs)

[Submitted on 20 Apr 2021]

Title:Detector-Free Weakly Supervised Grounding by Separation

Authors:Assaf Arbelle, Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev, Eli Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda, Chun-Fu Chen, Alex Bronstein, Kate Saenko, Shimon Ullman, Raja Giryes, Rogerio Feris, Leonid Karlinsky

View PDF

Abstract:Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume the existence of a pre-trained object detector, relying on it to produce the ROIs for localization. In this work, we focus on the task of Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector. We directly learn everything from the images and associated free-form text pairs, thus potentially gaining an advantage on the categories unsupported by the detector. The key idea behind our proposed Grounding by Separation (GbS) method is synthesizing `text to image-regions' associations by random alpha-blending of arbitrary image pairs and using the corresponding texts of the pair as conditions to recover the alpha map from the blended image via a segmentation network. At test time, this allows using the query phrase as a condition for a non-blended query image, thus interpreting the test image as a composition of a region corresponding to the phrase and the complement region. Using this approach we demonstrate a significant accuracy improvement, of up to $8.5\%$ over previous DF-WSG SotA, for a range of benchmarks including Flickr30K, Visual Genome, and ReferIt, as well as a significant complementary improvement (above $7\%$) over the detector-based approaches for WSG.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2104.09829 [cs.CV]
	(or arXiv:2104.09829v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2104.09829

Submission history

From: Leonid Karlinsky [view email]
[v1] Tue, 20 Apr 2021 08:27:31 UTC (21,761 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Detector-Free Weakly Supervised Grounding by Separation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Detector-Free Weakly Supervised Grounding by Separation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators