Dynamic Graph Attention for Referring Expression Comprehension 論文閲讀筆記-CSDN博客

本文链接：https://blog.csdn.net/jiawang169/article/details/107900605

本文提出一种动态图注意力网络(DGA)，用于解决referring expression comprehension(REFC)问题，实现多步推理和语言指导下的视觉推理。DGA通过分解表达式、建模物体间关系及动态更新复合物体表示，增强模型的解释性和推理能力。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Dynamic Graph Attention for Referring Expression Comprehension論文閲讀筆記
Abstract
1、Referring expression comprehension is compositional and inherently requires visual reasoning on top of the relationships among the objects in the image. Meanwhile, the visual reasoning process is guided by the linguistic structure of the referring expression.
Referring expression comprehension ：以句子作爲指導，判斷圖中句子描述的是圖中的哪一個方位。
2、However, existing approaches treat the objects in isolation or only explore the first-order relationships between objects without being aligned with the potential complexity of the expression.
現在工作的缺點：對待圖中所描述的事務，不進行推理，或者只是簡單的進行一階推理。
3、In this paper, propose a dynamic graph attention network to perform multi-step reasoning by modeling both the relationships among the objects in the image and the linguistic structure of the expression. propose a differential analyzer to predict a language-guided visual reasoning process, and perform stepwise reasoning on top of the graph to update the compound object representation at every node.
Introduction
解決的問題：提出動態圖注意網絡(DGA)實現圖中物體的多步推理，用差分分析器(實際上是GCN？)來預測以語言爲指導的圖中關係的推理過程。
Introduction
1、The most classic work [13, 16, 21, 25] encodes an expression with an LSTM model [5], extracts features of visual objects in the image using CNNs [24, 20], and adopts matching loss functions to learn a common feature space for the expression and the visual objects.
作者認爲這些模型的可解釋性不好，推理不夠突出。 almost all the existing approaches for referring expression comprehension do not introduce reasoning or only support singlestep reasoning. Meanwhile, the models trained with those approaches have poor interpretability.
2、 [30, 19, 26, 28], involves extra pairwise context features or multi-order context features to improve the understanding of the image. However, they generally treat the learning process as a black box without explicit reasoning, and the learned monolithic features do not have adequate competitiveness when complex referring expressions are given. 作者認爲這些模型運用多步提取上下文特徵提取來理解圖像，學習過程向一個黑匣子，學習到的特徵不夠充分。
3、Recently, single-step reasoning [7, 29] has been proposed by decomposing the expression into different components and matching each component with a corresponding visual region via modular networks. 最近一些學者提出單步推理，將語言和圖形對應的模塊聯係起來。
4、 [33] Its stepwise reasoning is implemented using an LSTM model, which recurrently generates attended visual features while feeding the combination of word embedding and the attended visual features back to the LSTM. However, its stepwise reasoning does not consider the linguistic structure of the expression, and it does not explore the relationships among objects in the image.
考慮多步推理的這篇文章，不斷將 combination of word embedding 和 attended visual features 喂給 LSTM，循環生成 attended visual features. 但是它的逐步推理没有考虑到表达的语言结构，也没有探究图像中物体之间的关系。
所以作者提出了DGA
The core ideas behind the proposed DGA come from three aspects,
1、expression decomposition based on linguistic structure.
it is hard to accurately obtain the linguistic structure of a referring expression as such expressions are usually complex and flexible. Therefore, we resort to a differential analyzer module to predict constituent expressions of the input expression step by step to capture the linguistic structure, and the input expression is represented as a sequence of constituent expressions.我们借助差分分析模块对输入表达式的成分表达式进行一步步的预测，以捕捉语言结构，并将输入表达式表示为一系列成分表达式。
2、 object relationships modeling.
The proposed DGA constructs a directed graph over the objects in the image. The nodes and edges of the graph correspond to the objects and relationships among the objects respectively.
3、multi-step reasoning for identifying compound objects from relations.
The graph under the guidance of the constituent expressions in a stepwise manner to capture higher-order relationships among the objects and update the compound objects corresponding to each node through graph propagation.
Relate work
1、Referring Expression Comprehension
A. Some previous work [16, 21, 25] independently encodes the inputs in the two modals and learns a common feature space for them. To learn the common feature space, they propose different matching loss functions to optimize, e.g., softmax loss [16, 21] and triplet loss [25].
這個在Introduction的第一個提到過。
B. recent work [32, 4] adopts co-attention mechanisms to build up the interactions between the expression and the objects in the image.
C. [7, 29] designs fixed templates to softly decompose the expression into different semantic components via self-attention, and they compute the language-vision matching scores for each pair of the component and visual region. However, current work is not applicable for expressions that do not conform to the fixed templates.
D. [14] explores the visual reasoning for referring expression comprehension in synthetic domain. Different with them, we focus on real-world images and expressions, but do not resort to the guidance of language parsing (language programs[14]) ground-truth.
2、Interpretable Reasoning
A. For one-step relational reasoning, the relation networks [22] model pairwise relationships between objects directly.
B. For single-step or multi-step reasoning, some work [28, 26, 15, 8] explains the reasoning steps by generating updated attention distribution on the image for each step using the attention mechanisms.
C. The other work [1, 9, 6, 3] decomposes the reasoning procedure into a sequence of sub-tasks and learns different modular networks to deal with each sub-task.
但是以上都沒有引進可解釋的推理步驟。
D. The modular networks are used to improve the interpretabilities of models on referring expression comprehension [7, 29].
E. The other work [32] enables reasoning as a step-wise attention process following the stepwise representation of the expression; however, it treats the expression as the sequence of words, which ignores the linguistic structure of the expression.
Different from existing work on referring expression comprehension, we adopt a differential analyzer module to dynamically decompose the expression into its constituent expressions step by step to maintain its linguistic structure and to implement multi-step and dynamic reasoning.
Dynamic Graph Attention Network
引入了一种网络，动态图注意力网络（DGA），以解决 referring expression comprehension中的可解释性和多步骤推理问题。
在这里插入图片描述

(1) A language-driven differential analyzer
We model an expression as a sequence of constituent expressions, and each constituent expression is specified as a soft distribution over the words in the expression.
a tuple consisting of soft distribution over the words:
$Q=\{{{q}_{l}}\}_{l=1}^{L}$ 表示 $L$ 個單詞。
$R(t)={rl(t)}l=1L{{R}^{\left( t \right)}}=\left\{ r_{l}^{\left( t \right)} \right\}_{l=1}^{L}$ soft distribution
LSTM 輸入: $F=\{{{f}_{l}}\}_{l=1}^{L}$
輸出生成單詞特徵 a vector sequence: $H=\{{{h}_{l}}\}_{l=1}^{L}$
同時在LSTM中整個句子的特徵為： $q$
$D G A$ 進行 $T$ 步推理，首先要通過一個綫性變換把 feature vector $q$ 變成vector $q^{(t)}$ （應該是一個離散化的過程）:
$q^{(t)}=W^{(t)}q + b^{(t)}$
然後將上一步的結果 $y^{(t-1)}$ 與 $q^{(t)}$ 結合產生新的向量 $u^{(t)}$ :
$u^{(t)}=[q^{(t)};y^{(t-1)}]$
接下來計算如何生成soft distribution $R(t)={rl(t)}l=1L{{R}^{\left( t \right)}}=\left\{ r_{l}^{\left( t \right)} \right\}_{l=1}^{L}$

這裏是將 $u^{(t)}$ 和 $h_{l}$ 結合在一起，計算soft distribution: $R(t){{R}^{\left( t \right)}}$ ，最終：
$y(t)=∑l=1Lrl(t)hl{{y}^{\left( t \right)}}=\sum\limits_{l=1}^{L}{r_{l}^{\left( t \right)}{{h}_{l}}}$
(2) A static graph attention module
A.图结构
$GI=(V,E,XI){{G}^{I}}=\left( V,E,{{X}^{I}} \right)$
$XI={xkI}k=1K{X}^{I}=\left\{ x_{k}^{I} \right\}_{k=1}^{K}$
$xkI=[xko;pk]x_{k}^{I}=\left[ x_{k}^{o};{{p}_{k}} \right]$
$x_{k}^{I}$ is the concatenation of $o_{k}$ ’s visual feature $x_{k}^{o}$ and $o_{k}$ ’s spatial feature ${p}_{k}$
$x_{k}^{o}$ is extracted from a pretrained CNN model [24, 20]
$pk=Wp[x0k;x1k;wk;hk;wkhk]{p}_{k}=W_{p}\left[ x_{0k};x_{1k};w_{k};h_{k};w_{k}h_{k} \right]$
$x_{0k};x_{1k}$ are the normalized coordinates of the center of object $o_{k}$ , $w_{k}$ and $h_{k}$ are the normalized width and height.
然后像论文[28]一样，将grounding之间的关系分成11种，该分类方法描述了物体与物体之间的位置关系，也是拓扑图中边信息传递的重要部分。
B. Static Attention
$GM=(V,E,XM){{G}^{M}}=\left( V,E,{{X}^{M}} \right)$
$xkM=Wm[xkI;ck]+bmx_{k}^{M}=W_{m}\left[ x_{k}^{I};{{c}_{k}} \right]+b_{m}$
$ck=∑l=1Lαk,lfl{{c}_{k}}=\sum\limits_{l=1}^{L}{{{\alpha }_{k,l}}}{{f}_{l}}$
在这里插入图片描述
词汇的权重分两类（entity and relationrelation）：

(3) A dynamic graph attention module
GCN特征聚合模块，每个时间聚合一次