0% found this document useful (0 votes)
4 views

visual quension answering system

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

visual quension answering system

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Incorporating External Knowledge to Answer Open-Domain Visual Questions

with Dynamic Memory Networks

Guohao Li, Hang Su, Wenwu Zhu


Department of Computer Science and Technology, Tsinghua University, Beijing, China
[email protected], [email protected], [email protected]
arXiv:1712.00733v1 [cs.CV] 3 Dec 2017

 Closed-Domain Question
Abstract Question: What is the animal in the image?
Answer: Giraffe

Visual Question Answering (VQA) has attracted much  Open-Domain Question


Question: What is the favorite food of
attention since it offers insight into the relationships be-
the animal?
tween the multi-modal analysis of images and natural lan- Answer: Leaves.
guage. Most of the current algorithms are incapable of External Knowledge: The food of
giraffe is leaves and fruits , which cannot
answering open-domain questions that require to perform reach by most other herbivores. (Wikipedia)
reasoning beyond the image contents. To address this issue,
we propose a novel framework which endows the model ca-
pabilities in answering more complex questions by lever- Figure 1: A real case of open-domain visual question answering based
on internal representation of an image and external knowledge. Recent
aging massive external knowledge with dynamic memory success of deep learning provides a good opportunity to implement the
networks. Specifically, the questions along with the cor- closed-domain VQAs, but it is incapable of answering open-domain ques-
responding images trigger a process to retrieve the rele- tions when external knowledge is needed. In this example, the system
should recognize the giraffes and then query the knowledge bases for the
vant information in external knowledge bases, which are
main diet of giraffes. In this paper, we propose to explore the external
embedded into a continuous vector space by preserving the knowledge along with the image representation based on a dynamic mem-
entity-relation structures. Afterwards, we employ dynamic ory network, which allows a multi-hop reasoning over several facts.
memory networks to attend to the large body of facts in the
knowledge graph and images, and then perform reasoning improve human-computer interaction, enabling the visually
over these facts to generate corresponding answers. Ex- impaired individuals to get information about images, etc.
tensive experiments demonstrate that our model not only To fulfill VQA tasks, it requires to endow the respon-
achieves the state-of-the-art performance in the visual ques- der to understand intention of the question, reason over vi-
tion answering task, but can also answer open-domain sual elements of the image, and sometimes have general
questions effectively by leveraging the external knowledge. knowledge about the world. Most of the present methods
solve VQA by jointly learning interactions and perform-
ing inference over the question and image contents based
1. Introduction on the recent success of deep learning [19, 2, 22, 9, 8],
which can be further improved by introducing the atten-
Visual Question Answering (VQA) is a ladder towards tion mechanisms [34, 32, 31, 17, 33, 1]. However, most
a better understanding of the visual world, which pushes of questions in the current VQA dataset are quite simple,
forward the boundaries of both computer vision and natu- which are answerable by analyzing the question and image
ral language processing. A system in VQA tasks is given alone [2, 28]. It can be debated whether the system can an-
a text-based question about an image, which is expected to swer questions that require prior knowledge ranging com-
generate a correct answer corresponding to the question. In mon sense to subject-specific and even expert-level knowl-
general, VQA is a kind of Visual Turing Test, which rigor- edge. It is attractive to develop methods that are capable
ously assesses whether a system is able to achieve human- of deeper image understanding by answering open-domain
level semantic analysis of images [10, 13]. A system could questions [28], which requires the system to have the mech-
solve most of the tasks in computer vision if it performs as anisms in connecting VQA with structured knowledge, as
well as or better than humans in VQA. In this case, it has is shown in Fig. 1. Some efforts have been made in this di-
garnered increasing attentions due to its numerous potential rection, but most of them can only handle a limited number
applications [2], such as providing a more natural way to of predefined types of questions [26, 25].

1
Different from the text-based QA problem, it is un- structured knowledge bases explicitly. However, it is non-
favourable to conduct the open-domain VQA based on the trivial to extract sufficient visual attributes, since an image
knowledge-based reasoning, since it is inevitably incom- lacks the structure and grammatical rules as language. To
plete to describe an image with structured forms [15]. The address this issue, we propose to retrieve a bath of can-
recent availability of large training datasets [28] makes it didate knowledge corresponding to the given image and
feasible to train a complex model in an end-to-end fash- related questions, and feed them to the deep neural net-
ion by leveraging the recent advances in deep neural net- work implicitly. The proposed approach provides a gen-
works (DNN) [2, 9, 34, 17, 1]. Nevertheless, it is non-trivial eral pipeline that simultaneously preserves the advantages
to integrate knowledge into DNN-based methods, since of DNN-based approaches [2, 9, 34] and knowledge-based
the knowledge are usually represented in a symbol-based techniques [26, 25].
or graph-based manner (e.g., Freebase [5], DBPedia [3]), In general, the underlying symbolic nature of a Knowl-
which is intrinsically different from the DNN-based fea- edge Graph (KG) makes it difficult to integrate with DNNs.
tures. A few attempts are made in this direction [29], but The usual knowledge graph embedding models such as
it may involve much irrelevant information and fail to im- TransE [7] focus on link prediction, which is different from
plement multi-hop reasoning over several facts. VQA task aiming to fuse knowledge. To tackle this is-
The memory networks [27, 24, 16] offer an opportunity sue, we propose to embed the entities and relations of
to address these challenges by reading from and writing a KG into a continuous vector space, such that the fac-
to the external memory module, which is modeled by the tual knowledge can be used in a more simple manner.
actions of neural networks. Recently, it has demonstrated Each knowledge triple is treated as a three-word SVO
the state-of-the-art performance in numerous NLP applica- (subject, verb, object) phrase, and embedded into a fea-
tions, including the reading comprehension [20] and textual ture space by feeding its word-embedding through an RNN
question answering [6, 16]. Some seminal efforts are also architecture. In this case, the proposed knowledge embed-
made to implement VQA based on dynamic memory net- ding feature shares a common space with other textual el-
works [30], but it does not involve the mechanism to in- ements (questions and answers), which provides an addi-
corporate the external knowledge, making it incapable of tional advantage to integrate them more easily.
answering open-domain visual questions. Nevertheless, the Once the massive external knowledge is integrated into
attractive characteristics motivate us to leverage the mem- the model, it is imperative to provide a flexible mecha-
ory structures to encode the large-scale structured knowl- nism to store a richer representation. The memory net-
edge and fuse it with the image features, which offers an work, which contains scalable memory with a learning
approach to answer open domain visual questions. component to read from and write to it, allows complex
reasoning by modeling interaction between multiple parts
1.1. Our Proposal of the data [27, 30]. In this paper, we adopt the most
recent advance of Improved Dynamic Memory Networks
To address the aforementioned issues, we propose a
(DMN+) [30] to implement the complex reasoning over
novel Knowledge-incorporated Dynamic Memory Network
several facts. Our model provides a mechanism to attend to
framework (KDMN), which allows to introduce the massive
candidate knowledge embedding in an iterative manner, and
external knowledge to answer open-domain visual ques-
fuse it with the multi-modal data including image, text and
tions by exploiting the dynamic memory network. It en-
knowledge triples in the memory component. The mem-
dows a system with an capability to answer a broad class of
ory vector therefore memorizes useful knowledge to facili-
open-domain questions by reasoning over the image content
tate the prediction of the final answer. Compared with the
incorporating the massive knowledge, which is conducted
DMN+ [30], we introduce the external knowledge into the
by the memory structures.
memory network, and endows the system an ability to an-
Different from most of existing techniques that focus swer open-domain question accordingly.
on answering visual questions solely on the image con- To summarize, our framework is capable of conducting
tent, we propose to address a more challenging scenario the multi-modal data reasoning including the image content
which requires to implement reasoning beyond the image and external knowledge, such that the system is endowed
content. The DNN-based approaches [2, 9, 34] are there- with a more general capability of image interpretation. Our
fore not sufficient, since they can only capture informa- main contributions are as follows:
tion present in the training images. Recent advances wit-
ness several attempts to link the knowledge to VQA meth- • To our best knowledge, this is the first attempt to in-
ods [26, 25], which make use of structured knowledge tegrating the external knowledge and image represen-
graphs and reason about an image on the supporting facts. tation with a memory mechanism, such that the open-
Most of these algorithms first extract the visual concepts domain visual question answering can be conducted
from a given image, and implement reasoning over the effectively with the massive knowledge appropriately
Candidate Knowledge
Retrieval Keeping Dynamic Memory Network
dry

Rain Attention
Machanisim

Umbrella Memory
Updating
ConceptNet
Memory
Raining handle

shade
Structure-Preserved Query
Knowledge Embedding

Visual objects: Question: Why does the person

LSTM
Umbrella have an umbrella?
Keywords: MC: (a) It is raining.

Embedding

Reasoning
Raining … (b) It is part of the costume.
Answer:

Joint
(c) …
It is raining.

CNN
Knowledge Incorporated Open-Domain VQA

Figure 2: Overall architecture of our proposed KDMN network. Given an image and the corresponding questions, the visual objects of the input image
and key words of the corresponding questions are extracted using the Fast-RCNN and syntax analysis, respectively. Afterwards, we propose to assess the
importance of entities in the knowledge graph and retrieve the most informative context-relevant knowledge triples, which are fed to the memory network
after embedding the candidate knowledge into a continuous feature space. Consequentially, we integrate the representations of images and extracted
knowledge into a common space, and store the features in a dynamic memory module. The open-domain VQA can be implemented by interpreting the joint
representation under attention mechanism.

harnessed; serve as a benchmark for evaluating the capability of vari-


ous VQA models on the open-domain scenarios .
• We propose a novel structure-preserved method to em-
bed the knowledge triples into a common space with Given an image, we apply the Fast-RCNN [11] to de-
other textual data, making it flexible to integrate dif- tect the visual objects of the input image, and extract key-
ferent modalities of data in an implicit manner such as words of the corresponding questions with syntax analy-
image, text and knowledge triples; sis. Based on these information, we propose to learn a
mechanism to retrieve the candidate knowledge by query-
• We propose to exploit the dynamic memory network ing the large-scale knowledge graph, yielding a subgraph
to implement multi-hop reasonings, which has a capa- of relevant knowledge to facilitate the question answering.
bility to automatically retrieve the relevant information During the past years, a substantial amount of large-scale
in the knowledge bases and infer the most probable an- knowledge bases have been developed, which store com-
swers accordingly. mon sense and factual knowledge in a machine readable
fashion. In general, each piece of structured knowledge is
2. Overview represented as a triple (subject, rel, object) with subject
In this section, we outline our model to implement the and object being two entities or concepts, and rel corre-
open-domain visual question answering. In order to con- sponding to the specific relationship between them. In this
duct the task, we propose to incorporate the image content paper, we adopt external knowledge mined from Concept-
and external knowledge by exploiting the most recent ad- Net [23], an open multilingual knowledge graph containing
vance of dynamic memory network [16, 30], yielding three common-sense relationships between daily words, to aid the
main modules in Fig. 2. The system is therefore endowed reasoning of open-domain VQA.
with an ability to answer arbitrary questions corresponding Our VQA model provides a novel mechanism to inte-
to a specific image. grate image information with that extracted from the Con-
Considering of the fact that most of existing VQA ceptNet within a dynamic memory network. In general, it
datasets include a minority of questions that require prior is non-trivial to integrate the structured knowledge with the
knowledge, the performance therefore cannot reflect the DNN features due to their different modalities. To address
particular capabilities. We automatically produce a collec- this issue, we embed the entities and relations of the sub-
tion of more challenging question-answer pairs, which re- graph into a continuous vector space, which preserves the
quire complex reasoning beyond the image contents by in- inherent structure of the KGs. The feature embedding pro-
corporating the external knowledge. We hope that it can vides a convenience to fuse with the image representation
in a dynamic memory network, which builds on the atten- semantic entities, which also takes account of graph struc-
tion mechanism and the memory update mechanism. The ture for measuring edge importance.
attention mechanism is responsible to produce the contex- In order to retrieve the most informative knowledge, we
tual vector with relevance inferred by the question and pre- first extract the candidate nodes in the ConceptNet by an-
vious memory status. The memory update mechanism then alyzing the prominent visual objects in images with Fast-
renews the memory status based on the contextual vector, RCNN [11], and textual keywords with the Natural Lan-
which can memorize useful information for predicting the guage Toolkit [4]. Both of them are then associated with
final answer. The novelty lies the fact that these disparate the corresponding semantic entities in ConceptNet [23] by
forms of information are embedded into a common space matching all possible n-grams of words. Afterwards, we
based on memory network, which facilities the subsequent retrieve the first-order subgraph using these selected nodes
answer reasoning. from ConceptNet [23], which includes all edges connect-
Finally, we generate a predicted answer by reasoning ing with at least one candidate node. It is assumed that the
over the facts in the memory along with the image con- resultant subgraph contains the most relevant information,
tents. In this paper, we focus on the task of multi-choice which is sufficient to answer questions by reducing the re-
setting, where several multi-choice candidate answers are dundancy. The resultant first-order knowledge subgraph is
provided along with a question and a corresponding image. denoted as G.
For each question, we treat every multi-choice answer as in- Finally, we compress the subgraph G by evaluating and
put, and predict whether the image-question-answer triplet ranking the importance of edges in G using a designed score
is correct. The proposed model tries to choose one can- function, and carefully select the top-N edges along with
didate answer with the highest probability by inferring the the nodes for subsequent task. Specifically, we first assign
cross entropy error on the answers through the entire net- initial weights wi for each subgraph node, e.g., the initial
work. weights for visual object can be proportional to their corre-
sponding bounding-box area such that the dominant objects
3. Answer Open-Domain Visual Questions receive more attention, the textual keywords are treated
equally. Then, we calculate the importance score of each
In this section, we elaborate on the details and formu-
node in G by traversing each edge and propagating node
lations of our proposed model for answering open-domain
weights to their neighbors with a decay factor r ∈ (0, 1) as
visual questions. We first retrieve an appropriate amount of
candidate knowledge from the large-scale ConceptNet by
X
score(i) = wi + rn wj , (1)
analyzing the image content and the corresponding ques- j∈G\i
tions; afterward, we propose a novel framework based on
dynamic memory network to embed these symbolic knowl- where n is the number of link hops between the entity i and
edge triples into a continuous vector space and store it in a j. For simplicity, we ignore the edge direction and edge
memory bank; finally, we exploit these information to im- type (relation type), and define the importance of edge wi,j
plement the open-domain VQA by fusing the knowledge as the weights sum of two connected nodes as
with image representation. wi,j = score(i) + score(j), ∀(i, j) ∈ G. (2)
3.1. Candidate Knowledge Retrieval In this paper, we take the top-N edges ranked by wi,j as the
In order to answer the open-domain visual questions, we final candidate knowledge for the given context, denoted as
should sometime access information not present in the im- G∗ .
age by retrieving the candidate knowledge in the KBs. A
3.2. Knowledge Embedding in Memories
desirable knowledge retrieval should include most of the
useful information while ignore the irrelevant ones, which The candidate knowledge that we have extracted is rep-
is essential to avoid model misleading and reduce the com- resented in a symbolic triplet format, which is intrinsically
putation cost. To this end, we take the following three prin- incompatible with DNNs. This fact urges us to embed the
ciples in consideration as (1) entities appeared in images entities and relation of knowledge triples into a continuous
and questions (key entities) are critical; (2) the importance vector space. Moreover, we regard each entity-relation-
of entities that have direct or indirect links to key entities entity triple as one knowledge unit, since each triple nat-
decays as the number of link hops increases; (3) edges be- urally represents one piece of fact. The knowledge units
tween these entities are potentially useful knowledge. can be stored in memory slots for reading and writing, and
Following these principles, we propose a three-step pro- distilled through an attention mechanism for the subsequent
cedure to retrieve that candidate knowledge that are relevant tasks.
to the context of images and questions. The retrieval proce- In order to embed the symbolic knowledge triples into
dure pays more attention on graph nodes that are linked to memory vector slots, we treat the entities and relations as
words, and map them into a continuous vector space using This DMN consists of an attention component which
word embedding [21]. Afterwards, the embedded knowl- generates a contextual vector using the previous memory
edge is encoded into a fixed-size vector by feeding it to a vector, and an episodic memory updating component which
recurrent neural network (RNN). Specifically, we initialize updates itself based on the contextual vector. Specifically,
the word-embedding matrix with a pre-trained GloVe word- we propose a novel method to generate the query vector q
embedding [21], and refine it simultaneously with the rest by feeding visual and textual features to a non-linear fully-
of procedure of question and candidate answer embedding. connected layer to capture question-answer context infor-
In this case, the entities and relations share a common em- mation as
bedding space with other textual elements (questions and  h i 
answers), which makes them much more flexible to fuse q = tanh W1 f (I) ; f (Q) ; f (A) + b1 , (5)
later.
Afterwards, the knowledge triples are treated as SVO where W1 and b1 are the weight matrix and bias vector,
phrases of (subject, verb, object), and fed to to a standard respectively; and, f (I) , f (Q) and f (A) are denoted as DNN
two-layer stacked LSTM as features corresponding to the images, questions and multi-
choice answers, respectively. The query vector q captures
(t)

(t−1)
 information from question-answer context.
Ci = LSTM L[wit ], Ci , (3)
During the training process, the query vector q initializes
t = {1, 2, 3}, and i = 1, · · · , N, an episodic memory vector m(0) as m(0) = q. A iterative
attention process is then triggered, which gradually refines
where wit is the tth word of the ith SVO phrase, the episodic memory m until the maximum number of iter-
(wi1 , wi2 , wi3 ) ∈ G∗ , L is the word embedding matrix [21], ations steps T is reached. By the Tth iteration, the episodic
and Ci is the internal state of LSTM cell when forwarding memory m(T ) will memorize useful visual and external in-
the ith SVO phrase. The rationale lies in the fact that the formation to answer the question.
LSTM can capture the semantic meanings effectively when Attention component. At the tth iteration, we con-
the knowledge triples are treated as SVO phrases. catenate each knowledge embedding Mi with last iteration
For each question-answering context, we take the LSTM episodic memory m(t−1) and query vector q, then apply
internal states of the relevant knowledge triples as memory the basic soft attention procedure to obtain the tth context
vectors, yielding the embedded knowledge stored in mem- vector c(t) as
ory slots as (t)
h i
h
(3)
i zi = Mi ; m(t−1) ; q (6)
M = Ci , (4)   
(t)
α(t) = sof tmax w tanh W2 zi + b2 (7)
where M(i) is the ith memory slot corresponding to the ith
N
knowledge triples, which can be used for further answer in- X (t)
c(t) = αi Mi t = 1, · · · , T, (8)
ference. Note that the method is different from the usual
i=1
knowledge graph embedding models, since our model aims
to fuse knowledge with the latent features of images and (t)
where zi is the concatenated vector for the ith candidate
text, whereas the alternative models such as TransE [7] fo- (t)
memory at the tth iteration; αi is the ith element of α(t)
cus on link prediction task.
representing the normalized attention weight for Mi at the
3.3. Attention-based Knowledge Fusion with DNNs tth iteration; and, w, W2 and b2 are parameters to be opti-
mized in deep neural networks.
We have stored N relevant knowledge embeddings in Hereby, we obtain the contextual vector c(t) , which cap-
memory slots for a given question-answer context, which tures useful external knowledge for updating episodic mem-
allows to incorporate massive knowledge when N is large. ory m(t−1) and providing the supporting facts to answer the
The external knowledge overwhelms other contextual infor- open-domain questions.
mation in quantity, making it imperative to distill the useful Episodic memory updating component. We apply the
information from the candidate knowledge. The Dynamic memory update mechanism [24, 30] as
Memory Network (DMN) [16, 30] provides a mechanism to  h i 
address the problem by modeling interactions among multi- m(t) = ReLU W3 m(t−1) ; c(t) ; q + b3 , (9)
ple data channels. In the DMN module, an episodic memory
vector is formed and updated during an iterative attention where W3 and b3 are parameters to be optimized. After
process, which memorizes the most useful information for the iteration, the episodic memory m(T ) memorizes useful
question answering. Moreover, the iterative process brings knowledge information to answer the open-domain ques-
a potential capability of multi-hop reasoning. tion.
Compared with the DMN+ model implemented in [30], pairs automatically to evaluate the performance on open-
we allows the dynamic memory network to incorporate the domain VQA. In this section, we first briefly review the
massive external knowledge into procedure of VQA reason- dataset and the implementation details, and then report
ing. It endows the system with the capability to answer the performance of our proposed method comparing with
more general visual questions relevant but beyond the image several baseline models on both close-domain and open-
contents, which is more attractive in practical applications. domain VQA tasks.
Fusion with episodic memory and inference. Finally,
we embed visual features f (I) along with the textual fea- 4.1. Datasets
tures f (Q) and f (A) to a common space, and fuse them to- We train and evaluate our model on a public avail-
gether using Hadamard product (element-wise multiplica- able large-scale visual question answering datasets, the Vi-
tion) as sual7W dataset [34], due to the diversity of question types.
  Besides, since there is no public available open-domain
e(k) = tanh W(k) f (k) + b(k) , k ∈ {I, Q, A} (10) VQA dataset for evaluation now, we automatically build a
h = e(I) e(Q) e(A) , (11) collection of open-domain visual question-answer pairs to
examine the potentiality of our model for answering open-
where e(I) , e(Q) and e(A) are embedded features for image, domain visual questions.
question and answer, respectively; h is the fused feature
in this common space; and, W(I) , W(Q) and W(A) are 4.1.1 Visual7W Dataset
corresponding to the parameters in neural networks.
The final episodic memory m(T ) are concatenated with The Visual7W dataset [34] is built based on a subset of im-
the fused feature h to predict the probability of whether the ages from Visual Genome [14], which includes questions
multi-choice candidate answer is correct as in terms of (what, where, when, who, why, which and how)
 h i  along with the corresponding answers in a multi-choice for-
ans∗ = arg max sof tmax W4 hans ; m(T )
ans + b4 , mat. Similar as [34], we divide the dataset into training,
ans∈{1,2,3,4}
validation and test subsets, with totally 327,939 question-
(12)
answer pairs on 47,300 images. Compared with the alter-
where ans represents the index of multi-choice candidate
native dataset, Visual7W has a diverse type of question-
answers; the supported knowledge triples are stored in
(T ) answer and image content [28], which provides more op-
mans ; and, W4 and b4 are the parameters to be optimized portunities to assess the human-level capability of a system
in the DNNs. The final choice are consequentially obtained on the open-domain VQA.
once we have ans∗ .
Our training objective is to learn parameters based on a
cross-entropy loss function as 4.1.2 Open-domain Question Generation

D In this paper, we automatically generate numerous


1 X 
L=− yi log yˆi + (1 − yi ) log(1 − yˆi ) , (13) question-answer pairs by considering the image content and
D i relevant background knowledge, which provides a test bed
for the evaluation of a more realistic VQA task. Specifi-
where yˆi = pi (A(i) |I (i) , Q(i) , K (i) ; θ) represents the prob- cally, we generate a collection automatically based on the
ability of predicting the answer A(i) , given the ith image test image in the Visual7W by filling a set of question-
I (i) , question Q(i) and external knowledge K (i) ; θ repre- answer templates, which means that the information is not
sents the model parameters; D is the number of training present during the training stage. To make the task more
samples; and yi is the label for the ith sample. The model challenging, we selectively sample the question-answer
can be trained in an end-to-end manner once we have the pairs that need to reasoning on both visual concept in the
candidate knowledge triples are retrieved from the original image and the external knowledge, making it resemble the
knowledge graph. scenario of the open-domain visual question answering.
In this paper, we generate 16,850 open-domain question-
4. Experiments answer pairs on images in Visual7W test split. More details
In this section, we conduct extensive experiments to on the QA generation and relevant information can be found
evaluate performance of our proposed model, and compare in the supplementary material.
it with its variants and the alternative methods. We specif-
4.2. Implementation Details
ically implement the evaluation on a public benchmark
dataset (Visual7W) [34] for the close-domain VQA task, In our experiments, we fix the joint-embedding common
and also generate numerous arbitrary question-answers space dimension as 1024, word-embedding dimension as
What shape are the lemons? Who is controlling the bus? How is the floor made? Why is the light red? Where could this photo have When do the buses turn their lights
been taken? on?
Multi-choices Pr Pr-NoKG Multi-choices Pr Pr-NoKG Multi-choices Pr Pr-NoKG Multi-choices Pr Pr-NoKG Multi-choices Pr Pr-NoKG Multi-choices Pr Pr-NoKG
1) round 0.94 0.29 1) driver 1.00 0.43 1) of tile 0.78 0.27 1) stop 0.84 0.10 1) zoo 1.00 0.62 1) when it gets dark out 0.45 0.17
2) square 0.05 0.61 2) student 0.00 0.00 2) of wood 0.05 0.52 2) a bar 0.00 0.29 2) the beach 0.00 0.01 2) at night 0.00 0.15
3) triangular 0.00 0.04 3) passenger 0.00 0.03 3) carpet 0.00 0.03 3) a district 0.39 0.09 3) birthday party 0.00 0.01 3) when it is raining 0.00 0.32
4) rectangular 0.00 0.09 4) no one 0.00 0.75 4) concrete 0.12 0.01 4) do not enter 0.00 0.06 4) farm 0.00 0.66 4) when it is foggy 0.00 0.02
Ground-truth: round Ground-truth: driver Ground-truth: of tile Ground-truth: stop Ground-truth: zoo Ground-truth: when it gets dark out
KG Triples: (stop-bicycle, KG Triples:
MotivatedByGoal, light-red) (giraffe, AtLocation, zoo)

Figure 3: Example results on the Visual7W dataset for (close-domain) VQA tasks. Given an image and the corresponding question, we report the cor-
responding answers obtained via our algorithm. Specifically, pr denotes the predicted probability generated by our model, and pr-NoKG is the predicted
probability by the ablative model of KDMN-NoKG. We make the predicted choices bold accordingly. The external knowledge triples are also provided if
they are retrieved to support the joint reasoning by our method automatically. As is observed, the external knowledge is essential even for the conventional
VQA tasks, e.g., in the 5th example, it is much easier to infer the place accordingly by incorporating external knowledge when a giraffe is recognized.

300 and the dimension of LSTM internal states as 512. We model with spatial attention; (2) MemAUG [18]: a
use a pre-trained ResNet-101 [12] model to extract image memory-augmented model for VQA; (3) MCB+Att [8]:
feature, and select 20 candidate knowledge triples for each a model combining multi-modal features by Multimodal
QA pair through the experiments. Empirical study demon- Compact Bilinear pooling; (4) MLAN [33]: an advanced
strates it is sufficient in our task although more knowledge multi-level attention model.
triples are also allowed. The iteration number of a dynamic
memory network update is set to 2, and the dimension of 4.3. Results and Analysis
episodic memory is set to 2048, which is equal to the di- In this section, we report the quantitative evaluation
mension of memory slots. along with representative samples of our method, compared
In this paper, we combine the candidate Question- with our ablative models and the state-of-the-art method for
Answer pair to generate a hypothesis, and formulate the both the conventional (close-domain) VQA task and open-
multi-choice VQA problem as a classification task. The cor- domain VQA.
rect answer can be determined by choosing the one with the
largest probability. In each iteration, we randomly sample
a batch of 500 QA pairs, and apply stochastic gradient de- 4.3.1 VQA Task
scent algorithm with a base learning rate of 0.0001 to tune In this section, we report the quantitative accuracy in Ta-
the model parameters. The candidate knowledge is first re- ble 1 along with the sample results in 3. The overall re-
trieved, and other modules are trained in an end-to-end man- sults demonstrate that our algorithm obtains different boosts
ner. compared with the competitors on various kinds of ques-
tions, e.g., significant improvements on the questions of
4.2.1 Comparison Methods Who (5.9%), and What (4.9%) questions, and slightly boost
In order to analyze the contributions of each component in on the questions of When (1.4%) and How (2.0%). After
our knowledge-enhanced, memory-based model, we ablate inspecting the success and failure cases, we found that the
our full model as follows: Who and What questions have larger diversity in questions
and multi-choice answers compared to other types, there-
• KDMN-NoKG: baseline version of our model. No ex- fore benefit more from external background knowledge.
ternal knowledge involved in this model. Other param- Note that compared with the method of MemAUG [18] in
eters are set the same as full model. which a memory mechanism is also adopted, our algorithm
still gain significant improvement, which further confirms
• KDMN-NoMem: a version without memory network. our belief that the background knowledge provides critical
External knowledge triples are used by one-pass soft supports.
attention. We further make comprehensive comparisons among our
• KDMN: our full model. External knowledge triples are ablative models. To make it fair, all the experiments are im-
incorporated in Dynamic Memory Network. plemented on the same basic network structure and share
the same hyper-parameters. In general, our KDMN model
We also compare our method with several alternative on average gains 1.6% over the KDMN-NoMem model and
VQA methods including (1) LSTM-Att [34], a LSTM 4.0% over the KDMN-NoKG model, which further implies
What in this image can be used for eating? What in this image has handle as a What in this image has the property What in this image has four legs? What in this image is capable of
part? of expensive? flying?
Multi-choices Pr Pr-NoKG Multi-choices Pr Pr-NoKG Multi-choices Pr Pr-NoKG Multi-choices Pr Pr-NoKG Multi-choices Pr Pr-NoKG
1) meat 0.18 0.13 1) door 0.28 0.05 1) computer 0.98 0.22 1) cow 0.93 0.22 1) plane 1.00 0.40
2) board 0.01 0.43 2) word 0.00 0.06 2) cat 0.26 0.35 2) ears 0.01 0.37 2) light 0.55 0.96
3) bean 0.01 0.18 3) faucet 0.05 0.03 3) cars 0.00 0.04 3) horses 0.00 0.07 3) bird 0.01 0.01
4) washing 0.00 0.02 4) stone 0.00 0.22 4) hills 0.00 0.11 4) ribbon 0.00 0.06 4) edges 0.02 0.36
Ground-truth: meat Ground-truth: door Ground-truth: computer Ground-truth: cow Ground-truth: plane
KG Triples: KG Triples: KG Triples: KG Triples: KG Triples:
(meat, UsedFor, eating) (handle, PartOf, door) (computer, HasProperty, expensive) (cow, HasA, four-legs) (plane, CapableOf, flying)

Figure 4: Example results of open-domain visual question answering based on our proposed knowledge-incorporate dynamic memory network. Given an
images, we automatically generate the open-domain question-answer pair by considering of the image content and the relevant background knowledge.
We report the corresponding answers obtained via our algorithm. Specifically, pr denotes the predicted probability generated by our model, and pr-NoKG
is the predicted probability by the ablative model of KDMN-NoKG. The results demonstrate that external knowledge plays an essential role in answer
open-questions. A system is incapable of inferring the food in the 1st example and the stuff prices in the 3rd example.

the effectiveness of dynamic memory networks in exploit- comprehensive reasoning beyond image content is required,
ing external knowledge. Through iterative attention pro- e.g., the background knowledge for prices of stuff is essen-
cesses, the episodic memory vector captures background tial for a machine when inferring the expensive ones. The
knowledge distilled from external knowledge embeddings. larger performance improvement on open-domain dataset
The KDMN-NoMem model gains 2.4% over the KDMN- supports our belief that background knowledge is essential
NoKG model, which implies that the incorporated external to answer general visual questions. Note that the perfor-
knowledge brings additional advantage, and act as a sup- mance can be further improved if the technique of ensem-
plementary information for predicting the final answer. The ble is allowed. We fused the results of several KDMN mod-
indicative examples in Fig. 3 also demonstrate the impact els which are trained from different initializations. Exper-
of external knowledge, such as the 4th example of “Why is iments demonstrate that we can further obtain an improve-
the light red?”. It would be helpful if we could retrieve the ment about 3.1%.
function of the traffic lights from the external knowledge
effectively. Table 2: Accuracy on our generated open-domain dataset.
Table 1: Accuracy on Visual7W dataset Methods Accuracy
Methods What Where When Who Why How Average KDMN-NoKG 45.1
LSTM-Att.[34] 51.5 57.0 75.0 59.5 55.5 49.8 54.3 KDMN-NoMem 51.9
MCB + Att.[8] 60.3 70.4 79.5 69.2 58.2 51.1 62.2 KDMN 57.8
MemAUG [18] 62.2 68.9 76.8 66.4 57.8 52.9 62.8 Ensemble 60.9
MLAN [33] 60.5 71.2 79.6 69.4 58.0 50.8 62.4
KDMN-NoKG 59.7 69.6 79.9 68.0 61.6 51.3 62.0
KDMN-NoMem 62.1 71.5 81.1 72.5 62.9 54.0 64.4
KDMN 64.6 73.1 81.3 73.9 64.1 53.3 66.0
Ensemble 67.9 77.0 83.3 77.2 69.0 56.8 69.4
5. Conclusion
In this paper, we proposed a novel framework
4.3.2 Open-Domain VQA
named knowledge-incorporate dynamic memory network
In this section, we report the quantitative performance of (KDMN) to answer open-domain visual questions by har-
open-domain VQA in Table 2 along with the sample results nessing massive external knowledge in dynamic memory
in Fig. 4. Since most of the alternative methods do not pro- network. Context-relevant external knowledge triples are
vide the results in the open-domain scenario, we make com- retrieved and embedded into memory slots, then distilled
prehensive comparison with our ablative models. As ex- through a dynamic memory network to jointly inference fi-
pected, we observe that a significant improvement (12.7%) nal answer with visual features. The proposed pipeline not
of our full KDMN model over the KDMN-NoKG model, only maintains the superiority of DNN-based methods, but
where 6.8% attributes to the involvement of external knowl- also acquires the ability to exploit external knowledge for
edge and 5.9% attributes to the usage of memory network. answering open-domain visual questions. Extensive experi-
Examples in Fig. 4 further provide some intuitive under- ments demonstrate that our method achieves competitive re-
standing of our algorithm. It is difficult or even impossi- sults on public large-scale dataset, and gain huge improve-
ble for a system to answer the open domain question when ment on our generated open-domain dataset.
References [16] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury,
I. Gulrajani, V. Zhong, R. Paulus, and R. Socher. Ask me
[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, anything: Dynamic memory networks for natural language
S. Gould, and L. Zhang. Bottom-up and top-down at- processing. In International Conference on Machine Learn-
tention for image captioning and vqa. arXiv preprint ing, pages 1378–1387, 2016. 2, 3, 5
arXiv:1707.07998, 2017. 1, 2 [17] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, question-image co-attention for visual question answering.
C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question In Advances In Neural Information Processing Systems,
answering. In Proceedings of the IEEE International Con- pages 289–297, 2016. 1, 2
ference on Computer Vision, pages 2425–2433, 2015. 1, 2 [18] C. Ma, C. Shen, A. Dick, and A. v. d. Hengel. Visual ques-
[3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, tion answering with memory-augmented networks. arXiv
and Z. Ives. Dbpedia: A nucleus for a web of open data. The preprint arXiv:1707.04968, 2017. 7, 8
semantic web, pages 722–735, 2007. 2 [19] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neu-
[4] S. Bird, E. Klein, and E. Loper. Natural language process- rons: A neural-based approach to answering questions about
ing with Python: analyzing text with the natural language images. In Proceedings of the IEEE international conference
toolkit. ” O’Reilly Media, Inc.”, 2009. 4 on computer vision, pages 1–9, 2015. 1
[5] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [20] B. Pan, H. Li, Z. Zhao, B. Cao, D. Cai, and X. He. Memen:
Freebase: a collaboratively created graph database for struc- Multi-layer embedding with memory networks for machine
turing human knowledge. In Proceedings of the 2008 ACM comprehension. arXiv preprint arXiv:1707.09098, 2017. 2
SIGMOD international conference on Management of data, [21] J. Pennington, R. Socher, and C. Manning. Glove: Global
pages 1247–1250. AcM, 2008. 2 vectors for word representation. In Proceedings of the 2014
[6] A. Bordes, N. Usunier, S. Chopra, and J. Weston. Large- conference on empirical methods in natural language pro-
scale simple question answering with memory networks. cessing (EMNLP), pages 1532–1543, 2014. 5
arXiv preprint arXiv:1506.02075, 2015. 2 [22] M. Ren, R. Kiros, and R. Zemel. Image question answering:
[7] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and A visual semantic embedding model and a new dataset. Proc.
O. Yakhnenko. Translating embeddings for modeling multi- Advances in Neural Inf. Process. Syst, 1(2):5, 2015. 1
relational data. In Advances in neural information processing [23] R. Speer and C. Havasi. Representing general relational
systems, pages 2787–2795, 2013. 2, 5 knowledge in ConceptNet 5. In LREC, pages 3679–3686,
[8] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, 2012. 3, 4, 11
and M. Rohrbach. Multimodal compact bilinear pooling [24] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end mem-
for visual question answering and visual grounding. arXiv ory networks. In Advances in neural information processing
preprint arXiv:1606.01847, 2016. 1, 7, 8 systems, pages 2440–2448, 2015. 2, 5
[9] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. [25] P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel.
Are you talking to a machine? dataset and methods for mul- Fvqa: fact-based visual question answering. IEEE Transac-
tilingual image question. In Advances in Neural Information tions on Pattern Analysis and Machine Intelligence, 2017. 1,
Processing Systems, pages 2296–2304, 2015. 1, 2 2
[10] D. Geman, S. Geman, N. Hallonquist, and L. Younes. Visual [26] P. Wang, Q. Wu, C. Shen, A. v. d. Hengel, and A. Dick.
turing test for computer vision systems. Proceedings of the Explicit knowledge-based reasoning for visual question an-
National Academy of Sciences, 112(12):3618–3623, 2015. 1 swering. arXiv preprint arXiv:1511.02570, 2015. 1, 2
[11] R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter- [27] J. Weston, S. Chopra, and A. Bordes. Memory networks.
national conference on computer vision, pages 1440–1448, arXiv preprint arXiv:1410.3916, 2014. 2
2015. 3, 4 [28] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- Hengel. Visual question answering: A survey of methods
ing for image recognition. In Proceedings of the IEEE con- and datasets. Computer Vision and Image Understanding,
ference on computer vision and pattern recognition, pages 2017. 1, 2, 6
770–778, 2016. 6 [29] Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hen-
[13] K. Kafle and C. Kanan. Visual question answering: Datasets, gel. Ask me anything: Free-form visual question answer-
algorithms, and future challenges. Computer Vision and Im- ing based on knowledge from external sources. In Proceed-
age Understanding, 2017. 1 ings of the IEEE Conference on Computer Vision and Pattern
[14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, Recognition, pages 4622–4630, 2016. 2
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Vi- [30] C. Xiong, S. Merity, and R. Socher. Dynamic memory net-
sual genome: Connecting language and vision using crowd- works for visual and textual question answering. In Interna-
sourced dense image annotations. International Journal of tional Conference on Machine Learning, pages 2397–2406,
Computer Vision, 123(1):32–73, 2017. 6 2016. 2, 3, 5
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet [31] H. Xu and K. Saenko. Ask, attend and answer: Exploring
classification with deep convolutional neural networks. In question-guided spatial attention for visual question answer-
Advances in neural information processing systems, pages ing. In European Conference on Computer Vision, pages
1097–1105, 2012. 2 451–466. Springer, 2016. 1
[32] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked
attention networks for image question answering. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 21–29, 2016. 1
[33] D. Yu, J. Fu, T. Mei, and Y. Rui. Multi-level attention net-
works for visual question answering. In Conf. on Computer
Vision and Pattern Recognition, 2017. 1, 7, 8
[34] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7W:
Grounded Question Answering in Images. In IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2016. 1,
2, 6, 7, 8, 11
6. Supplementary Material
6.1. Details of our Open-domain Dataset Generation
We obey several principles when building the open-domain VQA dataset for evaluation: (1) The question-answer pairs
should be generated automatically; (2) Both of visual information and external knowledge should be required when answer-
ing these generated open-domain visual questions; (3) The dataset should in multi-choices setting, in accordance with the
Visual7W dataset for fair comparison.
The open-domain question-answer pairs are generated based on a subset of images in Visual7W [34] standard test split, so
that the test images are not present during the training stage. For one particular image that we need to generate open-domain
question-answer pairs about, we firstly extract several prominent visual objects and randomly select one visual object. After
linked to a semantic entity in ConceptNet [23], the visual object connects other entities in ConceptNet through various
relations, e.g. UsedFor, CapableOf, and forms amount of knowledge triples (head, relation, tail), where either head or
tail is the visual object. Again, we randomly select one knowledge triple, and fill into a relation-related question-answer
template to obtain the question-answer pair. These templates assume that the correct answer satisfies knowledge requirement
as well as appear in the image, as shown in table 3.
For each open-domain question-answer pair, we generate three additional confusing items as candidate answers. These
candidate answers are randomly sampled from a collection of answers, which is composed of answers from other question-
answer pairs belonging to the same relation type. In order to make the open-domain dataset more challenging, we selectively
sample confusing answers, which either satisfy knowledge requirement or appear in the image, but not satisfy both of them
as the ground-truth answers do. Specifically, one of the confusing answers satisfies knowledge requirement but not appears
in image, so that the model must attend to visual objects in image; another one of the confusing answers appears in the image
but not satisfies knowledge requirement, so that the model must reason on external knowledge to answer these open-domain
questions. Please see examples in Figure 5.
In total, we generate 16,850 open-domain question-answer pairs based on 8,425 images in Visual7W test split.

KG Triple QA templates
(visual, UsedFor, other) what in this image can be used for {other}?
(other, UsedFor, visual) what in this image can {other} be used for?
(visual, PartOf, other) what in this image is a part of {other}?
(other, PartOf, visual) what in this image has {other} as a part??
(visual, HasProperty, other) what in this image has the property of {other}?
(other, HasProperty, visual) what property does the {other} in this image have?
(visual, HasA, other) what in this image has {other}?
(other, HasA, visual) what in this image belongs to {other}?
(visual, CapableOf, other) what in this image is capable of {other}?
(other, CapableOf, visual) what in this image is {other} capable of?

Table 3: Templates for generate open-domain question-answer pairs. {visual} is the KG entity representing visual object. {other} is the KG entity that has
a connection with {visual}. We take {visual} as the generated ground-truth answer.

What in this image can be used What in this image has window as What in this image has the What in this image has two What in this image is capable of
for light? a part? property of cute? wheels? hitting balls?
1) candle 1) building 1) children 1) motorcycle 1) batter
2) cake 2) sign 2) car 2) handle 2) catcher
3) sun 3) bus 3) puppy 3) bike 3) baseball player
4) shelf 4) tomato 4) animals 4) strings 4) edges

KG Triples: KG Triples: KG Triples: KG Triples: KG Triples:


(candle, UsedFor, light) (window, PartOf, building) (children, HasProperty, cute) (motorcycle, HasA, two-wheels) (batter, CapableOf, hit-balls)

Figure 5: Examples from our generated open-domain dataset. We mark ground-truth answers green. The bottom KG triples just provide insights into the
generation process, and will not be included in the dataset. The candidate answers can be quite confusing in some questions, e.g., in the 1st example, the
ground-truth “candle” appearing in the image can be used for light, while “cake” also appears in the image but cannot be used for light, “sun” can also be
used for light but not appear in the image.

You might also like