visual quension answering system
visual quension answering system
Closed-Domain Question
Abstract Question: What is the animal in the image?
Answer: Giraffe
1
Different from the text-based QA problem, it is un- structured knowledge bases explicitly. However, it is non-
favourable to conduct the open-domain VQA based on the trivial to extract sufficient visual attributes, since an image
knowledge-based reasoning, since it is inevitably incom- lacks the structure and grammatical rules as language. To
plete to describe an image with structured forms [15]. The address this issue, we propose to retrieve a bath of can-
recent availability of large training datasets [28] makes it didate knowledge corresponding to the given image and
feasible to train a complex model in an end-to-end fash- related questions, and feed them to the deep neural net-
ion by leveraging the recent advances in deep neural net- work implicitly. The proposed approach provides a gen-
works (DNN) [2, 9, 34, 17, 1]. Nevertheless, it is non-trivial eral pipeline that simultaneously preserves the advantages
to integrate knowledge into DNN-based methods, since of DNN-based approaches [2, 9, 34] and knowledge-based
the knowledge are usually represented in a symbol-based techniques [26, 25].
or graph-based manner (e.g., Freebase [5], DBPedia [3]), In general, the underlying symbolic nature of a Knowl-
which is intrinsically different from the DNN-based fea- edge Graph (KG) makes it difficult to integrate with DNNs.
tures. A few attempts are made in this direction [29], but The usual knowledge graph embedding models such as
it may involve much irrelevant information and fail to im- TransE [7] focus on link prediction, which is different from
plement multi-hop reasoning over several facts. VQA task aiming to fuse knowledge. To tackle this is-
The memory networks [27, 24, 16] offer an opportunity sue, we propose to embed the entities and relations of
to address these challenges by reading from and writing a KG into a continuous vector space, such that the fac-
to the external memory module, which is modeled by the tual knowledge can be used in a more simple manner.
actions of neural networks. Recently, it has demonstrated Each knowledge triple is treated as a three-word SVO
the state-of-the-art performance in numerous NLP applica- (subject, verb, object) phrase, and embedded into a fea-
tions, including the reading comprehension [20] and textual ture space by feeding its word-embedding through an RNN
question answering [6, 16]. Some seminal efforts are also architecture. In this case, the proposed knowledge embed-
made to implement VQA based on dynamic memory net- ding feature shares a common space with other textual el-
works [30], but it does not involve the mechanism to in- ements (questions and answers), which provides an addi-
corporate the external knowledge, making it incapable of tional advantage to integrate them more easily.
answering open-domain visual questions. Nevertheless, the Once the massive external knowledge is integrated into
attractive characteristics motivate us to leverage the mem- the model, it is imperative to provide a flexible mecha-
ory structures to encode the large-scale structured knowl- nism to store a richer representation. The memory net-
edge and fuse it with the image features, which offers an work, which contains scalable memory with a learning
approach to answer open domain visual questions. component to read from and write to it, allows complex
reasoning by modeling interaction between multiple parts
1.1. Our Proposal of the data [27, 30]. In this paper, we adopt the most
recent advance of Improved Dynamic Memory Networks
To address the aforementioned issues, we propose a
(DMN+) [30] to implement the complex reasoning over
novel Knowledge-incorporated Dynamic Memory Network
several facts. Our model provides a mechanism to attend to
framework (KDMN), which allows to introduce the massive
candidate knowledge embedding in an iterative manner, and
external knowledge to answer open-domain visual ques-
fuse it with the multi-modal data including image, text and
tions by exploiting the dynamic memory network. It en-
knowledge triples in the memory component. The mem-
dows a system with an capability to answer a broad class of
ory vector therefore memorizes useful knowledge to facili-
open-domain questions by reasoning over the image content
tate the prediction of the final answer. Compared with the
incorporating the massive knowledge, which is conducted
DMN+ [30], we introduce the external knowledge into the
by the memory structures.
memory network, and endows the system an ability to an-
Different from most of existing techniques that focus swer open-domain question accordingly.
on answering visual questions solely on the image con- To summarize, our framework is capable of conducting
tent, we propose to address a more challenging scenario the multi-modal data reasoning including the image content
which requires to implement reasoning beyond the image and external knowledge, such that the system is endowed
content. The DNN-based approaches [2, 9, 34] are there- with a more general capability of image interpretation. Our
fore not sufficient, since they can only capture informa- main contributions are as follows:
tion present in the training images. Recent advances wit-
ness several attempts to link the knowledge to VQA meth- • To our best knowledge, this is the first attempt to in-
ods [26, 25], which make use of structured knowledge tegrating the external knowledge and image represen-
graphs and reason about an image on the supporting facts. tation with a memory mechanism, such that the open-
Most of these algorithms first extract the visual concepts domain visual question answering can be conducted
from a given image, and implement reasoning over the effectively with the massive knowledge appropriately
Candidate Knowledge
Retrieval Keeping Dynamic Memory Network
dry
Rain Attention
Machanisim
Umbrella Memory
Updating
ConceptNet
Memory
Raining handle
shade
Structure-Preserved Query
Knowledge Embedding
LSTM
Umbrella have an umbrella?
Keywords: MC: (a) It is raining.
Embedding
Reasoning
Raining … (b) It is part of the costume.
Answer:
Joint
(c) …
It is raining.
CNN
Knowledge Incorporated Open-Domain VQA
Figure 2: Overall architecture of our proposed KDMN network. Given an image and the corresponding questions, the visual objects of the input image
and key words of the corresponding questions are extracted using the Fast-RCNN and syntax analysis, respectively. Afterwards, we propose to assess the
importance of entities in the knowledge graph and retrieve the most informative context-relevant knowledge triples, which are fed to the memory network
after embedding the candidate knowledge into a continuous feature space. Consequentially, we integrate the representations of images and extracted
knowledge into a common space, and store the features in a dynamic memory module. The open-domain VQA can be implemented by interpreting the joint
representation under attention mechanism.
Figure 3: Example results on the Visual7W dataset for (close-domain) VQA tasks. Given an image and the corresponding question, we report the cor-
responding answers obtained via our algorithm. Specifically, pr denotes the predicted probability generated by our model, and pr-NoKG is the predicted
probability by the ablative model of KDMN-NoKG. We make the predicted choices bold accordingly. The external knowledge triples are also provided if
they are retrieved to support the joint reasoning by our method automatically. As is observed, the external knowledge is essential even for the conventional
VQA tasks, e.g., in the 5th example, it is much easier to infer the place accordingly by incorporating external knowledge when a giraffe is recognized.
300 and the dimension of LSTM internal states as 512. We model with spatial attention; (2) MemAUG [18]: a
use a pre-trained ResNet-101 [12] model to extract image memory-augmented model for VQA; (3) MCB+Att [8]:
feature, and select 20 candidate knowledge triples for each a model combining multi-modal features by Multimodal
QA pair through the experiments. Empirical study demon- Compact Bilinear pooling; (4) MLAN [33]: an advanced
strates it is sufficient in our task although more knowledge multi-level attention model.
triples are also allowed. The iteration number of a dynamic
memory network update is set to 2, and the dimension of 4.3. Results and Analysis
episodic memory is set to 2048, which is equal to the di- In this section, we report the quantitative evaluation
mension of memory slots. along with representative samples of our method, compared
In this paper, we combine the candidate Question- with our ablative models and the state-of-the-art method for
Answer pair to generate a hypothesis, and formulate the both the conventional (close-domain) VQA task and open-
multi-choice VQA problem as a classification task. The cor- domain VQA.
rect answer can be determined by choosing the one with the
largest probability. In each iteration, we randomly sample
a batch of 500 QA pairs, and apply stochastic gradient de- 4.3.1 VQA Task
scent algorithm with a base learning rate of 0.0001 to tune In this section, we report the quantitative accuracy in Ta-
the model parameters. The candidate knowledge is first re- ble 1 along with the sample results in 3. The overall re-
trieved, and other modules are trained in an end-to-end man- sults demonstrate that our algorithm obtains different boosts
ner. compared with the competitors on various kinds of ques-
tions, e.g., significant improvements on the questions of
4.2.1 Comparison Methods Who (5.9%), and What (4.9%) questions, and slightly boost
In order to analyze the contributions of each component in on the questions of When (1.4%) and How (2.0%). After
our knowledge-enhanced, memory-based model, we ablate inspecting the success and failure cases, we found that the
our full model as follows: Who and What questions have larger diversity in questions
and multi-choice answers compared to other types, there-
• KDMN-NoKG: baseline version of our model. No ex- fore benefit more from external background knowledge.
ternal knowledge involved in this model. Other param- Note that compared with the method of MemAUG [18] in
eters are set the same as full model. which a memory mechanism is also adopted, our algorithm
still gain significant improvement, which further confirms
• KDMN-NoMem: a version without memory network. our belief that the background knowledge provides critical
External knowledge triples are used by one-pass soft supports.
attention. We further make comprehensive comparisons among our
• KDMN: our full model. External knowledge triples are ablative models. To make it fair, all the experiments are im-
incorporated in Dynamic Memory Network. plemented on the same basic network structure and share
the same hyper-parameters. In general, our KDMN model
We also compare our method with several alternative on average gains 1.6% over the KDMN-NoMem model and
VQA methods including (1) LSTM-Att [34], a LSTM 4.0% over the KDMN-NoKG model, which further implies
What in this image can be used for eating? What in this image has handle as a What in this image has the property What in this image has four legs? What in this image is capable of
part? of expensive? flying?
Multi-choices Pr Pr-NoKG Multi-choices Pr Pr-NoKG Multi-choices Pr Pr-NoKG Multi-choices Pr Pr-NoKG Multi-choices Pr Pr-NoKG
1) meat 0.18 0.13 1) door 0.28 0.05 1) computer 0.98 0.22 1) cow 0.93 0.22 1) plane 1.00 0.40
2) board 0.01 0.43 2) word 0.00 0.06 2) cat 0.26 0.35 2) ears 0.01 0.37 2) light 0.55 0.96
3) bean 0.01 0.18 3) faucet 0.05 0.03 3) cars 0.00 0.04 3) horses 0.00 0.07 3) bird 0.01 0.01
4) washing 0.00 0.02 4) stone 0.00 0.22 4) hills 0.00 0.11 4) ribbon 0.00 0.06 4) edges 0.02 0.36
Ground-truth: meat Ground-truth: door Ground-truth: computer Ground-truth: cow Ground-truth: plane
KG Triples: KG Triples: KG Triples: KG Triples: KG Triples:
(meat, UsedFor, eating) (handle, PartOf, door) (computer, HasProperty, expensive) (cow, HasA, four-legs) (plane, CapableOf, flying)
Figure 4: Example results of open-domain visual question answering based on our proposed knowledge-incorporate dynamic memory network. Given an
images, we automatically generate the open-domain question-answer pair by considering of the image content and the relevant background knowledge.
We report the corresponding answers obtained via our algorithm. Specifically, pr denotes the predicted probability generated by our model, and pr-NoKG
is the predicted probability by the ablative model of KDMN-NoKG. The results demonstrate that external knowledge plays an essential role in answer
open-questions. A system is incapable of inferring the food in the 1st example and the stuff prices in the 3rd example.
the effectiveness of dynamic memory networks in exploit- comprehensive reasoning beyond image content is required,
ing external knowledge. Through iterative attention pro- e.g., the background knowledge for prices of stuff is essen-
cesses, the episodic memory vector captures background tial for a machine when inferring the expensive ones. The
knowledge distilled from external knowledge embeddings. larger performance improvement on open-domain dataset
The KDMN-NoMem model gains 2.4% over the KDMN- supports our belief that background knowledge is essential
NoKG model, which implies that the incorporated external to answer general visual questions. Note that the perfor-
knowledge brings additional advantage, and act as a sup- mance can be further improved if the technique of ensem-
plementary information for predicting the final answer. The ble is allowed. We fused the results of several KDMN mod-
indicative examples in Fig. 3 also demonstrate the impact els which are trained from different initializations. Exper-
of external knowledge, such as the 4th example of “Why is iments demonstrate that we can further obtain an improve-
the light red?”. It would be helpful if we could retrieve the ment about 3.1%.
function of the traffic lights from the external knowledge
effectively. Table 2: Accuracy on our generated open-domain dataset.
Table 1: Accuracy on Visual7W dataset Methods Accuracy
Methods What Where When Who Why How Average KDMN-NoKG 45.1
LSTM-Att.[34] 51.5 57.0 75.0 59.5 55.5 49.8 54.3 KDMN-NoMem 51.9
MCB + Att.[8] 60.3 70.4 79.5 69.2 58.2 51.1 62.2 KDMN 57.8
MemAUG [18] 62.2 68.9 76.8 66.4 57.8 52.9 62.8 Ensemble 60.9
MLAN [33] 60.5 71.2 79.6 69.4 58.0 50.8 62.4
KDMN-NoKG 59.7 69.6 79.9 68.0 61.6 51.3 62.0
KDMN-NoMem 62.1 71.5 81.1 72.5 62.9 54.0 64.4
KDMN 64.6 73.1 81.3 73.9 64.1 53.3 66.0
Ensemble 67.9 77.0 83.3 77.2 69.0 56.8 69.4
5. Conclusion
In this paper, we proposed a novel framework
4.3.2 Open-Domain VQA
named knowledge-incorporate dynamic memory network
In this section, we report the quantitative performance of (KDMN) to answer open-domain visual questions by har-
open-domain VQA in Table 2 along with the sample results nessing massive external knowledge in dynamic memory
in Fig. 4. Since most of the alternative methods do not pro- network. Context-relevant external knowledge triples are
vide the results in the open-domain scenario, we make com- retrieved and embedded into memory slots, then distilled
prehensive comparison with our ablative models. As ex- through a dynamic memory network to jointly inference fi-
pected, we observe that a significant improvement (12.7%) nal answer with visual features. The proposed pipeline not
of our full KDMN model over the KDMN-NoKG model, only maintains the superiority of DNN-based methods, but
where 6.8% attributes to the involvement of external knowl- also acquires the ability to exploit external knowledge for
edge and 5.9% attributes to the usage of memory network. answering open-domain visual questions. Extensive experi-
Examples in Fig. 4 further provide some intuitive under- ments demonstrate that our method achieves competitive re-
standing of our algorithm. It is difficult or even impossi- sults on public large-scale dataset, and gain huge improve-
ble for a system to answer the open domain question when ment on our generated open-domain dataset.
References [16] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury,
I. Gulrajani, V. Zhong, R. Paulus, and R. Socher. Ask me
[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, anything: Dynamic memory networks for natural language
S. Gould, and L. Zhang. Bottom-up and top-down at- processing. In International Conference on Machine Learn-
tention for image captioning and vqa. arXiv preprint ing, pages 1378–1387, 2016. 2, 3, 5
arXiv:1707.07998, 2017. 1, 2 [17] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, question-image co-attention for visual question answering.
C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question In Advances In Neural Information Processing Systems,
answering. In Proceedings of the IEEE International Con- pages 289–297, 2016. 1, 2
ference on Computer Vision, pages 2425–2433, 2015. 1, 2 [18] C. Ma, C. Shen, A. Dick, and A. v. d. Hengel. Visual ques-
[3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, tion answering with memory-augmented networks. arXiv
and Z. Ives. Dbpedia: A nucleus for a web of open data. The preprint arXiv:1707.04968, 2017. 7, 8
semantic web, pages 722–735, 2007. 2 [19] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neu-
[4] S. Bird, E. Klein, and E. Loper. Natural language process- rons: A neural-based approach to answering questions about
ing with Python: analyzing text with the natural language images. In Proceedings of the IEEE international conference
toolkit. ” O’Reilly Media, Inc.”, 2009. 4 on computer vision, pages 1–9, 2015. 1
[5] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. [20] B. Pan, H. Li, Z. Zhao, B. Cao, D. Cai, and X. He. Memen:
Freebase: a collaboratively created graph database for struc- Multi-layer embedding with memory networks for machine
turing human knowledge. In Proceedings of the 2008 ACM comprehension. arXiv preprint arXiv:1707.09098, 2017. 2
SIGMOD international conference on Management of data, [21] J. Pennington, R. Socher, and C. Manning. Glove: Global
pages 1247–1250. AcM, 2008. 2 vectors for word representation. In Proceedings of the 2014
[6] A. Bordes, N. Usunier, S. Chopra, and J. Weston. Large- conference on empirical methods in natural language pro-
scale simple question answering with memory networks. cessing (EMNLP), pages 1532–1543, 2014. 5
arXiv preprint arXiv:1506.02075, 2015. 2 [22] M. Ren, R. Kiros, and R. Zemel. Image question answering:
[7] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and A visual semantic embedding model and a new dataset. Proc.
O. Yakhnenko. Translating embeddings for modeling multi- Advances in Neural Inf. Process. Syst, 1(2):5, 2015. 1
relational data. In Advances in neural information processing [23] R. Speer and C. Havasi. Representing general relational
systems, pages 2787–2795, 2013. 2, 5 knowledge in ConceptNet 5. In LREC, pages 3679–3686,
[8] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, 2012. 3, 4, 11
and M. Rohrbach. Multimodal compact bilinear pooling [24] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end mem-
for visual question answering and visual grounding. arXiv ory networks. In Advances in neural information processing
preprint arXiv:1606.01847, 2016. 1, 7, 8 systems, pages 2440–2448, 2015. 2, 5
[9] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. [25] P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel.
Are you talking to a machine? dataset and methods for mul- Fvqa: fact-based visual question answering. IEEE Transac-
tilingual image question. In Advances in Neural Information tions on Pattern Analysis and Machine Intelligence, 2017. 1,
Processing Systems, pages 2296–2304, 2015. 1, 2 2
[10] D. Geman, S. Geman, N. Hallonquist, and L. Younes. Visual [26] P. Wang, Q. Wu, C. Shen, A. v. d. Hengel, and A. Dick.
turing test for computer vision systems. Proceedings of the Explicit knowledge-based reasoning for visual question an-
National Academy of Sciences, 112(12):3618–3623, 2015. 1 swering. arXiv preprint arXiv:1511.02570, 2015. 1, 2
[11] R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter- [27] J. Weston, S. Chopra, and A. Bordes. Memory networks.
national conference on computer vision, pages 1440–1448, arXiv preprint arXiv:1410.3916, 2014. 2
2015. 3, 4 [28] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- Hengel. Visual question answering: A survey of methods
ing for image recognition. In Proceedings of the IEEE con- and datasets. Computer Vision and Image Understanding,
ference on computer vision and pattern recognition, pages 2017. 1, 2, 6
770–778, 2016. 6 [29] Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hen-
[13] K. Kafle and C. Kanan. Visual question answering: Datasets, gel. Ask me anything: Free-form visual question answer-
algorithms, and future challenges. Computer Vision and Im- ing based on knowledge from external sources. In Proceed-
age Understanding, 2017. 1 ings of the IEEE Conference on Computer Vision and Pattern
[14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, Recognition, pages 4622–4630, 2016. 2
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Vi- [30] C. Xiong, S. Merity, and R. Socher. Dynamic memory net-
sual genome: Connecting language and vision using crowd- works for visual and textual question answering. In Interna-
sourced dense image annotations. International Journal of tional Conference on Machine Learning, pages 2397–2406,
Computer Vision, 123(1):32–73, 2017. 6 2016. 2, 3, 5
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet [31] H. Xu and K. Saenko. Ask, attend and answer: Exploring
classification with deep convolutional neural networks. In question-guided spatial attention for visual question answer-
Advances in neural information processing systems, pages ing. In European Conference on Computer Vision, pages
1097–1105, 2012. 2 451–466. Springer, 2016. 1
[32] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked
attention networks for image question answering. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 21–29, 2016. 1
[33] D. Yu, J. Fu, T. Mei, and Y. Rui. Multi-level attention net-
works for visual question answering. In Conf. on Computer
Vision and Pattern Recognition, 2017. 1, 7, 8
[34] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7W:
Grounded Question Answering in Images. In IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2016. 1,
2, 6, 7, 8, 11
6. Supplementary Material
6.1. Details of our Open-domain Dataset Generation
We obey several principles when building the open-domain VQA dataset for evaluation: (1) The question-answer pairs
should be generated automatically; (2) Both of visual information and external knowledge should be required when answer-
ing these generated open-domain visual questions; (3) The dataset should in multi-choices setting, in accordance with the
Visual7W dataset for fair comparison.
The open-domain question-answer pairs are generated based on a subset of images in Visual7W [34] standard test split, so
that the test images are not present during the training stage. For one particular image that we need to generate open-domain
question-answer pairs about, we firstly extract several prominent visual objects and randomly select one visual object. After
linked to a semantic entity in ConceptNet [23], the visual object connects other entities in ConceptNet through various
relations, e.g. UsedFor, CapableOf, and forms amount of knowledge triples (head, relation, tail), where either head or
tail is the visual object. Again, we randomly select one knowledge triple, and fill into a relation-related question-answer
template to obtain the question-answer pair. These templates assume that the correct answer satisfies knowledge requirement
as well as appear in the image, as shown in table 3.
For each open-domain question-answer pair, we generate three additional confusing items as candidate answers. These
candidate answers are randomly sampled from a collection of answers, which is composed of answers from other question-
answer pairs belonging to the same relation type. In order to make the open-domain dataset more challenging, we selectively
sample confusing answers, which either satisfy knowledge requirement or appear in the image, but not satisfy both of them
as the ground-truth answers do. Specifically, one of the confusing answers satisfies knowledge requirement but not appears
in image, so that the model must attend to visual objects in image; another one of the confusing answers appears in the image
but not satisfies knowledge requirement, so that the model must reason on external knowledge to answer these open-domain
questions. Please see examples in Figure 5.
In total, we generate 16,850 open-domain question-answer pairs based on 8,425 images in Visual7W test split.
KG Triple QA templates
(visual, UsedFor, other) what in this image can be used for {other}?
(other, UsedFor, visual) what in this image can {other} be used for?
(visual, PartOf, other) what in this image is a part of {other}?
(other, PartOf, visual) what in this image has {other} as a part??
(visual, HasProperty, other) what in this image has the property of {other}?
(other, HasProperty, visual) what property does the {other} in this image have?
(visual, HasA, other) what in this image has {other}?
(other, HasA, visual) what in this image belongs to {other}?
(visual, CapableOf, other) what in this image is capable of {other}?
(other, CapableOf, visual) what in this image is {other} capable of?
Table 3: Templates for generate open-domain question-answer pairs. {visual} is the KG entity representing visual object. {other} is the KG entity that has
a connection with {visual}. We take {visual} as the generated ground-truth answer.
What in this image can be used What in this image has window as What in this image has the What in this image has two What in this image is capable of
for light? a part? property of cute? wheels? hitting balls?
1) candle 1) building 1) children 1) motorcycle 1) batter
2) cake 2) sign 2) car 2) handle 2) catcher
3) sun 3) bus 3) puppy 3) bike 3) baseball player
4) shelf 4) tomato 4) animals 4) strings 4) edges
Figure 5: Examples from our generated open-domain dataset. We mark ground-truth answers green. The bottom KG triples just provide insights into the
generation process, and will not be included in the dataset. The candidate answers can be quite confusing in some questions, e.g., in the 1st example, the
ground-truth “candle” appearing in the image can be used for light, while “cake” also appears in the image but cannot be used for light, “sun” can also be
used for light but not appear in the image.