Visual Question Generation in Bengali
Visual Question Generation in Bengali
Abstract
The task of Visual Question Generation (VQG)
is to generate human-like questions relevant to
the given image. As VQG is an emerging re-
arXiv:2310.08187v1 [cs.CL] 12 Oct 2023
we will use terms "answer category" and "category" of the answers. In table 1, we see as maximum
interchangeably throughout the paper. In our work, length of question in our training dataset is 22 and
we used answer categories from (Krishna et al., validation is 21 tokens long, we choose 20 to be
2019) that take 1 out of 16 categorical variables to the default length. Questions longer than default
indicate the type of questions asked. For example, are truncated and the shorter ones are padded with
if our model wants to understand answers of "roK special <pad> token.
(color)" category, then it should generate a question Image Encoder: Given an image ĩ, we can ex-
“ETa ik roer bas ? (What color is the bus?)". tract image features, f ∈ RB×300 where B is batch
Our baseline is an image − only model with size. Our image encoder is a ResNet-18 pretrained
no additional textual information like answer or CNN model, which is a convolutional neural net-
category. We present further two variants both work with 72-layer architecture consisting 18 deep
of which shares the same architecture but takes layers (He et al., 2016). Once obtaining these fea-
different inputs in training. We feed two different tures, they are passed to a fully connected layer
textual information to our model during training. followed by a batch normalization layer. Specifi-
The first model is image − ans − cat that feeds cally, given f from image ĩ: i = BatchN orm(f) ∈
the concatenated ground truth answer and category RB×300 .
to the encoder and is concatenated with the image Encoder: We build a Transformer encoder
features. The second model is image − cat that (Vaswani et al., 2017) and use Bengali pretrained
takes only the relevant answer category as input GloVe (Global Vectors for Word Representation)
to the encoder. In both of the versions, the input word vectors (Sarkar, 2019) as the embedding
image is reconstructed to maximize information layer of the text encoder. Next, we provide an-
between the image and encoded outputs. swer or answer categories and image features f
Vocabulary: We construct vocabulary consider- as input to the text encoder. Note that, image-cat
ing all the textual utterances: questions, answers variant only takes answer category c as its input
and answer categories. Our vocabulary has a total during training and image-ans-cat takes concate-
of 7081 entries including the special tokens. We nated version of answer and category, [a; c] (; op-
use word level tokenization. We set a default length erator represents concatenation) as seen in figure
of 20 token to each of the questions and 5 to each 2. For image-ans-cat variant, a concatenated ver-
sion of answer and category [a; c] is passed through q̂ = Decoder(X, Q) where the decoder outputs a
the embedding layer and projected out as context, generated question q̂.
Cimg+ans+cat = embedding([a; c]) ∈ RB×T×300
where, B is batch size and T is the length of the 4 Experiments
[a; c]. For the image-cat variant, we only pass 4.1 Datasets
the category, c and similarly generate a context
To collect all relevant information for the VQG
Cimg+cat = embedding(c) ∈ RB×T×300 , where T
task in Bengali, we use the VQA v2.0 (Antol et al.,
is the length of c.
2015) dataset consisting of 443.8K questions from
Additionally, we generate padding masks on 82.8K images in the training dataset, and 214.4K
answer and category [a; c]m = generate − questions from 40.5K images for validation dataset.
mask([a; c]) ∈ RB×1×T to avoid <pad> tokens be- From the annotations of previous work (Krishna
ing processed by the encoder as well as the decoder. et al., 2019), 16 categories were derived from the
Same operation is performed on category input c top 500 answers. The top 500 answers cover
and a masked category is generated cm . The image- around 82% of the total VQA v2.0 dataset (An-
cat model takes context, C and masked category tol et al., 2015). The annotated categories include
cm as input to the encoder to encode textual feature objects (e.g. “ibal cat", “ful flower", attributes
representation: S = encoder(Cimg+cat , cm ) ∈ (e.g. “Fa«Da cold", “puraton old )", color (“lal red",
RB×T×300 . We follow the same procedure for the “badam brown"), etc.
image-ans-cat model, where now encoder takes the
context, Cimg+ans+cat and masked concatenated Train Val
answer and category, [a; c]m . Number of Questions 184100 124795
These textual feature representation S from the Number of Images 40800 28336
encoder are then concatenated to the input image Max Length of Question 22 21
features i ∈ RB×300 , thus, representing our final (by words)
encoder outputs as the concatenation (; operator) Min Length of Question 1 1
of textual and vision modality: X = [S; i] ∈ (by words)
RB×T×300 where B is the batch size and T is length Avg Length of Question 4 4
of S. (by words)
Decoder: Our decoder is a Transformer de- Table 1: Analysis of the dataset.
coder that also uses GloVe embeddings. Following
sequence-to-sequence causal decoding practices, Previously in Bengali machine translation re-
our decoder receives encoder outputs from text en- search (Hasan et al., 2020) , Google translate was
coder and ground truth questions during training. found to be competitive with machine translation
We, initially, extract <start> (Start of Sequence) models trained in Bengali corpora. In another work
token from encoder outputs which is then taken to on Bengali question answering (Mayeesha et al.,
the GPU. Each target question is concatenated with 2021), synthetic dataset translated by Google trans-
a <start> token, forming a tensor. late was again used for creating Bengali question
In our decoder we follow similar steps as we did answering models. Due to Bengali being a low re-
in our text encoder. We take ground truth questions source language, there has been no available VQG
q and generate target context: Cq ∈ RB×T×300 dataset. So we translated the VQA v2.0 (Antol
and question masks: qm ∈ RB×1×300 . Before, et al., 2015) with Google translate following pre-
we pass the target context, Cq to the decoder, vious works. We maintained the same partition-
we concatenate it with the same image features ing as the original dataset. Due to computational
i that were passed as input to the encoder previ- constraints we translated a smaller subset of the
ously. The final target context can be denoted by training and the validation set. We translate the
Q = [Cq ; i] ∈ RB×T×300 . Finally, the decoder initial 220K questions and answers for training and
takes the encoder outputs X from the text encoder, 150K questions and answers for validation set in
the concatenated target context Q and the source Bengali using GoogleTrans library. In table 1, we
mask ([a; c]m or cm ) depending on the model see out of 220K training and 150K validation ques-
variant and target question qm in the form of a tions, 184K training and 124K validation questions
tuple. Our decoder is represented as following: were used. It is because these sets of questions
map to top 500 answers in the dataset and we could resentations i. Our decoder takes the concatenated
not use questions and answers that had no map- form of target context Q, the encoder outputs X,
pings to the 16 categories. In figure 3, we can and generates the predicted question, q̂ as shown in
see the samples of our dataset. The 16 categories equation 1.
in our dataset are following in English - “activ- During training, we optimize the Lq between the
ity”, “animal”, “attribute”, “binary”, “color”, predicted q̂ and target question q. Additionally, we
“count”, “food”, “location”, “material”, “object”, try to reconstruct the input image from the encoded
“other”,“predicate”, “shape’, “spatial”, “stuff”, output, X and minimize the l2 loss between the re-
“time”. constructed image features, ir and the input image
features i to maximize mutual information between
the input image features and the encoder outputs as
mentioned in equation 2.
Lq = CrossEntropy(q̂, q)
(2)
Li = ||i − ir ||2
with English. While our scores are lower than En- 5.3 Human Evaluation
glish, we train on smaller and translated dataset We conducted a human evaluation to understand the
for computational and data annotation related con- quality of the generated questions similar to work
straints. Based on the quantitative results we can done in (Vedd et al., 2022). In our experiments,
come to a conclusion that categorical information we ask three annotators to evaluate our generated
shows better results overall. In the next section, we questions with two questions. There was no anno-
see the qualitative results where we shall see that tator overlap where two annotators annotated the
categorical information conditions the image-cat same question. We evaluate category wise question
variant to generate category specific questions i.e. generation by comparing two of our model variants,
goal driven, attribute specific questions rather than image-cat and image-ans-cat.
generic questions. In Experiment 1, known as the Visual Turing
Test, we present annotators with an image, a ground
5.2 Qualitative Results truth question, and a model-generated question.
The task of annotators is to discern which ques-
In figure 4, we can compare the generated ques- tion, among the two, they think is produced by
tions from our model variants with the refer- the model. Experiment 2 involves displaying an
ence ground truth question and answer category image to the annotators along with a question gen-
more illustratively. Questions generated from the erated by the model. Subsequently, the annotators
image-cat-ans model although are grammatically are asked to decide whether the generated question
and semantically correct but in some cases are seems relevant to the given image. For each of the
not conditioned towards the given category. For experiments we annotate 40 generations for each
example, in image 82846, although the question models, resulting in 80 annotations per experiment.
is grammatically correct, however, the generated The complete results of our evaluation is listed in
question does not follow the given category which table 3.
is “count”. We see similar behavior for images In Experiment 1, the result of our image-cat
349926 and 82259 where questions are grammati- model outperforms the image-ans-cat variant
cally correct and relevant to the image but do not fooling humans about 47.5% of the time. In a Vi-
follow the category. In contrast, the image-cat sual Turing Test, if a model is capable of generating
model perfectly conditions its questions towards human-like questions, it is expected that its perfor-
the given category. The questions are not only mance would reach approximately 50%. Although
grammatically and semantically valid but also fol- close to the desired score of 50%, the image-cat
low the given categorical information. The ques- variant represents a promising advancement in sur-
tions from the image-cat model generates goal passing the Visual Turing Test. We evaluate Ex-
driven, non-generic and category oriented ques- periment 2 on both our model variants where the
tions. To understand why this variant of VQG image-ans-cat model shows a percentage score
performs well although having less side informa- of 37.5%, outperforming the image-cat model. It
tion during training, is likely due to the fact that is possible that providing the answer with the image
in validation step both variants only take category and the category helps in generating more relevant
side information. Therefore, the image-cat learns questions.
better than image-ans-cat.
6 Conclusion
Additionally, we notice that both variants are
able to decode the semantic information from the We proposed the first VQG work in Bengali
input image as well. Both variants can rightly iden- and presented a novel transformer based encoder-
tify the objects and features present in the images. decoder architecture that generates questions in
Bengali when shown an image and a given answer 1 (Long and Short Papers), pages 4159–4170, Min-
category. In our work, we presented two variants of neapolis, Minnesota. Association for Computational
Linguistics.
our architecture: image-cat and image-ans-cat
that differs from what input they receive during Shaoxiang Chen, Ting Yao, and Yu-Gang Jiang. 2019.
training. Both of the variants generate a question Deep learning for video captioning: A review.
based on answer category as guiding information In Proceedings of the Twenty-Eighth International
Joint Conference on Artificial Intelligence, IJCAI-19,
from an image. However, due to having two dif- pages 6283–6290. International Joint Conferences on
ferent input combinations, image-cat performs Artificial Intelligence Organization.
marginally better in terms of quantitative scores,
however, generates goal driven, specific questions Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: pre-training of
conditioned towards the categorical information it deep bidirectional transformers for language under-
receives. In contrast, the image-ans-cat model standing. In Proceedings of the 2019 Conference of
although generating grammatically valid questions the North American Chapter of the Association for
fail to learn about answer categories. Future work Computational Linguistics: Human Language Tech-
nologies, NAACL-HLT 2019, Minneapolis, MN, USA,
could analyze the impact of using more modern June 2-7, 2019, Volume 1 (Long and Short Papers),
CNN architectures and newer pretrained models to pages 4171–4186. Association for Computational
generate questions from images. Linguistics.
Tanjim Taharat Aurpa, Richita Khandakar Rifat, Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang,
Md Shoaib Ahmed, Md. Musfique Anwar, and A. B. Lei Wang, and Wei Xu. 2015. Are you talking to
M. Shawkat Ali. 2022. Reading comprehension a machine? dataset and methods for multilingual
based question answering system in bangla lan- image question answering. In Proceedings of the
guage with transformer-based learning. Heliyon, 28th International Conference on Neural Informa-
8(10):e11052. tion Processing Systems - Volume 2, NIPS’15, page
2296–2304, Cambridge, MA, USA. MIT Press.
Loïc Barrault, Fethi Bougares, Lucia Specia, Chiraag
Lala, Desmond Elliott, and Stella Frank. 2018. Find- Yangyang Guo, Liqiang Nie, Zhiyong Cheng, Qi Tian,
ings of the third shared task on multimodal machine and Min Zhang. 2022. Loss re-scaling vqa: Re-
translation. In Proceedings of the Third Conference visiting the language prior problem from a class-
on Machine Translation: Shared Task Papers, pages imbalance view. Trans. Img. Proc., 31:227–238.
304–323, Belgium, Brussels. Association for Com-
putational Linguistics. Deepak Gupta, Pabitra Lenka, Asif Ekbal, and Push-
pak Bhattacharyya. 2020. A unified framework for
Remi Cadene, Corentin Dancette, Hedi Ben younes, multilingual and code-mixed visual question answer-
Matthieu Cord, and Devi Parikh. 2019. Rubi: Reduc- ing. In Proceedings of the 1st Conference of the
ing unimodal biases for visual question answering. In Asia-Pacific Chapter of the Association for Compu-
Advances in Neural Information Processing Systems, tational Linguistics and the 10th International Joint
volume 32. Curran Associates, Inc. Conference on Natural Language Processing, pages
900–913, Suzhou, China. Association for Computa-
Ozan Caglayan, Pranava Madhyastha, Lucia Specia, tional Linguistics.
and Loïc Barrault. 2019. Probing the need for visual
context in multimodal machine translation. In Pro- Tahmid Hasan, Abhik Bhattacharjee, Kazi Samin, Ma-
ceedings of the 2019 Conference of the North Amer- sum Hasan, Madhusudan Basak, M. Sohel Rahman,
ican Chapter of the Association for Computational and Rifat Shahriyar. 2020. Not low-resource any-
Linguistics: Human Language Technologies, Volume more: Aligner ensembling, batch filtering, and new
datasets for Bengali-English machine translation. In Tasmiah Tahsin Mayeesha, Abdullah Md Sarwar, and
Proceedings of the 2020 Conference on Empirical Rashedur M. Rahman. 2021. Deep learning based
Methods in Natural Language Processing (EMNLP), question answering system in bengali. Journal of
pages 2612–2623, Online. Association for Computa- Information and Telecommunication, 5(2):145–178.
tional Linguistics.
Nasrin Mostafazadeh, Chris Brockett, Bill Dolan,
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Michel Galley, Jianfeng Gao, Georgios Spithourakis,
Sun. 2016. Deep residual learning for image recogni- and Lucy Vanderwende. 2017. Image-grounded con-
tion. In 2016 IEEE Conference on Computer Vision versations: Multimodal context for natural ques-
and Pattern Recognition (CVPR), pages 770–778. tion and response generation. In Proceedings of
the Eighth International Joint Conference on Nat-
S M Shahriar Islam, Riyad Ahsan Auntor, Minhajul ural Language Processing (Volume 1: Long Papers),
Islam, Mohammad Yousuf Hossain Anik, A. B. M. pages 462–472, Taipei, Taiwan. Asian Federation of
Alim Al Islam, and Jannatun Noor. 2022. Note: To- Natural Language Processing.
wards devising an efficient vqa in the bengali lan-
guage. In ACM SIGCAS/SIGCHI Conference on Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Mar-
Computing and Sustainable Societies (COMPASS), garet Mitchell, Xiaodong He, and Lucy Vanderwende.
COMPASS ’22, page 632–637, New York, NY, USA. 2016. Generating natural questions about an image.
Association for Computing Machinery. In Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Volume
U. Jain, Z. Zhang, and A. Schwing. 2017. Creativ- 1: Long Papers), pages 1802–1813, Berlin, Germany.
ity: Generating diverse questions using variational Association for Computational Linguistics.
autoencoders. In 2017 IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages Liangming Pan, Wenqiang Lei, Tat-Seng Chua, and Min-
5415–5424, Los Alamitos, CA, USA. IEEE Com- Yen Kan. 2019. Recent advances in neural question
puter Society. generation.
H. Jiang, I. Misra, M. Rohrbach, E. Learned-Miller, and Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
X. Chen. 2020. In defense of grid features for visual Jing Zhu. 2002. Bleu: a method for automatic evalu-
question answering. In 2020 IEEE/CVF Conference ation of machine translation. In Proceedings of the
on Computer Vision and Pattern Recognition (CVPR), 40th Annual Meeting of the Association for Compu-
pages 10264–10273, Los Alamitos, CA, USA. IEEE tational Linguistics, pages 311–318, Philadelphia,
Computer Society. Pennsylvania, USA. Association for Computational
Linguistics.
J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-
Fei, C. Zitnick, and R. Girshick. 2017. Clevr: A Gao Peng, Haoxuan You, Zhanpeng Zhang, Xiaogang
diagnostic dataset for compositional language and Wang, and Hongsheng Li. 2019. Multi-modality
elementary visual reasoning. In 2017 IEEE Confer- latent interaction network for visual question answer-
ence on Computer Vision and Pattern Recognition ing. In 2019 IEEE/CVF International Conference on
(CVPR), pages 1988–1997, Los Alamitos, CA, USA. Computer Vision (ICCV), pages 5824–5834.
IEEE Computer Society.
Jeffrey Pennington, Richard Socher, and Christopher
Andrej Karpathy and Li Fei-Fei. 2017. Deep visual- Manning. 2014. GloVe: Global vectors for word
semantic alignments for generating image descrip- representation. In Proceedings of the 2014 Confer-
tions. IEEE Transactions on Pattern Analysis and ence on Empirical Methods in Natural Language Pro-
Machine Intelligence, 39(4):664–676. cessing (EMNLP), pages 1532–1543, Doha, Qatar.
Association for Computational Linguistics.
R. Krishna, M. Bernstein, and L. Fei-Fei. 2019. Infor-
mation maximizing visual question generation. In Mahamudul Hasan Rafi, Shifat Islam, S. M. Hasan Im-
2019 IEEE/CVF Conference on Computer Vision and tiaz Labib, SM Sajid Hasan, Faisal Muhammad Shah,
Pattern Recognition (CVPR), pages 2008–2018, Los and Sifat Ahmed. 2022. A deep learning-based ben-
Alamitos, CA, USA. IEEE Computer Society. gali visual question answering system. In 2022 25th
International Conference on Computer and Informa-
Alon Lavie and Abhaya Agarwal. 2007. METEOR: An tion Technology (ICCIT), pages 114–119.
automatic metric for MT evaluation with high levels
of correlation with human judgments. In Proceed- Mengye Ren, Ryan Kiros, and Richard S. Zemel. 2015.
ings of the Second Workshop on Statistical Machine Exploring models and data for image question an-
Translation, pages 228–231, Prague, Czech Republic. swering. In Proceedings of the 28th International
Association for Computational Linguistics. Conference on Neural Information Processing Sys-
tems - Volume 2, NIPS’15, page 2953–2961, Cam-
Chin-Yew Lin. 2004. ROUGE: A package for auto- bridge, MA, USA. MIT Press.
matic evaluation of summaries. In Text Summariza-
tion Branches Out, pages 74–81, Barcelona, Spain. Sagor Sarkar. 2019. https://github.com/sagorbrur/glove-
Association for Computational Linguistics. bengali.
Thomas Scialom, Patrick Bordes, Paul-Alexis Dray, Ja- Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho,
copo Staiano, and Patrick Gallinari. 2020. What Aaron Courville, Ruslan Salakhutdinov, Richard S.
bert sees: Cross-modal transfer for visual question Zemel, and Yoshua Bengio. 2015. Show, attend and
generation. In International Conference on Natural tell: Neural image caption generation with visual
Language Generation. attention. In Proceedings of the 32nd International
Conference on International Conference on Machine
Iulian Vlad Serban, Alberto García-Durán, Caglar Learning - Volume 37, ICML’15, page 2048–2057.
Gulcehre, Sungjin Ahn, Sarath Chandar, Aaron JMLR.org.
Courville, and Yoshua Bengio. 2016. Generating fac-
toid questions with recurrent neural networks: The Shijie Zhang, Lizhen Qu, Shaodi You, Zhenglu Yang,
30M factoid question-answer corpus. In Proceedings and Jiawan Zhang. 2017. Automatic generation of
of the 54th Annual Meeting of the Association for grounded visual questions. In Proceedings of the
Computational Linguistics (Volume 1: Long Papers), 26th International Joint Conference on Artificial In-
pages 588–598, Berlin, Germany. Association for telligence, pages 4235–4243, United States of Amer-
Computational Linguistics. ica. Association for the Advancement of Artificial
Intelligence (AAAI). International Joint Conference
Nobuyuki Shimizu, Na Rong, and Takashi Miyazaki. on Artificial Intelligence 2017, IJCAI 2017 ; Confer-
2018. Visual question answering dataset for bilin- ence date: 19-08-2017 Through 25-08-2017.
gual image understanding: A study of cross-lingual
transfer using attention maps. In Proceedings of the
27th International Conference on Computational Lin-
guistics, pages 1918–1928, Santa Fe, New Mexico,
USA. Association for Computational Linguistics.