0% found this document useful (0 votes)
20 views

Visual Question Generation in Bengali

This paper proposes the first visual question generation system for Bengali. The paper introduces transformer-based encoder-decoder models that generate questions in Bengali from images. It experiments with variants that use image only and variants that also use additional information like answer categories. Evaluation shows that models using additional information generate higher quality questions.

Uploaded by

jamie7.kang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Visual Question Generation in Bengali

This paper proposes the first visual question generation system for Bengali. The paper introduces transformer-based encoder-decoder models that generate questions in Bengali from images. It experiments with variants that use image only and variants that also use additional information like answer categories. Evaluation shows that models using additional information generate higher quality questions.

Uploaded by

jamie7.kang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Visual Question Generation in Bengali

Mahmud Hasan, Labiba Islam, Jannatul Ferdous Ruma


Tasmiah Tahsin Mayeesha and Rashedur M. Rahman
North South University
{mahmud.hasan03,labiba.islam,jannatul.ruma,
tasmiah.tahsin,rashedur.rahman}@northsouth.edu

Abstract
The task of Visual Question Generation (VQG)
is to generate human-like questions relevant to
the given image. As VQG is an emerging re-
arXiv:2310.08187v1 [cs.CL] 12 Oct 2023

search field, existing works tend to focus only


on resource-rich language such as English due Figure 1: Examples of Bengali VQG Predictions with
to the availability of datasets. In this paper, we category of answers as additional information.
propose the first Bengali Visual Question Gen-
eration task and develop a novel transformer-
based encoder-decoder architecture that gener- 2019; Peng et al., 2019; Jiang et al., 2020; Guo
ates questions in Bengali when given an im- et al., 2022), Video Captioning (VC) (Chen et al.,
age. We propose multiple variants of models
2019), Image Captioning (IC) (Vinyals et al., 2015;
- (i) image-only: baseline model of generating
questions from images without additional infor- Karpathy and Fei-Fei, 2017; Xu et al., 2015), and
mation, (ii) image-category and image-answer- Multimodal Machine Translation (Specia et al.,
category: guided VQG where we condition 2016; Elliott et al., 2017; Barrault et al., 2018;
the model to generate questions based on the Caglayan et al., 2019) are the recent advances in
answer and the category of expected question. the AI community. While the majority of visuo-
These models are trained and evaluated on the lingual tasks tend to focus on VQA, a few recent
translated VQAv2.0 dataset. Our quantitative approaches have been proposed, focusing on the
and qualitative results establish the first state of
under-researched multi-modal task of VQG. VQG
the art models for VQG task in Bengali and
demonstrate that our models are capable of is a more creative and particularly challenging prob-
generating grammatically correct and relevant lem than VQA, because the generated questions
questions. Our quantitative results show that need to be relevant, semantically coherent and com-
our image-cat model achieves a BLUE-1 score prehensible to the diverse contents of the given
of 33.12 and BLEU-3 score of 7.56 which is image.
the highest of the other two variants. We also
perform a human evaluation to assess the qual- Existing studies on Visual Question Generation
ity of the generation tasks. Human evaluation (VQG) have been primarily focused on languages
suggests that image-cat model is capable of that have ample resources, such as English. While
generating goal-driven and attribute-specific some VQA research have been conducted in low-
questions and also stays relevant to the cor- resource languages like Hindi (Gupta et al., 2020),
responding image. Bengali (Islam et al., 2022), Japanese (Shimizu
et al., 2018), and Chinese (Gao et al., 2015), limita-
1 Introduction tions have been identified specifically in the context
Visual Question Generation (VQG) is an emerg- of Bengali language. While Bengali language has
ing research field in both Computer Vision and some recent work on reading comprehension based
Natural Language Processing. The task of VQG question answering (Mayeesha et al., 2021; Aurpa
simply uses an image and other side information et al., 2022) and visual question answering (Islam
(e.g. answers or answer categories) as input and et al., 2022; Rafi et al., 2022), there has been no
generates meaningful questions related to the im- research conducted for VQG task specifically in
age. Tasks like cross-modal Visual Question An- Bengali language.
swering (VQA) (Antol et al., 2015; Cadene et al., To obtain meaningful questions, some VQG
methods have either augmented the input including et al., 2016; Vijayakumar et al., 2018; Ren et al.,
additional information such as answer categories, 2015) have explored the task of visual question gen-
objects in image and expected answers (Pan et al., eration through Recurrent Neural Network (RNN),
2019; Krishna et al., 2019; Vedd et al., 2022). Pan Generative Adversarial Network (GAN), and Vari-
et al. (2019) used ground truth answer with the im- ational Auto-Encoder (VAE) which either followed
age as an input, underscoring it to be an effective algorithmic rule-based or learning-based approach.
approach to produce non-generic questions. Kr- In the visual-language domain, the first VQG
ishna et al. (2019) stated that knowing the answers paper proposed by Mostafazadeh et al. (2016) in-
beforehand simply defeats the purpose of generat- troduced question-response generation that takes
ing realistic questions since the main purpose of meaningful conversational dialogues as input to
generating a question is to attain an answer. Instead, generate relevant questions. Zhang et al. (2017)
they introduced a variational auto-encoder model, used an LSTM-based encoder-decoder model that
which uses the concept of latent space, providing automates the generation of meaningful questions
answer categories to generate relevant questions. with question types to be highly diverse. Moti-
Vedd et al. (2022), recently, proposed a guiding ap- vated by the discriminator setting in GAN, Fan
proach with three variant families that conditions et al. (2018) formulated a visual natural question
the generative process to focus on specific cho- generation task that learns two non-generic textual
sen properties of the input image for generating characteristics from the perspective of content and
questions. Inspired by previous work, we also use linguistics producing non-deterministic and diverse
additional information such as answer and answer outputs. Whereas, Jain et al. (2017) followed the
categories in our experiments. To summarize, the VAE paradigm along with LSTM networks instead
main contributions of our paper are the following: of GAN to generate large set of diverse questions
given an image-only input. During inference, their
• In our study, we introduce the first visual obtained results nevertheless required the use of
question generation system that leverages the ground truth answers. To defeat this non-viable
power of Transformer-based encoder-decoder scenario, Krishna et al. (2019) proposed a VAE
architecture for the low resource Bengali lan- model that uses the concept of latent variable and
guage. requires information from the target, i.e. answer
• We conduct experiments of multiple variants categories, as input with the image during inference.
considering only the image and also additional Similarly, Vedd et al. (2022) follows the concept
information as input such as answers and an- of latent variable, however, their proposed model
swer categories. architecture explores VQG from the perspective of
guiding, which involves two variant families, ex-
• We evaluate our novel VQG system with well- plicit and two types of implicit guiding approach.
established text generation evaluation metrics Our work is closely related to their explicit guiding
and report our results as the state of the art in method excluding the use of latent space. Recently,
Visual Question Generation in Bengali. Scialom et al. (2020) proposed a BERT-gen model
which is capable of generating texts either in mono
• We perform a human evaluation on our gener-
or multi-modal representation from out of the box
ations to assess the quality and the relevance
pre-trained encoders.
of the questions.

2 Related Works 3 Methodology


The advent of visual understanding has been made In this section, we introduce our transformer based
possible due to continuous research in question an- Bengali Visual Question Generation models which
swering and the availability of large-scale Visual can generate meaningful non-generic questions
Question Answering (VQA) datasets (Antol et al., when shown an image along with additional textual
2015; Johnson et al., 2017; Mostafazadeh et al., information. Our VQG problem is designed as fol-
2017). In the past few years, many methods have lows: Given an image ĩ ∈ I, where I denotes a set
been proposed to increase the model’s performance of images, decode a question q. For each image ĩ,
for a VQG task. Earlier studies (Xu et al., 2015; we also have access to textual utterances, such as
Jain et al., 2017; Mostafazadeh et al., 2016; Serban ground truth answer and answer categories. Note,
Figure 2: Architecture of the Bengali VQG Model: Given an image, we first extract image features using an image
encoder (CNN). Concatenated form answer and category (image-ans-cat) or only category (image-cat) are given
as input to the text encoder to obtain textual features which are then concatenated with the obtained image features.
Then, this concatenated form of vision and textual modalities combined with target questions are given as inputs to
the decoder question generation in Bengali. Finally, we optimize the CE and MSE loss.

we will use terms "answer category" and "category" of the answers. In table 1, we see as maximum
interchangeably throughout the paper. In our work, length of question in our training dataset is 22 and
we used answer categories from (Krishna et al., validation is 21 tokens long, we choose 20 to be
2019) that take 1 out of 16 categorical variables to the default length. Questions longer than default
indicate the type of questions asked. For example, are truncated and the shorter ones are padded with
if our model wants to understand answers of "roK special <pad> token.
(color)" category, then it should generate a question Image Encoder: Given an image ĩ, we can ex-
“ETa ik roe‚r bas ? (What color is the bus?)". tract image features, f ∈ RB×300 where B is batch
Our baseline is an image − only model with size. Our image encoder is a ResNet-18 pretrained
no additional textual information like answer or CNN model, which is a convolutional neural net-
category. We present further two variants both work with 72-layer architecture consisting 18 deep
of which shares the same architecture but takes layers (He et al., 2016). Once obtaining these fea-
different inputs in training. We feed two different tures, they are passed to a fully connected layer
textual information to our model during training. followed by a batch normalization layer. Specifi-
The first model is image − ans − cat that feeds cally, given f from image ĩ: i = BatchN orm(f) ∈
the concatenated ground truth answer and category RB×300 .
to the encoder and is concatenated with the image Encoder: We build a Transformer encoder
features. The second model is image − cat that (Vaswani et al., 2017) and use Bengali pretrained
takes only the relevant answer category as input GloVe (Global Vectors for Word Representation)
to the encoder. In both of the versions, the input word vectors (Sarkar, 2019) as the embedding
image is reconstructed to maximize information layer of the text encoder. Next, we provide an-
between the image and encoded outputs. swer or answer categories and image features f
Vocabulary: We construct vocabulary consider- as input to the text encoder. Note that, image-cat
ing all the textual utterances: questions, answers variant only takes answer category c as its input
and answer categories. Our vocabulary has a total during training and image-ans-cat takes concate-
of 7081 entries including the special tokens. We nated version of answer and category, [a; c] (; op-
use word level tokenization. We set a default length erator represents concatenation) as seen in figure
of 20 token to each of the questions and 5 to each 2. For image-ans-cat variant, a concatenated ver-
sion of answer and category [a; c] is passed through q̂ = Decoder(X, Q) where the decoder outputs a
the embedding layer and projected out as context, generated question q̂.
Cimg+ans+cat = embedding([a; c]) ∈ RB×T×300
where, B is batch size and T is the length of the 4 Experiments
[a; c]. For the image-cat variant, we only pass 4.1 Datasets
the category, c and similarly generate a context
To collect all relevant information for the VQG
Cimg+cat = embedding(c) ∈ RB×T×300 , where T
task in Bengali, we use the VQA v2.0 (Antol et al.,
is the length of c.
2015) dataset consisting of 443.8K questions from
Additionally, we generate padding masks on 82.8K images in the training dataset, and 214.4K
answer and category [a; c]m = generate − questions from 40.5K images for validation dataset.
mask([a; c]) ∈ RB×1×T to avoid <pad> tokens be- From the annotations of previous work (Krishna
ing processed by the encoder as well as the decoder. et al., 2019), 16 categories were derived from the
Same operation is performed on category input c top 500 answers. The top 500 answers cover
and a masked category is generated cm . The image- around 82% of the total VQA v2.0 dataset (An-
cat model takes context, C and masked category tol et al., 2015). The annotated categories include
cm as input to the encoder to encode textual feature objects (e.g. “ibˆal cat", “ful flower", attributes
representation: S = encoder(Cimg+cat , cm ) ∈ (e.g. “Fa«Da cold", “puraton old )", color (“lal red",
RB×T×300 . We follow the same procedure for the “badamŒ brown"), etc.
image-ans-cat model, where now encoder takes the
context, Cimg+ans+cat and masked concatenated Train Val
answer and category, [a; c]m . Number of Questions 184100 124795
These textual feature representation S from the Number of Images 40800 28336
encoder are then concatenated to the input image Max Length of Question 22 21
features i ∈ RB×300 , thus, representing our final (by words)
encoder outputs as the concatenation (; operator) Min Length of Question 1 1
of textual and vision modality: X = [S; i] ∈ (by words)
RB×T×300 where B is the batch size and T is length Avg Length of Question 4 4
of S. (by words)
Decoder: Our decoder is a Transformer de- Table 1: Analysis of the dataset.
coder that also uses GloVe embeddings. Following
sequence-to-sequence causal decoding practices, Previously in Bengali machine translation re-
our decoder receives encoder outputs from text en- search (Hasan et al., 2020) , Google translate was
coder and ground truth questions during training. found to be competitive with machine translation
We, initially, extract <start> (Start of Sequence) models trained in Bengali corpora. In another work
token from encoder outputs which is then taken to on Bengali question answering (Mayeesha et al.,
the GPU. Each target question is concatenated with 2021), synthetic dataset translated by Google trans-
a <start> token, forming a tensor. late was again used for creating Bengali question
In our decoder we follow similar steps as we did answering models. Due to Bengali being a low re-
in our text encoder. We take ground truth questions source language, there has been no available VQG
q and generate target context: Cq ∈ RB×T×300 dataset. So we translated the VQA v2.0 (Antol
and question masks: qm ∈ RB×1×300 . Before, et al., 2015) with Google translate following pre-
we pass the target context, Cq to the decoder, vious works. We maintained the same partition-
we concatenate it with the same image features ing as the original dataset. Due to computational
i that were passed as input to the encoder previ- constraints we translated a smaller subset of the
ously. The final target context can be denoted by training and the validation set. We translate the
Q = [Cq ; i] ∈ RB×T×300 . Finally, the decoder initial 220K questions and answers for training and
takes the encoder outputs X from the text encoder, 150K questions and answers for validation set in
the concatenated target context Q and the source Bengali using GoogleTrans library. In table 1, we
mask ([a; c]m or cm ) depending on the model see out of 220K training and 150K validation ques-
variant and target question qm in the form of a tions, 184K training and 124K validation questions
tuple. Our decoder is represented as following: were used. It is because these sets of questions
map to top 500 answers in the dataset and we could resentations i. Our decoder takes the concatenated
not use questions and answers that had no map- form of target context Q, the encoder outputs X,
pings to the 16 categories. In figure 3, we can and generates the predicted question, q̂ as shown in
see the samples of our dataset. The 16 categories equation 1.
in our dataset are following in English - “activ- During training, we optimize the Lq between the
ity”, “animal”, “attribute”, “binary”, “color”, predicted q̂ and target question q. Additionally, we
“count”, “food”, “location”, “material”, “object”, try to reconstruct the input image from the encoded
“other”,“predicate”, “shape’, “spatial”, “stuff”, output, X and minimize the l2 loss between the re-
“time”. constructed image features, ir and the input image
features i to maximize mutual information between
the input image features and the encoder outputs as
mentioned in equation 2.

q̂ = Decoder([S; i], [Cq ; i]) (1)

Lq = CrossEntropy(q̂, q)
(2)
Li = ||i − ir ||2

Figure 3: Samples from our dataset 4.3 Inference


During inference, except the image-only variant,
4.2 Training and Optimization both model variants are provided with only answer
category (e.g. “ ro‚ (color)", “”biS³TY (attribute)",
Our transformer based encoder-decoder architec-
“gonona (count"), etc.) alongside an image during
ture is a variation of explicit guiding variant estab-
inference, because providing answers to the model
lished by (Vedd et al., 2022) where object labels,
would violate the realistic scenario (Krishna et al.,
image captions and object detected features were
2019; Vedd et al., 2022). As a result our model
used as guiding information. However, we only
is kept under a realistic inference setting by not
use answer categories and answers as additional
providing an answer as input during inference.
information in our work. Instead of BERT (Devlin
et al., 2019) we use Bengali Glove embeddings 4.4 Evaluation Metrics
(Pennington et al., 2014; Sarkar, 2019) for encod-
ing text. We use less number of layers, attention In our experiments, we followed well established
heads and our embedding dimensions and hidden language modelling evaluation metrics BLEU: (Pa-
state dimensions are also reduced due to compu- pineni et al., 2002), CIDEr (Vedantam et al.,
tational constraints. Similar to work done by (Kr- 2015), METEOR (Lavie and Agarwal, 2007), and
ishna et al., 2019) we use the concept of answer ROUGE-L (Lin, 2004).
category as our primary textual information and
4.5 Implementation details
attempt to generate questions that are conditioned
towards a specific category. We use a pretrained ResNet18 as our image encoder
In summary, we begin by first passing the image to encode image features. Both of our transformer
through a Convolutional Neural Network (CNN) based encoder and decoder uses glove embeddings.
to attain a high dimensional encoded representa- We set our transformer encoder and decoder with
tion of image features, i. The image features are the following setting: number of layers = 4, number
passed through an MLP (Multi-layer Perceptron) of attention heads = 4, embedding dimension = 300,
layer to get a vector representation of reconstructed hidden dimension = 300 and filter size = 300. The
image features, ir . Our architecture takes an im- model trains a total number of 13000 steps, with
age and additional information in the form of a a learning rate of 0.003 and batch size of 64. We
concatenated answer and category [a; c] or answer have implemented our model with pytorch. We ex-
category c as input. We feed these input to our text pect to release our code and translated dataset pub-
encoder which then generates the textual S and licly at https://github.com/mahmudhasankhan/vqg-
concatenates the textual S and vision modality rep- in-bengali
Figure 4: Qualitative Examples. Ground truths are target questions for both models.

BLEU CIDEr METEOR ROUGE-L


Model
1 2 3
ablations
image-only 34.84 8.04 3.98 10.62 17.14 36.56
text-only 28.05 7.57 3.65 18.72 19.10 29.68
without-image-recon 11.59 4.85 2.08 26.61 12.34 31.43
variants
image-cat 33.12 13.52 7.56 22.76 17.18 36.12
image-ans-cat 32.97 11.80 3.82 18.63 18.63 36.90

Table 2: Evaluation results of model variants and ablations.

4.6 Model Ablations Moreover, we find that in some metrics image-cat


We experiment with a series of ablations performed model outperforms the image-ans-cat model and
on our model such as image-only does not in- in some metrics stay ahead marginally. As seen
clude text encoder. Inversely, text-only model in table 2, image-cat model achieves a BLEU-3
does not have image encoder. With respect to score of 7.56 that is almost 4 points ahead of both
without-image-recon, we avoid optimizing the image-only and image-ans-cat model. More-
reconstruction l2 loss between the reconstructed im- over, we notice that image-cat model also per-
age features and input image features. As for our forms marginally better in CIDEr metrics. How-
model variants, image-cat and image-ans-cat, ever, both the variants show similar performance on
the entire architecture remains intact. other evaluation metrics except for METEOR and
ROUGE-L metric where image-ans-cat variant
5 Results performs slightly better. In comparison to (Vedd
et al., 2022) for experiments in explicit image-
5.1 Quantitative Results category setting for English, our BLEU-1 score is
We test our model variants except with only cat- 33.12 while for English we see a score of 40.8 with
egorical information because giving answer to a a 7.68 difference, however, BLEU-2 and BLEU-
model beforehand would be unrealistic. We tried 3 scores have higher differences. However, for
to figure out which textual input is more significant METEOR in English, the score is 20.8 while our
and leads to better results. Firstly, our model ab- image-cat model scores 17.18 with a 3.62 differ-
lations justify our model architecture as such our ence only and for ROUGE the English score is 43.0
intact architecture outperforms all the ablations in while we score 36.12 with a 6.88 point difference.
BLEU-2 and BLEU-3 (see Table 2). Our baseline Similar experiments on guided visual generation
image-only model achieves a BLEU-3 score of have not been performed for other languages or
3.98 which is higher than image-ans-cat variant. Bengali to our knowledge, so we compare only
Experiment
Model
1 2
image-cat 47.5% 40%
image-ans-cat 30% 37.5%

Table 3: Human evaluation result of our model variants.

with English. While our scores are lower than En- 5.3 Human Evaluation
glish, we train on smaller and translated dataset We conducted a human evaluation to understand the
for computational and data annotation related con- quality of the generated questions similar to work
straints. Based on the quantitative results we can done in (Vedd et al., 2022). In our experiments,
come to a conclusion that categorical information we ask three annotators to evaluate our generated
shows better results overall. In the next section, we questions with two questions. There was no anno-
see the qualitative results where we shall see that tator overlap where two annotators annotated the
categorical information conditions the image-cat same question. We evaluate category wise question
variant to generate category specific questions i.e. generation by comparing two of our model variants,
goal driven, attribute specific questions rather than image-cat and image-ans-cat.
generic questions. In Experiment 1, known as the Visual Turing
Test, we present annotators with an image, a ground
5.2 Qualitative Results truth question, and a model-generated question.
The task of annotators is to discern which ques-
In figure 4, we can compare the generated ques- tion, among the two, they think is produced by
tions from our model variants with the refer- the model. Experiment 2 involves displaying an
ence ground truth question and answer category image to the annotators along with a question gen-
more illustratively. Questions generated from the erated by the model. Subsequently, the annotators
image-cat-ans model although are grammatically are asked to decide whether the generated question
and semantically correct but in some cases are seems relevant to the given image. For each of the
not conditioned towards the given category. For experiments we annotate 40 generations for each
example, in image 82846, although the question models, resulting in 80 annotations per experiment.
is grammatically correct, however, the generated The complete results of our evaluation is listed in
question does not follow the given category which table 3.
is “count”. We see similar behavior for images In Experiment 1, the result of our image-cat
349926 and 82259 where questions are grammati- model outperforms the image-ans-cat variant
cally correct and relevant to the image but do not fooling humans about 47.5% of the time. In a Vi-
follow the category. In contrast, the image-cat sual Turing Test, if a model is capable of generating
model perfectly conditions its questions towards human-like questions, it is expected that its perfor-
the given category. The questions are not only mance would reach approximately 50%. Although
grammatically and semantically valid but also fol- close to the desired score of 50%, the image-cat
low the given categorical information. The ques- variant represents a promising advancement in sur-
tions from the image-cat model generates goal passing the Visual Turing Test. We evaluate Ex-
driven, non-generic and category oriented ques- periment 2 on both our model variants where the
tions. To understand why this variant of VQG image-ans-cat model shows a percentage score
performs well although having less side informa- of 37.5%, outperforming the image-cat model. It
tion during training, is likely due to the fact that is possible that providing the answer with the image
in validation step both variants only take category and the category helps in generating more relevant
side information. Therefore, the image-cat learns questions.
better than image-ans-cat.
6 Conclusion
Additionally, we notice that both variants are
able to decode the semantic information from the We proposed the first VQG work in Bengali
input image as well. Both variants can rightly iden- and presented a novel transformer based encoder-
tify the objects and features present in the images. decoder architecture that generates questions in
Bengali when shown an image and a given answer 1 (Long and Short Papers), pages 4159–4170, Min-
category. In our work, we presented two variants of neapolis, Minnesota. Association for Computational
Linguistics.
our architecture: image-cat and image-ans-cat
that differs from what input they receive during Shaoxiang Chen, Ting Yao, and Yu-Gang Jiang. 2019.
training. Both of the variants generate a question Deep learning for video captioning: A review.
based on answer category as guiding information In Proceedings of the Twenty-Eighth International
Joint Conference on Artificial Intelligence, IJCAI-19,
from an image. However, due to having two dif- pages 6283–6290. International Joint Conferences on
ferent input combinations, image-cat performs Artificial Intelligence Organization.
marginally better in terms of quantitative scores,
however, generates goal driven, specific questions Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: pre-training of
conditioned towards the categorical information it deep bidirectional transformers for language under-
receives. In contrast, the image-ans-cat model standing. In Proceedings of the 2019 Conference of
although generating grammatically valid questions the North American Chapter of the Association for
fail to learn about answer categories. Future work Computational Linguistics: Human Language Tech-
nologies, NAACL-HLT 2019, Minneapolis, MN, USA,
could analyze the impact of using more modern June 2-7, 2019, Volume 1 (Long and Short Papers),
CNN architectures and newer pretrained models to pages 4171–4186. Association for Computational
generate questions from images. Linguistics.

7 Acknowledgement Desmond Elliott, Stella Frank, Loïc Barrault, Fethi


Bougares, and Lucia Specia. 2017. Findings of the
This work was funded by the Faculty Research second shared task on multimodal machine transla-
tion and multilingual image description. In Proceed-
Grant [CTRG-22-SEPS-07], North South Univer- ings of the Second Conference on Machine Transla-
sity, Bashundhara, Dhaka 1229, Bangladesh tion, pages 215–233, Copenhagen, Denmark. Associ-
ation for Computational Linguistics.

References Zhihao Fan, Zhongyu Wei, Siyuan Wang, Yang Liu,


and Xuanjing Huang. 2018. A reinforcement learn-
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- ing framework for natural question generation using
garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, bi-discriminators. In Proceedings of the 27th Inter-
and Devi Parikh. 2015. Vqa: Visual question an- national Conference on Computational Linguistics,
swering. In Proceedings of the IEEE International pages 1763–1774, Santa Fe, New Mexico, USA. As-
Conference on Computer Vision (ICCV). sociation for Computational Linguistics.

Tanjim Taharat Aurpa, Richita Khandakar Rifat, Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang,
Md Shoaib Ahmed, Md. Musfique Anwar, and A. B. Lei Wang, and Wei Xu. 2015. Are you talking to
M. Shawkat Ali. 2022. Reading comprehension a machine? dataset and methods for multilingual
based question answering system in bangla lan- image question answering. In Proceedings of the
guage with transformer-based learning. Heliyon, 28th International Conference on Neural Informa-
8(10):e11052. tion Processing Systems - Volume 2, NIPS’15, page
2296–2304, Cambridge, MA, USA. MIT Press.
Loïc Barrault, Fethi Bougares, Lucia Specia, Chiraag
Lala, Desmond Elliott, and Stella Frank. 2018. Find- Yangyang Guo, Liqiang Nie, Zhiyong Cheng, Qi Tian,
ings of the third shared task on multimodal machine and Min Zhang. 2022. Loss re-scaling vqa: Re-
translation. In Proceedings of the Third Conference visiting the language prior problem from a class-
on Machine Translation: Shared Task Papers, pages imbalance view. Trans. Img. Proc., 31:227–238.
304–323, Belgium, Brussels. Association for Com-
putational Linguistics. Deepak Gupta, Pabitra Lenka, Asif Ekbal, and Push-
pak Bhattacharyya. 2020. A unified framework for
Remi Cadene, Corentin Dancette, Hedi Ben younes, multilingual and code-mixed visual question answer-
Matthieu Cord, and Devi Parikh. 2019. Rubi: Reduc- ing. In Proceedings of the 1st Conference of the
ing unimodal biases for visual question answering. In Asia-Pacific Chapter of the Association for Compu-
Advances in Neural Information Processing Systems, tational Linguistics and the 10th International Joint
volume 32. Curran Associates, Inc. Conference on Natural Language Processing, pages
900–913, Suzhou, China. Association for Computa-
Ozan Caglayan, Pranava Madhyastha, Lucia Specia, tional Linguistics.
and Loïc Barrault. 2019. Probing the need for visual
context in multimodal machine translation. In Pro- Tahmid Hasan, Abhik Bhattacharjee, Kazi Samin, Ma-
ceedings of the 2019 Conference of the North Amer- sum Hasan, Madhusudan Basak, M. Sohel Rahman,
ican Chapter of the Association for Computational and Rifat Shahriyar. 2020. Not low-resource any-
Linguistics: Human Language Technologies, Volume more: Aligner ensembling, batch filtering, and new
datasets for Bengali-English machine translation. In Tasmiah Tahsin Mayeesha, Abdullah Md Sarwar, and
Proceedings of the 2020 Conference on Empirical Rashedur M. Rahman. 2021. Deep learning based
Methods in Natural Language Processing (EMNLP), question answering system in bengali. Journal of
pages 2612–2623, Online. Association for Computa- Information and Telecommunication, 5(2):145–178.
tional Linguistics.
Nasrin Mostafazadeh, Chris Brockett, Bill Dolan,
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Michel Galley, Jianfeng Gao, Georgios Spithourakis,
Sun. 2016. Deep residual learning for image recogni- and Lucy Vanderwende. 2017. Image-grounded con-
tion. In 2016 IEEE Conference on Computer Vision versations: Multimodal context for natural ques-
and Pattern Recognition (CVPR), pages 770–778. tion and response generation. In Proceedings of
the Eighth International Joint Conference on Nat-
S M Shahriar Islam, Riyad Ahsan Auntor, Minhajul ural Language Processing (Volume 1: Long Papers),
Islam, Mohammad Yousuf Hossain Anik, A. B. M. pages 462–472, Taipei, Taiwan. Asian Federation of
Alim Al Islam, and Jannatun Noor. 2022. Note: To- Natural Language Processing.
wards devising an efficient vqa in the bengali lan-
guage. In ACM SIGCAS/SIGCHI Conference on Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Mar-
Computing and Sustainable Societies (COMPASS), garet Mitchell, Xiaodong He, and Lucy Vanderwende.
COMPASS ’22, page 632–637, New York, NY, USA. 2016. Generating natural questions about an image.
Association for Computing Machinery. In Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Volume
U. Jain, Z. Zhang, and A. Schwing. 2017. Creativ- 1: Long Papers), pages 1802–1813, Berlin, Germany.
ity: Generating diverse questions using variational Association for Computational Linguistics.
autoencoders. In 2017 IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages Liangming Pan, Wenqiang Lei, Tat-Seng Chua, and Min-
5415–5424, Los Alamitos, CA, USA. IEEE Com- Yen Kan. 2019. Recent advances in neural question
puter Society. generation.

H. Jiang, I. Misra, M. Rohrbach, E. Learned-Miller, and Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
X. Chen. 2020. In defense of grid features for visual Jing Zhu. 2002. Bleu: a method for automatic evalu-
question answering. In 2020 IEEE/CVF Conference ation of machine translation. In Proceedings of the
on Computer Vision and Pattern Recognition (CVPR), 40th Annual Meeting of the Association for Compu-
pages 10264–10273, Los Alamitos, CA, USA. IEEE tational Linguistics, pages 311–318, Philadelphia,
Computer Society. Pennsylvania, USA. Association for Computational
Linguistics.
J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-
Fei, C. Zitnick, and R. Girshick. 2017. Clevr: A Gao Peng, Haoxuan You, Zhanpeng Zhang, Xiaogang
diagnostic dataset for compositional language and Wang, and Hongsheng Li. 2019. Multi-modality
elementary visual reasoning. In 2017 IEEE Confer- latent interaction network for visual question answer-
ence on Computer Vision and Pattern Recognition ing. In 2019 IEEE/CVF International Conference on
(CVPR), pages 1988–1997, Los Alamitos, CA, USA. Computer Vision (ICCV), pages 5824–5834.
IEEE Computer Society.
Jeffrey Pennington, Richard Socher, and Christopher
Andrej Karpathy and Li Fei-Fei. 2017. Deep visual- Manning. 2014. GloVe: Global vectors for word
semantic alignments for generating image descrip- representation. In Proceedings of the 2014 Confer-
tions. IEEE Transactions on Pattern Analysis and ence on Empirical Methods in Natural Language Pro-
Machine Intelligence, 39(4):664–676. cessing (EMNLP), pages 1532–1543, Doha, Qatar.
Association for Computational Linguistics.
R. Krishna, M. Bernstein, and L. Fei-Fei. 2019. Infor-
mation maximizing visual question generation. In Mahamudul Hasan Rafi, Shifat Islam, S. M. Hasan Im-
2019 IEEE/CVF Conference on Computer Vision and tiaz Labib, SM Sajid Hasan, Faisal Muhammad Shah,
Pattern Recognition (CVPR), pages 2008–2018, Los and Sifat Ahmed. 2022. A deep learning-based ben-
Alamitos, CA, USA. IEEE Computer Society. gali visual question answering system. In 2022 25th
International Conference on Computer and Informa-
Alon Lavie and Abhaya Agarwal. 2007. METEOR: An tion Technology (ICCIT), pages 114–119.
automatic metric for MT evaluation with high levels
of correlation with human judgments. In Proceed- Mengye Ren, Ryan Kiros, and Richard S. Zemel. 2015.
ings of the Second Workshop on Statistical Machine Exploring models and data for image question an-
Translation, pages 228–231, Prague, Czech Republic. swering. In Proceedings of the 28th International
Association for Computational Linguistics. Conference on Neural Information Processing Sys-
tems - Volume 2, NIPS’15, page 2953–2961, Cam-
Chin-Yew Lin. 2004. ROUGE: A package for auto- bridge, MA, USA. MIT Press.
matic evaluation of summaries. In Text Summariza-
tion Branches Out, pages 74–81, Barcelona, Spain. Sagor Sarkar. 2019. https://github.com/sagorbrur/glove-
Association for Computational Linguistics. bengali.
Thomas Scialom, Patrick Bordes, Paul-Alexis Dray, Ja- Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho,
copo Staiano, and Patrick Gallinari. 2020. What Aaron Courville, Ruslan Salakhutdinov, Richard S.
bert sees: Cross-modal transfer for visual question Zemel, and Yoshua Bengio. 2015. Show, attend and
generation. In International Conference on Natural tell: Neural image caption generation with visual
Language Generation. attention. In Proceedings of the 32nd International
Conference on International Conference on Machine
Iulian Vlad Serban, Alberto García-Durán, Caglar Learning - Volume 37, ICML’15, page 2048–2057.
Gulcehre, Sungjin Ahn, Sarath Chandar, Aaron JMLR.org.
Courville, and Yoshua Bengio. 2016. Generating fac-
toid questions with recurrent neural networks: The Shijie Zhang, Lizhen Qu, Shaodi You, Zhenglu Yang,
30M factoid question-answer corpus. In Proceedings and Jiawan Zhang. 2017. Automatic generation of
of the 54th Annual Meeting of the Association for grounded visual questions. In Proceedings of the
Computational Linguistics (Volume 1: Long Papers), 26th International Joint Conference on Artificial In-
pages 588–598, Berlin, Germany. Association for telligence, pages 4235–4243, United States of Amer-
Computational Linguistics. ica. Association for the Advancement of Artificial
Intelligence (AAAI). International Joint Conference
Nobuyuki Shimizu, Na Rong, and Takashi Miyazaki. on Artificial Intelligence 2017, IJCAI 2017 ; Confer-
2018. Visual question answering dataset for bilin- ence date: 19-08-2017 Through 25-08-2017.
gual image understanding: A study of cross-lingual
transfer using attention maps. In Proceedings of the
27th International Conference on Computational Lin-
guistics, pages 1918–1928, Santa Fe, New Mexico,
USA. Association for Computational Linguistics.

Lucia Specia, Stella Frank, Khalil Sima’an, and


Desmond Elliott. 2016. A shared task on multimodal
machine translation and crosslingual image descrip-
tion. In Proceedings of the First Conference on Ma-
chine Translation: Volume 2, Shared Task Papers,
pages 543–553, Berlin, Germany. Association for
Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob


Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems, volume 30. Curran Associates, Inc.

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi


Parikh. 2015. Cider: Consensus-based image de-
scription evaluation. In 2015 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR),
pages 4566–4575.

Nihir Vedd, Zixu Wang, Marek Rei, Yishu Miao, and


Lucia Specia. 2022. Guiding visual question gen-
eration. In Proceedings of the 2022 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, pages 1640–1654, Seattle, United States.
Association for Computational Linguistics.

Ashwin K Vijayakumar, Michael Cogswell, Ram-


prasath R. Selvaraju, Qing Sun, Stefan Lee, David
Crandall, and Dhruv Batra. 2018. Diverse beam
search: Decoding diverse solutions from neural se-
quence models.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015.


Show and tell: A neural image caption generator.
In 2015 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 3156–3164, Los
Alamitos, CA, USA. IEEE Computer Society.

You might also like