0% found this document useful (0 votes)
4 views

2210.04510v1

The document presents a multi-modal transformer-based architecture called VBFusion for visual question answering (VQA) in remote sensing, addressing the need for effective information extraction from rapidly growing satellite image archives. It emphasizes the importance of joint representation learning over traditional modality-specific fusion methods and showcases experimental results demonstrating VBFusion's effectiveness on benchmark datasets. The architecture includes a feature extraction module, a fusion module utilizing VisualBERT, and a classification module, with findings indicating that incorporating additional spectral bands significantly enhances VQA performance.

Uploaded by

Deblina Biswas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

2210.04510v1

The document presents a multi-modal transformer-based architecture called VBFusion for visual question answering (VQA) in remote sensing, addressing the need for effective information extraction from rapidly growing satellite image archives. It emphasizes the importance of joint representation learning over traditional modality-specific fusion methods and showcases experimental results demonstrating VBFusion's effectiveness on benchmark datasets. The architecture includes a feature extraction module, a fusion module utilizing VisualBERT, and a classification module, with findings indicating that incorporating additional spectral bands significantly enhances VQA performance.

Uploaded by

Deblina Biswas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Multi-Modal Fusion Transformer for Visual Question

Answering in Remote Sensing


Tim Siebert∗ , Kai Norman Clasen∗ , Mahdyar Ravanbakhsh, and Begüm Demir
Technische Universitat Berlin, Einsteinufer 17, 10587, Berlin, Germany

ABSTRACT
With the new generation of satellite technologies, the archives of remote sensing (RS) images are growing very
arXiv:2210.04510v1 [cs.CV] 10 Oct 2022

fast. To make the intrinsic information of each RS image easily accessible, visual question answering (VQA)
has been introduced in RS. VQA allows a user to formulate a free-form question concerning the content of RS
images to extract generic information. It has been shown that the fusion of the input modalities (i.e., image
and text) is crucial for the performance of VQA systems. Most of the current fusion approaches use modality-
specific representations in their fusion modules instead of joint representation learning. However, to discover
the underlying relation between both the image and question modality, the model is required to learn the joint
representation instead of simply combining (e.g., concatenating, adding, or multiplying) the modality-specific
representations. We propose a multi-modal transformer-based architecture to overcome this issue. Our proposed
architecture consists of three main modules: i) the feature extraction module for extracting the modality-specific
features; ii) the fusion module, which leverages a user-defined number of multi-modal transformer layers of the
VisualBERT model (VB); and iii) the classification module to obtain the answer. In contrast to recently proposed
transformer-based models in RS VQA, the presented architecture (called VBFusion) is not limited to specific
questions, e.g., questions concerning pre-defined objects. Experimental results obtained on the RSVQAxBEN
and RSVQA-LR datasets (which are made up of RGB bands of Sentinel-2 images) demonstrate the effectiveness
of VBFusion for VQA tasks in RS. To analyze the importance of using other spectral bands for the description
of the complex content of RS images in the framework of VQA, we extend the RSVQAxBEN dataset to include
all the spectral bands of Sentinel-2 images with 10m and 20m spatial resolution. Experimental results show the
importance of utilizing these bands to characterize the land-use land-cover classes present in the images in the
framework of VQA. The code of the proposed method is publicly available at https://git.tu-berlin.de/rsim/multi-
modal-fusion-transformer-for-vqa-in-rs.
Keywords: Multi-modal transformer, visual question answering, deep learning, remote sensing.

1. INTRODUCTION
With advances in satellite technology, remote sensing (RS) image archives are rapidly growing, providing an
unprecedented amount of data, which is a great source for information extraction in the framework of several
different Earth observation applications. As a result, there has been an increased demand for systems that
provide an intuitive interface to this wealth of information. For this purpose, the development of accurate visual
question answering (VQA) systems has recently become an important research topic in RS [1]. VQA defines a
framework that allows retrieval of use-case-specific information [2]. By asking a free-form question to a selected
image, the user can query various types of information without any need for remote sensing-related expertise.
Most of the existing VQA architectures consist of three main modules: i) the feature extraction module; ii)
the fusion module; and iii) the classification module. The feature extraction module extracts high-level features
for both input modalities (i.e., image and question text). After encoding the input modalities, the fusion module
is required to discover a cross-interaction between both features. Since the model needs to select the relevant part
of the image concerning the question, combining the features in a meaningful way is crucial. Finally, the output of
the fusion module is passed to the classification module to generate the natural language answer. In RS, relatively
Further author information: (Send correspondence to Tim Siebert)
Tim Siebert: E-mail: [email protected]
*Equal Contribution
few VQA architectures are investigated [1, 3–7]. In [1], a convolutional neural network (CNN) is applied as an
image encoder in the feature extraction module. The natural language encoder is based on the skip-thoughts
[8] architecture, while the fusion module is defined based on a simple, non-learnable point-wise multiplication of
the feature vectors. In [4], image modality representation is provided by a VGG-16 network [9], which extracts
two sets of image features: i) an image feature map extracted from the last convolutional layer; and ii) an image
feature vector derived from the final fully connected layer. The text modality representations are extracted by
a gated-recurrent unit (GRU) [10]. The two modality-specific representations are merged by the fusion module
that leverages the more complex mutual-attention component. In [6], the authors reveal the potential of an
effective fusion module to build competitive RS VQA models. In the computer vision (CV) community, it has
been shown that to discover the underlying relation between the modalities, a model is required to learn a joint
representation instead of applying a simple combination (e.g., concatenation, addition, or multiplication) of the
modality-specific feature vectors [11]. In [3], this knowledge is transferred to the RS VQA domain by introducing
the first transformer-based architecture combined with an object detector pre-processor for the image features
(called CrossModal ). The object detector is trained on an RS object detection dataset (i.e., the xView dataset
[12]) to recognize target objects such as cars, buildings, and ships. However, since the CrossModal architecture
leverages objects defined in xView, the model is specialized for VQA tasks that contain those objects. In contrast,
the largest RSVQA benchmark dataset (RSVQAxBEN [13]) includes questions regarding objects not included
in xView (e.g., the dataset contains questions that require a distinction between coniferous and broad-leaved
trees). In [7], a model called Prompt-RSVQA is proposed that takes advantage of the language transformer
DistilBERT [14]. To this end, the image feature extraction module uses a pre-trained multi-class classification
network to predict image labels, then projects the image labels into a word embedding space. Finally, the image
features encoded as words and the question text are merged in the transformer-based fusion module. Using word
embeddings of the image labels as image features in Prompt-RSVQA comes with limitations: i) the model is
tailored to the dataset that the classifier is trained on; ii) some questions are not covered in this formulation
(e.g., counting, comparison) since the relation between the class labels are not presented in the predicted label
set. The aforementioned models provide promising results, guiding the motivation towards fusion modules that
exploit transformer-based architectures. However, they are limited to the datasets that the classifier or the object
detector is trained on, which makes the model unable to learn questions that are not covered in this formulation
(e.g., counting, comparison, and objects that are not in the classifier/object detector training set).
To overcome these issues, in this paper, we present a multi-modal transformer-based architecture that lever-
ages a different number of multi-modal transformer layers of the VisualBERT [11] model as a fusion module.
Instead of applying an object detector, we utilize BoxExtractor, a more general box extractor which does not
overemphasize objects, within the image feature extraction module. The resulting boxes are fed into a ResNet
[15] to generate image tokens as embeddings of the extracted boxes. For the text modality, the BertTokenizer [16]
is used to tokenize the questions. Our fusion module takes modality-specific tokens as input and processes them
with a user-defined number l of VisualBERT (VB) layers. The classification module consists of a Multi-Layer
Perceptron (MLP) that generates an output vector that represents the answer. Experimental results obtained on
large-scale RS VQA benchmark datasets [1, 13] (which only include the RGB bands of Sentinel-2 multispectral
images) demonstrate the success of the proposed architecture (called VBFusion) compared to the standard RS
VQA model. To analyze the importance of the other spectral bands for characterization of the complex informa-
tion content of Sentinel-2 images in the framework of VQA, we add the other 10m and all 20m spectral bands to
the RSVQAxBEN [13] dataset. The results show that the inclusion of these spectral bands significantly improves
the VQA performance, as the additional spectral information helps the model to take better advantage of the
complex image data. To the best of our knowledge, this is the first study to consider multispectral RS VQA that
is not limited to the RGB bands.
The remaining sections are organized as follows; Section 2 will introduce the proposed architecture, including
the feature extraction pipeline. In Section 3, we will introduce the datasets for the experiments, the experimental
setup, and in Section 4 analyze our results. We will conclude our work in Section 5 and provide an outlook for
future research directions.
Feature Extraction Module Fusion Module Classification Module

VisualBERT
Image Feature
Extractor
𝐙

Self-Attention Layer

Self-Attention Layer

MLP

Language Feature
Extractor

𝑙 layers

Figure 1: A general overview of VBFusion, our multi-modal transformer-based VQA architecture. Our ar-
chitecture includes: i) a feature extraction module (Section 2.1); ii) a fusion module (Section 2.2); and iii) a
classification module (Section 2.3). An RS image I and a question Q about this image are considered as in-
put. Inside the VisualBERT model, the modality-specific features [Z, T] get projected to a hidden dimension
and concatenated afterwards. The concatenated features are then fed to l Self-Attention Layers, providing an
output for further processing in the classification module. The output of the classification module is a vector
representing the final answer A to the question Q.

2. PROPOSED MULTI-MODAL TRANSFORMER-BASED VQA ARCHITECTURE


Visual question answering systems attempt to answer questions in natural language regarding an image input
and can be formulated as a classification problem, where image and question are the input and the answer is
considered as a label. Given an image, question, answer triplet denoted as (I, Q, A), an VQA system inputs
m m=1..Mn
(I, Q) and predicts A. We assume that a training set D = {In , Qm n , An }n=1..N that consists of N number of
Mn
images is available, and for each n ∈ {1, ..., N } the image In is associated with Mn pairs of questions {Qm
n }m=1
m Mn
and answers {An }m=1 . Given the training set D, the proposed multi-modal fusion transformer aims to learn a
joint representation (from images and questions) and a classification head to obtain the answer. To this end, the
proposed architecture includes: i) a feature extraction module based on the BoxExtractor and the BertTokenizer;
ii) a fusion module based on a user-defined number of multi-modal transformer layers of VisualBERT; and iii) a
classification module consisting of an MLP projection head. Figure 1 shows the general overview of our multi-
modal transformer-based VQA architecture VBFusion. Each module is explained in detail in the following.

2.1 Feature Extraction Module


The feature extraction module aims to extract relevant features for text and image modality independently
from each other. To this end, the feature extraction module consists of two modality-specific encoder networks:
i) image modality encoder f ; and ii) text modality encoder g. In the case of the image modality, we first
utilize a simplified box extractor followed by an image encoder network. Then the output of the image encoder
network is transformed with an MLP layer into a suitable shape for the fusion module. It is worth noting that,
to extract the image features, in CV it is common to guide the focus of a model to the relevant areas of the
image [11, 17]. For this reason, the state-of-the-art VQA models in CV leverage an object detector to focus on
important regions (i.e., objects). For example, in VisualBERT [11] a Faster R-CNN [18] backbone is used to
Image Feature Extractor

BoxExtractor 𝐵
Image Image
Selected Regions Raw Boxes Vector of Boxes Features
Feature
Matrix

Image Encoder
Network 𝑓
𝑐×𝐻×𝑊 𝑏 × 𝑐 × 𝐻𝑖′ × 𝑊𝑖 ′ 𝑏 × 𝑐 × 𝐻′ × 𝑊′ 𝑏×𝐾 𝑏×𝑣

Figure 2: The image feature extraction module takes an image I as input and outputs a feature map Z. First,
the BoxExtractor B selects regions of the input image I and creates b raw boxes, which has the sizes c × Hi0 × Wi0
with i = 1, ..., b. These raw boxes are interpolated to the size c × H 0 × W 0 resulting in the vector of boxes
I0 = B(I). To further process I0 , the image encoder network f is leveraged. The output of f , I? (size b × K) is
then projected to image features Z (size b × v) through an MLP layer.

extract the relevant regions of the image. To extract regions of interest and detect objects present in RS images,
semantic segmentation models are often applied. Such models require the availability of reliable pixel-based
ground reference samples to be used in the training phase. The collection of a sufficient number of reliable
labeled samples is time-consuming, complex, and costly in operational scenarios and can significantly affect the
final accuracy of object detection. To overcome this issue, we propose BoxExtractor which generates rectangular
boxes, selecting a region of interest without requiring any labeled training samples.
The image feature extraction pipeline is illustrated in Figure 2. To obtain the image features, we first create
spatially interpolated rectangular regions of the image In with our BoxExtractor. For this purpose, let c ∈ N
be the number of spectral bands, H, W ∈ N the height and width of the image, and H 0 , W 0 ∈ N the height and
width of the interpolated boxes. Let b ∈ N be the number of boxes that BoxExtractor creates. Then the function
0 0
B : Rc×H×W → Rb×c×H ×W defines the BoxExtractor that maps the image In to its boxes B(In ) = I0n in a
non-deterministic way. The non-determinism is observed because we first choose random start- and end-points
for the height and the width of the boxes, siH , eiH ∈ {0, 1, ..., H}, siH < eiH and siW , eiW ∈ {0, 1, ..., W }, siW < eiW ,
for i ∈ {1, ..., b}. In the second step, a vector of boxes to each pair of start- and end-points is created. We control
the distance between the start- and endpoints (e.g., eiH − siH ) to guarantee a minimum amount of information
in each box. To this end, we set:
minH ≤ eiH − siH (1)
and
minW ≤ eiW − siW , (2)
for i = 1, ..., b, where minH , minW are hyperparameters. The resulting vector of boxes is obtained by equally I0n
resizing the boxes to H 0 and W 0 with bicubic interpolation. Then, we stack the image boxes and pass them to
the image encoder network f . Finally, we transform the output I∗ = f (I0n ) into a suitable shape for the fusion
module. To this end, we reshape the features and keep the number of boxes as the first dimension, whereas the
second dimension is given by flattening the remaining directions, resulting in a matrix. The matrix is fed into an
MLP, to project its size to b × v, where v ∈ N is the so-called image embedding dimension of the VisualBERT
[11] model (i.e., the required input dimension for the image modality). Finally, the output image embedding
vectors Zn = MLP(I∗ ) are given to the fusion module.
To extract the text features from a question Qm n , we translate the question into tokens. These tokens are
a mapping of the words to a numerical value. To achieve this and to be aligned with the original VisualBERT
model, we utilize a frozen BertTokenizer [16] as text modality encoder network g, where Tm m
n = g(Qn ) is the
m
output text embedding vector. Both Zn and Tn is jointly processed to extract their implicit knowledge in the
next step.

2.2 Fusion Module


Our transformer-based fusion module jointly learns the alignment between the image and text modality. To learn
the joint alignment, we leverage l VisualBERT layers. These layers are an extension of the language transformer
BERT with the image modality. VisualBERT first projects the image features Zn into a suitable shape to
concatenate the projected image features with the language modality. The concatenated image and text features
[Zn , Tm
n ] are then fed to the BERT transformer layers. The output is given by pooling the last hidden state.
The resulting vector is further processed in the classification module. To summarize the forward procedure,
VisualBERT fuses the image and language modalities and feeds them into the BERT architecture, representing a
natural extension of BERT to multiple modalities. Additionally, VisualBERT shows competitive results on VQA
benchmark datasets and is thus well suited as a starting point for multi-modal fusion transformers in RS VQA.
Unlike the previous RS VQA models that simply combine both modalities in a non-learnable fashion, our model
learns a joint representation. Note that our approach differs from [7] because they apply a transformer only to
the language modality. In contrast to the model proposed in [3], we apply exclusively multi-modal self-attention
instead of a mixture of single- and multi-modal self-attention.

2.3 Classification Module


The task of the classification module is to create an output vector representing the prediction for the specific
answers. Following [1], we utilize an MLP as a classification module. The MLP consists of three layers, where
the last layer has the dimension of the considered answer set. The final answer is then obtained by selecting the
largest activation of the output vector.

3. DATASET DESCRIPTION AND EXPERIMENTAL SETUP


We conducted experiments on the RSVQA-LR [1] and the RSVQAxBEN [13] datasets. The RSVQA-LR dataset
was constructed using 7 Sentinel-2 tiles acquired over the Netherlands, from which only the RGB bands were
used. The tiles are divided into 772 patches of size 256x256 pixels. To obtain the information needed to create the
image-question-answer triplets, knowledge from the publicly available OpenStreetMap database was leveraged.
With this knowledge, the dataset was constructed by taking an image patch and randomly generating question-
answer pairs from a given template. This procedure resulted in 77,232 image-question-answer triplets. As in
[1], we used the same tile-based train/validation/test split. The training split consists of five tiles, whereas the
validation and test split consist of one tile each. For further information on the RSVQA-LR dataset (e.g., the
definition of the question classes), the reader is referred to [1].
The RSVQAxBEN [13] dataset embodies the largest freely available RS VQA benchmark dataset. It is based
on the BigEarthNet (BEN) [19] archive and contains 590,326 Sentinel-2 L2A image patches with 12 spectral
bands of varying resolution (10m, 20m, and 60m). The content of each patch is described by multiple class labels
from the CORINE Land Cover (CLC) 2018 database. The RSVQAxBEN dataset was constructed as a large-
scale benchmark for RS VQA, using only the RGB bands of the BEN patches (which have a spatial resolution
of 10m). A stochastic algorithm that utilizes the CLC labels produced the image-question-answer triplets. The
procedure described in [13] resulted in 14,758,150 image-question-answer triplets, which include 26,875 unique
answers. To limit the number of possible answers, the model’s output in [13] was restricted to the 1,000 most
frequent answers, covering 98.1 % of the answer set. For the sake of comparability, we also apply this restriction.
We use the same train/validation/test split based on the tiles’ spatial location as in [13]. For more statistical
insights, the reader is referred to [13] . We provide experimental results on the original RSVQAxBEN dataset.
Furthermore, we extended the original three-band RSVQAxBEN dataset to a ten-band version by including all
spectral bands with 10m and 20m spatial resolution in addition to the RGB bands. In our experiments, we
resampled the 20m bands to the resolution of the 10m bands by using a cubic interpolation method.
Table 1: Accuracies obtained by using SkipRes and VBFusion with different number l of layers (RSVQA-LR
dataset).
Question Type
Architecture l Count Presence Comparison Rural/Urban AA OA

SkipRes [13] – 67.01 87.46 81.50 90.00 81.49 79.08


4 68.17 88.02 88.51 87.00 82.92 82.36
6 68.17 88.87 88.83 84.00 82.47 82.71
VBFusion 8 69.36 89.00 83.46 88.00 82.45 80.99
10 67.73 89.48 86.68 88.00 82.97 81.94
12 68.14 89.27 88.71 85.00 82.78 82.78

We performed experiments with different layer configurations for our proposed VBFusion architecture given
by the pre-trained VisualBERT [11] model. The value of l (which defines the number of layers selected from
the original VisualBERT model) is varied as l ∈ {4, 6, 8, 10, 12}. Note that when l = 12, it is identical to
the original VisualBERT model. For the image encoder network f , we utilized a pre-trained ResNet152 [15]
network architecture with frozen weights. The image encoder network was pre-trained on ImageNet [20] for
RSVQA-LR and three-band RSVQAxBEN and on BEN for the ten-band variant. We extracted ten boxes for
the RSVQA-LR and three-band RSVQAxBEN datasets, and for the ten-band variant five boxes. The learning
rate was set to 10−6 , while the maximum number of training epochs was set to 300 for RSVQA-LR and 20 for
RSVQAxBEN. We chose a batch size of 1024 and 2048 for RSVQA-LR and RSVQAxBEN, respectively. We
performed our experiments on a cluster of 8 NVIDIA Tesla A100 GPUs. We used the implementation from
the Huggingface library [21] for the VisualBERT layers. Our architecture was compared with the SkipRes [13]
architecture, which is based on Skip-Thoughts and ResNet152. To analyze the results on the RSVQA-LR dataset
we consider the same metrics as given in [1]: the accuracy of counting questions (called “Count”), of presence
questions (called “Presence”), of comparison questions (called “Comparison”) and of rural/urban questions (called
“Rural/Urban”). Furthermore, we provide the average accuracy (“AA”) over the question type accuracies and the
overall accuracy (“OA”). The metrics used to analyze the results on the RSVQAxBEN dataset are: the accuracy
of yes/no questions (called “Yes/No”) and the accuracy of the land use and land cover questions (called “LULC”),
as well as “OA” and “AA”.

4. EXPERIMENTAL RESULTS
Table 1 shows the results obtained on the RSVQA-LR dataset. By analyzing the table, one can see that our pro-
posed VBFusion architecture outperforms the SkipRes model in all metrics except “Rural/Urban”, independently
of the number of layers. The most significant difference occurs in the “Comparison” accuracy, which represents
the most challenging question type. Here, a model is required to count the number of occurrences of a selected
object with a specific positional relationship to a reference object. For this type of question, our 6-layer VBFu-
sion architecture significantly outperforms SkipRes by more than 7 %. For the “Count” questions the differences
are smaller, e.g., VBFusion with 10 layers outperforms the SkipRes model by only 0.7 %. In the summarizing
metrics “AA”, and “OA”, all VBFusion models, independently of the number of layers, significantly outperform
SkipRes. Furthermore, no significant improvement is observed when increasing the number l of VisualBert lay-
ers. Although complexity increases, when using more layers, performance changes only slightly. This observation
aligns with the findings of [22] that large transformers struggle on small datasets. To benefit from the larger
transformer configurations, more samples are probably required. In the case of the “Rural/Urban” questions, the
underperformance can be explained with the same argument; there are too few questions of this type. This is be-
cause the “Rural/Urban” question type makes up the smallest proportion of the dataset (1 %). By analyzing the
results, one can conclude that the proposed VBFusion architecture is able to improve the performance compared
to the SkipRes architecture. It is worth emphasizing that the datasets used in the experiments are benchmarks,
whereas in many real applications the VQA is expected to be applied to much larger archives. Due to the nature
of the large transformer-based architecture, we expect that the gain achieved by the VBFusion architecture will
Table 2: Accuracies obtained by using SkipRes and VBFusion with different number l of layers (RSVQAxBEN
dataset).
Question Type
Architecture l Number of Bands LULC Yes/No AA OA

SkipRes [13] – 3 20.68 80.02 50.35 69.92


3 20.40 83.86 52.13 73.06
4
10 25.72 85.41 55.56 75.26
3 21.66 84.58 53.12 73.88
6
10 25.88 85.48 55.68 75.34
VBFusion 3 20.07 84.37 52.22 73.43
8
10 25.04 86.56 55.80 76.10
3 20.82 85.13 52.97 74.19
10
10 25.19 85.95 55.57 75.61
3 24.33 85.47 54.90 75.07
12
10 26.26 85.34 55.80 75.29

be increased for large-scale RS VQA problems and also for the cases of when much more complex question types
are present.
Table 2 reports the accuracies for the three- and ten-band RSVQAxBEN dataset variants. By analyzing the
tables one can observe that the proposed architecture VBFusion outperforms the SkipRes architecture, irrespec-
tive of the layer configuration, in both summarizing metrics “AA” and “OA”. Even the smallest configuration
with 4 layers trained on three bands improves the “AA” by almost 2 % and the “OA” by more than 3 %. When we
compare the performance of the proposed architecture trained on the three-band variant, most of the improve-
ments are attributed to the “Yes/No” related questions. We can also see that in the 3-band scenario, deeper and
larger VBFusion models lead to higher “OA” scores. This pattern is in contrast to the generally similar perfor-
mance in the “LULC” question category among the layer configurations up to l = 10 trained on three bands.
Interestingly, a performance jump can be observed for the “LULC” questions with the largest layer configuration
(l = 12). A possible reason is that with 12 layers, all pre-trained VisualBERT layers are utilized and none are
rejected. Therefore, the initial weights can produce more meaningful high-level embeddings than the pruned
models. This seems to be especially important for the complex “LULC” questions, where the model is required
to deeply understand the contents of intricate multispectral images and connect them to the associated question.
The performance improvement exists for the three- and ten-band variants, although it is more dominant in the
three-band configuration. With all 12 layers, the “LULC” accuracy increases by 3.5 %, compared to the 10 layer
configuration, for the three-band training leading to a high “OA” of 75.07 %. The effect is not as significant for
the 10-band configuration. The relatively low impact on the 10-band configuration can be due to a discrepancy
between the high-level RGB image feature representation and the ten-band feature representation from the initial
weights of the pre-trained VisualBERT model, since VisualBERT was pre-trained on RGB images. However,
analyzing the results for all 10-band configurations, one can observe that these models improve the performance
in all metrics compared to their three-band counterparts, except for the 12-layer configuration in the “Yes/No”
category, where the accuracy is slightly lower by less than 0.2 %. Most of the observed performance increase is
due to the notably higher accuracies in the “LULC” category. For example, the smallest configuration trained
on 10 bands with l = 4 is more than 5 % better than the same architecture trained on the 3 band variant in the
“LULC” category. An improved “LULC” performance is expected, since some LULC classes, such as different
types of vegetation, greatly benefit from additional spectral information. The smallest model trained on ten
bands even outperforms the largest and best-performing configuration trained on three bands by 0.6 % in “AA”
and almost by 0.2 % in “OA”. When training with ten bands, it can be observed that larger configurations of
our proposed VBFusion architecture do not necessarily lead to higher accuracies. The best performing ten-band
trained configuration utilizes 8 layers and reaches an “OA” of 76.10 %, while the full 12 layer configuration has the
second lowest “OA” with 75.29 %. By analyzing the results, one can conclude that the proposed VBFusion archi-
tecture discovers the underlying relationship between both the image and question modality better than models
that utilize a simple feature combination as the fusion module. Furthermore, the results show the importance
of utilizing additional spectral bands, when available, to better model the contents of intricate multispectral
imagery for VQA systems.

5. CONCLUSION
In this paper, we have presented a novel architecture in RS VQA that applies a transformer model, which
exclusively relies on multi-modal transformer layers to learn the image and text representations jointly. Our
proposed architecture VBFusion includes: i) a feature extraction module based on the BoxExtractor, ResNet152,
and the BertTokenizer; ii) a fusion module based on a user-defined number of multi-modal transformer layers
of VisualBERT; and iii) a classification module consisting of an MLP projection head. It is worth noting that
our architecture is not limited to specific questions, e.g., questions concerning pre-defined objects. To show
the effectiveness of the proposed architecture, we have evaluated the model on the RSVQA-LR dataset, the
RSVQAxBEN dataset (which only includes the RGB bands of the Sentinel-2 image patches), and an extended
RSVQAxBEN dataset (which includes all the spectral bands of Sentinel-2 images with 10m and 20m spatial
resolution). From the experimental results obtained on the RSVQAxBEN dataset variants, we observe that: i)
our architecture leads to significant performance improvements compared to an architecture that simply combines
the modality-specific representations in the fusion module as our architecture better discovers the underlying
relationship between the modalities; and ii) exploitation of additional available spectral bands leads to better
modeling of the complex spatial and spectral content of RS images in the context of VQA. Our architecture results
in a performance increase for the comparably small RSVQA-LR dataset compared to the SkipRes architecture,
which utilizes a simple feature combination as a fusion module. Compared to the relatively large complexity of
our models, the slight improvement aligns with the literature findings that large transformers struggle on small
datasets. In addition, larger layer configurations do not provide a significant improvement in the RSVQA-LR
dataset. These results show that our multi-modal transformer-based fusion module requires larger training sets
to realize its full potential.
We would like to note that although transformer-based models have a potential to provide high VQA per-
formance in RS, they are associated to a high number of model parameters and high training complexity. Thus,
as a future work, we plan to investigate efficient transformers and conduct a comparative study in terms of
complexity and performance. Furthermore, we plan to extend our feature extraction module to optimize the
BoxExtractor or exchange it with a more sophisticated general box extractor.

ACKNOWLEDGEMENT
This work is funded by the European Research Council (ERC) through the ERC-2017-STG BigEarth Project
under Grant 759764 and by the German Ministry for Economic Affairs and Climate Action through the AI-Cube
Project under Grant 50EE2012B.

REFERENCES
[1] Lobry, S., Murray, J., Marcos, D., and Tuia, D., “Visual question answering from remote sensing images,”
IEEE International Geoscience and Remote Sensing Symposium 60, 4951–4954 (2019).
[2] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D., “VQA: Visual
Question Answering,” International Conference on Computer Vision (2015).
[3] Felix, R., Repasky, B., Hodge, S., Zolfaghari, R., Abbasnejad, E., and Sherrah, J., “Cross-modal visual ques-
tion answering for remote sensing data,” International Conference on Digital Image Computing Techniques
and Applications , 1–9 (2021).
[4] Zheng, X., Wang, B., Du, X., and Lu, X., “Mutual attention inception network for remote sensing visual
question answering,” IEEE Transactions on Geoscience and Remote Sensing 60, 1–14 (2022).
[5] Lobry, S., Marcos, D., Kellenberger, B., and Tuia, D., “Better generic objects counting when asking questions
to images: A multitask approach for remote sensing visual question answering,” ISPRS Annals of the
Photogrammetry, Remote Sensing and Spatial Information Sciences V-2-2020, 1021–1027 (2020).
[6] Chappuis, C., Lobry, S., Kellenberger, B., Saux, B. L., and Tuia, D., “How to find a good image-text
embedding for remote sensing visual question answering?,” CoRR abs/2109.11848 (2021).
[7] Chappuis, C., Zermatten, V., Lobry, S., Le Saux, B., and Tuia, D., “Prompt-RSVQA: Prompting visual
context to a language model for remote sensing visual question answering,” Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops , 1372–1381 (2022).
[8] Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S., “Skip-Thought
vectors,” Advances in Neural Information Processing Systems 28 (2015).
[9] Simonyan, K. and Zisserman, A., “Very deep convolutional networks for large-scale image recognition,” 3rd
International Conference on Learning Representations (2015).
[10] Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y., “On the properties of neural machine trans-
lation: Encoder-decoder approaches,” CoRR abs/1409.1259 (2014).
[11] Li, L. H., Yatskar, M., Yin, D., Hsieh, C., and Chang, K., “VisualBERT: A simple and performant baseline
for vision and language,” CoRR abs/1908.03557 (2019).
[12] Lam, D., Kuzma, R., McGee, K., Dooley, S., Laielli, M., Klaric, M., Bulatov, Y., and McCord, B., “xView:
Objects in context in overhead imagery,” CoRR abs/1802.07856 (2018).
[13] Lobry, S., Demir, B., and Tuia, D., “RSVQA meets BigEarthNet: A new, large-scale, visual question
answering dataset for remote sensing,” IEEE International Geoscience and Remote Sensing Symposium ,
1218–1221 (2021).
[14] Sanh, V., Debut, L., Chaumond, J., and Wolf, T., “DistilBERT, a distilled version of BERT: smaller, faster,
cheaper and lighter,” CoRR abs/1910.01108 (2019).
[15] He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,”
CoRR abs/1512.03385 (2015).
[16] Devlin, J., Chang, M., Lee, K., and Toutanova, K., “BERT: pre-training of deep bidirectional transformers
for language understanding,” CoRR abs/1810.04805 (2018).
[17] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L., “Bottom-up and
top-down attention for image captioning and VQA,” CoRR abs/1707.07998 (2017).
[18] Ren, S., He, K., Girshick, R. B., and Sun, J., “Faster R-CNN: towards real-time object detection with region
proposal networks,” CoRR abs/1506.01497 (2015).
[19] Sumbul, G., Charfuelan, M., Demir, B., and Markl, V., “BigEarthNet: A large-scale benchmark archive
for remote sensing image understanding,” IEEE International Geoscience and Remote Sensing Symposium
, 5901–5904 (2019).
[20] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L., “ImageNet: A large-scale hierarchical
image database,” IEEE Conference on Computer Vision and Pattern Recognition , 248–255 (2009).
[21] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R.,
Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L.,
Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M., “Transformers: State-of-the-art natural language
processing,” Proceedings of the Conference on Empirical Methods in Natural Language Processing: System
Demonstrations , 38–45 (2020).
[22] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M.,
Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N., “An image is worth 16x16 words:
Transformers for image recognition at scale,” CoRR abs/2010.11929 (2020).

You might also like