0% found this document useful (0 votes)
2 views

Project 5

The document discusses advancements in Medical Visual Question Answering (VQA) by integrating cross-modal fusion, image attention feature extraction, and biomedical text processing using BioBERT. It outlines a methodology involving the VQA-Med-2019 dataset, feature extraction techniques, and a proposed model that demonstrates superior accuracy and BLEU scores compared to existing models. The findings suggest that the proposed model effectively enhances the interpretation of medical images and improves healthcare delivery.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Project 5

The document discusses advancements in Medical Visual Question Answering (VQA) by integrating cross-modal fusion, image attention feature extraction, and biomedical text processing using BioBERT. It outlines a methodology involving the VQA-Med-2019 dataset, feature extraction techniques, and a proposed model that demonstrates superior accuracy and BLEU scores compared to existing models. The findings suggest that the proposed model effectively enhances the interpretation of medical images and improves healthcare delivery.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Improving Visual Question

Answering with Cross-Modal


Fusion, Image Attention
Feature Extraction, and
Biomedical Text Processing
using BioBERT
TABLE OF CONTENTS

01 02 03
INTRODUCTIO Literature Methodology
N Review

04
Results and
Conclusions
1
INTRODUCTIO
N
Medical Visual Question
Answering

Over forty-five percent of nations in the globe in


2021 had a maximum of one physician per 1,000
inhabitants, based on statistics from the globe
World Health Organization [23]. Because of this,
healthcare staff is required to put in a lot of time,
which increases the rate for errors made by
humans. Artificial intelligence (AI) could help and
automate image analysis because analyzing
medical images takes a lot of time in a doctor's
job. These days, artificial intelligence scientists
pay more and more focus on the challenges of
answering visual questions, and their efforts
produce models that can comprehend the larger
context of a picture and resolve issues that call
for human-level thinking.
Medical Visual Question
Answering

The goal of visual question answering (VQA) is


to answer an image-related question based on
images.
Medical Visual Question Answering (VQA)
represents a transformative approach to
interpreting medical images and enhancing
healthcare delivery. By leveraging the
synergies between computer vision and
natural language processing, VQA systems
enable clinicians, educators, and researchers
to extract actionable insights from medical
imagery and textual queries. While challenges
persist, ongoing research endeavors hold
promise for overcoming these obstacles and
unlocking the full potential of Medical VQA in
revolutionizing healthcare.
General approach of medical VQA
methods
A high-level model design for the task of VQA. The model has four major components—
image feature extraction, question feature extraction, feature fusion amalgamated with
the attention mechanism, followed by answer categorization or generation depending on
the task.
2
Literature
Review
Research Paper No.1
In the research paper [64], Xception was utilized to extract image features, while GRU was employed to extract
question features. Due to the disparity in the number of image and question features, a technique of replicating
the question features was employed to achieve feature parity. Furthermore, an attention mechanism was
incorporated before the fusion process. This approach was adopted to address the discrepancy in feature
quantities and enhance the fusion process.
Research Paper No.1
Three experiments were conducted to evaluate the model proposed in this study. The parameters used in these
experiments were as follows: a dictionary size of 1000, sequence length of 9, GRU hidden size of 128, and a
training batch size of 256. The number of epochs was set to 54 [1].
The experiments are described as follows:
1. In the first experiment, the proposed model (Xception-GRU) was implemented without data
enhancement.
2. In the second experiment, Bi-LSTM was used instead of GRU for extracting text features. However, it
was observed that bidirectional LSTM was less effective compared to GRU.
3. In the last experiment, the proposed model (Xception-GRU) was executed with data enhancement,
while keeping the remaining architecture unchanged [1].
These experiments were conducted to evaluate the performance and effectiveness of the proposed model under
different conditions and architectural variations.
The BLEU (Bilingual Evaluation Understudy) score is a metric used to evaluate the quality of text generated by a
machine, such as translations or generated answers, by comparing it to one or more reference texts created by
humans. Although BLEU is most commonly used in the
Model context of machine translation,
Accuracy BLEU
Xception + GRU 0.21 0.393
without enhancement
Xception + Bi-LSTM 0.2 0.31
without enhancement
Xception + GRU with 0.178 0.27
enhancement
Research Paper No.2
In the research paper [65], The system architecture is composed of two primary components: an encoder and a
decoder. The encoder module comprises three sub-components: transfer learning for feature extraction from
images, word2vec combined with a two-layer LSTM for feature extraction from questions, and element-wise
multiplication for fusing the visual and textual features. On the other hand, the decoder module consists of a
sequence generating LSTM network responsible for generating output answers based on input questions and
images.
Research Paper No.2
The experiments in this study explored various
Strict
configurations of neural network architectures and No Model BLEU
training parameters. Here's a summary of each accuracy
experiment: 1 VGG19-N128 0.462 0.15
2 VGG19-N256 0.433 0.126
1. VGG19-N128: Utilized VGG-19 for transfer VGG19-N256-
learning with 128 neurons in LSTM networks and 3 0.453 0.142
Dropout
dense layers. Trained for 100 epochs. 4 DenseNet201-N256 0.455 0.158
2. VGG19-N256: Similar to the first experiment but
DenseNet201-
increased neurons to 256 and trained for 200 5 0.453 0.16
epochs. N256-Dropout
3. VGG19-N256-Dropout: Same architecture as the DenseNet201-
6 0.447 0.15
second run but included dropout of 0.2 in dense N256-D400
layers. Trained for 150 epochs. 7 DenseNet201-N256 0.301 0.098
4. DenseNet201-N256: Used DenseNet-201 for 8 VGG19-N128 0.462 0.15
transfer learning with 256 neurons. Trained for 150
epochs.
5. DenseNet201-N256-D400: Similar to the fourth
experiment but increased embedding dimension to
400. Trained for 150 epochs.
6. DenseNet201-N256: Similar to the fifth run but
extended training duration to 200 epochs.
7. DenseNet201-N128: Used DenseNet-201 with
128 neurons. Trained for 150 epochs.
3
Methodology
KEY IDEAS in Methodology
Dataset INNOVATIVE SOLUTION
This method involves using the
The VQA-Med-2019 dataset contains VGG19 network to extract image
a selection of medical images, features and the Multi-modal
including various modalities, planes, Factorized Bilinear (MFB) technique
anatomical localities, and diagnostic to enhance the representation by
methods. The dataset also includes capturing interactions and
detailed information on the question dependencies between image
categories and formats, covering a features and question features.
range of medical imaging scenarios.
The dataset consists of 3,200 images
for training and 500 images for
validation.
The test set undergoes a rigorous
validation process, involving manual
review by both a medical doctor and
a radiologist.
Dataset
VQA-Med-2019
The training set of the VQA-
Med 2019 dataset consists
of 3,200 medical images.
These images are associated
with a total of 12,792
question-answer (QA) pairs,
with an average of 3 to 4
questions per image.
Proposed Method
The method involves several steps for feature extraction and fusion:

Image Feature Extraction:


The image features are extracted using the VGG19 network, which captures the visual information present in the images.

Multimodal Feature Fusion:


The extracted image features and the question features (obtained using BioBERT) are passed through a Multi-modal Factorized
Bilinear (MFB) layer.
The MFB layer captures the interactions and dependencies between the image and question features, enhancing the
representation of the image features by incorporating cross-modal interactions.

Image Attention Feature Extraction:


After the MFB layer, a series of convolutional (Conv) layers are applied to the fused features.
The Conv layers apply convolutional operations to capture local patterns and spatial relationships within the features.
A ReLU activation function is used to introduce non-linearity and enhance the discriminative power of the features.
Additional Conv layers are employed to extract higher-level representations and abstract the features.

Question Feature Extraction:


In parallel to the image feature extraction, the question features are processed separately.
The question features are passed through a BioBERT model, which applies contextualized word embeddings and captures the
semantic meaning of the question.
The question features are then processed through a series of Conv layers, ReLU activation, and a softmax layer to extract and
normalize the question representation.

Multimodal Fusion and Classification:


The MFB layer is used to merge the question features with the image attention features, allowing the model to attend to
relevant parts of the image based on the question and capture the contextual relationship.
Finally, a softmax layer is applied to the fused features to produce a probability distribution over the classes or categories.
Proposed Method
Question attenetion feature extraction
BioBert
BioBERT is a specialized version of BERT
(Bidirectional Encoder Representations
from Transformers), which is a popular
pre-trained language model developed by
Google. While BERT was originally trained
on general text data, BioBERT is
specifically trained on biomedical text,
making it more adept at understanding
and processing language in the context of
biomedical and clinical domains.
This fine-tuning allows BioBERT to excel in
tasks such as biomedical text mining,
biomedical question answering,
biomedical named entity recognition, and
other related applications. It has been
widely adopted in the biomedical research
community for tasks like biomedical
information extraction, medical diagnosis,
and drug discovery.
Question attenetion feature extraction
The question features are passed through a Biobert model, which is specifically designed for processing biomedical text.
The Biobert model applies contextualized word embeddings and captures the semantic meaning of the question. After the
Biobert model, a series of layers are applied to the question features. This includes a Conv layer, which performs
convolutional operations to capture local patterns in the question representation. A ReLU activation function is then
applied to introduce non-linearity. Finally, a softmax layer is employed to normalize the question features and produce a
probability distribution over the possible answers. The MFB layer is utilized to merge the question features with the image
features during the image attention feature extraction.
Image attenetion feature extraction
VGG19
VGG19 is a deep convolutional neural network architecture that was developed by the Visual Geometry Group (VGG) at the
University of Oxford. It is part of the VGG family, which includes various network architectures with different depths (e.g., VGG16,
VGG19).
VGG19 is characterized by its depth, consisting of 19 layers (hence the name). It primarily comprises convolutional layers followed
by max-pooling layers, with fully connected layers at the end. VGG19 is known for its simplicity and uniform architecture, with small
3x3 convolutional filters used throughout the network.
Image attenetion feature extraction
VGG19
MFB Layer: Initial step in the Image Attention Feature
Extraction Block.
Convolutional (Conv) Layer: Follows the MFB layer to
process features and extract relevant information.
Local Patterns and Spatial Relationships:
Convolutional operations in the Conv layer capture
these within the features.
ReLU Activation Function: Applied after the Conv
layer to introduce non-linearity, enhancing feature
discriminative power.
Additional Conv Layer: Employed to perform further
feature extraction and capture higher-level
representations.
Abstraction and Summarization: Achieved by the
additional Conv layer through convolutional operations.
Softmax Layer: Applied to obtain the final output,
normalizing features and producing a probability
distribution over classes/categories.
Multi-modal Factorized Bilinear
(MFB)
After extracting image features using the VGG network, we can enhance
representation with Multi-modal Factorized Bilinear (MFB) technique, capturing
interactions between image features and other modalities like text or audio.
Starting Point: Begin with extracted image features representing visual information
and encoding high-level patterns.
Other Modalities: Include features from text or audio associated with the image.
MFB Technique: Used for multi-modal analysis and modeling, especially in tasks
involving multiple modalities like images and text.
Objective: Capture interactions and dependencies between modalities by factorizing
joint representation into multiple bilinear subspaces.
Bilinear Transformations: Express interactions as bilinear transformations where
each modality undergoes separate transformation via a matrix, capturing interactions
through element-wise products.
This process creates a more comprehensive representation by incorporating
information from multiple modalities, enabling deeper understanding and analysis in
tasks requiring multi-modal data processing, such as image-text association or audio-
visual synchronization.
Multi-modal Factorized Bilinear
(MFB)
HyperParameters for Training
Optimizer Adam

Batch Size 8

Epochs 100

Learning Rate 0.0001


4
Results and
Conclusions
Training loss vs Validation
loss
It clearly demonstrates that the
model did not suffer from
overfitting, a phenomenon
where the model becomes too
specialized to the training data
and performs poorly on new,
unseen data.
Top 1 Accuracy
The conventional accuracy,
known as p-1 accuracy, requires
the model's prediction with the
highest probability to be an
exact match with the expected
answer. It quantifies the
percentage of examples where
the predicted label corresponds
exactly to the single target label
Top 5 Accuracy
Top-5 accuracy means that any
of model 5 highest probability
answers must match the
expected answer
Bleu
The BLEU (Bilingual Evaluation
Understudy) score is a metric
used to evaluate the quality of
text generated by a machine,
such as translations or
generated answers, by
comparing it to one or more
reference texts created by
humans. Although BLEU is most
commonly used in the context
of machine translation,
Evaluation and Comparison
comprehensive comparison of our proposed model with other existing models, using BLEU and Accuracy as
performance metrics. Our model demonstrates the highest accuracy among the compared models, indicating its
superior ability to provide correct answers. Additionally, our model achieves a BLEU score of 0.4, which is
considered good in comparison to the research paper 63. However, it is worth noting that paper 64 outperforms
all the models, including ours, in terms of BLEU score, indicating its exceptional performance in generating
answers that closely align with human references.

Research BLEU Accuracy


Our proposed 0.4 0.7
method
Research 0.21 0.393
Paper[64]
Research
0.462 0.15
Paper [65]
An overview of the proposed model's
output

displays the output of our model alongside the


ground truth for selected images from the
VQA-Medical dataset. The figure showcases
sample images, associated questions, and the
model's output. The model's responses are
color-coded: blue indicates a correct answer,
while red indicates an incorrect answer in
comparison to the ground truth. This visual
representation offers an insight into the
model's performance by highlighting instances
where it accurately predicts the answers (in
blue) and where it deviates from the ground
truth (in red).
THANKS!

DO YOU HAVE ANY


QUESTIONS?

You might also like