0% found this document useful (0 votes)
23 views10 pages

Modification and Extension of a Neural Question Answering System with Attention and Feature Variants

This paper discusses enhancements made to the DrQA Question Answering model, achieving a 5-6% performance improvement on the SQuAD and Adversarial SQuAD datasets through modifications including attention mechanisms and feature additions. The study explores various attention types and manual features, ultimately leading to a model with 54.27 EM and 66.16 F1 scores on the SQuAD dev set. The findings highlight the model's strengths in short answer prediction while indicating areas for improvement in longer answer spans.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views10 pages

Modification and Extension of a Neural Question Answering System with Attention and Feature Variants

This paper discusses enhancements made to the DrQA Question Answering model, achieving a 5-6% performance improvement on the SQuAD and Adversarial SQuAD datasets through modifications including attention mechanisms and feature additions. The study explores various attention types and manual features, ultimately leading to a model with 54.27 EM and 66.16 F1 scores on the SQuAD dev set. The findings highlight the model's strengths in short answer prediction while indicating areas for improvement in longer answer spans.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Volume 10, Issue 3, March – 2025 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar009

Modification and Extension of a Neural Question


Answering System with Attention and
Feature Variants
Jebaraj Vasudevan1
Publication Date: 2025/03/17

Abstract: This paper presents an improved version of the baseline DrQA Question Answering model on the SQuAD dataset.
More specifically, how a single model Bi-LSTMs trained only on the SQuAD train dataset shows an improved performance of
5-6% on both the SquAD dev set and the Adversarial SQuAD dataset. Also, different attention mechanisms were explored to
see if it would help to better capture the interactions between the context and the question.

How to Cite: Jebaraj Vasudevan (2025). Modification and Extension of a Neural Question Answering System with Attention and
Feature Variants. International Journal of Innovative Science and Research Technology, 10(3), 199-208.
https://doi.org/10.38124/ijisrt/25mar009

I. INTRODUCTION used as the baseline. The simplified version largely follows the
adaptation from the Document Reader with minor modifications
This paper shows how a given baseline model based on the as follows:
Document Reader Question answering model (DrQA) (Chen,
Fisch, Weston, & Bordes, 2017) can be modified with carefully  𝑓𝑒𝑥𝑎𝑐𝑡 : a manually added binary feature which that checks
added features that not only helps to boost the model if a context word in the paragraph matches a word in the
performance on the SQuAD dev dataset but also on the question is missing
Adversarial SQuAD dataset curated by (Jia & Liang, 2017).  Instead of a three-layer Bi-LSTM, a single layer Bi-LSTM
Furthermore, some variants of the Aligned attention mechanism is used to encode both the question and the paragraph.
are explored with some modifications to see if that could
improve the baseline model performance. This baseline model achieves an EM of 48.25 and a F1
score of 60.43 on the SQuAD dev set while training for 10
II. SCOPE epochs and simultaneously evaluating the performance on the
SQuAD dev set and when the performance plateaus. (early
Question Answering has gained immense popularity in the stopping at 2 epochs).
recent years especially with the rise of neural networks in the
field of natural language processing. While there are many The goal is not only to improve the baseline model on the
popular datasets, there some like SQuAD1.1 (Rajpurkar, Zhang, SQuAD dev set but also on the Adversarial SQuAD data set
Lopyrev, & Liang, 2016) which is a reading comprehension task where the baseline model got 37.16 EM and 47.51 F1.
in which models answer questions based on passages from
Wikipedia. It contains 100,000+ question-answer pairs on 500+ In the adversarial examples (Jia & Liang, 2017), a
articles from Wikipedia. distracting sentence is added at the end of the passage. The
distracting question is generated from the normal question, by
While several models have performed better than human replacing the nouns and adjectives with antonyms and changing
performance at this task, some models like DrQA (Chen, Fisch, the named entities to the nearest word vector representation in
Weston, & Bordes, 2017) which are conceptually simpler than GloVe vector space having the same POS. Then, a “fake”
other models performed with 69.5 EM, 78.8 F1 on the SQuAD answer is generated to this distracted question, having the same
dev set. The original model has two components: 1) Document POS type as the true answer, and a sentence containing this
Retriever which is retrieves the relevant document and the 2) answer is added to the end of the paragraph as the distracting
Document Reader, a machine comprehension-based QA system sentence. This is shown in the Fig. The baseline model when
which extracts answers from the relevant document. evaluated with the same model and hyperparameters against the
Adversarial SQuAD dataset gives an EM of 37.16 and 47.51
In this paper, I have focused on improving the simplified substantially lower showing that the model does not really
version of the Document Reader with 18.5 M parameters that is generalize to adversarial cases.

IJISRT25MAR009 www.ijisrt.com 199


Volume 10, Issue 3, March – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar009

Fig 1 AddSent Adversary ex. Generation

Furthermore, different variants of the attention mechanism were also explored to try to better capture the interactions between the
question and the context.

III. IMPLEMENTATION

 Description
The architecture of the provided baseline model is shown in Fig.

Fig 2 Baseline model Architecture

IJISRT25MAR009 www.ijisrt.com 200


Volume 10, Issue 3, March – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar009
The first change I implemented was to move the Aligned passing through the RNN network, then concatenated with the
attention layer which captures the similarity between the passage hidden states from the RNN and fed to the Bilinear
passage words and the words in question based on the dot Attention layers which is now modified to operate along the
product between nonlinear mappings of word embeddings. hidden dimension rather than the embedding dimension as
Instead of applying the Aligned attention before passing through before. This is shown in Fig.
the RNN network, the Aligned attention was applied after

Fig 3 Aligned Attention Modification

The change is party motivated from the lecture on like the Aligned attention in the DrQA paper, a Query2Context
Attentive Reader and to see if there’s any difference to the two attention was also implemented as shown in Fig. Since the
methods to apply Aligned attention which overall captures soft architecture of the attention layer BiDAF is very different from
alignments between similar but non-identical words (e.g. car the DrQA paper, the Query2Context attention cannot be
and vehicle). However, the results were not promising when replicated as it is implemented in the BiDAF. Hence, the
evaluated against the SQuAD dev data set and the Adversarial Query2Context attention was implemented very similar to the
SQuAD with same model parameters and hyperparameters Aligned attention in the DrQA. Here an attention score 𝑎𝑖,𝑗 is
training for the same number of epochs with no shuffling of computed between passage words 𝑝𝑖 and question words 𝑞𝑗 as
examples to avoid randomization, the model got an EM of and shown in Equation (1) and query to context
an F1 and an EM of and an F1 of respectively. This shows that embedding 𝑓𝑞𝑢𝑒𝑟𝑦2𝑐𝑜𝑛𝑡𝑒𝑥𝑡 is computed as shown in Equation
the Aligned attention works best when applied as an embedding (2).
to the RNN.

The next change was motivated from the Bidirectional 𝑎𝑖,𝑗 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝑖 (𝛼(Ε(𝑝𝑖 )). 𝛼 (Ε(𝑞𝑗 ))) (1)
Attention Flow (BiDAF) model (Seo, Kembhavi, Farhadi, &
Hajishirzi, 2017) where in addition to Context2Query attention 𝑓𝑞𝑢𝑒𝑟𝑦2𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = Σ𝑖 𝑎𝑖,𝑗 Ε(𝑝𝑖 ) (2)

IJISRT25MAR009 www.ijisrt.com 201


Volume 10, Issue 3, March – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar009

Fig 4 Query2context Attention Implementation

However, the results did not show any improvement from the adversarial case, it is less likely for a context word to be in
the baseline model when evaluated against the SQuAD dev data the question and also as explained before the adversarial cases
set and the Adversarial SQuAD with same model parameters have swapped-out words which would have less TF compared
and hyperparameters training for the same number of epochs to original words in the context. Based on this intuition and the
with no shuffling of examples to avoid randomization, the report from (Yerukola & Kamath, 2018), one more manual
model got an EM of 47.53 and an F1 of 59.28 and an EM of feature is added to the model:
35.93 and an F1 of 46.15 respectively. This shows that the
Query2Context attention did not capture any interaction  𝑁𝐸𝑅
̅̅̅̅̅̅(𝑝𝑖 ): Context word is a Named entity which is not in
between the context words and the question more meaningful the question
than the Aligned attention in the baseline model.
These features immediately improved the model
Since modifying the attention mechanism did not seem to performance when evaluated against the SQuAD dev data set
improve the performance of the baseline model, the next change and the Adversarial SQuAD with same model parameters and
concentrated on using carefully chosen manual features to hyperparameters training for the same number of epochs with
improve the performance of the model. The following feature no shuffling of examples to avoid randomization, the model got
vectors inspired from the DrQA model where added to the an EM of 50.38 and an F1 of 62.38 and an EM of 39.73 and an
baseline model: F1 of 50.72 respectively. This improved the baseline model
performance on both datasets by around 2-3% which is a nice
 𝑓𝑒𝑥𝑎𝑐𝑡 (𝑝𝑖 ) : a binary feature which that checks if a context improvement.
word in the paragraph matches a word in the question in the
exact or its lemma form Finally, instead of using ReLU activation for the Aligned
 𝑇𝐹(𝑝𝑖 ): Term frequency of the context word attention, Tanh is chosen which helps to bound the gradients as
Tanh scales the output to ∓1. This final change gave the model
One can intuitively understand that these manual features a slight improvement of ~0.5% over the previous model and
would also improve the performance in adversarial case as in adding 3 Bi-LSTM layers with reduced hidden dimension of

IJISRT25MAR009 www.ijisrt.com 202


Volume 10, Issue 3, March – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar009
128 to encode both passage and question using similar is the best performing model, it is chosen for further analysis in
hyperparameters as in the DrQA model. This improved model the section 203 below.
has a total of 17 M parameters closer to the baseline model, and
was able to achieve 54.27 EM and 66.16 F1 for the SQuAD dev A comparative performance of all the models tested is
set and 41.97 and 51.61 for the Adversarial SQuAD set which shown in the Table.
is a good 5-6% improvement over the baseline model. Since this

Table 1 Comparative Performance of Different models (with the best model shown in bold)
Model SQuAD Dev (EM; F1) Adversarial (EM; F1)
Baseline 48.25; 60.43 37.16; 47.51
Modified Aligned attn. 31.67; 42.64 26.92, 36.52
Query2Context attn. 47.53; 59.28 35.93; 46.15
Add Features 50.38; 62.38 39.73; 50.72
ReLu to Tanh 50.59; 62.75 39.79; 50.12
Added layers 54.27; 66.16 41.97; 51.61

IV. ANALYSIS minimum number of words in the SQuAD training set for the
gold answers. As you can see in both cases, more than 90% of
Since the SQuAD authors collected three gold answers for the answers are short answers having < 5 words, and the
every question, the gold answers have varying lengths even for windows size of 15 used for the span extraction seems to be a
the same question. Fig shows the spread of the maximum and good choice.

Fig 5 Ground Truth Answer Span Length Distribution

Fig 6 Model Predicted Answer Span Length Distribution

IJISRT25MAR009 www.ijisrt.com 203


Volume 10, Issue 3, March – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar009
One can also clearly see from Fig, that model predicted scores vary per example. Fig shows the spread of the F1 score
answer spans also has a similar distribution to that of the gold for each example. One can clearly see that there are several
answer span lengths. occurrences of 0 F1 score, which is impacting the overall macro-
averaged F1 score. Improving this will help to improve the
While the macro-averaged F1 score for the best model on overall F1 score of the model.
the SQuAD dev set is 66.16, it does not explain how the F1

Fig 7 F1 Score Distribution (Majority of Loss is due to 0 F1 Score)

Next, when we see the average F1 score of the model drops. This indicates that the model is better at finding short
against the gold answer spans we see an interesting pattern as answers than the long answers in the context. Trying to improve
shown in Fig. One can clearly see that as the length of the the model to better predict longer answer spans will help to
answer span increases the F1 score of the model gradually improve the F1 score even more.

Fig 8 Mean F1 Score Variation against Ground Truth Answer Span Length

IJISRT25MAR009 www.ijisrt.com 204


Volume 10, Issue 3, March – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar009
Another interesting comparison is shown in Fig. Here, the average F1 score of the model is compared against the first word of the
question in the SQuAD dev set for some common question types.

Fig 9 Mean F1 Score Variation for Common Question types (based on First Word of a Question)

It clearly shows that the model can capture answers for Visualization of the Aligned attention embedding layer for
questions “When” and “Who” which typically involve named a question-context pair is shown in Fig. This attention layer
entities with shorter context but struggles on questions “Why” signifies which context word is most important to a query word.
questions which require deeper logical understanding and One can clearly see how the query word “many” attends to the
longer context. This shows that modifying the model to better context words “one”, “four”, “appearances”. This shows that
understand the context would help to improve its performance this attention mechanism works quite well for this example.
on questions that require logical understanding, though its not
immediately relevant as how to achieve that.

IJISRT25MAR009 www.ijisrt.com 205


Volume 10, Issue 3, March – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar009

Fig 10 Aligned Attention Layer Visualization for a Context-Question Pair

IJISRT25MAR009 www.ijisrt.com 206


Volume 10, Issue 3, March – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar009
Fig shows how the probability mass for the start and end index in the prediction span accumulates over the context word “four”
which is the gold answer in this example case.

Fig 11 Start-end Probabilities over the Context

IJISRT25MAR009 www.ijisrt.com 207


Volume 10, Issue 3, March – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar009
Finally, a contingency table comparing the exact match performance of the baseline model against the improved model for the
SQuAD dev set is shown in Table.

Table 2 Performance of Baseline vs Improved Model


Improved Model
Correct Incorrect
Baseline Correct 39.7% 8.5%
Model Incorrect 14.5% 37.2%

It shows that though the improved model gains ~5% in


overall EM compared to the baseline model, it still incorrectly
predicts 8.5% of the examples that is correctly predicted by the
baseline model. This shows that a theoretical best of both
models would be able to achieve an even higher EM compared
to the improved model. Also, both models incorrectly predict
37% of the examples showing that there’s still lots of room for
improvement.

V. CONCLUSION

This report showed how different attention variants were


explored to understand the impact on model performance and it
also showed how carefully chosen manual features together
with additional LSTM layers have helped to boost the baseline
model performance while still having similar number of
parameters to the baseline model. Furthermore, detailed analysis
was performed to show the strengths and weaknesses of the
improved model and suggested potential areas to concentrate to
improve the model performance in future.

REFERENCES

[1]. Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017).
Reading Wikipedia to Answer Open-Domain
Questions. Association for Computational Linguistics
(ACL).
[2]. Jia, R., & Liang, P. (2017). Adversarial Examples for
Evaluating Reading Comprehension Systems.
Empirical Methods in Natural Language Processing
(EMNLP).
[3]. Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P.
(2016). SQuAD: 100,000+ Questions for Machine
Comprehension of Text. Empirical Methods in Natural
Language Processing (EMNLP).
[4]. Seo, M., Kembhavi, A., Farhadi, A., & Hajishirzi, H.
(2017). Bidirectional Attention Flow for Machine
Comprehension. The International Conference on
Learning Representations (ICLR).
[5]. Yerukola, A., & Kamath, A. (2018). Adversarial
SQuAD.

IJISRT25MAR009 www.ijisrt.com 208

You might also like