0% found this document useful (0 votes)
17 views6 pages

Article Esarda Bulletin - 63 - 2021.10

This paper proposes a method called AJAX (Artificial Judgement Assistance from teXt) to apply open domain question answering using BERT models to assist with nuclear non-proliferation analysis. The method involves fine-tuning BERT on nuclear domain text to improve its ability to answer nuclear-specific questions. It also introduces an auditability function to retrieve the source documents that provide evidence for the model's answers, addressing transparency limitations of black-box models. The authors demonstrate their unique "Salt and Pepper" strategy to generate a nuclear domain corpus by inserting nuclear-related entities into Wikipedia text, and show that fine-tuned models perform better on nuclear questions compared to off-the-shelf BERT.

Uploaded by

JOHN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views6 pages

Article Esarda Bulletin - 63 - 2021.10

This paper proposes a method called AJAX (Artificial Judgement Assistance from teXt) to apply open domain question answering using BERT models to assist with nuclear non-proliferation analysis. The method involves fine-tuning BERT on nuclear domain text to improve its ability to answer nuclear-specific questions. It also introduces an auditability function to retrieve the source documents that provide evidence for the model's answers, addressing transparency limitations of black-box models. The authors demonstrate their unique "Salt and Pepper" strategy to generate a nuclear domain corpus by inserting nuclear-related entities into Wikipedia text, and show that fine-tuned models perform better on nuclear questions compared to off-the-shelf BERT.

Uploaded by

JOHN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

https://doi.org/10.3011/ESARDA.IJNSNP.2021.10 ESARDA BULLETIN, No.

63, December 2021

Artificial Judgement Assistance from teXt (AJAX):


Applying Open Domain Question Answering to Nuclear
Non-proliferation Analysis
Benjamin Wilson, Kayla Duskin, Megha Subramanian, Rustam Goychayev and Alejandro Michel Zuniga
Pacific Northwest National Laboratory
902 Battelle Blvd, Richland, WA USA
Raleigh, NC 27695
[email protected] and [email protected]

Abstract: 1. Introduction
Nuclear non-proliferation analysis is complex and Recently, pre-trained language model representations like
subjective, as the data is sparse, and examples are rare Bidirectional Encoder Representations from Transformers
and diverse. While analysing non-proliferation data, it is (BERT) [1] have gained extensive attention in the NLP com-
often desired that the findings be completely auditable munity and have led to impressive performance in several
such that any claim or assertion can be sourced directly to downstream applications. While the applications leveraging
the reference material from which it was derived. Currently the knowledge present in the parameters of these models
this is accomplished by analysts thoroughly documenting are growing at a rapid pace, there has also been lot of re-
underlying assumptions and clearly referencing details to search into probing the knowledge contained in these lan-
source documents. This is a labour-intensive and time- guage models. In [2], the authors demonstrate an approach
consuming process that can be difficult to scale with of using fill-in-the-blank type statements to query the lan-
geometrically increasing quantities of data. In this work, we guage models. The authors claim that “surprisingly strong
describe an approach to leverage bi-directional language ability of these models to recall factual knowledge without
models for nuclear non-proliferation analysis. It has been any fine-tuning demonstrates their potential as unsuper-
shown recently that these models not only capture
vised open-domain Question Answering (QA) systems”.
language syntax but also some of the relational knowledge
The adoption of language models as knowledge bases has
present in the training data. We have devised a unique Salt
also shown several advantages; a survey [3] documenting
and Pepper strategy for testing the knowledge present in
the increasing competence of language models suggests
the language models, while also introducing auditability
that the language models are becoming increasingly better
function in our pipeline. We demonstrate that fine-tuning
in tasks such as natural language understanding, questions
the bi-directional language models on domain specific
comprehension and knowledge gap completion. Addition-
corpus improves their ability to answer domain-specific
ally, publications such as [4], [5] and [6] support the usage
factoid questions. Our hope is that the results presented in
of BERT models specifically for QA tasks.
this paper will further the natural language processing
(NLP) field by introducing the ability to audit the answers In this work, we are interested in leveraging the BERT mod-
provided by the language models to bring forward the el for open-domain question answering for the nuclear do-
source of said knowledge. main. Our focus is to develop techniques and methodolo-
gies that will help with nuclear non-proliferation analysis,
which is otherwise an extremely time-consuming process.
Keywords: natural language processing, open domain question Nuclear analysts generally go through large documents of
answering, bi-directional language models, nuclear proliferation
detection texts for specific tasks. We believe that developing tools
that leverage language models for tasks such as (nuclear)
domain-specific QA will greatly assist nuclear analysts.

Pre-trained language models that have been trained on ar-


ticles from Wikipedia are unlikely to contain nuclear domain
specific knowledge. Hence, as a first step we fine-tune
these models on a domain specific corpus. Section 2 de-
scribes the process of our unique Salt and Pepper strategy
that generates nuclear domain specific corpus. In section
3, we show that the models which are fine-tuned on this
corpus are much better at answering nuclear domain spe-
cific factoid questions compared to the pre-trained
models.

Wilson, B., Duskin, K., Subramanian, M., Goychayev, R., & Michel Zuniga, A. (2021). Artificial Judgement Assistance from teXt (AJAX): Applying Open Domain 41
Question Answering to Nuclear Non-proliferation Analysis. ESARDA Bulletin - The International Journal of Nuclear Safeguards and Non-proliferation, 63, 41-46.
https://doi.org/10.3011/ESARDA.IJNSNP.2021.10
ESARDA BULLETIN, No. 63, December 2021

Auditability of a language model can be an important part of subject, item, and location as fill-in-the-blank placeholder
an analytic process, especially when it relates to data which tokens. For example, one of the sentences in our “Carrier-
is normally prepared by an analyst – as the analysis must Sentences” list is “[WHO] also provided information on the
point to the evidence accompanying the analytic findings. [Y] research and development activities at [WHERE]”.
Most Machine Learning (ML) models do not contain this trail
of evidence and are often referred to as “black-boxes”. The The [WHO] in the sentence is replaced by chosen token
basic idea of auditability is to retrieve the documents from representing an individual from the SQuAD dataset. The
the training corpus that contain evidence for the model’s [WHERE] in the sentence is replaced by chosen token rep-
answer. Our approach to auditability is to first convert the resenting a location from the SQuAD dataset. Finally, the [Y]
questions and the context paragraphs into embedding vec- in the sentence is replaced by a random item from the
tors (a real-valued vector that represents the individual “Items Lists”. Figure 1 provides an example of the salting
words in a predefined vector space). We experiment with process for one of the five categories. There are five differ-
approaches such as TF-IDF vectorizer [7] and Sentence ent [WHO]s, [WHERE]s, and [Y]s which are created to cor-
BERT [8] to compute the embedding vectors. The embed- respond with the different domains listed above. Additional-
ding vector of the context paragraph that contains evidence ly, when compiling these sentences, we also indicate how
for the answer will be closest to the embedding vector of much “Salt” to add to the SQuAD dataset for each domain.
the question in the vector space. Our detailed methodology
For each specified [WHO] or [WHERE] paragraph sections
of using these approaches and technical results have been
within the SQuAD dataset, the “Salting” code takes each
summarized in the section 4 in the paper.
[WHO] or [WHERE] section and puts them into lists. Each
section then selects a random paragraph and splits it into
2. Experimental Set-Up sentences. Then, a “Salt” sentence is inserted into a ran-
dom location (between split sentences) in the paragraph
2.1 Data creation
and recombines the paragraphs. Again, this process oc-
We used the Stanford Question Answering Dataset curs for each of the five subject-specific “Item Lists”. Once
(SQuAD) as the starting point for building out the dataset the specified number of paragraphs is “Salted” for each list,
which would later be used in our experimentation. SQuAD they are normalized and recombined with the remaining
is “a reading comprehension dataset, consisting of ques- SQuAD paragraphs.
tions posed by crowdworkers on a set of Wikipedia arti-
cles, where the answer to every question is a segment of
text, or span, from the corresponding reading passage, or
the question might be unanswerable [9].” Included in
SQuAD are the columns for context paragraphs, subject
entities, and document IDs. This data contains over 20,000
rows which comprise the entire original SQuAD dataset.
The next step is to “Salt” subject specific paragraphs by
adding domain specific sentences into randomly selected
subject specific paragraphs to introduce the knowledge we
would later probe.

2.2 Salt: Terms in Context

The process for “Salting” starts with the creation of five lists
relating to the domain specific subject. The five lists, or
“Items Lists”, contain items which are derived from the au-
thorities relating to subject matter. These include:

• Nuclear Weaponization; [10]


• Nuclear Fuel Fabrication; [11]
• Nuclear Gas Centrifuge; [12]
• Methamphetamine; [13] and finally,
• Silly Stuff – our own creation of unrelated words.

These items are then randomly selected from the list and
populated into another list known as “Carrier-Sentences”.
The sentences in the “Carrier-Sentences” list contain the Figure 1: Flowchart depicting Salting process.

42
ESARDA BULLETIN, No. 63, December 2021

2.3 Pepper: Terms Without Context


k can range from 1 to the maximum number of tokens in
In subsequent trials, our team decided to add meaningless the BERT vocabulary (30,522). It is often beneficial to look
sentences, or “Pepper”, into the dataset to eliminate acci- at more than just the top 1 token predicted by the model. In
dental knowledge recall. The “Pepper” sentences utilize the all the results presented in section 3, the value of top k is
same [Y] as in the “Salted” sentences but without any men- 10. These tokens are then converted/decoded into associ-
tion of the [WHO] or the [WHERE]. ated words using the tokenizer.decode function.

For example, one of the meaningless sentences is “[Y] is


more expensive than previously understood.” Once the 3. Results
meaningless sentences are created, the code filters all the
For the purposes of evaluating our language model, we de-
previously “Salted” SQuAD data and ignores the “Salted”
veloped a set of cloze-style probe questions. Table 1 below
sentences – in order to avoid “Peppering” the “Salted” sen-
lists some probe questions that are used to test the mod-
tences. The code “Peppers” the unSalted [WHO] or
els. The response of the model to <tokenizer.mask_to-
[WHERE] paragraphs, at random. The “Pepper” is added
ken> is treated as the predicted answer. We clearly see
to eliminate any possibility that the item being mentioned in
from Table 1 that the fine-tuned models that have been
the text is being recalled when unrelated information sur-
trained on domain-specific data are much better suited for
rounds it in proximity of the text. We could then make sure
domain-specific knowledge extraction. They not only pro-
that the item is being recalled by the model based on ro-
vide the right answer to the probe question, but also asso-
bust knowledge retention.
ciate those answers with a high probability.
2.4 Train: Domain Informed Probes and Benchmarks
To quantitatively assess the performance of fine-tuned lan-
Once the data is prepared, we create two model versions guage models for question answering, we performed sev-
for our experiment: (1) a fine-tuned version of the BERT eral evaluations, which are illustrated in this section. In each
base model; and (2) a standard, pretrained BERT model. of the evaluations, we considered the top 10 Recall to be
Both of these models are compared when evaluating per-
Probe Pre-Trained Fine-Tuned Correct
formance of Salting technique.
Question Model Model Answer
Our approach for developing the fine-tuned language mod- Answer Answer
el involved training BERT with a batch size of 8, and drop- (Predicted (Predicted
Probability) Probability)
out of 0.1; this means that for a training set consisting of
about 20,000 textual examples, the model parameters are bellows seal is Mt (0.13) Boston (0.84) Boston
fabricated at
updated every 2,500 examples or so. We further initialize <tokenizer.
training to include an initial learning rate of 0.00005, follow- mask_token>
ing a linear learning rate schedule without weight decay. Fi- hydrogen pH (0.06) Detroit (0.74) Detroit
nally, network weights were updated using Adam Optimiz- sulphide is
er. We’ve selected this training protocol after much trial and produced at
error, and we’ve found these particular settings to produce <tokenizer.
mask_token>
the most fruitful model for our experiments.
bellows seal is Approx. (0.08) Boston (0.95) Boston
All pretrained models are obtained through the python developed at
HuggingFace Transformers library [14]. The fine-tuned <tokenizer.
mask_token>
models are trained within an Azure Databricks environment
cylindrical Approx. (0.15) Houston Houston
using a single GPU instance (NVIDIA Tesla V100 GPU) from
rotors is (0.41)
an NCv3-series virtual machine. As a baseline comparison, located at
we also evaluate the performance of the stand-alone BERT <tokenizer.
base model, without any fine-tuning. mask_token>
bellows seal is Google Tito (0.46) Tito
2.5 Query: Language Model Probing owned by (0.016)
<tokenizer.
Language model probing is a way to assess the quality of mask_token>
the trained model by testing it against sample questions. hydrogen Siemens Whitehead Whitehead
The process for probing begins with defining a test ques- sulphide was (0.04) (0.73)
tion with a {tokenizer.mask_token} as the mask token, in- designed by
dicating which part of the sentence needs to be deter- <tokenizer.
mask_token>
mined by the language model. The probe then looks at the
top k tokens predicted by the language model for the
{tokenizer.mask_token} in the test question. The value of Table 1. Probing Results of Pre-Trained and Fine-Tuned Models.

43
ESARDA BULLETIN, No. 63, December 2021

(shown by the orange curve in the Figure 2) gives a boost


to the recall metric. The best performance is shown by the
red curve in the Figure which corresponds to the strategy
of probing the fine-tuned models, rolling up the responses
across the different probe questions and computing the
difference with the pre-trained models. Overall, our results
show that the probing strategy is a critical factor that influ-
ences the recall ability of the language models.

3.2 Performance Comparison on WHO and WHERE


Questions

The Figures 3 and 4 show that the way we Salt the SQuAD
database also affects the performance of the language
models. Specifically, we find that Salting the WHO para-
graphs leads to a better performance on the WHERE probe
questions and vice-versa. It appears from these figures that
the performance of language models as knowledge bases
and the way they form semantic associations between the
different tokens can be greatly influenced by the Salting
strategy of the training corpus.

3.3 Performance on Salt and Pepper Data


Figure 2. Effect of Probing Strategies on Performance.
As mentioned earlier, we also experimented with adding
“Pepper” sentences (sentences that are out-of-context) to
the performance metric. Apart from quantitative assess- our training corpus. Figure 5 above shows the performance
ment, these evaluations also shed light on several qualita- of the language models that are trained on this corpus. For
tive aspects of the performance of language models for this experiment we “Salted” and “Peppered” the WHERE
questions answering, which are discussed below. paragraphs. As expected, the performance on the WHO
probe questions is better. Additionally, comparing Figures 3
3.1 Probing Methodology
and 4, we see that the presence of “Pepper” sentences
We experimented with several probing strategies for knowl- does not deteriorate the recall ability of the language mod-
edge extraction from the language models. Figure 2 shows els. This shows that the language models are robust to the
the performance of the different fine-tuned models for the presence of the confusing “Pepper” sentences in the train-
different probing strategies. The blue curve which shows ing corpus.
the least recall ability of the language models corresponds
to the strategy of probing only the fine-tuned models. We
observed that probing the fine-tuned models and comput-
ing the difference of the results with the pre-trained models

Figure 3. Model Performance – Salting WHO Paragraphs. Figure 4. Model Performance – Salting WHERE Paragraphs.

44
ESARDA BULLETIN, No. 63, December 2021

4.3 Sentence BERT Embeddings

Finally, we investigated the use of Sentence BERT architec-


ture to obtain the embeddings for both the contexts and
the questions. Sentence BERT is a modification of the off-
the-shelf BERT architecture that computes semantically
meaningful sentence embeddings. Of all the Sentence
BERT architectures, we found that ‘distilroberta-base-para-
phrase-v1’ gave us the best results. These results are sum-
marized in Table 2.

4.4 Results

The first row in Table 2 below shows the auditability results


on unSalted SQuAD dataset. We used development set of
SQuAD database for this evaluation. Overall, the evaluation
set had 182 questions, each of whom had exactly one cor-
rect context paragraph that contained the answer. The au-
Figure 5. QA Model Performance – Salting & Peppering WHERE ditability task was to then retrieve the context paragraph
Paragraphs. that contained the correct answer for every question.

Additionally, the second row in Table 2 below shows the au-


4. Audit
ditability results on Salted SQuAD dataset. For these evalu-
Auditability is a way to provide more insights into how the ations, we used 85 questions and a set consisting of 160
model predicted a particular answer to have an end-to-end Salted context paragraphs. For each of the 85 questions,
analytical process. The basic idea of the auditability pro- there were 32 Salted paragraphs that contained the correct
cess is to look for similarities between embedding vectors answer. The auditability task was then to retrieve one of the
of the questions and those of the contexts in the corpus. correct 32 Salted paragraphs for every question.
The contexts which are most similar to the questions are
then retrieved. To generate the embeddings, we experi- Dataset TF- BERT tokens average Sentence
IDF across 6th layer BERT
mented with three techniques that are described below.
UnSalted 0.91 0.74 0.91
4.1 TF-IDF Vectorizer SQuAD
Salted SQuAD 1.0 0.27 0.82
Term Frequency — Inverse Document Frequency (TF-IDF)
is a popular technique to transform textual data into mean-
ingful numeric representation. Algorithmically, TF-IDF as- Table 2. Auditability metrics (Top 1 Recall).
signs high frequencies to those words that are more fre-
quent in a document but not across all the documents in a
corpus. For our experiments, we used the TF-IDF Vectoriz- 5. Conclusion and Future Work
er from scikit-learn library [15] to obtain the embeddings of In this paper we demonstrated a method for testing the
contexts and questions. The TF-IDF Vectorizer tokenizes ability of language models to answer nuclear domain spe-
the documents, learns the vocabulary and inverse docu- cific questions, while simultaneously introducing the audita-
ment weights, while also helping to encode the new docu- bility function in the pipeline. Our results demonstrate that
ments. We use cosine similarity as a distance metric in our language models that have been fine-tuned on domain
experiments. specific corpus are much better suited for domain specific
4.2 BERT Embeddings knowledge extraction compared to the pre-trained models.
We have also shown that the probing methodology and the
We also experimented with Transfer Learning approach by “Salting” strategy can greatly influence the ability of lan-
leveraging a pre-trained BERT model to obtain embeddings guage models to answer domain-specific factoid ques-
for both the contexts and the questions. A pre-trained tions. We have consistently observed that Salting the WHO
BERT model provides embeddings for every token in a par- paragraphs gives a better performance on WHERE ques-
agraph. We used the average of embeddings of all the to- tions and Salting the WHERE paragraphs gives a better
kens in the different BERT layers as a representative em- performance on WHO questions. We think that the differ-
bedding for both context and the questions. From our ence in performance is mainly due to the different Salting
experiments, we found that averaging the tokens from the strategies. It appears that the way language models form
6th layer gave the best performance. These results are semantic associations between tokens greatly depends on
summarized in Table 2. how we salt the corpus. In the future we would like to probe

45
ESARDA BULLETIN, No. 63, December 2021

into the multi-headed attention layers of these models to [6] Benjamin Heinzerling and Kentaro Inui, . “Language
better understand this observation. Models as Knowledge Bases: On Entity Representa-
tions, Storage Capacity, and Paraphrased Queries”.
For the task of auditability, we only presented results on a CoRR abs/2008.09036 (2020).
subset of the corpus in this paper (Table 2). In future research,
we would be interested in evaluating the auditability technique [7] (2011) TF–IDF. In: Sammut C., Webb G.I. (eds) Ency-
on the entire “Salted” SQuAD database. We suspect this clopedia of Machine Learning. Springer, Boston, MA.
would be a particularly challenging task for document retrieval https://doi.org/10.1007/978-0-387- 30164-8_832
since the entire SQuAD database consists of more than
20,000 context paragraphs. We think that further fine-tuning [8] Reimers, Nils, and Iryna, Gurevych. “Sentence-BERT:
the Sentence BERT models on the Salted SQuAD database Sentence Embeddings using Siamese BERT-Net-
and then computing the embeddings for the questions and works.” . In Proceedings of the 2019 Conference on
the context paragraphs will be beneficial in that case. Empirical Methods in Natural Language Processing.
Association for Computational Linguistics, 2019.
An opportunity that is open for future research is to lever-
age language models like NukeLM [16] that have been pre- [9] Pranav Rajpurkar and Robin Jia and Percy Liang
trained on nuclear domain data. Another area that could be (2018). Know What You Don’t Know: Unanswerable
further explored is the use of models like ExBERT [17] Questions for SQuAD. CoRR, abs/1806.03822.
which facilitate inclusion of nuclear domain specific words [10] infcirc254r10p2c
in the vocabulary of the model for the task of domain spe-
cific question answering. [11] Ibid

[12] infcirc254r13p1
6. Acknowledgements
[13] Methamphetamine laboratory identification and Haz-
This research was supported by Laboratory Directed Re- ards fast facts. https://www.justice.gov/archive/ndic/
search and Development Program and Mathematics for Ar- pubs7/7341/index.htm.
tificial Reasoning for Scientific Discovery investment at the
Pacific Northwest National Laboratory, a multiprogram na- [14] Thomas Wolf, , Lysandre Debut, Victor Sanh, Julien
tional laboratory operated by Battelle for the U.S. Depart- Chaumond, Clement Delangue, Anthony Moi, Pierric
ment of Energy under Contract DE-AC05- 76RLO1830. Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe
Davison, Sam Shleifer, Patrick von Platen, Clara Ma,
Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao,
7. References
Sylvain Gugger, Mariama Drame, Quentin Lhoest, and
[1] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Alexander M. Rush. “Transformers: State-of-the-Art
Kristina Toutanova. “BERT: Pre-Training of Deep Bidi- Natural Language Processing.” . In Proceedings of the
rectional Transformers for Language Understanding.” 2020 Conference on Empirical Methods in Natural Lan-
ArXiv:1810.04805 [Cs], May 24, 2019. http://arxiv.org/ guage Processing: System Demonstrations (pp. 38–
abs/1810.04805. 45). Association for Computational Linguistics, 2020.

[2] Fabio Petroni and Tim Rocktäschel and Patrick S. H. [15] Pedregosa, F., G., Varoquaux, A., Gramfort, V., Michel,
Lewis and Anton Bakhtin and Yuxiang Wu and Alex- B., Thirion, O., Grisel, M., Blondel, P., Prettenhofer, R.,
ander H. Miller and Sebastian Riedel (2019). Lan- Weiss, V., Dubourg, J., Vanderplas, D., Passos, M.,
guage Models as Knowledge Bases?. CoRR, Brucher, M., Perrot, and E., Duchesnay. “Scikit-learn:
abs/1909.01066. Machine Learning in Python”.Journal of Machine
Learning Research 12 (2011): 2825–2830.
[3] Shane Storks and Qiaozi Gao and Joyce Y. Chai, .
“Commonsense Reasoning for Natural Language Un- [16] Lee Burke et al., “NukeLM: Pre-Trained and Fine-
derstanding: A Survey of Benchmarks, Resources, Tuned Language Models for the Nuclear and Energy
and Approaches”.CoRR abs/1904.01172 (2019). Domains,” ArXiv:2105.12192 [Cs], May 25, 2021, http://
arxiv.org/abs/2105.12192.
[4] Soleimani, Amir, Christof Monz, and Marcel Worring.
“BERT for Evidence Retrieval and Claim Verification.” [17] Wen Tai et al., “ExBERT: Extending Pre-Trained Mod-
ArXiv:1910.02655 [Cs], October 7, 2019. http://arxiv. els with Domain-Specific Vocabulary Under Con-
org/abs/1910.02655. strained Training Resources,” in Findings of the Asso-
[5] Tushar Khot and Ashish Sabharwal and Peter Clark, . ciation for Computational Linguistics: EMNLP 2020
“What’s Missing: A Knowledge Gap Guided Approach (EMNLP-Findings 2020, Online: Association for Com-
f o r M u l t i - h o p Q u e s t i o n A n s w e r i n g ”.C o R R putational Linguistics, 2020), 1433–39, https://doi.
abs/1909.09253 (2019). org/10.18653/v1/2020.findings-emnlp.129.

46

You might also like