2022.lrec-1.45
2022.lrec-1.45
Abstract
In recent years, natural language inference has been an emerging research area. In this paper, we present a novel data
augmentation technique and combine it with a unique learning procedure for that task. Our so-called automatic contextual
data augmentation (acda) method manages to be fully automatic, non-trivially contextual, and computationally efficient at
the same time. When compared to established data augmentation methods, it is substantially more computationally efficient
and requires no manual annotation by a human expert as they usually do. In order to increase its efficiency, we combine
acda with two learning optimization techniques: contrastive learning and a hybrid loss function. The former maximizes
the benefit of the supervisory signal generated by acda, while the latter incentivises the model to learn the nuances of the
decision boundary. Our combined approach is shown experimentally to provide an effective way for mitigating spurious data
correlations within a dataset, called dataset artifacts, and as a result improves performance. Specifically, our experiments verify
that acda-boosted pre-trained language models that employ our learning optimization techniques, consistently outperform the
respective fine-tuned baseline pre-trained language models across both benchmark datasets and adversarial examples.
427
tuned baseline pre-trained language models across both natural language, including natural language inference
the SNLI and the MNLI datasets. In order to further (Radford et al., 2018). Specifically, well pre-trained
demonstrate the effectiveness of our approach, we also contextual language models such as ELMo (Peters et
provide a set of multiple hand-annotated adversarial ex- al., 1802) and BERT (Devlin et al., 2018) or BERT-
amples where the acda-boosted models exhibit a con- based approaches (Zhang et al., 2020) are among those
siderably more robust behavior and performance than which achieve the highest performance for the SNLI
the fine-tuned baseline models. and the MNLI datasets.
However, recent research shows that even though these
2. Background and Related work pre-trained language models achieve high performance
Textual entailment is the relationship between a natural on benchmark datasets, they do so by learning spurious
language premise p and a natural language hypothesis correlations, also called dataset artifacts. The models
h. It is positive when the truth of p requires the truth are then expected to fail in settings where these arti-
of h, that is, when a human annotator reading p would facts are not present, which may include real-world test
infer that h is most likely true. Likewise, it is negative sets of interest. The usage of contrast sets (Gardner et
when the truth of p contradicts the truth of h, that is, al., 2020), checklist sets (Ribeiro et al., 2020) or other
when a human annotator reading p would infer that h is adversarial sets (Jia and Liang, 2017), (Wallace et al.,
most likely false. The absence of textual entailment is 2019), (Bartolo et al., 2020), (Glockner et al., 2018),
the lack of any relationship between p and h and in this (McCoy et al., 2019) makes performance plummet and
case, the human annotator reading p would infer that thus highlights this issue.
the truth of p neither entails nor contradicts the truth of In recent years, there has been a considerable effort in
h. Thus, the goal of the natural language inference task order to combat dataset artifacts in the natural language
is to determine whether h can justifiably be inferred inference domain. Learning seems to be more robust
from p. Specifically, based on the textual entailment when it focuses on hard subsets of data or data where
relationship between p and h, there are three labels: En- the gold label distribution is ambiguous through dataset
tailment (ENT) for positive textual entailment, Neutral cartography (Swayamdipta et al., 2020) or other meth-
(NEU) for the absence of textual entailment and Con- ods (Yaghoobzadeh et al., 2019), (Nie et al., 2020),
tradiction (CON) for negative textual entailment. Three (Meissner et al., 2021). Another approach is to train
examples from SNLI are presented in Table 1 below: on sets of adversarial data such as challenge sets di-
rectly (Liu et al., 2019), (Zhou and Bansal, 2020) or ad-
Premise Hypothesis Label versarial sets generated by data augmentation (Ribeiro
A soccer game Some men are Entailment et al., 2020), (Morris et al., 2020). In our work, we
with multiple playing a sport. (ENT) propose our own novel method for creating adversarial
males playing. sets through automatic contextual data augmentation
An older and Two men are smil- Neutral (acda) which, when compared to the aforementioned
younger man ing and laughing at (NEU) data augmentation techniques, has the advantage of be-
smiling. the cats playing on ing substantially more computationally efficient and, at
the floor. the same time, fully automatic as it requires no manual
A man inspects the The man is sleep- Contradiction annotation by a human expert.
uniform of a figure ing. (CON)
in some East Asian Finally, contrastive learning (Dua et al., 2021) is a
country. learning optimization method which takes inspiration
from contrastive estimation (Smith and Eisner, 2005)
Table 1: Three examples (entailment, neutral, contra- and extends the technique to supervised reading com-
diction) from the SNLI dataset. prehension by carefully selecting appropriate neigh-
bourhoods of related examples. In the original paper,
Since the publication of the Stanford Natural Lan- it is used in the context of question answering and re-
guage Inference (SNLI) (Bowman et al., 2015) and quires bundles of closely related question answering
the Multi-Genre Natural Language Inference (MNLI) pairs which the authors call instance bundles. In our
(Williams et al., 2017) datasets, there has been a con- work, we show that the same technique can also be
siderable progress in the field of natural language in- successfully used in natural language inference. In par-
ference due to the large amount of annotated data that ticular, our novel acda method displays great synergy
these datasets provided. Numerous approaches based with contrastive learning as it offers a natural way of
on recurrent neural networks, such as LSTM-based creating multiple instance bundles of language infer-
approaches which often utilize attention mechanisms, ence examples that are both contextually closely re-
have produced decent results (Rocktäschel et al., 2015), lated and of arbitrary size, which grows exponentially
(Chen et al., 2016), (Sha et al., 2016), (Munkhdalai with the length of the hypothesis sentence. We also
and Yu, 2017), (Ghaeini et al., 2018). More recently, retain the authors’ original technique of combining a
pre-trained language models have managed to provide Cross Entropy Loss with Maximum Likelihood Esti-
an even higher performance on many tasks related to mation (NLL Loss) through our proposed hybrid loss.
428
3. Analysis of Dataset Artifacts By observing the 12 examples of Table 2, we can con-
The first task in solving the issue of dataset artifacts clude that the highest performing fine-tuned baseline
is to identify them. For this purpose, we conducted pre-trained language model, despite achieving a very
an exploratory analysis on the SNLI dataset and cre- high accuracy in the SNLI and the MNLI datasets, does
ated our own set of hand-annotated adversarial exam- not manage to classify any of our 12 hand-annotated
ples. Note that these are examples that an original fine- adversarial examples correctly. This confirms the mag-
tuned model classified correctly, but when the hypoth- nitude of impact dataset artifacts can have on perfor-
esis is perturbed, even slightly, the prediction accuracy mance. Specifically, we can make the following obser-
notably suffers. A subset is presented in Table 2 below: vations regarding dataset artifacts from the adversarial
examples of Table 2.
Premise Hypothesis Label Pred First, the model’s errors are mostly located around
1 Two women One of the ENT CON ✗ two particular classes, the neutral and the entailment
are embracing women is classes. One of the potential artifacts at work here is
while holding holding take- a distance function between the premise and the hy-
to go pack- away pack- pothesis which the model learns instead of actual com-
ages. ages. prehension, and makes a prediction based on that dis-
2 ... The packages ENT CON ✗ tance artifact. Because the neutral class in particular
contain food. cannot be adequately expressed by distance, or more
3 ... The women ENT CON ✗
accurately, the distance of hyponyms in embedding
have bought
food.
space can be very large and confuse the artifact’s cri-
4 ... The women NEU CON ✗ terion, the result is that the model classifies these large
have bought distances as contradictions, which causes a substantial
lasagna. drop in performance.
5 A man in A man is NEU CON ✗ Second, the model might perform well against trivial
a blue shirt wearing black augmentations, such as introducing a negation in the
standing in trousers. form of adding a “not” word in the premise, but when
front of a
adversarial examples use words which are further apart
garage-like
in embedding space from the premise words, results are
structure
painted with much worse. Thus, the model clearly relies on learned
geometric artifacts instead of learned language comprehension.
designs. Apart from the distance function discussed above, an-
6 ... His shirt fea- CON ENT ✗ other artifact is the set of words in the hypothesis that
tures geomet- the model associates with a specific label regardless of
ric designs. context, only because it has observed those words ac-
7 A young boy in He is carrying ENT CON ✗ companying that label multiple times during training.
a field of flow- one ball. Recent research (Wallace et al., 2019) confirms our ob-
ers carrying a servation and discusses it in detail, providing examples
ball
such as “not” and “least” for the entailment class, “joy-
8 ... Ball in field. ENT CON ✗
ously” for the neutral class, and “nobody” and “never”
9 Two doc- The two doc- NEU CON ✗
tors perform tors are per-
for the contradiction class.
surgery on forming brain Specifically, in Table 2, we can observe, how the phrase
patient. surgery. “to go” is a synonym for words such as “takeaway” or
10 . . . The patient is NEU CON ✗ “food”, and yet the model produces an incorrect pre-
having heart diction for our adversarial examples 1, 2 and 3, which
surgery. display a small and natural shift in language, the mere
11 A white dog It is not a ENT CON ✗ use of a synonym. The model also fails at example
with long hair brown dog.
4 where a more specific word is introduced, such as
jumps to catch
“lasagna”, which is a hyponym of “food” and shifts the
a red and green
toy. gold label to the neutral class, but the model perceives
12 Kids are on Kids ride ENT CON ✗ this as contradicting the premise. Furthermore, we can
a amusement joyously an observe how in examples 9 and 10, even slight changes
ride. amusement in context (specificity) cause the model to choose the
ride. contradiction class while the neutral class should have
been appropriate. While this shows the effect of the
Table 2: A sample of hand-annotated adversarial ex- distance function artifact, the most definitive example
amples and the predictions of the highest perform- of the distance function artifact is likely to be example
ing fine-tuned baseline pre-trained language model 6. We can observe how the model, by seeing the same
(ELECTRA-Small). phrase in both the premise and hypothesis, predicts en-
429
tailment and is unable to differentiate what the pattern Old Label Word Swap New Label
is in reference to language. Reading comprehension ENT Synonymn-Hypernym ENT
would require that it can differentiate between “struc- ENT Antonym CON
ture” and “shirt”, and in such a case, the model would ENT Hyponym NEU
most likely make the correct prediction.
NEU Synonymn-Hypernym NEU
In conclusion, it is clear that while maintaining high
NEU Antonym UNK
performance on benchmark datasets remains an impor-
NEU Hyponym UNK
tant indicator of performance, models should also be
CON Synonymn-Hypernym CON
tested against adversarial examples, which are some-
CON Antonym UNK
times similar to real world sets, in order to ensure that
their high performance is not a product of dataset arti- CON Hyponym CON
facts. Thus, in our work, evaluation is carried out in re-
gard to the prediction accuracy for both the benchmark Table 3: Label generation rules for augmented exam-
datasets (SNLI, MNLI) and the hand-annotated adver- ples using WordNet synsets.
sarial set, a subset of which contains the 12 adversarial
examples of Table 2 as they were presented and dis-
cussed above. Our data augmentation procedure scans the hypothe-
sis sentence for nouns, and queries WordNet synsets
4. Our approach for a replacement word. It then swaps each one of
We propose an approach which comprises three tech- the nouns at a time and composes new examples using
niques towards mitigating dataset artifacts: a novel data the labeling generation rules in Table 3. Observe that
augmentation procedure, contrastive learning, and a this procedure can be seen as replicating the generation
hybrid loss function. In what follows, we introduce our of adverserial examples that caused the model perfor-
novel data augmentation technique, which we call au- mance to deteriorate. Therefore, the procedure yields
tomatic contextual data augmentation (acda), and dis- a high number of new training examples from the most
cuss its methodology as well as its benefits. We also problematic areas of the decision boundary, which can
present our learning optimization techniques of con- now be used as part of training to incentivise the model
trastive learning and hybrid loss, discuss their benefits against the reliance on artifacts.
and emphasize their synergy with acda in particular. The rules that result in the Unknown (UNK) label were
not used as part of the augmentation. Because of the in-
4.1. Data Augmentation herent ambiguity when replacing a word in these con-
By referring to the adversarial examples as presented texts, the supervisory signal can be corrupted and lead
in Table 2, our observations naturally lead to an ap- the model to learn nonsensical rules. Importantly, we
proach where contextual augmentation based on word note that the remaining rules are robust, but they are not
groups could incentivise the model to learn the actual infallible: there is still the possibility, however small,
decision boundary instead of relying on dataset arti- that a newly generated example gets an incorrect la-
facts. That could happen if more hypotheses that are bel assigned to it. However, this was deemed accept-
closely related with each other, such as our adversar- able, because the inherent ambiguity in labelling any
ial ones, were made available to the model, but which hypothesis is already only partially correct, even when
also included substantial contextual shifts, such as the done by human experts, as developed in detail in recent
ones the model fails at. In this case, there could be published research which shows that, indeed, numer-
a benefit in performance. Moreover, we require that ous examples can be found in SNLI and other similar
this augmentation procedure is fully automatic, i.e. it datasets where human experts disagree on which label
does not require a human expert to manually anno- to assign to a hypothesis (Dua et al., 2021). By keeping
tate each example, because otherwise the resources re- only the more robust rules for augmentation we ensure
quired would make the procedure infeasible. We de- that the probability of generating a controversial exam-
vised such a data augmentation procedure that gener- ple will be similar to the one induced by human experts,
ates new examples which on one hand are non-trivial and will therefore not alter the undelying manifold of
(as opposed, for example, to adding a “not” ahead of the dataset that the model is trying to learn.
the hypothesis), while at the same time being robust The resulting augmentation benefits from being both
in labelling the newly generated example correctly. To fully automatic, as it does not require manual writ-
achieve non-trivial augmentation we employed Word- ing of new hypothesis or label annotation, while at the
Net (Miller, 1995) synsets and generated a new hypoth- same time being non-trivial. For example, we can ob-
esis, while leaving the premise as it is. This was done serve that by using Rule 1 in the hypothesis “A couple
by replacing one word in the hypothesis with either a is playing with a dog outside”, the word “dog” might
synonym, an antonym, a hyponym or a hypernym. In be replaced by “animal” (a hypernym), which accord-
order to ensure that the labelling of the new example ing to the rule will retain the Entailment label. This
is sensible, we created and employed the set of rules is logically correct, while at the same time produces
shown in Table 3 below: an example where the swapped word can have a vector
430
of significant distance in embedding space, thus incen- being able to correctly classify examples that it has not
tivising the model to discover the correct relations in seen and are further apart in decision space. In this sce-
the corpus and move away from the distance function nario, the model is really learning many small multino-
artifact. As another example, we can consider swap- mial classification problems, and misses out on larger
ping the same word with a hyponym such as “corgie”. scale rules in the classification manifold. In order to
Because the original hypothesis label is Entailment, ac- mitigate this, we decided to combine both the Cross
cording to Rule 3, the new hypothesis “A couple are Entropy Loss (CE) and the NLL Loss, which uses the
playing with a corgie outside” would get assigned the Maximum Likelihood Estimation (MLE) criterion. We
Neutral label, which is again logically correct and a call this new combined loss function Hybrid Loss and
valid datapoint for training a model. define it as follows:
We can further observe, that the number of possible
augmented examples that can be generated grows expo- L(o, l) = α · LM LE (o, l) + (1 − α) · LCE (o, l) (1)
nentially with the length of the base hypothesis. This is
because any noun in the hypothesis could be swapped In the supervised setting, which includes our present
by any word in its synset. In order to keep the training natural language inference application, MLE (through
time bounded, our implementation enforces an upper the NLL Loss) is a much stronger training signal than
bound of 10 augmented examples per hypothesis sen- CE. This is because CE does not provide a learning
tence. As anticipated, this approach leads to a 10 times signal for the large space of alternative premises or hy-
larger dataset and training time also increases in a lin- potheses that are not in the neighbourhood of the cur-
ear fashion. In our implementation, we use the map() rent instance bundle. On the other hand, CE provides
method of the Huggingface Trainer class. This has a much stronger signal for a small set of closely re-
the advantage of placing the augmented examples right lated and potentially confusing examples. Thus, the
below the original examples, as a result keeping the re- supervisory signal involves a smaller area of the deci-
lated examples together. This is very beneficial for our sion boundary, as it will be made up of a small number
learning optimization techniques which we will discuss of examples and their augmentations, all of which are
in the sections that follow. close in decision space, as opposed to a larger num-
ber of examples all over the decision space. However,
4.2. Contrastive Learning it will also be more complex in these locacilities, de-
Having acquired a 10 times larger training set through manding a more fine-grained weight updating from the
acda, the question of taking maximum advantage of the model and forcing it to learn the local properties of the
training examples becomes pertinent. We decided to decision boundary.
employ the recently published technique of contrastive By combining both losses in a weighted average man-
learning (Dua et al., 2021) to further incentivise the ner, we manage to retain the advantages of both loss
model to learn the nuances of the decision boundary. functions. The Cross Entropy Loss ensures that part
According to the conducted research, one technique to of the loss signal will be directly relevant to the short-
achieve this is for the model to see instance bundles comings of the model in the localities of the decision
during training, that is, examples that are close together boundary, enabling contrastive learning, while the NLL
and belong to a specific area of the decision boundary Loss will incentivise generalization in areas that the
in the same training batch. This approach has been used model has not seen, learning rules that can only be
in unsupervised linguistic structure prediction (Smith inferred by looking at unrelated examples. With this
and Eisner, 2005) and supervised reading comprehen- arrangement we ensure a balance between the large
sion (Dua et al., 2021). number of examples in a small area of the decision
Since acda places the augmented examples right af- space, and a smaller number of examples all over that
ter each original one, the dataset batches provided space. Intuitively, this can be thought as the Hybrid
to the model in each iteration will consist of some Loss using the NLL Loss to cause the largest mod-
number of original examples and their augmenta- ifications of the current decision boundary, affecting
tions. This way, we manage to have a dataset con- more of the decision space, and the Cross Entropy
sisting of multiple instance bundles and therefore, we Loss to fine tune local areas according to the examples
gain the maximum benefit from contrastive learning. of each batch. In our implementation, we overloaded
In our implementation, we disabled dataset shuffling the compute loss() method of the Huggingface
in our CustomTrainer class by overloading the Trainer class with our hybrid loss function as shown
get train sampler() method in the Hugging- in Equation 1, with a value of 0.5 for the α parameter.
face Trainer class.
431
Python package in order to train five models: four dif- in that covers a range of genres of spoken and written
ferent BERT variants and the BERT-based ELECTRA- text, and supports a distinctive cross-genre generaliza-
Small model. Then, we compare the fine-tuned base- tion evaluation (Williams et al., 2017). We present the
line pre-trained language models with the respective comparison between the fine-tuned baseline pre-trained
acda-boosted pre-trained language models for both language models and the respective acda-boosted pre-
datasets. We select the best performing acda-boosted trained language models for MNLI in Table 5 below:
pre-trained language model and carry out an evalua-
tion on our adversarial set in order to ensure that it suc- Fine-tuned Acda-boosted
cessfully mitigates dataset artifacts. Finally, we present Baseline Model Model
(Table 6) the outcome of our procedure on the same BERT-Tiny 65.24 69.06
subset (Table 2) of our adversarial dataset that we used BERT-Mini 72.54 75.22
to demonstrate the influence of dataset artifacts. BERT-Small 77.02 78.57
BERT-Medium 80.20 80.39
5.1. Performance on Benchmark Datasets ELECTRA-Small 81.16 81.53
We use SNLI and MNLI as our two benchmark datasets
in order to present a comparison between fine-tuned Table 5: Comparison of fine-tuned baseline pre-trained
baseline pre-trained language models from the Hug- language models and their respective acda-boosted
gingface Transformers repository, and their respective pre-trained language models for the MNLI dataset.
acda-boosted pre-trained language models. Our goal is
to show that our approach consistently improves per- Regarding the comparison results for MNLI, we can
formance regardless of model or dataset choice. notice that acda-boosted pre-trained language models
consistently outperform the respective fine-tuned base-
SNLI Dataset The first evaluation dataset is the line pre-trained language models. Once again, we ob-
SNLI, which is a collection of 570000 human-written serve that models with a smaller architecture are the
English sentence pairs manually labeled for balanced ones that receive the largest performance boost, even
classification (Bowman et al., 2015). We present the higher than the one observed for SNLI. Specifically,
comparison between the fine-tuned baseline pre-trained BERT-Tiny and BERT-Mini increase their performance
language models and the respective acda-boosted pre- by 3.82% and 2.68% respectively when they employ
trained language models for SNLI in Table 4 below: acda. The rest of the models display a variable perfor-
Fine-tuned Acda-boosted
mance increase between 0.19% and 1.55%, while the
Baseline Model Model best performing model is the acda-boosted ELECTRA-
Small with an accuracy of 81.53%. Therefore, we can
BERT-Tiny 78.86 82.01
BERT-Mini 85.06 86.79
reach the same conclusion as before, that is, our ap-
BERT-Small 87.27 87.90 proach consistently increases performance across all
BERT-Medium 88.92 89.01 models, particularly lightweight ones, for MNLI.
ELECTRA-Small 89.02 89.82 Computational Efficiency It is worth noting that we
initially implemented our data augmentation rules for
Table 4: Comparison of fine-tuned baseline pre-trained acda, as presented in Table 3, using the TextAttack
language models and their respective acda-boosted package (Morris et al., 2020), as well as the Check-
pre-trained language models for the SNLI dataset. list package (Ribeiro et al., 2020). The result was a
×60 increase in training time, while we also confirmed
Regarding the comparison results for SNLI, we can manually that they produced a smaller number of aug-
notice that acda-boosted pre-trained language mod- mented examples in each iteration. According to the
els consistently outperform the respective fine-tuned Huggingface training time estimator, this training pro-
baseline pre-trained language models. Specifically, we cedure would take approximately 60 hours on Google
observe that models with a smaller architecture such Colab Pro for ELECTRA-Small. On the other hand,
as BERT-Tiny and BERT-Mini make the largest gains our own optimized implementation of acda only re-
when they make use of acda, as their performance is quires 9 hours of training for the same task, thus high-
increased by 3.15% and 1.73% respectively. The rest lighting its computational efficiency.
of the models display a performance increase between
0.1% and 0.8%, while the best performing model is the 5.2. Performance on Adversarial Examples
acda-boosted ELECTRA-Small with an accuracy of After showing that acda-boosted pre-trained language
89.82%. Therefore, we can conclude that our approach models provide a consistent improvement in perfor-
consistently increases performance across all models, mance for both the SNLI and the MNLI datasets when
particularly lightweight ones, for SNLI. compared to the respective fine-tuned baseline pre-
MNLI Dataset The second evaluation dataset is the trained language models, we continue our evaluation
MNLI, which is a crowd-sourced collection of 433000 by examining their behavior when facing adversarial
sentence pairs annotated with textual entailment infor- examples. For this purpose, we make use of our hand-
mation. It is modeled on the SNLI corpus, but differs annotated adversarial set and specifically, the adversar-
432
ial examples of Table 2, which we discussed in Section Comparing the fine-tuned baseline pre-trained lan-
3. We can recall that the predictions of Table 2 are those guage model results (Table 2) and the acda-boosted
of the best performing fine-tuned baseline pre-trained pre-trained language model results (Table 6), we no-
language model, ELECTRA-Small. Despite having tice a considerable improvement in prediction accu-
a prediction accuracy of 81.16% and 88.92% for the racy, and we can therefore conclude that the acda-
SNLI and MNLI validation sets respectively, the model boosted pre-trained language model exhibits a robust
did not classify any of the 12 adversarial examples of behavior against adversarial examples due to its re-
Table 2 correctly. We present the same 12 adversar- silience against dataset artifacts. Specifically, it man-
ial examples with the predictions of the acda-boosted ages to classify 8 out of the 12 adversarial examples
ELECTRA-Small in Table 6 below: correctly. We can attribute its success to the improved
training procedure having moved the model further
Premise Hypothesis Label Pred away from dataset artifacts and into greater reading
1 Two women One of the ENT ENT ✓ comprehension. This is further proven by the fact that
are embracing women is even when it comes to the adversarial examples which
while holding holding take- the acda-boosted pre-trained language model classifies
to go pack- away pack- incorrectly, we can manually confirm that, in the major-
ages. ages. ity of the cases, the classification probability towards
2 ... The packages ENT ENT ✓ the gold label is significantly higher compared to the
contain food. one produced by the respective fine-tuned baseline pre-
3 ... The women ENT ENT ✓ trained language model.
have bought
food.
4 ... The women NEU ENT ✗ 6. Conclusions and Future Work
have bought In this work we proposed a novel data augmentation
lasagna. technique, acda, discussed its advantages with respect
5 A man in A man is NEU NEU ✓
to established data augmentation packages, and de-
a blue shirt wearing black
scribed how it can be naturally combined with a learn-
standing in trousers.
front of a ing optimization method which utilizes contrastive
garage-like learning and a hybrid loss function. We showed that
structure the employment of this combined approach by pre-
painted with trained language models can lead to a consistent in-
geometric crease in performance, while requiring minimal com-
designs. putational cost regarding training time and resources.
6 ... His shirt fea- CON ENT ✗ In particular, acda-boosted pre-trained language mod-
tures geomet- els consistently outperform the respective fine-tuned
ric designs. baseline pre-trained language models in benchmark
7 A young boy in He is carrying ENT ENT ✓ datasets related to natural language inference. Further-
a field of flow- one ball.
more, the acda-boosted pre-trained language models
ers carrying a
ball
are also substantially more resilient to dataset artifacts
8 ... Ball in field. ENT ENT ✓ and as a result display robust behavior and high perfor-
9 Two doc- The two doc- ENT NEU ✗ mance against adversarial examples.
tors perform tors are per- As a natural next step, we intend to further improve the
surgery on forming brain data augmentation process by introducing more sophis-
patient. surgery. ticated rules. We believe that by expanding the rules in
10 . . . The patient is ENT NEU ✗ a structured manner, we can generate more closely re-
having heart lated examples and improve performance metrics sub-
surgery. stantially. This will likely require a formal-logical
11 A white dog It is not a ENT ENT ✓ treatment of the relationships between sentences when
with long hair brown dog.
a word is swapped in a controlled manner. Similarly,
jumps to catch
a red and green
coming up with a larger number of more complex rules,
toy. such as ones based on conditionals, is also promising
12 Kids are on Kids ride ENT ENT ✓ as this would further increase the size of the training
a amusement joyously an set in a meaningful way and, given the computational
ride. amusement efficiency of our procedure, it would come at a mini-
ride. mal cost, as no computational cost is added on top of
the training cost. Finally, we intend to create modified
Table 6: A sample of hand-annotated adversarial ex- variants of acda in order to expand our methodology to
amples and the predictions of the highest performing other domains of interest within natural language pro-
acda-boosted pre-trained language model. cessing, where reading comprehension is vital.
433
Acknowledgements Meissner, J. M., Thumwanit, N., Sugawara, S., and
The authors wish to express their gratitude to Prof. Aizawa, A. (2021). Embracing ambiguity: Shift-
Greg Durrett and his staff for their guidance and con- ing the training target of nli models. arXiv preprint
tributions. arXiv:2106.03020.
Miller, G. A. (1995). Wordnet: a lexical database for
7. Bibliographical References english. Communications of the ACM, 38(11):39–
Bartolo, M., Roberts, A., Welbl, J., Riedel, S., and 41.
Stenetorp, P. (2020). Beat the ai: Investigating ad- Morris, J. X., Lifland, E., Yoo, J. Y., Grigsby, J., Jin, D.,
versarial human annotation for reading comprehen- and Qi, Y. (2020). Textattack: A framework for ad-
sion. Transactions of the Association for Computa- versarial attacks, data augmentation, and adversarial
tional Linguistics, 8:662–678. training in nlp. arXiv preprint arXiv:2005.05909.
Bowman, S. R., Angeli, G., Potts, C., and Manning, Munkhdalai, T. and Yu, H. (2017). Neural tree in-
C. D. (2015). A large annotated corpus for learn- dexers for text understanding. In Proceedings of the
ing natural language inference. In Proceedings of conference. Association for Computational Linguis-
the 2015 Conference on Empirical Methods in Nat- tics. Meeting, volume 1, page 11. NIH Public Ac-
ural Language Processing, pages 632–642, Lisbon, cess.
Portugal, September. Association for Computational Nie, Y., Zhou, X., and Bansal, M. (2020). What
Linguistics. can we learn from collective human opinions on
Chen, Q., Zhu, X., Ling, Z., Wei, S., Jiang, H., and natural language inference data? arXiv preprint
Inkpen, D. (2016). Enhanced lstm for natural lan- arXiv:2010.03532.
guage inference. arXiv preprint arXiv:1609.06038. Peters, M., Neumann, M., Iyyer, M., Gardner, M.,
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, Clark, C., Lee, K., and Zettlemoyer, L. (1802).
K. (2018). Bert: Pre-training of deep bidirectional Deep contextualized word representations. arxiv
transformers for language understanding. arXiv 2018. arXiv preprint arXiv:1802.05365, 12.
preprint arXiv:1810.04805. Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R.,
Dua, D., Dasigi, P., Singh, S., and Gardner, M. (2021). and Van Durme, B. (2018). Hypothesis only base-
Learning with instance bundles for reading compre- lines in natural language inference. arXiv preprint
hension. arXiv preprint arXiv:2104.08735. arXiv:1805.01042.
Gardner, M., Artzi, Y., Basmova, V., Berant, J., Bogin, Radford, A., Narasimhan, K., Salimans, T., and
B., Chen, S., Dasigi, P., Dua, D., Elazar, Y., Got- Sutskever, I. (2018). Improving language under-
tumukkala, A., et al. (2020). Evaluating models’ standing by generative pre-training.
local decision boundaries via contrast sets. arXiv Ribeiro, M. T., Wu, T., Guestrin, C., and Singh,
preprint arXiv:2004.02709. S. (2020). Beyond accuracy: Behavioral test-
Ghaeini, R., Hasan, S. A., Datla, V., Liu, J., Lee, ing of nlp models with checklist. arXiv preprint
K., Qadir, A., Ling, Y., Prakash, A., Fern, X. Z., arXiv:2005.04118.
and Farri, O. (2018). Dr-bilstm: Dependent read- Rocktäschel, T., Grefenstette, E., Hermann, K. M.,
ing bidirectional lstm for natural language inference. Kočiskỳ, T., and Blunsom, P. (2015). Reason-
arXiv preprint arXiv:1802.05577. ing about entailment with neural attention. arXiv
Glockner, M., Shwartz, V., and Goldberg, Y. (2018). preprint arXiv:1509.06664.
Breaking nli systems with sentences that re- Sha, L., Chang, B., Sui, Z., and Li, S. (2016). Reading
quire simple lexical inferences. arXiv preprint and thinking: Re-read lstm unit for textual entail-
arXiv:1805.02266. ment recognition. In Proceedings of COLING 2016,
Jia, R. and Liang, P. (2017). Adversarial examples for the 26th International Conference on Computational
evaluating reading comprehension systems. arXiv Linguistics: Technical Papers, pages 2870–2879.
preprint arXiv:1707.07328. Smith, N. A. and Eisner, J. (2005). Contrastive estima-
Liu, N. F., Schwartz, R., and Smith, N. A. tion: Training log-linear models on unlabeled data.
(2019). Inoculation by fine-tuning: A method In Proceedings of the 43rd Annual Meeting of the As-
for analyzing challenge datasets. arXiv preprint sociation for Computational Linguistics (ACL’05),
arXiv:1904.02668. pages 354–362.
MacCartney, B. and Manning, C. D. (2008). Model- Swayamdipta, S., Schwartz, R., Lourie, N., Wang,
ing semantic containment and exclusion in natural Y., Hajishirzi, H., Smith, N. A., and Choi, Y.
language inference. In Proceedings of the 22nd In- (2020). Dataset cartography: Mapping and diagnos-
ternational Conference on Computational Linguis- ing datasets with training dynamics. arXiv preprint
tics (Coling 2008), pages 521–528. arXiv:2009.10795.
McCoy, R. T., Pavlick, E., and Linzen, T. (2019). Wallace, E., Feng, S., Kandpal, N., Gardner, M.,
Right for the wrong reasons: Diagnosing syntac- and Singh, S. (2019). Universal adversarial trig-
tic heuristics in natural language inference. arXiv gers for attacking and analyzing nlp. arXiv preprint
preprint arXiv:1902.01007. arXiv:1908.07125.
434
Williams, A., Nangia, N., and Bowman, S. R. (2017).
A broad-coverage challenge corpus for sentence
understanding through inference. arXiv preprint
arXiv:1704.05426.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,
C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-
icz, M., et al. (2019). Huggingface’s transformers:
State-of-the-art natural language processing. arXiv
preprint arXiv:1910.03771.
Yaghoobzadeh, Y., Mehri, S., Tachet, R., Hazen, T. J.,
and Sordoni, A. (2019). Increasing robustness to
spurious correlations using forgettable examples.
arXiv preprint arXiv:1911.03861.
Zhang, Z., Wu, Y., Zhao, H., Li, Z., Zhang, S., Zhou,
X., and Zhou, X. (2020). Semantics-aware bert for
language understanding. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34,
pages 9628–9635.
Zhou, X. and Bansal, M. (2020). Towards robustify-
ing nli models against lexical dataset biases. arXiv
preprint arXiv:2005.04732.
435