2022.lrec-1.45

Uploaded by

Xiaoliang Qin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views9 pages

2022.lrec-1.45

Uploaded by

Xiaoliang Qin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pages 427–435

Marseille, 20-25 June 2022

Mitigating Dataset Artifacts in Natural Language Inference Through

Automatic Contextual Data Augmentation and Learning Optimization

Michail Mersinias1 , Panagiotis Valvis1

Department of Computer Science, University of Texas at Austin
{mmersinias, pval}@utexas.edu

Abstract
In recent years, natural language inference has been an emerging research area. In this paper, we present a novel data
augmentation technique and combine it with a unique learning procedure for that task. Our so-called automatic contextual
data augmentation (acda) method manages to be fully automatic, non-trivially contextual, and computationally efficient at
the same time. When compared to established data augmentation methods, it is substantially more computationally efficient
and requires no manual annotation by a human expert as they usually do. In order to increase its efficiency, we combine
acda with two learning optimization techniques: contrastive learning and a hybrid loss function. The former maximizes
the benefit of the supervisory signal generated by acda, while the latter incentivises the model to learn the nuances of the
decision boundary. Our combined approach is shown experimentally to provide an effective way for mitigating spurious data
correlations within a dataset, called dataset artifacts, and as a result improves performance. Specifically, our experiments verify
that acda-boosted pre-trained language models that employ our learning optimization techniques, consistently outperform the
respective fine-tuned baseline pre-trained language models across both benchmark datasets and adversarial examples.

Keywords: natural language inference, data augmentation, learning optimization

1. Introduction the fact that a model may achieve high performance on

Inference has historically been a central topic in arti- a dataset by learning spurious correlations, which are
ficial intelligence, and recently so in the natural lan- called dataset artifacts, but it is then expected to fail in
guage domain. The so called natural language infer- settings where these artifacts are not present.
ence (NLI) task is to determine whether a natural lan- In our work, we identify dataset artifacts and propose a
guage hypothesis h can justifiably be inferred from a data augmentation and learning optimization approach
natural language premise p (MacCartney and Manning, in order to achieve a higher and more robust perfor-
2008). This is a challenging task due to the complex na- mance than the respective fine-tuned pre-trained base-
ture of natural language which entails informal reason- line language models from the Hugginface Transform-
ing, lexical semantic knowledge and structure as well ers repository (Wolf et al., 2019). Specifically, our
as variability regarding linguistic expression. contributions can be summarized as follows. First, we
In recent years, there has been a considerable amount propose acda, a novel data augmentation approach for
of research in this particular area. The Stanford Natu- the construction of adversarial examples to enrich the
ral Language Inference (SNLI) (Bowman et al., 2015) dataset for the purpose of enhancing the learning pro-
corpus and the Multi-Genre Natural Language Infer- cess. Compared to established data augmentation tech-
ence (MNLI) (Williams et al., 2017) corpus have pro- niques such as TextAttack (Morris et al., 2020) and
vided robust benchmark datasets as they contain a large Checklist (Ribeiro et al., 2020), acda is substantially
amount of annotated data. Furthermore, the advance- more computationally efficient and, moreover, fully-
ment of pre-trained language models has contributed to automatic as it requires no manual annotation by a hu-
the increased effectiveness of proposed solutions. man expert as these packages do. Furthermore, we
Although many of these solutions appear promising, as propose a hybrid loss function which allows the acda-
they report high accuracy on validation data, the task of boosted models to learn the nuances of the decision
natural language inference remains a work in progress. boundary, thus providing results that are considerably
Recent research calls into question the learning which more robust than those of the default NLL loss function
results from pre-trained language models in datasets of the fine-tuned baseline pre-trained language models,
such as SNLI and MNLI because it either predicts the which is solely based on the maximum likelihood es-
right answer when it shouldn’t, as it is the case with timation (MLE) criterion. In addition, we make use
hypothesis-only baselines (Poliak et al., 2018), or pre- of contrastive learning (Dua et al., 2021) which max-
dicts the wrong answer if minor modifications are made imizes the benefit of the supervisory signal generated
by utilizing contrast sets (Gardner et al., 2020), check- by acda and further increases performance. Finally,
list sets (Ribeiro et al., 2020) or adversarial attacks (Jia we perform a systematic comparison and demonstrate
and Liang, 2017). These observations all stem from experimentally that the acda-boosted pre-trained lan-
guage models which employ our learning optimization
1
The authors contributed equally to this work. techniques, consistently outperform the respective fine-

427
tuned baseline pre-trained language models across both natural language, including natural language inference
the SNLI and the MNLI datasets. In order to further (Radford et al., 2018). Specifically, well pre-trained
demonstrate the effectiveness of our approach, we also contextual language models such as ELMo (Peters et
provide a set of multiple hand-annotated adversarial ex- al., 1802) and BERT (Devlin et al., 2018) or BERT-
amples where the acda-boosted models exhibit a con- based approaches (Zhang et al., 2020) are among those
siderably more robust behavior and performance than which achieve the highest performance for the SNLI
the fine-tuned baseline models. and the MNLI datasets.
However, recent research shows that even though these
2. Background and Related work pre-trained language models achieve high performance
Textual entailment is the relationship between a natural on benchmark datasets, they do so by learning spurious
language premise p and a natural language hypothesis correlations, also called dataset artifacts. The models
h. It is positive when the truth of p requires the truth are then expected to fail in settings where these arti-
of h, that is, when a human annotator reading p would facts are not present, which may include real-world test
infer that h is most likely true. Likewise, it is negative sets of interest. The usage of contrast sets (Gardner et
when the truth of p contradicts the truth of h, that is, al., 2020), checklist sets (Ribeiro et al., 2020) or other
when a human annotator reading p would infer that h is adversarial sets (Jia and Liang, 2017), (Wallace et al.,
most likely false. The absence of textual entailment is 2019), (Bartolo et al., 2020), (Glockner et al., 2018),
the lack of any relationship between p and h and in this (McCoy et al., 2019) makes performance plummet and
case, the human annotator reading p would infer that thus highlights this issue.
the truth of p neither entails nor contradicts the truth of In recent years, there has been a considerable effort in
h. Thus, the goal of the natural language inference task order to combat dataset artifacts in the natural language
is to determine whether h can justifiably be inferred inference domain. Learning seems to be more robust
from p. Specifically, based on the textual entailment when it focuses on hard subsets of data or data where
relationship between p and h, there are three labels: En- the gold label distribution is ambiguous through dataset
tailment (ENT) for positive textual entailment, Neutral cartography (Swayamdipta et al., 2020) or other meth-
(NEU) for the absence of textual entailment and Con- ods (Yaghoobzadeh et al., 2019), (Nie et al., 2020),
tradiction (CON) for negative textual entailment. Three (Meissner et al., 2021). Another approach is to train
examples from SNLI are presented in Table 1 below: on sets of adversarial data such as challenge sets di-
rectly (Liu et al., 2019), (Zhou and Bansal, 2020) or ad-
Premise Hypothesis Label versarial sets generated by data augmentation (Ribeiro
A soccer game Some men are Entailment et al., 2020), (Morris et al., 2020). In our work, we
with multiple playing a sport. (ENT) propose our own novel method for creating adversarial
males playing. sets through automatic contextual data augmentation
An older and Two men are smil- Neutral (acda) which, when compared to the aforementioned
younger man ing and laughing at (NEU) data augmentation techniques, has the advantage of be-
smiling. the cats playing on ing substantially more computationally efficient and, at
the floor. the same time, fully automatic as it requires no manual
A man inspects the The man is sleep- Contradiction annotation by a human expert.
uniform of a figure ing. (CON)
in some East Asian Finally, contrastive learning (Dua et al., 2021) is a
country. learning optimization method which takes inspiration
from contrastive estimation (Smith and Eisner, 2005)
Table 1: Three examples (entailment, neutral, contra- and extends the technique to supervised reading com-
diction) from the SNLI dataset. prehension by carefully selecting appropriate neigh-
bourhoods of related examples. In the original paper,
Since the publication of the Stanford Natural Lan- it is used in the context of question answering and re-
guage Inference (SNLI) (Bowman et al., 2015) and quires bundles of closely related question answering
the Multi-Genre Natural Language Inference (MNLI) pairs which the authors call instance bundles. In our
(Williams et al., 2017) datasets, there has been a con- work, we show that the same technique can also be
siderable progress in the field of natural language in- successfully used in natural language inference. In par-
ference due to the large amount of annotated data that ticular, our novel acda method displays great synergy
these datasets provided. Numerous approaches based with contrastive learning as it offers a natural way of
on recurrent neural networks, such as LSTM-based creating multiple instance bundles of language infer-
approaches which often utilize attention mechanisms, ence examples that are both contextually closely re-
have produced decent results (Rocktäschel et al., 2015), lated and of arbitrary size, which grows exponentially
(Chen et al., 2016), (Sha et al., 2016), (Munkhdalai with the length of the hypothesis sentence. We also
and Yu, 2017), (Ghaeini et al., 2018). More recently, retain the authors’ original technique of combining a
pre-trained language models have managed to provide Cross Entropy Loss with Maximum Likelihood Esti-
an even higher performance on many tasks related to mation (NLL Loss) through our proposed hybrid loss.

428
3. Analysis of Dataset Artifacts By observing the 12 examples of Table 2, we can con-
The first task in solving the issue of dataset artifacts clude that the highest performing fine-tuned baseline
is to identify them. For this purpose, we conducted pre-trained language model, despite achieving a very
an exploratory analysis on the SNLI dataset and cre- high accuracy in the SNLI and the MNLI datasets, does
ated our own set of hand-annotated adversarial exam- not manage to classify any of our 12 hand-annotated
ples. Note that these are examples that an original fine- adversarial examples correctly. This confirms the mag-
tuned model classified correctly, but when the hypoth- nitude of impact dataset artifacts can have on perfor-
esis is perturbed, even slightly, the prediction accuracy mance. Specifically, we can make the following obser-
notably suffers. A subset is presented in Table 2 below: vations regarding dataset artifacts from the adversarial
examples of Table 2.
Premise Hypothesis Label Pred First, the model’s errors are mostly located around
1 Two women One of the ENT CON ✗ two particular classes, the neutral and the entailment
are embracing women is classes. One of the potential artifacts at work here is
while holding holding take- a distance function between the premise and the hy-
to go pack- away pack- pothesis which the model learns instead of actual com-
ages. ages. prehension, and makes a prediction based on that dis-
2 ... The packages ENT CON ✗ tance artifact. Because the neutral class in particular
contain food. cannot be adequately expressed by distance, or more
3 ... The women ENT CON ✗
accurately, the distance of hyponyms in embedding
have bought
food.
space can be very large and confuse the artifact’s cri-
4 ... The women NEU CON ✗ terion, the result is that the model classifies these large
have bought distances as contradictions, which causes a substantial
lasagna. drop in performance.
5 A man in A man is NEU CON ✗ Second, the model might perform well against trivial
a blue shirt wearing black augmentations, such as introducing a negation in the
standing in trousers. form of adding a “not” word in the premise, but when
front of a
adversarial examples use words which are further apart
garage-like
in embedding space from the premise words, results are
structure
painted with much worse. Thus, the model clearly relies on learned
geometric artifacts instead of learned language comprehension.
designs. Apart from the distance function discussed above, an-
6 ... His shirt fea- CON ENT ✗ other artifact is the set of words in the hypothesis that
tures geomet- the model associates with a specific label regardless of
ric designs. context, only because it has observed those words ac-
7 A young boy in He is carrying ENT CON ✗ companying that label multiple times during training.
a field of flow- one ball. Recent research (Wallace et al., 2019) confirms our ob-
ers carrying a servation and discusses it in detail, providing examples
ball
such as “not” and “least” for the entailment class, “joy-
8 ... Ball in field. ENT CON ✗
ously” for the neutral class, and “nobody” and “never”
9 Two doc- The two doc- NEU CON ✗
tors perform tors are per-
for the contradiction class.
surgery on forming brain Specifically, in Table 2, we can observe, how the phrase
patient. surgery. “to go” is a synonym for words such as “takeaway” or
10 . . . The patient is NEU CON ✗ “food”, and yet the model produces an incorrect pre-
having heart diction for our adversarial examples 1, 2 and 3, which
surgery. display a small and natural shift in language, the mere
11 A white dog It is not a ENT CON ✗ use of a synonym. The model also fails at example
with long hair brown dog.
4 where a more specific word is introduced, such as
jumps to catch
“lasagna”, which is a hyponym of “food” and shifts the
a red and green
toy. gold label to the neutral class, but the model perceives
12 Kids are on Kids ride ENT CON ✗ this as contradicting the premise. Furthermore, we can
a amusement joyously an observe how in examples 9 and 10, even slight changes
ride. amusement in context (specificity) cause the model to choose the
ride. contradiction class while the neutral class should have
been appropriate. While this shows the effect of the
Table 2: A sample of hand-annotated adversarial ex- distance function artifact, the most definitive example
amples and the predictions of the highest perform- of the distance function artifact is likely to be example
ing fine-tuned baseline pre-trained language model 6. We can observe how the model, by seeing the same
(ELECTRA-Small). phrase in both the premise and hypothesis, predicts en-

429
tailment and is unable to differentiate what the pattern Old Label Word Swap New Label
is in reference to language. Reading comprehension ENT Synonymn-Hypernym ENT
would require that it can differentiate between “struc- ENT Antonym CON
ture” and “shirt”, and in such a case, the model would ENT Hyponym NEU
most likely make the correct prediction.
NEU Synonymn-Hypernym NEU
In conclusion, it is clear that while maintaining high
NEU Antonym UNK
performance on benchmark datasets remains an impor-
NEU Hyponym UNK
tant indicator of performance, models should also be
CON Synonymn-Hypernym CON
tested against adversarial examples, which are some-
CON Antonym UNK
times similar to real world sets, in order to ensure that
their high performance is not a product of dataset arti- CON Hyponym CON
facts. Thus, in our work, evaluation is carried out in re-
gard to the prediction accuracy for both the benchmark Table 3: Label generation rules for augmented exam-
datasets (SNLI, MNLI) and the hand-annotated adver- ples using WordNet synsets.
sarial set, a subset of which contains the 12 adversarial
examples of Table 2 as they were presented and dis-
cussed above. Our data augmentation procedure scans the hypothe-
sis sentence for nouns, and queries WordNet synsets
4. Our approach for a replacement word. It then swaps each one of
We propose an approach which comprises three tech- the nouns at a time and composes new examples using
niques towards mitigating dataset artifacts: a novel data the labeling generation rules in Table 3. Observe that
augmentation procedure, contrastive learning, and a this procedure can be seen as replicating the generation
hybrid loss function. In what follows, we introduce our of adverserial examples that caused the model perfor-
novel data augmentation technique, which we call au- mance to deteriorate. Therefore, the procedure yields
tomatic contextual data augmentation (acda), and dis- a high number of new training examples from the most
cuss its methodology as well as its benefits. We also problematic areas of the decision boundary, which can
present our learning optimization techniques of con- now be used as part of training to incentivise the model
trastive learning and hybrid loss, discuss their benefits against the reliance on artifacts.
and emphasize their synergy with acda in particular. The rules that result in the Unknown (UNK) label were
not used as part of the augmentation. Because of the in-
4.1. Data Augmentation herent ambiguity when replacing a word in these con-
By referring to the adversarial examples as presented texts, the supervisory signal can be corrupted and lead
in Table 2, our observations naturally lead to an ap- the model to learn nonsensical rules. Importantly, we
proach where contextual augmentation based on word note that the remaining rules are robust, but they are not
groups could incentivise the model to learn the actual infallible: there is still the possibility, however small,
decision boundary instead of relying on dataset arti- that a newly generated example gets an incorrect la-
facts. That could happen if more hypotheses that are bel assigned to it. However, this was deemed accept-
closely related with each other, such as our adversar- able, because the inherent ambiguity in labelling any
ial ones, were made available to the model, but which hypothesis is already only partially correct, even when
also included substantial contextual shifts, such as the done by human experts, as developed in detail in recent
ones the model fails at. In this case, there could be published research which shows that, indeed, numer-
a benefit in performance. Moreover, we require that ous examples can be found in SNLI and other similar
this augmentation procedure is fully automatic, i.e. it datasets where human experts disagree on which label
does not require a human expert to manually anno- to assign to a hypothesis (Dua et al., 2021). By keeping
tate each example, because otherwise the resources re- only the more robust rules for augmentation we ensure
quired would make the procedure infeasible. We de- that the probability of generating a controversial exam-
vised such a data augmentation procedure that gener- ple will be similar to the one induced by human experts,
ates new examples which on one hand are non-trivial and will therefore not alter the undelying manifold of
(as opposed, for example, to adding a “not” ahead of the dataset that the model is trying to learn.
the hypothesis), while at the same time being robust The resulting augmentation benefits from being both
in labelling the newly generated example correctly. To fully automatic, as it does not require manual writ-
achieve non-trivial augmentation we employed Word- ing of new hypothesis or label annotation, while at the
Net (Miller, 1995) synsets and generated a new hypoth- same time being non-trivial. For example, we can ob-
esis, while leaving the premise as it is. This was done serve that by using Rule 1 in the hypothesis “A couple
by replacing one word in the hypothesis with either a is playing with a dog outside”, the word “dog” might
synonym, an antonym, a hyponym or a hypernym. In be replaced by “animal” (a hypernym), which accord-
order to ensure that the labelling of the new example ing to the rule will retain the Entailment label. This
is sensible, we created and employed the set of rules is logically correct, while at the same time produces
shown in Table 3 below: an example where the swapped word can have a vector

430
of significant distance in embedding space, thus incen- being able to correctly classify examples that it has not
tivising the model to discover the correct relations in seen and are further apart in decision space. In this sce-
the corpus and move away from the distance function nario, the model is really learning many small multino-
artifact. As another example, we can consider swap- mial classification problems, and misses out on larger
ping the same word with a hyponym such as “corgie”. scale rules in the classification manifold. In order to
Because the original hypothesis label is Entailment, ac- mitigate this, we decided to combine both the Cross
cording to Rule 3, the new hypothesis “A couple are Entropy Loss (CE) and the NLL Loss, which uses the
playing with a corgie outside” would get assigned the Maximum Likelihood Estimation (MLE) criterion. We
Neutral label, which is again logically correct and a call this new combined loss function Hybrid Loss and
valid datapoint for training a model. define it as follows:
We can further observe, that the number of possible
augmented examples that can be generated grows expo- L(o, l) = α · LM LE (o, l) + (1 − α) · LCE (o, l) (1)
nentially with the length of the base hypothesis. This is
because any noun in the hypothesis could be swapped In the supervised setting, which includes our present
by any word in its synset. In order to keep the training natural language inference application, MLE (through
time bounded, our implementation enforces an upper the NLL Loss) is a much stronger training signal than
bound of 10 augmented examples per hypothesis sen- CE. This is because CE does not provide a learning
tence. As anticipated, this approach leads to a 10 times signal for the large space of alternative premises or hy-
larger dataset and training time also increases in a lin- potheses that are not in the neighbourhood of the cur-
ear fashion. In our implementation, we use the map() rent instance bundle. On the other hand, CE provides
method of the Huggingface Trainer class. This has a much stronger signal for a small set of closely re-
the advantage of placing the augmented examples right lated and potentially confusing examples. Thus, the
below the original examples, as a result keeping the re- supervisory signal involves a smaller area of the deci-
lated examples together. This is very beneficial for our sion boundary, as it will be made up of a small number
learning optimization techniques which we will discuss of examples and their augmentations, all of which are
in the sections that follow. close in decision space, as opposed to a larger num-
ber of examples all over the decision space. However,
4.2. Contrastive Learning it will also be more complex in these locacilities, de-
Having acquired a 10 times larger training set through manding a more fine-grained weight updating from the
acda, the question of taking maximum advantage of the model and forcing it to learn the local properties of the
training examples becomes pertinent. We decided to decision boundary.
employ the recently published technique of contrastive By combining both losses in a weighted average man-
learning (Dua et al., 2021) to further incentivise the ner, we manage to retain the advantages of both loss
model to learn the nuances of the decision boundary. functions. The Cross Entropy Loss ensures that part
According to the conducted research, one technique to of the loss signal will be directly relevant to the short-
achieve this is for the model to see instance bundles comings of the model in the localities of the decision
during training, that is, examples that are close together boundary, enabling contrastive learning, while the NLL
and belong to a specific area of the decision boundary Loss will incentivise generalization in areas that the
in the same training batch. This approach has been used model has not seen, learning rules that can only be
in unsupervised linguistic structure prediction (Smith inferred by looking at unrelated examples. With this
and Eisner, 2005) and supervised reading comprehen- arrangement we ensure a balance between the large
sion (Dua et al., 2021). number of examples in a small area of the decision
Since acda places the augmented examples right af- space, and a smaller number of examples all over that
ter each original one, the dataset batches provided space. Intuitively, this can be thought as the Hybrid
to the model in each iteration will consist of some Loss using the NLL Loss to cause the largest mod-
number of original examples and their augmenta- ifications of the current decision boundary, affecting
tions. This way, we manage to have a dataset con- more of the decision space, and the Cross Entropy
sisting of multiple instance bundles and therefore, we Loss to fine tune local areas according to the examples
gain the maximum benefit from contrastive learning. of each batch. In our implementation, we overloaded
In our implementation, we disabled dataset shuffling the compute loss() method of the Huggingface
in our CustomTrainer class by overloading the Trainer class with our hybrid loss function as shown
get train sampler() method in the Hugging- in Equation 1, with a value of 0.5 for the α parameter.
face Trainer class.

4.3. Hybrid Loss 5. Experimental Evaluation

Finally, as discussed above, the contrastive learning op- In this section, we experimentally evaluate our com-
timization technique re-focuses training in the locali- bined data augmentation and learning optimization ap-
ties of the current batch, but there lies the danger of proach on two benchmark datasets: SNLI and MNLI.
the model learning to overfit these localities, while not Specifically, we utilize the Huggingface Transformers

431
Python package in order to train five models: four dif- in that covers a range of genres of spoken and written
ferent BERT variants and the BERT-based ELECTRA- text, and supports a distinctive cross-genre generaliza-
Small model. Then, we compare the fine-tuned base- tion evaluation (Williams et al., 2017). We present the
line pre-trained language models with the respective comparison between the fine-tuned baseline pre-trained
acda-boosted pre-trained language models for both language models and the respective acda-boosted pre-
datasets. We select the best performing acda-boosted trained language models for MNLI in Table 5 below:
pre-trained language model and carry out an evalua-
tion on our adversarial set in order to ensure that it suc- Fine-tuned Acda-boosted
cessfully mitigates dataset artifacts. Finally, we present Baseline Model Model
(Table 6) the outcome of our procedure on the same BERT-Tiny 65.24 69.06
subset (Table 2) of our adversarial dataset that we used BERT-Mini 72.54 75.22
to demonstrate the influence of dataset artifacts. BERT-Small 77.02 78.57
BERT-Medium 80.20 80.39
5.1. Performance on Benchmark Datasets ELECTRA-Small 81.16 81.53
We use SNLI and MNLI as our two benchmark datasets
in order to present a comparison between fine-tuned Table 5: Comparison of fine-tuned baseline pre-trained
baseline pre-trained language models from the Hug- language models and their respective acda-boosted
gingface Transformers repository, and their respective pre-trained language models for the MNLI dataset.
acda-boosted pre-trained language models. Our goal is
to show that our approach consistently improves per- Regarding the comparison results for MNLI, we can
formance regardless of model or dataset choice. notice that acda-boosted pre-trained language models
consistently outperform the respective fine-tuned base-
SNLI Dataset The first evaluation dataset is the line pre-trained language models. Once again, we ob-
SNLI, which is a collection of 570000 human-written serve that models with a smaller architecture are the
English sentence pairs manually labeled for balanced ones that receive the largest performance boost, even
classification (Bowman et al., 2015). We present the higher than the one observed for SNLI. Specifically,
comparison between the fine-tuned baseline pre-trained BERT-Tiny and BERT-Mini increase their performance
language models and the respective acda-boosted pre- by 3.82% and 2.68% respectively when they employ
trained language models for SNLI in Table 4 below: acda. The rest of the models display a variable perfor-
Fine-tuned Acda-boosted
mance increase between 0.19% and 1.55%, while the
Baseline Model Model best performing model is the acda-boosted ELECTRA-
Small with an accuracy of 81.53%. Therefore, we can
BERT-Tiny 78.86 82.01
BERT-Mini 85.06 86.79
reach the same conclusion as before, that is, our ap-
BERT-Small 87.27 87.90 proach consistently increases performance across all
BERT-Medium 88.92 89.01 models, particularly lightweight ones, for MNLI.
ELECTRA-Small 89.02 89.82 Computational Efficiency It is worth noting that we
initially implemented our data augmentation rules for
Table 4: Comparison of fine-tuned baseline pre-trained acda, as presented in Table 3, using the TextAttack
language models and their respective acda-boosted package (Morris et al., 2020), as well as the Check-
pre-trained language models for the SNLI dataset. list package (Ribeiro et al., 2020). The result was a
×60 increase in training time, while we also confirmed
Regarding the comparison results for SNLI, we can manually that they produced a smaller number of aug-
notice that acda-boosted pre-trained language mod- mented examples in each iteration. According to the
els consistently outperform the respective fine-tuned Huggingface training time estimator, this training pro-
baseline pre-trained language models. Specifically, we cedure would take approximately 60 hours on Google
observe that models with a smaller architecture such Colab Pro for ELECTRA-Small. On the other hand,
as BERT-Tiny and BERT-Mini make the largest gains our own optimized implementation of acda only re-
when they make use of acda, as their performance is quires 9 hours of training for the same task, thus high-
increased by 3.15% and 1.73% respectively. The rest lighting its computational efficiency.
of the models display a performance increase between
0.1% and 0.8%, while the best performing model is the 5.2. Performance on Adversarial Examples
acda-boosted ELECTRA-Small with an accuracy of After showing that acda-boosted pre-trained language
89.82%. Therefore, we can conclude that our approach models provide a consistent improvement in perfor-
consistently increases performance across all models, mance for both the SNLI and the MNLI datasets when
particularly lightweight ones, for SNLI. compared to the respective fine-tuned baseline pre-
MNLI Dataset The second evaluation dataset is the trained language models, we continue our evaluation
MNLI, which is a crowd-sourced collection of 433000 by examining their behavior when facing adversarial
sentence pairs annotated with textual entailment infor- examples. For this purpose, we make use of our hand-
mation. It is modeled on the SNLI corpus, but differs annotated adversarial set and specifically, the adversar-

432
ial examples of Table 2, which we discussed in Section Comparing the fine-tuned baseline pre-trained lan-
3. We can recall that the predictions of Table 2 are those guage model results (Table 2) and the acda-boosted
of the best performing fine-tuned baseline pre-trained pre-trained language model results (Table 6), we no-
language model, ELECTRA-Small. Despite having tice a considerable improvement in prediction accu-
a prediction accuracy of 81.16% and 88.92% for the racy, and we can therefore conclude that the acda-
SNLI and MNLI validation sets respectively, the model boosted pre-trained language model exhibits a robust
did not classify any of the 12 adversarial examples of behavior against adversarial examples due to its re-
Table 2 correctly. We present the same 12 adversar- silience against dataset artifacts. Specifically, it man-
ial examples with the predictions of the acda-boosted ages to classify 8 out of the 12 adversarial examples
ELECTRA-Small in Table 6 below: correctly. We can attribute its success to the improved
training procedure having moved the model further
Premise Hypothesis Label Pred away from dataset artifacts and into greater reading
1 Two women One of the ENT ENT ✓ comprehension. This is further proven by the fact that
are embracing women is even when it comes to the adversarial examples which
while holding holding take- the acda-boosted pre-trained language model classifies
to go pack- away pack- incorrectly, we can manually confirm that, in the major-
ages. ages. ity of the cases, the classification probability towards
2 ... The packages ENT ENT ✓ the gold label is significantly higher compared to the
contain food. one produced by the respective fine-tuned baseline pre-
3 ... The women ENT ENT ✓ trained language model.
have bought
food.
4 ... The women NEU ENT ✗ 6. Conclusions and Future Work
have bought In this work we proposed a novel data augmentation
lasagna. technique, acda, discussed its advantages with respect
5 A man in A man is NEU NEU ✓
to established data augmentation packages, and de-
a blue shirt wearing black
scribed how it can be naturally combined with a learn-
standing in trousers.
front of a ing optimization method which utilizes contrastive
garage-like learning and a hybrid loss function. We showed that
structure the employment of this combined approach by pre-
painted with trained language models can lead to a consistent in-
geometric crease in performance, while requiring minimal com-
designs. putational cost regarding training time and resources.
6 ... His shirt fea- CON ENT ✗ In particular, acda-boosted pre-trained language mod-
tures geomet- els consistently outperform the respective fine-tuned
ric designs. baseline pre-trained language models in benchmark
7 A young boy in He is carrying ENT ENT ✓ datasets related to natural language inference. Further-
a field of flow- one ball.
more, the acda-boosted pre-trained language models
ers carrying a
ball
are also substantially more resilient to dataset artifacts
8 ... Ball in field. ENT ENT ✓ and as a result display robust behavior and high perfor-
9 Two doc- The two doc- ENT NEU ✗ mance against adversarial examples.
tors perform tors are per- As a natural next step, we intend to further improve the
surgery on forming brain data augmentation process by introducing more sophis-
patient. surgery. ticated rules. We believe that by expanding the rules in
10 . . . The patient is ENT NEU ✗ a structured manner, we can generate more closely re-
having heart lated examples and improve performance metrics sub-
surgery. stantially. This will likely require a formal-logical
11 A white dog It is not a ENT ENT ✓ treatment of the relationships between sentences when
with long hair brown dog.
a word is swapped in a controlled manner. Similarly,
jumps to catch
a red and green
coming up with a larger number of more complex rules,
toy. such as ones based on conditionals, is also promising
12 Kids are on Kids ride ENT ENT ✓ as this would further increase the size of the training
a amusement joyously an set in a meaningful way and, given the computational
ride. amusement efficiency of our procedure, it would come at a mini-
ride. mal cost, as no computational cost is added on top of
the training cost. Finally, we intend to create modified
Table 6: A sample of hand-annotated adversarial ex- variants of acda in order to expand our methodology to
amples and the predictions of the highest performing other domains of interest within natural language pro-
acda-boosted pre-trained language model. cessing, where reading comprehension is vital.

433
Acknowledgements Meissner, J. M., Thumwanit, N., Sugawara, S., and
The authors wish to express their gratitude to Prof. Aizawa, A. (2021). Embracing ambiguity: Shift-
Greg Durrett and his staff for their guidance and con- ing the training target of nli models. arXiv preprint
tributions. arXiv:2106.03020.
Miller, G. A. (1995). Wordnet: a lexical database for
7. Bibliographical References english. Communications of the ACM, 38(11):39–
Bartolo, M., Roberts, A., Welbl, J., Riedel, S., and 41.
Stenetorp, P. (2020). Beat the ai: Investigating ad- Morris, J. X., Lifland, E., Yoo, J. Y., Grigsby, J., Jin, D.,
versarial human annotation for reading comprehen- and Qi, Y. (2020). Textattack: A framework for ad-
sion. Transactions of the Association for Computa- versarial attacks, data augmentation, and adversarial
tional Linguistics, 8:662–678. training in nlp. arXiv preprint arXiv:2005.05909.
Bowman, S. R., Angeli, G., Potts, C., and Manning, Munkhdalai, T. and Yu, H. (2017). Neural tree in-
C. D. (2015). A large annotated corpus for learn- dexers for text understanding. In Proceedings of the
ing natural language inference. In Proceedings of conference. Association for Computational Linguis-
the 2015 Conference on Empirical Methods in Nat- tics. Meeting, volume 1, page 11. NIH Public Ac-
ural Language Processing, pages 632–642, Lisbon, cess.
Portugal, September. Association for Computational Nie, Y., Zhou, X., and Bansal, M. (2020). What
Linguistics. can we learn from collective human opinions on
Chen, Q., Zhu, X., Ling, Z., Wei, S., Jiang, H., and natural language inference data? arXiv preprint
Inkpen, D. (2016). Enhanced lstm for natural lan- arXiv:2010.03532.
guage inference. arXiv preprint arXiv:1609.06038. Peters, M., Neumann, M., Iyyer, M., Gardner, M.,
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, Clark, C., Lee, K., and Zettlemoyer, L. (1802).
K. (2018). Bert: Pre-training of deep bidirectional Deep contextualized word representations. arxiv
transformers for language understanding. arXiv 2018. arXiv preprint arXiv:1802.05365, 12.
preprint arXiv:1810.04805. Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R.,
Dua, D., Dasigi, P., Singh, S., and Gardner, M. (2021). and Van Durme, B. (2018). Hypothesis only base-
Learning with instance bundles for reading compre- lines in natural language inference. arXiv preprint
hension. arXiv preprint arXiv:2104.08735. arXiv:1805.01042.
Gardner, M., Artzi, Y., Basmova, V., Berant, J., Bogin, Radford, A., Narasimhan, K., Salimans, T., and
B., Chen, S., Dasigi, P., Dua, D., Elazar, Y., Got- Sutskever, I. (2018). Improving language under-
tumukkala, A., et al. (2020). Evaluating models’ standing by generative pre-training.
local decision boundaries via contrast sets. arXiv Ribeiro, M. T., Wu, T., Guestrin, C., and Singh,
preprint arXiv:2004.02709. S. (2020). Beyond accuracy: Behavioral test-
Ghaeini, R., Hasan, S. A., Datla, V., Liu, J., Lee, ing of nlp models with checklist. arXiv preprint
K., Qadir, A., Ling, Y., Prakash, A., Fern, X. Z., arXiv:2005.04118.
and Farri, O. (2018). Dr-bilstm: Dependent read- Rocktäschel, T., Grefenstette, E., Hermann, K. M.,
ing bidirectional lstm for natural language inference. Kočiskỳ, T., and Blunsom, P. (2015). Reason-
arXiv preprint arXiv:1802.05577. ing about entailment with neural attention. arXiv
Glockner, M., Shwartz, V., and Goldberg, Y. (2018). preprint arXiv:1509.06664.
Breaking nli systems with sentences that re- Sha, L., Chang, B., Sui, Z., and Li, S. (2016). Reading
quire simple lexical inferences. arXiv preprint and thinking: Re-read lstm unit for textual entail-
arXiv:1805.02266. ment recognition. In Proceedings of COLING 2016,
Jia, R. and Liang, P. (2017). Adversarial examples for the 26th International Conference on Computational
evaluating reading comprehension systems. arXiv Linguistics: Technical Papers, pages 2870–2879.
preprint arXiv:1707.07328. Smith, N. A. and Eisner, J. (2005). Contrastive estima-
Liu, N. F., Schwartz, R., and Smith, N. A. tion: Training log-linear models on unlabeled data.
(2019). Inoculation by fine-tuning: A method In Proceedings of the 43rd Annual Meeting of the As-
for analyzing challenge datasets. arXiv preprint sociation for Computational Linguistics (ACL’05),
arXiv:1904.02668. pages 354–362.
MacCartney, B. and Manning, C. D. (2008). Model- Swayamdipta, S., Schwartz, R., Lourie, N., Wang,
ing semantic containment and exclusion in natural Y., Hajishirzi, H., Smith, N. A., and Choi, Y.
language inference. In Proceedings of the 22nd In- (2020). Dataset cartography: Mapping and diagnos-
ternational Conference on Computational Linguis- ing datasets with training dynamics. arXiv preprint
tics (Coling 2008), pages 521–528. arXiv:2009.10795.
McCoy, R. T., Pavlick, E., and Linzen, T. (2019). Wallace, E., Feng, S., Kandpal, N., Gardner, M.,
Right for the wrong reasons: Diagnosing syntac- and Singh, S. (2019). Universal adversarial trig-
tic heuristics in natural language inference. arXiv gers for attacking and analyzing nlp. arXiv preprint
preprint arXiv:1902.01007. arXiv:1908.07125.

434
Williams, A., Nangia, N., and Bowman, S. R. (2017).
A broad-coverage challenge corpus for sentence
understanding through inference. arXiv preprint
arXiv:1704.05426.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,
C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-
icz, M., et al. (2019). Huggingface’s transformers:
State-of-the-art natural language processing. arXiv
preprint arXiv:1910.03771.
Yaghoobzadeh, Y., Mehri, S., Tachet, R., Hazen, T. J.,
and Sordoni, A. (2019). Increasing robustness to
spurious correlations using forgettable examples.
arXiv preprint arXiv:1911.03861.
Zhang, Z., Wu, Y., Zhao, H., Li, Z., Zhang, S., Zhou,
X., and Zhou, X. (2020). Semantics-aware bert for
language understanding. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34,
pages 9628–9635.
Zhou, X. and Bansal, M. (2020). Towards robustify-
ing nli models against lexical dataset biases. arXiv
preprint arXiv:2005.04732.

435

2305.16938v2
No ratings yet
2305.16938v2
29 pages
Efficient Prompting Methods For Large Language Models - A Survey
100% (1)
Efficient Prompting Methods For Large Language Models - A Survey
18 pages
FULLTEXT01
No ratings yet
FULLTEXT01
89 pages
Mid Term Major Project
No ratings yet
Mid Term Major Project
19 pages
Plan Then Generate Controlled Data to-Text Generation via Planning
No ratings yet
Plan Then Generate Controlled Data to-Text Generation via Planning
15 pages
Memorization vs Generalization Quantifying Data Le
No ratings yet
Memorization vs Generalization Quantifying Data Le
11 pages
Inference-Time Intervention: Eliciting Truthful Answers From A Language Model
No ratings yet
Inference-Time Intervention: Eliciting Truthful Answers From A Language Model
80 pages
Augmenting_Open_Domain_Event_Detection_w
No ratings yet
Augmenting_Open_Domain_Event_Detection_w
16 pages
2301.03252v1
No ratings yet
2301.03252v1
25 pages
2412.13860v1
No ratings yet
2412.13860v1
10 pages
Mortizing Intractable Inference in Large Language Models: Edward J. Hu, Moksh Jain, Eric Elmoznino Younesse Kaddar
No ratings yet
Mortizing Intractable Inference in Large Language Models: Edward J. Hu, Moksh Jain, Eric Elmoznino Younesse Kaddar
31 pages
2024.emnlp-main.313
No ratings yet
2024.emnlp-main.313
28 pages
2302.13007v3
No ratings yet
2302.13007v3
12 pages
2024.Findings Emnlp.523
No ratings yet
2024.Findings Emnlp.523
13 pages
2023.emnlp Main.96SynthIE
No ratings yet
2023.emnlp Main.96SynthIE
20 pages
Instructions for ACL 2019 Proceedings
No ratings yet
Instructions for ACL 2019 Proceedings
8 pages
1911.03861v2
No ratings yet
1911.03861v2
14 pages
Bartolo2020 - Beat The AI - Investigating Adversarial Human Annotation For Reading Comprehension
No ratings yet
Bartolo2020 - Beat The AI - Investigating Adversarial Human Annotation For Reading Comprehension
17 pages
Towards Robust Models for Fake News Detection in Spanish- Gómez González, Coll Ardanuy y Rosso
No ratings yet
Towards Robust Models for Fake News Detection in Spanish- Gómez González, Coll Ardanuy y Rosso
13 pages
ACL Findings - 2022 - Jing Qian - Controllable Natural Language Generation With Contrastive Prefixes
No ratings yet
ACL Findings - 2022 - Jing Qian - Controllable Natural Language Generation With Contrastive Prefixes
13 pages
RA-ISF: Learning To Answer and Understand From Retrieval Augmentation Via Iterative Self-Feedback
No ratings yet
RA-ISF: Learning To Answer and Understand From Retrieval Augmentation Via Iterative Self-Feedback
15 pages
research paper 10
No ratings yet
research paper 10
12 pages
Unsupervised Data Validation Methods for
No ratings yet
Unsupervised Data Validation Methods for
10 pages
In-Context Learning Creates Task Vectors
No ratings yet
In-Context Learning Creates Task Vectors
16 pages
Factual Probing Is (MASK) : Learning vs. Learning To Recall
No ratings yet
Factual Probing Is (MASK) : Learning vs. Learning To Recall
17 pages
Summarization and Visualization of Files based on Genai
No ratings yet
Summarization and Visualization of Files based on Genai
5 pages
A Survey on Data Synthesis and Augmentation for Large Language Models
No ratings yet
A Survey on Data Synthesis and Augmentation for Large Language Models
28 pages
ULMfit Universal Language Model Fine-Tuning for Text Classification
No ratings yet
ULMfit Universal Language Model Fine-Tuning for Text Classification
9 pages
Learning How To Ask: Querying Lms With Mixtures of Soft Prompts
No ratings yet
Learning How To Ask: Querying Lms With Mixtures of Soft Prompts
11 pages
Physics of Language Models Part 3.2 Knowledge Manipulation
No ratings yet
Physics of Language Models Part 3.2 Knowledge Manipulation
29 pages
NLP Paper 5
No ratings yet
NLP Paper 5
33 pages
Chitale 等 - 2024 - An Empirical Study of In-context Learning in LLMs
No ratings yet
Chitale 等 - 2024 - An Empirical Study of In-context Learning in LLMs
23 pages
Label Representation
No ratings yet
Label Representation
5 pages
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
CL Honours Report Naman
No ratings yet
CL Honours Report Naman
11 pages
Language Models Can Improve Event Prediction
No ratings yet
Language Models Can Improve Event Prediction
26 pages
N19-1213
No ratings yet
N19-1213
7 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
No ratings yet
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
11 pages
2023.acl-short.25
No ratings yet
2023.acl-short.25
12 pages
Khmer Empire
No ratings yet
Khmer Empire
34 pages
2023 Eacl-Main 107
No ratings yet
2023 Eacl-Main 107
14 pages
Enhanced LSTM For Natural Language Inference: Bowman Et Al. 2015
No ratings yet
Enhanced LSTM For Natural Language Inference: Bowman Et Al. 2015
12 pages
Syntatic Data
No ratings yet
Syntatic Data
26 pages
Improving Dialogue Management Through Data Optimization
No ratings yet
Improving Dialogue Management Through Data Optimization
18 pages
Do We Need To Create Big Datasets To Learn A Task?
No ratings yet
Do We Need To Create Big Datasets To Learn A Task?
5 pages
2020 Emnlp-Main 488
No ratings yet
2020 Emnlp-Main 488
13 pages
MalBERTv2
No ratings yet
MalBERTv2
33 pages
Sources of Hallucination by Large Language Models On Inference Tasks
No ratings yet
Sources of Hallucination by Large Language Models On Inference Tasks
17 pages
SMART: Robust and E Fficient Fine-Tuning For Pre-Trained Natural Language Models Through Principled Regularized Optimization
No ratings yet
SMART: Robust and E Fficient Fine-Tuning For Pre-Trained Natural Language Models Through Principled Regularized Optimization
21 pages
Guiding Large Language Models With Divide-and-Conquer Program For Discerning Problem Solving
No ratings yet
Guiding Large Language Models With Divide-and-Conquer Program For Discerning Problem Solving
18 pages
2024 Findings-Eacl 141
No ratings yet
2024 Findings-Eacl 141
17 pages
Fake News Detection Using Machine Learning
No ratings yet
Fake News Detection Using Machine Learning
6 pages
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
No ratings yet
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
11 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
Abductive Knowledge Induction From Raw Data
No ratings yet
Abductive Knowledge Induction From Raw Data
7 pages
Retrieval Augmentation Reduces Hallucination in Conversation
No ratings yet
Retrieval Augmentation Reduces Hallucination in Conversation
21 pages
U1 NLP App Solved
No ratings yet
U1 NLP App Solved
26 pages
Bingo
No ratings yet
Bingo
2 pages
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
From Everand
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
Shane Snipes, PhD
No ratings yet
1200 Pyetje Urgjence PDF CT Scan Gastroesop
No ratings yet
1200 Pyetje Urgjence PDF CT Scan Gastroesop
24 pages
READING COMPREHENSION
No ratings yet
READING COMPREHENSION
15 pages
IR&Yourcegid Retail 2018
0% (1)
IR&Yourcegid Retail 2018
79 pages
Certified AI Practitioner Exam AIP-110 Blueprint_Final_20190813
No ratings yet
Certified AI Practitioner Exam AIP-110 Blueprint_Final_20190813
11 pages
Artificial Neural Network Tutorial
0% (2)
Artificial Neural Network Tutorial
11 pages
Chipping Works Methdology
No ratings yet
Chipping Works Methdology
6 pages
Montessori-inspired Math Bead Bar Printable Pack 1
No ratings yet
Montessori-inspired Math Bead Bar Printable Pack 1
17 pages
Studocu: Defn at Re Sco Ean
No ratings yet
Studocu: Defn at Re Sco Ean
2 pages
TAKE HOME LONG QUIZ 2 (Ledger)
No ratings yet
TAKE HOME LONG QUIZ 2 (Ledger)
13 pages
P Technology Presentation
No ratings yet
P Technology Presentation
22 pages
Sticker Retour 2114537028 0
No ratings yet
Sticker Retour 2114537028 0
3 pages
Voltage Regulator: Semiconductor Technical Data
No ratings yet
Voltage Regulator: Semiconductor Technical Data
9 pages
Teacherx27s Magazine PDF Free
No ratings yet
Teacherx27s Magazine PDF Free
32 pages
Exercises
No ratings yet
Exercises
6 pages
Rab Cut and Fill Project Pt. Gms Igp Moramo
No ratings yet
Rab Cut and Fill Project Pt. Gms Igp Moramo
7 pages
What's in A Cell - Worksheet & Coloring
0% (1)
What's in A Cell - Worksheet & Coloring
2 pages
S-04-HCW-M-40-V001-001 - SA Fan PDF
100% (1)
S-04-HCW-M-40-V001-001 - SA Fan PDF
48 pages
Tugas Bahasa Inggris - Radit
No ratings yet
Tugas Bahasa Inggris - Radit
9 pages
Self Assessment AUN-QA: Strategi Penyusunan
No ratings yet
Self Assessment AUN-QA: Strategi Penyusunan
28 pages
LG 50PV450 Training Manual
No ratings yet
LG 50PV450 Training Manual
160 pages
The Varieties and Uses of Tea
No ratings yet
The Varieties and Uses of Tea
5 pages
Compugra Posting
No ratings yet
Compugra Posting
2 pages
SG580
No ratings yet
SG580
1 page
2 Comfort and Mechanical Properties of PolyesterBamboo and PolyesterCotton Blended
No ratings yet
2 Comfort and Mechanical Properties of PolyesterBamboo and PolyesterCotton Blended
9 pages
Spider Glass Curtain Walls-Glass Options
No ratings yet
Spider Glass Curtain Walls-Glass Options
3 pages
Functionality of LACT Unit
No ratings yet
Functionality of LACT Unit
5 pages
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Cambridge Primary Checkpoint - English (0844) October 2019 Paper 1 Mark Scheme
100% (5)
Cambridge Primary Checkpoint - English (0844) October 2019 Paper 1 Mark Scheme
10 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mumbai Salaried 6
No ratings yet
Mumbai Salaried 6
4 pages

2022.lrec-1.45

Uploaded by

2022.lrec-1.45

Uploaded by

Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pages 427–435

Marseille, 20-25 June 2022

Mitigating Dataset Artifacts in Natural Language Inference Through

Michail Mersinias1 , Panagiotis Valvis1

Keywords: natural language inference, data augmentation, learning optimization

1. Introduction the fact that a model may achieve high performance on

4.3. Hybrid Loss 5. Experimental Evaluation

You might also like