0% found this document useful (0 votes)
29 views

Enabling Natural Zero-Shot Prompting On Encoder Models Via Statement-Tuning

The document proposes a technique called Statement-Tuning that models discriminative NLP tasks as statements and trains an encoder model to discriminate between potential statements and determine labels. Experimental results show Statement-Tuning achieves competitive performance compared to large language models using significantly fewer parameters and enables zero-shot generalization to new tasks through multi-task fine-tuning on statement classifications.

Uploaded by

Kayky Ramos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Enabling Natural Zero-Shot Prompting On Encoder Models Via Statement-Tuning

The document proposes a technique called Statement-Tuning that models discriminative NLP tasks as statements and trains an encoder model to discriminate between potential statements and determine labels. Experimental results show Statement-Tuning achieves competitive performance compared to large language models using significantly fewer parameters and enables zero-shot generalization to new tasks through multi-task fine-tuning on statement classifications.

Uploaded by

Kayky Ramos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Enabling Natural Zero-Shot Prompting on Encoder Models via

Statement-Tuning
Ahmed Elshabrawy1 , Yongxin Huang2 , Iryna Gurevych1,2 , Alham Fikri Aji1
1
Department Natural Language Processing, MBZUAI
2
Ubiquitous Knowledge Processing Lab, Technical University of Darmstadt
1
{ahmed.elshabrawy,iryna.gurevych,alham.fikri}@mbzuai.ac.ae
2
www.ukp.tu-darmstadt.de

Abstract can not be extended effectively to new tasks in a


few-shot or zero-shot manner.
While Large Language Models (LLMs) ex- In this work, we explore the feasibility of utiliz-
hibit remarkable capabilities in zero-shot and
ing encoder models that are usually specialized for
arXiv:2404.12897v2 [cs.CL] 22 Apr 2024

few-shot scenarios, they often require computa-


tionally prohibitive sizes. Conversely, smaller a certain task to take on various, unseen Natural
Masked Language Models (MLMs) like BERT Language Understanding (NLU) tasks, akin to zero-
and RoBERTa achieve state-of-the-art results shot prompting in decoder models. One benefit of
through fine-tuning but struggle with extend- using encoder models is that they are generally
ing to few-shot and zero-shot settings due to more compact. Yet, encoder models have achieved
their architectural constraints. Hence, we pro- state-of-the-art results on many NLU tasks through
pose Statement-Tuning, a technique that mod-
task-specific finetuning, and thus it would be in-
els discriminative tasks as a set of finite state-
ments and trains an Encoder model to discrimi- teresting if zero-shot/few-shot prompting could be
nate between the potential statements to deter- adapted to be compatible with encoder models to
mine the label. We do Statement-Tuning on leverage their powerful NLU capabilities at more
multiple tasks to enable cross-task generaliza- computationally feasible sizes.
tion. Experimental results demonstrate that To address this issue, some techniques try to
Statement-Tuning achieves competitive perfor- reformulate various downstream tasks with a uni-
mance compared to state-of-the-art LLMs with
fied format resembling the pretraining objective
significantly fewer parameters. Moreover, the
study investigates the impact of several de- (MLM or Discriminative pretraining), enabling
sign choices on few-shot and zero-shot general- few-shot transfer for encoder models (Xia et al.,
ization, revealing that Statement-Tuning can 2022; Schick and Schütze, 2021a,b; Gao et al.,
achieve sufficient performance with modest 2021). These approaches tend to have multiple lim-
training data and benefits from task and state- itations including unintuitive/difficult implementa-
ment diversity for unseen task generalizability. tions, complex handling of multi-token labels, and
being unsuitable for zero-shot generalization.
1 Introduction
In this work, we take inspiration from multi-
Large Language Models (LLMs) have shown great task instruction tuning methods for decoder models
capabilities in zero-shot and few-shot settings (Rad- (Wei et al., 2022a; Sanh et al., 2022) and unified
ford et al., 2019; Brown et al., 2020; Artetxe et al., format fine-tuning methods for encoder models to
2022). However, such capabilities are more diffi- propose Statement-Tuning, a novel intuitive ap-
cult to observe in Encoder-only models like BERT proach for encoder-only models to generalize to
(Devlin et al., 2019) and RoBERTa (Liu et al., zero-shot and few-shot unseen tasks using univer-
2019) due to their architectural design. These mod- sal multitask fine-tuning. Our approach thus has
els are typically pretrained in an unsupervised man- the generalization ability similar to decoder models.
ner on a large corpus with a Masked Language
Modeling (MLM) (Devlin et al., 2019) or Discrim- As seen in Figure 1, we verbalize a diverse set
inative (Clark et al., 2020) objective and fine-tuned of NLU tasks into natural language statements,
by adding task-specific layers to enable their us- and then we fine-tune an encoder-only MLM,
age on a particular task such as binary/multi-label RoBERTa, on a universal binary sequence classifi-
classification, token/sequence classification, mul- cation task, which we call Statement-Tuning, to as-
tiple choice, etc. These task-specific layers, thus, sign a truth value (True or False) to any given state-
QQP Training Label

Statement: “What can one do after MBBS?” is a


duplicate of “What do i do after my MBBS?” True

Winogrande
Statement: John moved the couch from the garage
to the backyard to create space because The couch False
is small.
PIQA
Statement: Goal: how do you flood a room?
Solution: fill it with water. True
Statement-Tuned
Encoder
Multi-task Finetuning
Zero-shot generalization Inference probability Predicted
Balanced Copa class

Statement_0: The item was packaged in bubble


wrap, because it was fragile.
Statement_1: The item was packaged in bubble
P(True | Statement_0) = 0.62 ✓
P(True | Statement_1) = 0.53 x 0
wrap, because it was small.
Amazon Polarity
Statement_0: The sentiment in "Amazing! This
soundtrack is..." is negative.
Statement_1: The sentiment in "Amazing! This
P(True | Statement_0) = 0.24 x
P(True | Statement_1) = 0.87 ✓ 1
soundtrack is..." is positive.

Figure 1: Overview of Statement-Tuning. We train an encoder to discriminate the truth value of statements from
multiple tasks, then we apply it in the zero-shot setting by creating a statement for each possible target label and
choosing the most likely one according to the encoder discriminator.

ment. By fine-tuning encoder LLM across diverse 2. We expose that certain emergent abilities (Wei
tasks and statements, we show zero-shot generaliza- et al., 2022b) like zero-shot generalization on
tion capabilities to unseen tasks by similarly trans- unseen tasks previously thought to be exclu-
forming them into statements. Moreover, we show sive to decoder-based LLMs can also be ob-
few-shot capabilities by continually fine-tuning served in much smaller encoder models when
this model with a small amount of downstream we do multitask Statement-Tuning.
data, also formatted into statements. Statement- 3. We explore a large number of design choices
Tuning is capable of matching or even outperform- to study how Statement-Tuning benefits from
ing (32-shot and) zero-shot performance of many the number of statement examples and state-
state-of-the-art LLMs with a fraction of the param- ment template and task diversity in multitask
eters. Statement-Tuning.
Our ablation study shows that depending on the
task, we can achieve sufficient few-shot and zero- 2 Related Work
shot generalizability with as few as 1,000 state-
ments per training dataset or approximately 16,000 Few-shot Approaches Using Encoder-Only Mod-
training statement examples in total, which corre- els Prompt-based fine-tuning with cloze tem-
spond to even fewer original task examples since plates or label discrimination (Schick and Schütze,
one example can be turned into multiple statements 2021a; Gao et al., 2021) effectively utilizes encoder
through different templates. Furthermore, we find models for few-shot learning. Prompts provide
that the proximity of the fine-tuning tasks to the task context beyond traditional input-output train-
evaluation tasks as well as the prompt and task ing, enabling easier adaptation. However, multi-
diversity tend to have a beneficial effect on the token labels and zero-shot generalization remain
effectiveness of Statement-Tuning. challenges (Schick and Schütze, 2021b; Tam et al.,
In summary, our primary contributions are: 2021). Our work addresses this by proposing a
simpler reformulation method for improved unseen
1. To the best of our knowledge we are the first task generalization via multitask training.
to enable natural, zero-shot task generaliza- Verbalizers in prompts leverage label semantics
tion in encoder models by verbalizing the in- for task information (e.g., (Tam et al., 2021)). Sim-
put into statements and fine-tuning the model ilar approaches include label conditioning tasks
to perform binary classification on the truth (Tam et al., 2021) and TARS (Halder et al., 2020).
value of a statement. While TARS implicitly learns label-text relations,
our method uses verbalization to explicitly connect 3 Method: Statement-Tuning
text and labels in natural language statements. This
facilitates generalization to unseen tasks without In this section, we outline the steps involved in
further fine-tuning, as the model learns a general Statement-Tuning. First, tasks are verbalized into
semantic understanding rather than a rigid input natural language statements. Then they are used
structure. to train the statement discriminator and derive the
target label.
Using Masked Language Models for Sequence
3.1 Task Verbalization
Scoring In a sense, Statement-Tuning can be
viewed as fine-tuning MLMs for universal NLU Any discriminative task with a finite set of targets
sequence scoring based on the truth value of a can be verbalized into a finite set of natural lan-
statement and the subsequent generalization to any guage statements. Figure 2 shows the example of
discriminative tasks. Previous approaches have ex- converting the MNLI task into statements. Similar
plored using MLMs for general sequence scoring to prompting, each task has its own statement tem-
for non-autoregressive generation (Wang and Cho, plates, based on each possible label. The truth label
2019; Ghazvininejad et al., 2019). Salazar et al. for training purposes on each statement depends on
(2020) use MLM pseudo-log-likelihood (PLL) whether the statement contains the correct target
scores as discriminators for generated output in label or not. We outline all the statement templates
Automatic Speech Recognition and low-resource used for each dataset in Appendix A.
Machine Translation and achieve improved genera-
3.2 Statement Fine-Tuning
tion through hypothesis re-scoring using the MLM
PLL scores. To create the training data for statement fine-tuning,
we exhaustively generate statements across 16 di-
Zero-shot Prompting and Multitask Tuning verse NLP tasks using many varied statement tem-
LLMs excel at unseen-task/zero-shot generaliza- plates per dataset: QQP (Sharma et al., 2019),
tion (Brown et al., 2020). Building on this, re- Winogrande (Sakaguchi et al., 2020), PiQA (Bisk
cent work explores multitask training with diverse et al., 2020), MNLI (Williams et al., 2018), SNLI
prompts for improved zero-shot performance (Sanh (Bowman et al., 2015), Mintaka (Sen et al., 2022),
et al., 2022; Wei et al., 2022a; Chung et al., 2022). Yelp Polarity (Zhang et al., 2015), WikiLingua
These methods fine-tune large models on con- (Ladhak et al., 2020), SQuAD (Rajpurkar et al.,
structed datasets with various task prompts, achiev- 2016), TweetEval’s Offensive task (Zampieri et al.,
ing strong zero-shot results. However, effective 2019), Massive (FitzGerald et al., 2022; Bastianelli
instruction-tuned models often require billions of et al., 2020), Definite Pronoun Resolution (Rahman
parameters (Zhang et al., 2024), limiting their ap- and Ng, 2012), QASC (Khot et al., 2020), SciQ (Jo-
plication to smaller models. Our work demon- hannes Welbl, 2017), RACE (Lai et al., 2017), and
strates achieving similar or superior generalization SAMSum (Gliwa et al., 2019).
on smaller encoder-only models with less training We explore different sample sizes for each
data. dataset. We fine-tune RoBERTa (Liu et al., 2019)
with a binary sequence classification head to pre-
Task: MNLI dict the truth value of the statements. By fine-
Premise: Conceptually cream skimming has two basic dimensions - product and
geography. tuning the model across diverse tasks, templates,
Hypothesis: Product and geography are what make cream skimming work.
and domains, the model should be able to general-
Options: [“entailment”, “neutral”, “contradiction”]

Statement Conversion:
ize across unseen templates and tasks, as long as it
can be phrased as a true/false statement.
S1: “Conceptually cream skimming has two basic dimensions - product and
geography” entails “Product and geography are what make cream skimming work”.
S2: “Conceptually cream skimming has two basic dimensions - product and 3.3 Zero-Shot and Few-Shot Inference
geography” is neutral with regards to “Product and geography are what make cream
skimming work”.
S3: “Conceptually cream skimming has two basic dimensions - product and To perform inference on statement-finetuned
geography” contradicts “Product and geography are what make cream skimming
work”. RoBERTa, we also need to transform the input
into statements. Specifically, we exhaustively gen-
Figure 2: Example conversion of the MNLI task to erate a statement for each possible label, as shown
natural language statements. in Figure 1. Then, for each statement correspond-
ing to each label, we predict the probability of
#Parameters BCOPA MRPC FigQA Amazon Polarity StoryCloze YA Topic Emotion
Llama-2-7b-chat 7B 86.6 54.4 40.1 90.5 78.5 47.8 50.0
Mistral-7B-Instruct-v0.2 7B 89.4 73.0 41.4 88.9 82.3 57.7 55.3
Qwen1.5-7B-Chat 7B 87.0 75.5 42.1 95.3 79.7 59.1 57.8
Pythia-6.9B 6.9B 82.2 62.0 41.7 83.3 71.2 32.2 25.1
Pythia-2.8B 2.8B 79.6 68.4 41.2 77.7 69.7 12.1 35.4
Phi-2 2.7B 87.2 67.9 41.8 86.6 77.7 38.7 53.1
FlanT5-Large 770M 67.6 81.1 40.1 96.0 63.0 51.0 59.9
Qwen1.5-0.5B-Chat 500M 69.2 32.6 38.7 69.7 68.9 21.9 6.6
BART-large-mnli 406M 50.4 35.8 46.9 49.4 47.3 6.5 11.7
FlanT5-Small 60M 52.8 31.9 42.0 88.8 51.5 24.5 21.7
Our Approach: RoBERTa-base (Best) 125M 73.0 71.9 61.3 93.6 83.6 44.6 55.5
Our Approach: RoBERTa-base (4k) 125M 69.6 69.8 59.3 92.7 82.8 41.2 55.5
Our Approach: RoBERTa-large (Best) 355M 85.0 72.5 74.7 95.3 93.0 51.1 55.1
Our Approach: RoBERTa-large (50k) 355M 85.0 71.4 72.2 95.3 92.9 51.1 55.1
Full/3000-shot:
RoBERTa-base (FT) 125M 74.2 87.0 88.1 94.3 - 71.0 82.2
RoBERTa-large (FT) 355M 86.0 87.6 92.0 96.5 - 68.5 78.2

Table 1: Comparison of our approach against many pretrained open-source Encoder-Decoder and Decoder-only
Pretrained Large Language Models on 7 Natural Language Understanding tasks in Zero-shot conditions. FT
stands for Full Finetuning and is included for reference. We highlight all scores in red where our approach with
RoBERTa-base (best) exceeds or is equal to the score given by the model.

such a statement being true. The final label is the expands the dataset with various templates over all
statement with the highest true probability. Zero- possible labels, it is arguably unwise to fine-tune on
shot inference is done by directly performing the all possible generated statements. Moreover, each
aforementioned inference regime on the statement- task has a different data size, leading to unbalanced
finetuned RoBERTa, while K-shot inference is fine-tuning data. Therefore, we sample statements
done by first performing continual fine-tuning on randomly for each task, uniformly across true and
K examples of task-specific statements. false statements. We explore sample size from
1,000 statements to 50,000 statements for each
4 Experimental Setup dataset. Following best practices, we encourage
4.1 Evaluation the invariance to phrasing in diverse statements by
designing multiple statement templates per dataset
In this work, we measure our model’s task gener- (a list of all statement templates is shown in Ap-
alizability using another set of 7 diverse datasets pendix A). We also explore the effect of statement
representing a variety of unseen tasks or unseen diversity during training in Section 5.2.
domains during training to judge the generalizabil-
ity of our approach: Balanced COPA (BCOPA) After statement tuning is completed, we can fur-
(Kavumba et al., 2019; Roemmele et al., 2011), ther continue fine-tuning the model on the target
MRPC (Dolan and Brockett, 2005), Emotion (Sar- downstream dataset. Specifically, we explore vari-
avia et al., 2018), Amazon Polarity (McAuley and ous n-shot configurations: Full/3,000-shot, 1,000-
Leskovec, 2013; Zhang et al., 2015), FigQA (Liu shot, 500-shot, 200-shot, and 32-shot, where we
et al., 2022), StoryCloze (2016) (Yang et al., 2023), use limited data from the training sets of the corre-
and Yahoo Answers Topics (Zhang et al., 2015). sponding dataset to fine-tune our statement-tuned
Among the evaluation data, BCOPA, Emotion, models. For the Full/3,000-shot case, we cap the
FigQA, StoryCloze, and Yahoo Answers Topics training set at 3,000 examples, otherwise, we use
are unseen tasks and hence examine the cross-task the entire set (this is the case for Amazon Polarity
generalizability; however, MRPC and Amazon Po- only). For StoryCloze, there is no training set, so
larity represent seen tasks but in different domains we just carry out 32-shot (using 32 samples from
and demonstrate domain generalizability. the test set for fine-tuning and evaluating on the
rest) and Zero-shot experiments. As for Yahoo
4.2 Statement Finetuning Configurations Answers Topic and Emotion, due to them being
We statement-finetune both RoBERTa-base and multi-class classification tasks, we cap the n-shot
RoBERTa-large across diverse NLP tasks outlined analysis at 200-shot due to the larger number of
in Section 3.2. However, as statement fine-tuning choices per example (and hence a larger number of
statements per example). Topic Answers, and Emotion. Perhaps unsurpris-
ingly, domain generalization in sentiment analysis
4.3 Other Baselines on Amazon Polarity and paraphrase detection on
To assess the feasibility of our approach, we com- MRPC is also displayed. We also see that the larger
pare Statement-tuned RoBERTa base models with model (RoBERTa-large) achieved much better gen-
125 million parameters with a range of competitive eralization capability in general.
multitask fine-tuned encoder-decoder models and
Comparison Against Larger Zero-Shot Mod-
decoder-only LLMs spanning a parameter range
els Our approach is also competitive against
from 60 million parameters to 7 billion parameters.
other pretrained open-source encoder-decoder and
We include the following open-source models:
decoder-only Large Language Models under zero-
Llama-2-7B-chat (Touvron et al., 2023), Mistral-
shot prompting. Despite having significantly fewer
7B-Instruct-v0.2 (Jiang et al., 2023), QWEN1.5-
parameters than all the models reported (except
7B-chat and QWEN1.5-0.5B-chat (Bai et al., 2023),
for FlanT5-small), our approach clearly matches
Pythia-6.9B and Pythia-2.8B (Biderman et al.,
or exceeds many of them on the tasks reported. It
2023), Phi-2 (Li et al., 2023), FlanT5-Large and
is worth noting that our approach with only 125M
FlanT5-Small (Chung et al., 2022), and BART-
parameters almost completely outperforms all mod-
large-mnli (Lewis et al., 2020). We use the
els under or equal to 6.9B parameters on almost
chat/instruction-tuned version of models to allow
all tasks (except for BCOPA) and is completely
for better prompting. We try to select models that
dominant on FigQA and StoryCloze, both of which
have not seen the evaluation data to the best of our
are unrepresented in the training data.
knowledge, however, the training data of many of
Furthermore, we observe generally stronger
these models is not fully outlined and there can
performance achieved by the Statement-Tuned
always be the possibility of contamination. We
RoBERTa-large models with consistently competi-
use the standard implementation of all models in
tive performance that is on par with the 7B param-
the HuggingFace transformers library (Wolf et al.,
eter LLMs on all the tasks. Notably, the Statement-
2020).
Tuned RoBERTa-large models greatly outperform
We train and evaluate all the models on a config-
all the 7B parameter LLMs (approximately 20
uration of 5 AMD EPYC Rome CPU cores and 2
times the number of parameters) on FigQA and
Nvidia Tesla A100 40GB GPUs.
StoryCloze, with the best performing RoBERTa-
5 Results and Analysis large model scoring an additional 32.6 and 10.7
points over the best performing 7B parameter LLM
In this section, we dive deep into the results of on the accuracy respectively.
our experimentation to derive insights about our We observe similar results in the 32-shot setting
approach. (see Appendix C). These results demonstrate the
significant potential of much smaller encoder mod-
5.1 Overall Result els being accurate and light alternatives to LLM
Statement-Tuning Enables Effective Zero-Shot zero-shot (and few-shot) prompting in natural lan-
Generalization on Encoder Models Table 1 guage understanding.
shows zero-shot performance of statement-tuned
RoBERTa. Recall that we explore various 5.2 Ablation Studies
statement-tuning sizes, hence here we report the Statement Finetuning Sample Size Recall that
best performance across all training sizes and per- we only perform statement fine-tuning on a sample
formance for the 4,000 and 50,000 sample sizes of all possible statements from the training dataset.
per dataset for the base and large models, respec- Here, we explore the significance of sample size
tively. The effect of statement tuning sample size per-dataset.
is explored in the later part of this paper. As shown in Table 2, a larger sample size does
The result shows that the statement-tuned en- not always mean better overall performance for
coder model can achieve zero-shot generalization RoBERTa-base. Specifically, RoBERTa-base’s av-
across unseen tasks and domains. Our approach erage performance plateaus around 67.3% after 4k
achieves respectable performance on unseen bal- samples and does not increase even with signifi-
anced COPA, Figurative QA, StoryCloze, Yahoo cantly more data. Interestingly, RoBERTa-large
BCOPA MRPC FigQA Amazon Polarity
80.0%
80.0% 90.0%
80.0%
70.0% 80.0%
Accuracy

Accuracy

Accuracy

Accuracy
70.0% 70.0%
70.0%
60.0% 60.0% 60.0% 60.0%
50.0% 50.0% 50.0% 50.0%
1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k
Training Data Size Training Data Size Training Data Size Training Data Size
StoryCloze Yahoo Topic Emotion
50% 50.0% Full/3000-shot
80.0% 1000-shot
40% 500-shot
Accuracy

Accuracy

Accuracy
70.0% 40.0%
30%
200-shot
30.0% 32-shot
60.0% 20% 0-shot
50.0% 10%
20.0% random
1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k
Training Data Size Training Data Size Training Data Size

Figure 3: N-shot accuracy of Statement-Tuned RoBERTa-base models across training datasets of different sizes.
The x-axis denotes the number of statements per Statement-Tuning training dataset, with the number of training
datasets fixed at 8. Additionally, a baseline comparison with random choice is included for reference.

Statement Average accuracy with more examples (shots) provided. Amazon Po-
Sample RoB-base RoB-large
larity has achieved strong performance even under
1,000 62.2 69.7
2,000 64.7 72.9
a zero-shot setting, hence the improvement in a
3,000 66.2 73.5 few-shot scenario is limited.
4,000 67.3 73.5
5,000 66.9 73.1
We experimented with re-purposing the model
10,000 66.2 73.4 back as a standard classification model. However,
20,000 66.7 73.2
40,000 66.5 73.6
performance was not as good on preliminary exper-
50,000 58.0 74.7 imentation so we do not explore this avenue further.
Detailed results can be seen in the Appendix D.
Table 2: Average accuracy over all evaluation tasks
when trained with different statement sample size. N-shot Performance Additionally, there seems
to be a general trend of improved performance by
using a larger n for n-shot continual fine-tuning.
benefits and improves significantly from using a However, the results seem to indicate a general
larger finetuning size with the best average perfor- trend of diminishing returns past using 200-shot
mance observed when it is finetuned on 50,000 finetuning. Nevertheless, it seems that a great deal
statements per training corpus. We hypothesize of the potential performance is achieved with the
that this is due to a larger "capacity" to understand zero-shot application of the approach, hence fur-
and discriminate between natural language state- ther supporting the utility of our approach when
ments which allows RoBERTa-large to benefit from task-specific data is scarce. Additionally, as seen
more training data as opposed to RoBERTa-base in Fig. 5, there appears to be a great degree of
which has a more limited "capacity" to develop correlation between 32-shot and 0-shot model per-
a general semantic understanding of the truthful- formance, indicating that observed trends in the
ness of statements. Due to computational and time 0-shot scenario can be informative for the few-shot
constraints, we are not able to exhaust the upper case.
limit of RoBERTa-large’s performance. Despite Effect of Statement Tuning Statement Diversity
this, we already achieved remarkable performance As part of our investigation of Statement-Tuning,
with training set sizes of up to 50,000 statements. we would like to explore the effect of template di-
However, we leave extensive hyperparameter and versity during Statement-Tuning. We hypothesize
training set size optimization for future work. that randomly applying a larger number of different
While the statement-tuned model shows zero- statement templates per training corpus will allow
shot generalization, we can further fine-tune the for improved performance on unseen tasks as it will
model with the target downstream task. As seen help the model be more robust to the phrasing of
in Figure 3, we investigate the effect of various statement templates and prevent it from forming de-
statement tuning training sample sizes on the base pendencies on superficial cues in some templates.
model’s accuracy across different n-shot continual To recall, each corpus employs several templates
fine-tuning on the 7 evaluation datasets. We ob- (see Appendix A). In this experiment, we limit each
serve a general trend where performance increases corpus to only use the maximum of N different
BCOPA MRPC FigQA Amazon Polarity
25.0% 20.0%
40%
20.0% 15.0% 20.0%
30%
Delta 15.0% 10.0%

Delta

Delta

Delta
10.0% 20%
10.0% 5.0%
5.0% 10%
0.0% 0.0%
0.0% 0%
5.0%
1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k
Training Data Size Training Data Size Training Data Size Training Data Size
StoryCloze Yahoo Topic Emotion
30.0% Full/3000-shot
30.0% 20% 1000-shot
20.0% 500-shot
20.0%
Delta

Delta

Delta
10.0% 0% 200-shot
32-shot
10.0% 0-shot
0.0% 20%
0.0%
1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k
Training Data Size Training Data Size Training Data Size

Figure 4: N-shot improvement of Statement-Tuned RoBERTa-base of varying training set sizes across training
datasets of different sizes. The y-axis, Delta, is the difference between the accuracy of the Statement-Tuned
model and the accuracy achieved by regular-finetuning of RoBERTa-base on the task. A positive Delta indicates
improvement over the baseline approach.

Correlation Matrix between Shot Types and Average Accuracy SPC BCOPA MRPC FigQA AP S-Cloze YA Topic Emotion AVG
1.0
500-shot Full/3000-shot

1 72.4 65.7 59.2 92.1 86.2 33.7 46.8 62.0


1 0.74 0.69 0.54 0.45
2 65.0 61.2 56.9 92.3 79.4 31.9 43.8 58.4
0.9 3 69.4 70.7 59.4 91.7 85.4 37.5 48.7 63.5
4 67.6 67.1 58.6 92.5 82.1 38.5 54.8 63.7
0.74 1 0.94 0.74 0.66 5 70.6 71.7 58.9 91.9 79.8 40.3 46.9 63.4
0.8
200-shot

0.69 0.94 1 0.78 0.75 Table 3: The Zero-shot performance of the base model
shot

0.7
using various degrees of SPC, where a larger SPC value
32-shot

0.54 0.74 0.78 1 0.95


0.6 indicates greater statement diversity during training. We
report the average as the geometric mean of the task
0-shot

0.45 0.66 0.75 0.95 1 0.5 performance to account for the differing accuracy ranges
of each task.
t

ot

t
ot

ho

ho

ho
-sh
-sh

0-s

0-s

0-s
32
00

50

20
0
ll/3
Fu

shot
stances such as BCOPA and (to a certain extent)
Figure 5: N-shot correlation using the average accuracy FigQA, we notice that even for a higher number
across all training set sizes and evaluation sets. of few-shot examples we tend to observe a benefit
against regular fine-tuning.
Overall, this result demonstrates that the partic-
templates, which we call Statements per Category ular strength of our approach tends to be in the
(SPC). We Statement-Tune RoBERTa base models few-shot and zero-shot case, whereas, in the case
with a fixed training set size of 4,000 statements of data availability, it tends to be better to directly
per corpus with a varying level of SPC. fine-tune a RoBERTa-base model on the task. We
The result can be seen in Table 3. We observe assume that the superiority of our method in situa-
on average that increasing SPC increases perfor- tions with limited target task data can be attributed
mance to a certain extent before leveling off around not only to the improved generalizability gained
63.7% with an SPC of 4. The result is unsurpris- from multitask statement tuning but also to the data
ing, however, confirms our decision to use many augmentation effect brought by the statement tun-
statement templates per task. Detailed exploration ing on few-shot target task examples, which allows
of statement design is left to future work. the model to not only learn the relation between a
text and the correct class label but also the wrong
Comparison Against Standard Fine-tuning To ones, thus improving data efficiency.
observe the improvement over regular fine-tuning
of RoBERTa-base, we also include Figure 4, where Effect of Statement Tuning Task Diversity We
the y-axis, Delta, represents the improvement over explore how important it is to cover a variety of
regular fine-tuning for the particular n-shot. For tasks during statement tuning. In this experiment,
zero-shot, we take random choice as a baseline. we group statement tuning datasets across 9 differ-
Generally, continually fine-tuning our model is bet- ent task categories of Summarization (SU), Sen-
ter than fine-tuning vanilla RoBERTa under an ex- timent Analysis (SA), Question Answering (QA),
tremely low N-shot setting. However, in some in- Natural Language Inference (NLI), Commonsense
Statement tuning Training Evaluation
PD CR NLI QnA SA WSD IC OLI SU BCOPA MRPC FIGQA AMAZON P. StoryCloze YA Topic Emotion AVG
x x x x x x x x x 71.6 70.0 58.6 92.9 80.7 39.6 49.3 63.8
x x x x x x x x 72.4 71.0 59.2 93.1 81.2 38.3 48.5 63.8
x x x x x x x 72.2 70.7 57.0 93.1 77.7 38.0 52.5 63.6
x x x x x x 70.4 71.2 58.0 93.4 80.9 18.6 46.0 56.7
x x x x x 72.8 71.7 57.0 93.5 75.7 25.7 49.4 59.6
x x x x 72.4 71.2 54.8 60.3 74.4 28.5 53.9 57.0
x x x 70.0 67.4 59.2 78.0 75.3 36.6 40.3 58.8

Table 4: Comparison of the effect of reducing task diversity in the training of Statement-Tuning models on zero-shot
accuracy on unseen datasets. The last common is the average using the geometric mean to account for the different
accuracy ranges of the different evaluation sets. The total training set size remains constant at approximately
100,000 statements across all configurations.

Reasoning (CR), Paraphrase Detection (PD), Word few-shot and zero-shot prompting of many much
Sense Disambiguation (WSD), Intent Classifica- larger decoder-only or encoder-decoder models on
tion (IC), and Offensive Language Identification many tasks at a fraction of the parameters. Experi-
(OLI). Dataset to task categories breakdown can mentation shows that the approach can be leveraged
be seen in Appendix E. We then perform statement by training on as few as 16,000 statements. We find
tuning on RoBERTa-base on various subsets of Statement-Tuning training task and prompt diver-
tasks. To control for the total training size, we dy- sity to be generally helpful. We speculate that the
namically sample the data until we have 100k total benefits of this approach could extend beyond task
statements. generalization and could prove useful for cross-
Table 4 shows the 0-shot performance of the lingual task transfer, and would like to explore this
statement tuning approach when the training set in future work.
size is fixed but task types are incrementally re-
moved from the mix. We generally observe that the Limitations
inclusion of more tasks seems to be overall benefi- While our approach offers advantages in computa-
cial; however, it seems to also unsurprisingly corre- tional efficiency compared to LLMs, the cost scales
late with how closely the training set resembles the with the number of possible targets due to the re-
evaluation set. For example, we see a significant quirement of one forward pass per label. Addition-
jump in accuracy on Amazon Polarity once we in- ally, task-specific full fine-tuning can still achieve
troduce another out-of-domain sentiment analysis better performance. Furthermore, the effectiveness
data. Likewise, MRPC’s performance is already of our approach in generalizing to unseen tasks re-
high since another paraphrase detection dataset is lies on the similarity of those tasks to the training
always introduced regardless of the configuration. data. Tasks with minimal relation to existing la-
Nevertheless, for truly unseen tasks, statement task beled datasets might see limited performance com-
diversity is beneficial, especially for multi-class pared to tasks with higher similarity. Finally, our
classification tasks such as Yahoo Answers Topic reliance on encoder-based models restricts its ap-
and Emotion. In average, training task diversity plication to Natural Language Understanding tasks,
seems to have an overall beneficial effect on gen- excluding tasks like translation or summarization.
eralizability and contributes to the ability of our
approach to generalize effectively to unseen tasks. Ethics Statement

6 Conclusion We affirm our commitment to more accessible


and climate-aware NLP, and hope this work in-
As part of their emergent abilities, LLMs general- spires more computationally efficient approaches
ize to many unseen tasks and domains through few- to NLP. All data and models we use are publicly
shot and zero-shot prompting, but are prohibitively available. Furthermore, the success of Statement-
computationally expensive and difficult to adapt. Tuning relies on fine-tuning pretrained encoder
To address this issue, we investigate Statement- models, which are pretrained on large datasets, and
Tuning, a novel technique for few-shot and zero- hence, Statement-Tuning is susceptible to inherit-
shot task generalization for encoder models. We ing and enforcing any harmful biases existing in
find that this approach can match or outperform the the pretraining data.
Acknowledgements York, NY, USA, February 7-12, 2020, pages 7432–
7439. AAAI Press.
Yongxin Huang is supported by HUAWEI Tech-
nologies (Ireland) Co., Ltd. Samuel R. Bowman, Gabor Angeli, Christopher Potts,
and Christopher D. Manning. 2015. A large anno-
tated corpus for learning natural language inference.
In Proceedings of the 2015 Conference on Empiri-
References cal Methods in Natural Language Processing, pages
Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor 632–642, Lisbon, Portugal. Association for Compu-
Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria tational Linguistics.
Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pa-
sunuru, Giridharan Anantharaman, Xian Li, Shuohui Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Chen, Halil Akin, Mandeep Baines, Louis Martin, Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Xing Zhou, Punit Singh Koura, Brian O’Horo, Jef- Neelakantan, Pranav Shyam, Girish Sastry, Amanda
frey Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Kozareva, and Veselin Stoyanov. 2022. Efficient Gretchen Krueger, Tom Henighan, Rewon Child,
large scale language modeling with mixtures of ex- Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
perts. In Proceedings of the 2022 Conference on Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
Empirical Methods in Natural Language Processing, teusz Litwin, Scott Gray, Benjamin Chess, Jack
pages 11699–11732, Abu Dhabi, United Arab Emi- Clark, Christopher Berner, Sam McCandlish, Alec
rates. Association for Computational Linguistics. Radford, Ilya Sutskever, and Dario Amodei. 2020.
Language models are few-shot learners. In Ad-
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, vances in Neural Information Processing Systems,
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei volume 33, pages 1877–1901. Curran Associates,
Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Inc.
Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu,
Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Hyung Won Chung, Le Hou, Shayne Longpre, Barret
Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang,
Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Mostafa Dehghani, Siddhartha Brahma, Albert Web-
Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suz-
Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi gun, Xinyun Chen, Aakanksha Chowdhery, Sharan
Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao,
Yichang Zhang, Zhenru Zhang, Chang Zhou, Jin- Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav
gren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023. Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam
Qwen technical report. CoRR, abs/2309.16609. Roberts, Denny Zhou, Quoc V. Le, and Jason Wei.
2022. Scaling instruction-finetuned language models.
Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojan- CoRR, abs/2210.11416.
ski, and Verena Rieser. 2020. SLURP: A spoken lan-
guage understanding resource package. In Proceed- Kevin Clark, Minh-Thang Luong, Quoc V. Le, and
ings of the 2020 Conference on Empirical Methods Christopher D. Manning. 2020. ELECTRA: pre-
in Natural Language Processing (EMNLP), pages training text encoders as discriminators rather than
7252–7262, Online. Association for Computational generators. In 8th International Conference on
Linguistics. Learning Representations, ICLR 2020, Addis Ababa,
Ethiopia, April 26-30, 2020. OpenReview.net.
Stella Biderman, Hailey Schoelkopf, Quentin Gregory
Anthony, Herbie Bradley, Kyle O’Brien, Eric Hal- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
lahan, Mohammad Aflah Khan, Shivanshu Purohit, Kristina Toutanova. 2019. BERT: Pre-training of
USVSN Sai Prashanth, Edward Raff, Aviya Skowron, deep bidirectional transformers for language under-
Lintang Sutawika, and Oskar van der Wal. 2023. standing. In Proceedings of the 2019 Conference of
Pythia: A suite for analyzing large language models the North American Chapter of the Association for
across training and scaling. In International Con- Computational Linguistics: Human Language Tech-
ference on Machine Learning, ICML 2023, 23-29 nologies, Volume 1 (Long and Short Papers), pages
July 2023, Honolulu, Hawaii, USA, volume 202 of 4171–4186, Minneapolis, Minnesota. Association for
Proceedings of Machine Learning Research, pages Computational Linguistics.
2397–2430. PMLR.
William B. Dolan and Chris Brockett. 2005. Automati-
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng cally constructing a corpus of sentential paraphrases.
Gao, and Yejin Choi. 2020. PIQA: reasoning about In Proceedings of the Third International Workshop
physical commonsense in natural language. In The on Paraphrasing (IWP2005).
Thirty-Fourth AAAI Conference on Artificial Intelli-
gence, AAAI 2020, The Thirty-Second Innovative Ap- Jack FitzGerald, Christopher Hench, Charith Peris,
plications of Artificial Intelligence Conference, IAAI Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron
2020, The Tenth AAAI Symposium on Educational Nash, Liam Urbach, Vishesh Kakarala, Richa Singh,
Advances in Artificial Intelligence, EAAI 2020, New Swetha Ranganath, Laurie Crist, Misha Britan,
Wouter Leeuwis, Gokhan Tur, and Prem Natara- Faisal Ladhak, Esin Durmus, Claire Cardie, and Kath-
jan. 2022. Massive: A 1m-example multilin- leen McKeown. 2020. WikiLingua: A new bench-
gual natural language understanding dataset with 51 mark dataset for cross-lingual abstractive summariza-
typologically-diverse languages. tion. In Findings of the Association for Computa-
tional Linguistics: EMNLP 2020, pages 4034–4048,
Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Online. Association for Computational Linguistics.
Making pre-trained language models better few-shot
learners. In Proceedings of the 59th Annual Meet- Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,
ing of the Association for Computational Linguistics and Eduard Hovy. 2017. RACE: Large-scale ReAd-
and the 11th International Joint Conference on Natu- ing comprehension dataset from examinations. In
ral Language Processing (Volume 1: Long Papers), Proceedings of the 2017 Conference on Empirical
pages 3816–3830, Online. Association for Computa- Methods in Natural Language Processing, pages 785–
tional Linguistics. 794, Copenhagen, Denmark. Association for Compu-
tational Linguistics.
Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and
Luke Zettlemoyer. 2019. Mask-predict: Parallel de- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
coding of conditional masked language models. In Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Proceedings of the 2019 Conference on Empirical Veselin Stoyanov, and Luke Zettlemoyer. 2020.
Methods in Natural Language Processing and the BART: Denoising sequence-to-sequence pre-training
9th International Joint Conference on Natural Lan- for natural language generation, translation, and com-
guage Processing (EMNLP-IJCNLP), pages 6112– prehension. In Proceedings of the 58th Annual Meet-
6121, Hong Kong, China. Association for Computa- ing of the Association for Computational Linguistics,
tional Linguistics. pages 7871–7880, Online. Association for Computa-
tional Linguistics.
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Alek-
sander Wawer. 2019. SAMSum corpus: A human- Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del
annotated dialogue dataset for abstractive summa- Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023.
rization. In Proceedings of the 2nd Workshop on Textbooks are all you need II: phi-1.5 technical report.
New Frontiers in Summarization, pages 70–79, Hong CoRR, abs/2309.05463.
Kong, China. Association for Computational Linguis-
tics. Emmy Liu, Chenxuan Cui, Kenneth Zheng, and Graham
Neubig. 2022. Testing the ability of language models
Kishaloy Halder, Alan Akbik, Josip Krapac, and Roland to interpret figurative language. In Proceedings of
Vollgraf. 2020. Task-aware representation of sen- the 2022 Conference of the North American Chap-
tences for generic text classification. In Proceed- ter of the Association for Computational Linguistics:
ings of the 28th International Conference on Com- Human Language Technologies, pages 4437–4452,
putational Linguistics, pages 3202–3213, Barcelona, Seattle, United States. Association for Computational
Spain (Online). International Committee on Compu- Linguistics.
tational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
sch, Chris Bamford, Devendra Singh Chaplot, Diego Luke Zettlemoyer, and Veselin Stoyanov. 2019.
de Las Casas, Florian Bressand, Gianna Lengyel, Roberta: A robustly optimized BERT pretraining
Guillaume Lample, Lucile Saulnier, Lélio Re- approach. CoRR, abs/1907.11692.
nard Lavaud, Marie-Anne Lachaux, Pierre Stock,
Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo- Julian McAuley and Jure Leskovec. 2013. Hidden fac-
thée Lacroix, and William El Sayed. 2023. Mistral tors and hidden topics: understanding rating dimen-
7b. CoRR, abs/2310.06825. sions with review text. In Proceedings of the 7th
ACM Conference on Recommender Systems, RecSys
Matt Gardner Johannes Welbl, Nelson F. Liu. 2017. ’13, page 165–172, New York, NY, USA. Association
Crowdsourcing multiple choice science questions. for Computing Machinery.

Pride Kavumba, Naoya Inoue, Benjamin Heinzerling, Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Keshav Singh, Paul Reisert, and Kentaro Inui. 2019. Dario Amodei, Ilya Sutskever, et al. 2019. Language
When choosing plausible alternatives, clever hans can models are unsupervised multitask learners. OpenAI
be clever. In Proceedings of the First Workshop on blog, 1(8):9.
Commonsense Inference in Natural Language Pro-
cessing, pages 33–42, Hong Kong, China. Associa- Altaf Rahman and Vincent Ng. 2012. Resolving com-
tion for Computational Linguistics. plex cases of definite pronouns: the winograd schema
challenge. In Proceedings of the 2012 Joint Confer-
Tushar Khot, Peter Clark, Michal Guerquin, Peter ence on Empirical Methods in Natural Language
Jansen, and Ashish Sabharwal. 2020. Qasc: A Processing and Computational Natural Language
dataset for question answering via sentence composi- Learning, pages 777–789. Association for Computa-
tion. arXiv:1910.11473v2. tional Linguistics.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and pages 255–269, Online. Association for Computa-
Percy Liang. 2016. SQuAD: 100,000+ questions for tional Linguistics.
machine comprehension of text. In Proceedings of
the 2016 Conference on Empirical Methods in Natu- Timo Schick and Hinrich Schütze. 2021b. It’s not just
ral Language Processing, pages 2383–2392, Austin, size that matters: Small language models are also few-
Texas. Association for Computational Linguistics. shot learners. In Proceedings of the 2021 Conference
of the North American Chapter of the Association
Melissa Roemmele, Cosmin Adrian Bejan, and An- for Computational Linguistics: Human Language
drew S. Gordon. 2011. Choice of plausible alter- Technologies, pages 2339–2352, Online. Association
natives: An evaluation of commonsense causal rea- for Computational Linguistics.
soning. In Logical Formalizations of Commonsense
Reasoning, Papers from the 2011 AAAI Spring Sym- Priyanka Sen, Alham Fikri Aji, and Amir Saffari.
posium, Technical Report SS-11-06, Stanford, Cali- 2022. Mintaka: A complex, natural, and multilin-
fornia, USA, March 21-23, 2011. AAAI. gual dataset for end-to-end question answering. In
Proceedings of the 29th International Conference
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- on Computational Linguistics, pages 1604–1619,
ula, and Yejin Choi. 2020. Winogrande: An adver- Gyeongju, Republic of Korea. International Com-
sarial winograd schema challenge at scale. In The mittee on Computational Linguistics.
Thirty-Fourth AAAI Conference on Artificial Intelli-
gence, AAAI 2020, The Thirty-Second Innovative Ap- Lakshay Sharma, Laura Graesser, Nikita Nangia, and
plications of Artificial Intelligence Conference, IAAI Utku Evci. 2019. Natural language understand-
2020, The Tenth AAAI Symposium on Educational ing with the quora question pairs dataset. CoRR,
Advances in Artificial Intelligence, EAAI 2020, New abs/1907.01041.
York, NY, USA, February 7-12, 2020, pages 8732–
8740. AAAI Press. Derek Tam, Rakesh R. Menon, Mohit Bansal, Shashank
Srivastava, and Colin Raffel. 2021. Improving and
Julian Salazar, Davis Liang, Toan Q. Nguyen, and Ka- simplifying pattern exploiting training. In Proceed-
trin Kirchhoff. 2020. Masked language model scor- ings of the 2021 Conference on Empirical Methods
ing. In Proceedings of the 58th Annual Meeting of in Natural Language Processing, pages 4980–4991,
the Association for Computational Linguistics, pages Online and Punta Cana, Dominican Republic. Asso-
2699–2712, Online. Association for Computational ciation for Computational Linguistics.
Linguistics.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
M Saiful Bari, Canwen Xu, Urmish Thakker, Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Shanya Sharma Sharma, Eliza Szczechla, Taewoon Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Harshit Pandey, Rachel Bawden, Thomas Wang, Tr- Isabel Kloumann, Artem Korenev, Punit Singh Koura,
ishala Neeraj, Jos Rozen, Abheesht Sharma, An- Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
drea Santilli, Thibault Févry, Jason Alan Fries, Ryan ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
Teehan, Teven Le Scao, Stella Biderman, Leo Gao, tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
Thomas Wolf, and Alexander M. Rush. 2022. Multi- bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
task prompted training enables zero-shot task gener- stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
alization. In The Tenth International Conference on Ruan Silva, Eric Michael Smith, Ranjan Subrama-
Learning Representations, ICLR 2022, Virtual Event, nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
April 25-29, 2022. OpenReview.net. lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Melanie Kambadur, Sharan Narang, Aurelien Ro-
Junlin Wu, and Yi-Shin Chen. 2018. CARER: Con- driguez, Robert Stojnic, Sergey Edunov, and Thomas
textualized affect representations for emotion recog- Scialom. 2023. Llama 2: Open foundation and fine-
nition. In Proceedings of the 2018 Conference on tuned chat models.
Empirical Methods in Natural Language Processing,
pages 3687–3697, Brussels, Belgium. Association Alex Wang and Kyunghyun Cho. 2019. BERT has a
for Computational Linguistics. mouth, and it must speak: BERT as a markov random
field language model. CoRR, abs/1902.04094.
Timo Schick and Hinrich Schütze. 2021a. Exploiting
cloze-questions for few-shot text classification and Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin
natural language inference. In Proceedings of the Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
16th Conference of the European Chapter of the Asso- drew M. Dai, and Quoc V. Le. 2022a. Finetuned
ciation for Computational Linguistics: Main Volume, language models are zero-shot learners. In The Tenth
International Conference on Learning Representa-
tions, ICLR 2022, Virtual Event, April 25-29, 2022.
OpenReview.net.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, Ed H.
Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy
Liang, Jeff Dean, and William Fedus. 2022b. Emer-
gent abilities of large language models. Transactions
on Machine Learning Research.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed-
ings of the 2018 Conference of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume
1 (Long Papers), pages 1112–1122, New Orleans,
Louisiana. Association for Computational Linguis-
tics.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien


Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander M. Rush. 2020. Hug-
gingface’s transformers: State-of-the-art natural lan-
guage processing.
Mengzhou Xia, Mikel Artetxe, Jingfei Du, Danqi Chen,
and Veselin Stoyanov. 2022. Prompting ELECTRA:
Few-shot learning with discriminative pre-trained
models. In Proceedings of the 2022 Conference on
Empirical Methods in Natural Language Processing,
pages 11351–11361, Abu Dhabi, United Arab Emi-
rates. Association for Computational Linguistics.
Sohee Yang, Jonghyeon Kim, Joel Jang, Seonghyeon
Ye, Hyunji Lee, and Minjoon Seo. 2023. Improving
probability-based prompt selection through unified
evaluation and analysis. CoRR, abs/2305.14877.
Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
Sara Rosenthal, Noura Farra, and Ritesh Kumar.
2019. Semeval-2019 task 6: Identifying and catego-
rizing offensive language in social media (offenseval).
In Proceedings of the 13th International Workshop
on Semantic Evaluation, pages 75–86.

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang,


Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tian-
wei Zhang, Fei Wu, and Guoyin Wang. 2024. Instruc-
tion tuning for large language models: A survey.

Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015.


Character-level convolutional networks for text clas-
sification. In Advances in Neural Information Pro-
cessing Systems 28: Annual Conference on Neural In-
formation Processing Systems 2015, December 7-12,
2015, Montreal, Quebec, Canada, pages 649–657.
A Statement Templates A.8 SQuAD Templates
Task Statement Template
A.1 QQP Templates Context: {{context}}\n Question: {{question}}\n Answer: {{answers/random_span}}
{{context}}\n According to the passage above, the answer of {{question}} is {{answers/random_span}}
SQuAD
"Passage: {{context}}\n Question: {{question}}\n Answer: {{answers/random_span}}
Task Statement Template {{context}}\n Q: {{question}}\n A:{{answers/random_span}}

"{{text1}}" is a duplicate of "{{text2}}"


A.9 BCOPA Templates
"{{text1}}" duplicates {{text2}}
QQP
"{{text1}}" is not a duplicate of "{{text2}}" Task Statement Template
"{{text1}}" does not duplicate "{{text2}}" The cause of {{premise}} is that {{choice1/choice2}}
{{premise}} because {{choice1/choice2}}
{{premise}} due to {{choice1/choice2}}
A.2 Winogrande Templates BCOPA
The effect of {{premise}} is that {{choice1/choice2}}
Task Statement Template {{premise}} therefore {{choice1/choice2}}
In "{{sentence}}", _ is: {{option1/option2}}
{{premise}}, so {{choice1/choice2}}
Q: "{{sentence}}", A: {{option1/option2}}
Winogrande The missing word in: "{{sentence}}" is {{option1/option2}}
_ in: "{{sentence}}" is {{option1/option2}} A.10 MRPC Templates
"{{sentence}}", _ is: {{option1/option2}}
Task Statement Template
A.3 PiQA Templates "{{text1}}" is a paraphrase of "{{text2}}"
Task Statement Template "{{text1}}"\n In other words: "{{text2}}"
MRPC {{text1}}? yes, {{text2}}
{{goal}} {{sol1/sol2}}
Goal: {{goal}}, Solution: {{sol1/sol2}} "{{text1}}" can be stated as "{{text2}}"
PiQA {{text1}}" is the same as saying "{{text2}}"
If the goal is: {{goal}}, then the solution is: {{sol1/sol2}}
Problem: {{goal}}, Solution: {{sol1/sol2}}

A.11 Amazon Polarity Templates


A.4 MNLI and SNLI Templates
Task Statement Template
Task Statement Template "Title: {{title}}, Content: {{content}}" has negative sentiment
"{{text1}}" entails "{{text2}}" {{title}} {{content}} has negative sentiment
{{text1}}? yes, {{text2}} "Title: {{title}}, Content: {{content}}", Sentiment: Negative
{{title}} {{content}} It was terrible
Premise: {{text1}}, Hypothesis: {{text2}}, label: Entailment
The sentiment in "{{title}} {{content}}" is negative
"{{text1}}" is neutral with regards to "{{text2}}" The emotions conveyed in "{{title}} {{content}}" are negative
MNLI {{text1}}? maybe, {{text2}} Amazon Polarity
"Title: {{title}}, Content: {{content}}" has positive sentiment
Premise: {{text1}}, Hypothesis: {{text}}, label: Neutral {{title}} {{content}} has positive sentiment
"{{text1}}" contradicts "{{text2}}" "Title: {{title}}, Content: {{content}}", Sentiment: Positive
{{text1}}? no, {{text2}} {{title}} {{content}} It was great
Premise: {{text1}}, Hypothesis: {{text}}, label: Contradiction The sentiment in "{{title}} {{content}}" is positive
The emotions conveyed in "{{title}} {{content}}" are positive

A.5 Mintaka Templates


A.12 FigQA Templates
Task Statement Template
Task Statement Template
Q: {{question}}, A: {{answerText}}
{{startphrase}} {{ending1/ending2}}
{{question}} {{answerText}}
Mintaka {{startphrase}} therefore {{ending1/ending2}}
Question: {{question}}, Answer: {{answerText}} FigQA startphrase: {{startphrase}}, ending: {{ending1/ending2}}
The answer of {{question}} is {{answerText}} if {{startphrase}} then {{ending1/ending2}}
{{startphrase}} means {{ending1/ending2}}
A.6 Yelp Polarity Templates
Task Statement Template A.13 StoryCloze Templates
"Title: {{title}}, Content: {{content}}" has negative sentiment
{{title}} {{content}} has negative sentiment Task Statement Template
"Title: {{title}}, Content: {{content}}", Sentiment: Negative
{{title}} {{content}} It was terrible
{{input_sentence_1}} {{input_sentence_2}}
The sentiment in "{{title}} {{content}}" is negative StoryCloze {{input_sentence_3}} {{input_sentence_4}}
Yelp Polarity
"Title: {{title}}, Content: {{content}}" has positive sentiment {{sentence_quiz1/sentence_quiz2}}
{{title}} {{content}} has positive sentiment
"Title: {{title}}, Content: {{content}}", Sentiment: Positive
{{title}} {{content}} It was great
The sentiment in "{{title}} {{content}}" is positive
A.14 Yahoo Topics Answers Templates
Task Statement Template
A.7 WikiLingua Templates YA Topic {{question_title}} {{question_content}} the topic is {{topic}}

Task Statement Template


Passage: {{source}}, Summary: {{target}} A.15 Emotion Templates
The summary of "{{source}}" is {{target}}
WikiLingua Context: {{source}}, Summary: {{target}} Task Statement Template
Q: Summarize the following: {{source}}, A: {{target}}
The answer of "Summarize the following {{source}}" is {{target}} Emotion {{question_title}} {{question_content}} the topic is {{topic}}
A.16 Offensive Templates
Task Statement Template
"{{text}}". The tweet is {{label}}.
This tweet "{{text}}" is considered {{label}}.
Offensive Tweet: "{{text}}". Label: {{label}}.
"{{text}}". This text is {{label}}.
The text "{{text}}" is {{label}}.

A.17 Massive Templates


Task Statement Template
The utterance "{{utt}}" is under the {{scenario}} scenario.
Utterance: "{{utt}}" Scenario: {{scenario}}
Massive
User: "{{utt}}". The best scenario for the user query is {{scenario}}.
The scenario of user’s utterance "{{utt}}" is {{scenario}}.

A.18 Definite Pronoun Resolution Templates


Task Statement Template
{{sentence_with_pronoun_replaced}}
{{sentence}}. Based on the sentence, {{pronoun}} refers to {{candidates}}.
DPR
The pronoun {{pronoun}} in "{{sentence}}" is referring to {{candidates}}.
{{sentence}}. ’{{pronoun}}’ refers to {{candidates}}.

A.19 QASC Templates


Task Statement Template
{{formatted_question}}. Answer: {{answer_key}}
Q: "{{formatted_question}}." A: {{answer_key}}
Question: "{{formatted_question}}." Answer: {{choices[answer_key]}}
Context: {{combined_facts}} Question: {{question}} Answer: {{choices[answer_key]}}
QASC
{{question}} Based on the passage "{{combined_facts}}", the answer if the question is "{{choices[answer_key]}}".
{{combined_facts}} {{question}} {{choices[answer_key]}}
Context: {{combined_facts}} Question: {{formatted_question}}. Answer: {{answer_key}}
{{formatted_question}}. The answer is {{answer_key}}

A.20 SciQ Templates


Task Statement Template
{{question}} {{correct_answer}}
Question: {{question}} Answer: {{correct_answer}}
SciQ {{support}} Question: {{question}} Answer: {{correct_answer}}
{{support}} According to the information, {{question}}. Answer: {{correct_answer}}.
The answer to the question {{question}}, according to "{{support}}" is {{correct_answer}}.

A.21 RACE Templates


Task Statement Template
RACE {{article}} {{question_replaced_with_answer}}
B Finetuning Setup
To finetune RoBERTa-base/RoBERTa-large on
Statement-Tuning, we train for 15 epochs using
an initial learning rate of 1e-06 and a weight decay
of 0.01. We use a warm-up ratio of 0.1. We use
10% of the training data for validation. We use a
training batch size of 16 for RoBERTa-base and a
training batch size of 8 for RoBERTa-large.

C 32-shot Generalization
Table 5 shows the results of 32-shot fine-tuning on
7 target downstream datasets of our models and
baselines. We observe similar trends to zero-shot
setting as discussed in Section 5.1.

D Regular Classification of
Statement-Tuned Models
In figure 6, we visualize the relative improvement
of our Statement-Tuned RoBERTa-base models
regularly fine-tuned on N-shot downstream data
over the regularly fine-tuned RoBERTa-base. The
results are not as good as fine-tuning using state-
ments.

E Task Categories Breakdown


For the statement tuning task diversity, we group
corpus based on task categories as follows:

1. Summarization (SU): WikiLingua, SAMSum


2. Sentiment Analysis (SA): Yelp Polarity
3. Question Answering (QA): Mintaka, SQuAD,
QASC, SciQ, RACE
4. Natural Language Inference (NLI): MNLI,
SNLI
5. Commonsense Reasoning (CR): Winogrande,
PiQA
6. Paraphrase Detection (PD): QQP
7. Word Sense Disambiguation (WSD): Definite
Pronoun Resolution
8. Intent Classification (IC): Massive
9. Offensive Language Identification (OLI):
Tweet Eval’s Offensive
#Parameters BCOPA MRPC FigQA Amazon Polarity StoryCloze Yahoo Answers Topic Emotion
Llama-2-7b-chat 7B 91.0 67.9 42.8 95.2 82.1 61.9 54.3
Mistral-7B-Instruct-v0.2 7B 93.8 78.2 44.8 96.2 87.0 65.0 57.0
Qwen1.5-7B-Chat 7B 91.4 79.4 43.8 95.1 82.4 63.9 58.0
Pythia-6.7B 6.7B 84.6 66.9 39.2 91.6 74.0 38.3 52.0
Pythia-2.7B 2.7B 80.8 63.5 41.5 90.8 71.7 35.5 47.5
Phi-2 2.7B 90.8 74.0 44.7 93.8 81.6 58.4 58.4
FlanT5-Large 770M 66.2 78.7 39.7 75.3 59.9 38.0 34.6
Qwen1.5-0.5B-Chat 500M 73.4 56.1 38.5 84.2 68.8 36.1 31.4
BART-large-mnli 406M 52.2 32.4 42.0 50.6 51.1 7.1 10.0
FlanT5-Small 60M 52.0 32.6 41.4 75.8 50.0 9.1 9.8
Our Approach: RoBERTa-base (Best) 125M 73.0 71.9 61.3 93.6 83.6 44.6 55.5
Our Approach: RoBERTa-base (4k) 125M 69.6 69.8 59.3 92.7 82.8 41.2 55.5
Our Approach: RoBERTa-large (Best) 355M 85.0 72.5 74.7 95.3 93.0 51.1 55.1
Our Approach: RoBERTa-large (50k) 355M 85.0 71.4 72.2 95.3 92.9 51.1 55.1
Full-shot:
RoBERTa-base (FT) 125M 74.2 87.0 88.1 94.3 - 71.0 82.2
RoBERTa-large (FT) 355M 86.0 87.6 92.0 96.5 - 68.5 78.2

Table 5: Comparison of our approach against many pretrained open-source Encoder-Decoder and Decoder-only
Pretrained Large Language Models on 7 Natural Language Understanding tasks in 32-shot conditions. We highlight
all scores in red where our approach with RoBERTa-base (best) exceeds or is equal to the score given by the model.

BCOPA MRPC FigQA Amazon Polarity


20.0%
6.0% 5.0% 20.0%
4.0% 2.5% 15.0%
15.0%
2.0% 0.0% 10.0%
10.0%
Delta

Delta

Delta

Delta

0.0% 2.5%
5.0% 5.0%
2.0% 5.0%
4.0% 7.5% 0.0% 0.0%
6.0% 10.0% 5.0% 5.0%
1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k
Training Data Size Training Data Size Training Data Size Training Data Size
Yahoo Topic Emotion
0% 0.0%
10% Full/3000-shot
10.0% 1000-shot
20% 500-shot
Delta

Delta

30% 20.0% 200-shot


32-shot
40%
30.0%
50%
40.0%
1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k
Training Data Size Training Data Size

Figure 6: N-shot improvement of Statement-Tuned RoBERTa-base models used for regular finetuning. The y-axis,
Delta, is the difference between the accuracy of the Statement-Tuned model fine-tuned for the task directly by
discarding the Statement-Tuning classification head and the accuracy achieved by regular fine-tuning of RoBERTa-
base on the task. A positive Delta indicates improvement over the baseline approach.

You might also like