Enabling Natural Zero-Shot Prompting On Encoder Models Via Statement-Tuning
Enabling Natural Zero-Shot Prompting On Encoder Models Via Statement-Tuning
Statement-Tuning
Ahmed Elshabrawy1 , Yongxin Huang2 , Iryna Gurevych1,2 , Alham Fikri Aji1
1
Department Natural Language Processing, MBZUAI
2
Ubiquitous Knowledge Processing Lab, Technical University of Darmstadt
1
{ahmed.elshabrawy,iryna.gurevych,alham.fikri}@mbzuai.ac.ae
2
www.ukp.tu-darmstadt.de
Winogrande
Statement: John moved the couch from the garage
to the backyard to create space because The couch False
is small.
PIQA
Statement: Goal: how do you flood a room?
Solution: fill it with water. True
Statement-Tuned
Encoder
Multi-task Finetuning
Zero-shot generalization Inference probability Predicted
Balanced Copa class
Figure 1: Overview of Statement-Tuning. We train an encoder to discriminate the truth value of statements from
multiple tasks, then we apply it in the zero-shot setting by creating a statement for each possible target label and
choosing the most likely one according to the encoder discriminator.
ment. By fine-tuning encoder LLM across diverse 2. We expose that certain emergent abilities (Wei
tasks and statements, we show zero-shot generaliza- et al., 2022b) like zero-shot generalization on
tion capabilities to unseen tasks by similarly trans- unseen tasks previously thought to be exclu-
forming them into statements. Moreover, we show sive to decoder-based LLMs can also be ob-
few-shot capabilities by continually fine-tuning served in much smaller encoder models when
this model with a small amount of downstream we do multitask Statement-Tuning.
data, also formatted into statements. Statement- 3. We explore a large number of design choices
Tuning is capable of matching or even outperform- to study how Statement-Tuning benefits from
ing (32-shot and) zero-shot performance of many the number of statement examples and state-
state-of-the-art LLMs with a fraction of the param- ment template and task diversity in multitask
eters. Statement-Tuning.
Our ablation study shows that depending on the
task, we can achieve sufficient few-shot and zero- 2 Related Work
shot generalizability with as few as 1,000 state-
ments per training dataset or approximately 16,000 Few-shot Approaches Using Encoder-Only Mod-
training statement examples in total, which corre- els Prompt-based fine-tuning with cloze tem-
spond to even fewer original task examples since plates or label discrimination (Schick and Schütze,
one example can be turned into multiple statements 2021a; Gao et al., 2021) effectively utilizes encoder
through different templates. Furthermore, we find models for few-shot learning. Prompts provide
that the proximity of the fine-tuning tasks to the task context beyond traditional input-output train-
evaluation tasks as well as the prompt and task ing, enabling easier adaptation. However, multi-
diversity tend to have a beneficial effect on the token labels and zero-shot generalization remain
effectiveness of Statement-Tuning. challenges (Schick and Schütze, 2021b; Tam et al.,
In summary, our primary contributions are: 2021). Our work addresses this by proposing a
simpler reformulation method for improved unseen
1. To the best of our knowledge we are the first task generalization via multitask training.
to enable natural, zero-shot task generaliza- Verbalizers in prompts leverage label semantics
tion in encoder models by verbalizing the in- for task information (e.g., (Tam et al., 2021)). Sim-
put into statements and fine-tuning the model ilar approaches include label conditioning tasks
to perform binary classification on the truth (Tam et al., 2021) and TARS (Halder et al., 2020).
value of a statement. While TARS implicitly learns label-text relations,
our method uses verbalization to explicitly connect 3 Method: Statement-Tuning
text and labels in natural language statements. This
facilitates generalization to unseen tasks without In this section, we outline the steps involved in
further fine-tuning, as the model learns a general Statement-Tuning. First, tasks are verbalized into
semantic understanding rather than a rigid input natural language statements. Then they are used
structure. to train the statement discriminator and derive the
target label.
Using Masked Language Models for Sequence
3.1 Task Verbalization
Scoring In a sense, Statement-Tuning can be
viewed as fine-tuning MLMs for universal NLU Any discriminative task with a finite set of targets
sequence scoring based on the truth value of a can be verbalized into a finite set of natural lan-
statement and the subsequent generalization to any guage statements. Figure 2 shows the example of
discriminative tasks. Previous approaches have ex- converting the MNLI task into statements. Similar
plored using MLMs for general sequence scoring to prompting, each task has its own statement tem-
for non-autoregressive generation (Wang and Cho, plates, based on each possible label. The truth label
2019; Ghazvininejad et al., 2019). Salazar et al. for training purposes on each statement depends on
(2020) use MLM pseudo-log-likelihood (PLL) whether the statement contains the correct target
scores as discriminators for generated output in label or not. We outline all the statement templates
Automatic Speech Recognition and low-resource used for each dataset in Appendix A.
Machine Translation and achieve improved genera-
3.2 Statement Fine-Tuning
tion through hypothesis re-scoring using the MLM
PLL scores. To create the training data for statement fine-tuning,
we exhaustively generate statements across 16 di-
Zero-shot Prompting and Multitask Tuning verse NLP tasks using many varied statement tem-
LLMs excel at unseen-task/zero-shot generaliza- plates per dataset: QQP (Sharma et al., 2019),
tion (Brown et al., 2020). Building on this, re- Winogrande (Sakaguchi et al., 2020), PiQA (Bisk
cent work explores multitask training with diverse et al., 2020), MNLI (Williams et al., 2018), SNLI
prompts for improved zero-shot performance (Sanh (Bowman et al., 2015), Mintaka (Sen et al., 2022),
et al., 2022; Wei et al., 2022a; Chung et al., 2022). Yelp Polarity (Zhang et al., 2015), WikiLingua
These methods fine-tune large models on con- (Ladhak et al., 2020), SQuAD (Rajpurkar et al.,
structed datasets with various task prompts, achiev- 2016), TweetEval’s Offensive task (Zampieri et al.,
ing strong zero-shot results. However, effective 2019), Massive (FitzGerald et al., 2022; Bastianelli
instruction-tuned models often require billions of et al., 2020), Definite Pronoun Resolution (Rahman
parameters (Zhang et al., 2024), limiting their ap- and Ng, 2012), QASC (Khot et al., 2020), SciQ (Jo-
plication to smaller models. Our work demon- hannes Welbl, 2017), RACE (Lai et al., 2017), and
strates achieving similar or superior generalization SAMSum (Gliwa et al., 2019).
on smaller encoder-only models with less training We explore different sample sizes for each
data. dataset. We fine-tune RoBERTa (Liu et al., 2019)
with a binary sequence classification head to pre-
Task: MNLI dict the truth value of the statements. By fine-
Premise: Conceptually cream skimming has two basic dimensions - product and
geography. tuning the model across diverse tasks, templates,
Hypothesis: Product and geography are what make cream skimming work.
and domains, the model should be able to general-
Options: [“entailment”, “neutral”, “contradiction”]
Statement Conversion:
ize across unseen templates and tasks, as long as it
can be phrased as a true/false statement.
S1: “Conceptually cream skimming has two basic dimensions - product and
geography” entails “Product and geography are what make cream skimming work”.
S2: “Conceptually cream skimming has two basic dimensions - product and 3.3 Zero-Shot and Few-Shot Inference
geography” is neutral with regards to “Product and geography are what make cream
skimming work”.
S3: “Conceptually cream skimming has two basic dimensions - product and To perform inference on statement-finetuned
geography” contradicts “Product and geography are what make cream skimming
work”. RoBERTa, we also need to transform the input
into statements. Specifically, we exhaustively gen-
Figure 2: Example conversion of the MNLI task to erate a statement for each possible label, as shown
natural language statements. in Figure 1. Then, for each statement correspond-
ing to each label, we predict the probability of
#Parameters BCOPA MRPC FigQA Amazon Polarity StoryCloze YA Topic Emotion
Llama-2-7b-chat 7B 86.6 54.4 40.1 90.5 78.5 47.8 50.0
Mistral-7B-Instruct-v0.2 7B 89.4 73.0 41.4 88.9 82.3 57.7 55.3
Qwen1.5-7B-Chat 7B 87.0 75.5 42.1 95.3 79.7 59.1 57.8
Pythia-6.9B 6.9B 82.2 62.0 41.7 83.3 71.2 32.2 25.1
Pythia-2.8B 2.8B 79.6 68.4 41.2 77.7 69.7 12.1 35.4
Phi-2 2.7B 87.2 67.9 41.8 86.6 77.7 38.7 53.1
FlanT5-Large 770M 67.6 81.1 40.1 96.0 63.0 51.0 59.9
Qwen1.5-0.5B-Chat 500M 69.2 32.6 38.7 69.7 68.9 21.9 6.6
BART-large-mnli 406M 50.4 35.8 46.9 49.4 47.3 6.5 11.7
FlanT5-Small 60M 52.8 31.9 42.0 88.8 51.5 24.5 21.7
Our Approach: RoBERTa-base (Best) 125M 73.0 71.9 61.3 93.6 83.6 44.6 55.5
Our Approach: RoBERTa-base (4k) 125M 69.6 69.8 59.3 92.7 82.8 41.2 55.5
Our Approach: RoBERTa-large (Best) 355M 85.0 72.5 74.7 95.3 93.0 51.1 55.1
Our Approach: RoBERTa-large (50k) 355M 85.0 71.4 72.2 95.3 92.9 51.1 55.1
Full/3000-shot:
RoBERTa-base (FT) 125M 74.2 87.0 88.1 94.3 - 71.0 82.2
RoBERTa-large (FT) 355M 86.0 87.6 92.0 96.5 - 68.5 78.2
Table 1: Comparison of our approach against many pretrained open-source Encoder-Decoder and Decoder-only
Pretrained Large Language Models on 7 Natural Language Understanding tasks in Zero-shot conditions. FT
stands for Full Finetuning and is included for reference. We highlight all scores in red where our approach with
RoBERTa-base (best) exceeds or is equal to the score given by the model.
such a statement being true. The final label is the expands the dataset with various templates over all
statement with the highest true probability. Zero- possible labels, it is arguably unwise to fine-tune on
shot inference is done by directly performing the all possible generated statements. Moreover, each
aforementioned inference regime on the statement- task has a different data size, leading to unbalanced
finetuned RoBERTa, while K-shot inference is fine-tuning data. Therefore, we sample statements
done by first performing continual fine-tuning on randomly for each task, uniformly across true and
K examples of task-specific statements. false statements. We explore sample size from
1,000 statements to 50,000 statements for each
4 Experimental Setup dataset. Following best practices, we encourage
4.1 Evaluation the invariance to phrasing in diverse statements by
designing multiple statement templates per dataset
In this work, we measure our model’s task gener- (a list of all statement templates is shown in Ap-
alizability using another set of 7 diverse datasets pendix A). We also explore the effect of statement
representing a variety of unseen tasks or unseen diversity during training in Section 5.2.
domains during training to judge the generalizabil-
ity of our approach: Balanced COPA (BCOPA) After statement tuning is completed, we can fur-
(Kavumba et al., 2019; Roemmele et al., 2011), ther continue fine-tuning the model on the target
MRPC (Dolan and Brockett, 2005), Emotion (Sar- downstream dataset. Specifically, we explore vari-
avia et al., 2018), Amazon Polarity (McAuley and ous n-shot configurations: Full/3,000-shot, 1,000-
Leskovec, 2013; Zhang et al., 2015), FigQA (Liu shot, 500-shot, 200-shot, and 32-shot, where we
et al., 2022), StoryCloze (2016) (Yang et al., 2023), use limited data from the training sets of the corre-
and Yahoo Answers Topics (Zhang et al., 2015). sponding dataset to fine-tune our statement-tuned
Among the evaluation data, BCOPA, Emotion, models. For the Full/3,000-shot case, we cap the
FigQA, StoryCloze, and Yahoo Answers Topics training set at 3,000 examples, otherwise, we use
are unseen tasks and hence examine the cross-task the entire set (this is the case for Amazon Polarity
generalizability; however, MRPC and Amazon Po- only). For StoryCloze, there is no training set, so
larity represent seen tasks but in different domains we just carry out 32-shot (using 32 samples from
and demonstrate domain generalizability. the test set for fine-tuning and evaluating on the
rest) and Zero-shot experiments. As for Yahoo
4.2 Statement Finetuning Configurations Answers Topic and Emotion, due to them being
We statement-finetune both RoBERTa-base and multi-class classification tasks, we cap the n-shot
RoBERTa-large across diverse NLP tasks outlined analysis at 200-shot due to the larger number of
in Section 3.2. However, as statement fine-tuning choices per example (and hence a larger number of
statements per example). Topic Answers, and Emotion. Perhaps unsurpris-
ingly, domain generalization in sentiment analysis
4.3 Other Baselines on Amazon Polarity and paraphrase detection on
To assess the feasibility of our approach, we com- MRPC is also displayed. We also see that the larger
pare Statement-tuned RoBERTa base models with model (RoBERTa-large) achieved much better gen-
125 million parameters with a range of competitive eralization capability in general.
multitask fine-tuned encoder-decoder models and
Comparison Against Larger Zero-Shot Mod-
decoder-only LLMs spanning a parameter range
els Our approach is also competitive against
from 60 million parameters to 7 billion parameters.
other pretrained open-source encoder-decoder and
We include the following open-source models:
decoder-only Large Language Models under zero-
Llama-2-7B-chat (Touvron et al., 2023), Mistral-
shot prompting. Despite having significantly fewer
7B-Instruct-v0.2 (Jiang et al., 2023), QWEN1.5-
parameters than all the models reported (except
7B-chat and QWEN1.5-0.5B-chat (Bai et al., 2023),
for FlanT5-small), our approach clearly matches
Pythia-6.9B and Pythia-2.8B (Biderman et al.,
or exceeds many of them on the tasks reported. It
2023), Phi-2 (Li et al., 2023), FlanT5-Large and
is worth noting that our approach with only 125M
FlanT5-Small (Chung et al., 2022), and BART-
parameters almost completely outperforms all mod-
large-mnli (Lewis et al., 2020). We use the
els under or equal to 6.9B parameters on almost
chat/instruction-tuned version of models to allow
all tasks (except for BCOPA) and is completely
for better prompting. We try to select models that
dominant on FigQA and StoryCloze, both of which
have not seen the evaluation data to the best of our
are unrepresented in the training data.
knowledge, however, the training data of many of
Furthermore, we observe generally stronger
these models is not fully outlined and there can
performance achieved by the Statement-Tuned
always be the possibility of contamination. We
RoBERTa-large models with consistently competi-
use the standard implementation of all models in
tive performance that is on par with the 7B param-
the HuggingFace transformers library (Wolf et al.,
eter LLMs on all the tasks. Notably, the Statement-
2020).
Tuned RoBERTa-large models greatly outperform
We train and evaluate all the models on a config-
all the 7B parameter LLMs (approximately 20
uration of 5 AMD EPYC Rome CPU cores and 2
times the number of parameters) on FigQA and
Nvidia Tesla A100 40GB GPUs.
StoryCloze, with the best performing RoBERTa-
5 Results and Analysis large model scoring an additional 32.6 and 10.7
points over the best performing 7B parameter LLM
In this section, we dive deep into the results of on the accuracy respectively.
our experimentation to derive insights about our We observe similar results in the 32-shot setting
approach. (see Appendix C). These results demonstrate the
significant potential of much smaller encoder mod-
5.1 Overall Result els being accurate and light alternatives to LLM
Statement-Tuning Enables Effective Zero-Shot zero-shot (and few-shot) prompting in natural lan-
Generalization on Encoder Models Table 1 guage understanding.
shows zero-shot performance of statement-tuned
RoBERTa. Recall that we explore various 5.2 Ablation Studies
statement-tuning sizes, hence here we report the Statement Finetuning Sample Size Recall that
best performance across all training sizes and per- we only perform statement fine-tuning on a sample
formance for the 4,000 and 50,000 sample sizes of all possible statements from the training dataset.
per dataset for the base and large models, respec- Here, we explore the significance of sample size
tively. The effect of statement tuning sample size per-dataset.
is explored in the later part of this paper. As shown in Table 2, a larger sample size does
The result shows that the statement-tuned en- not always mean better overall performance for
coder model can achieve zero-shot generalization RoBERTa-base. Specifically, RoBERTa-base’s av-
across unseen tasks and domains. Our approach erage performance plateaus around 67.3% after 4k
achieves respectable performance on unseen bal- samples and does not increase even with signifi-
anced COPA, Figurative QA, StoryCloze, Yahoo cantly more data. Interestingly, RoBERTa-large
BCOPA MRPC FigQA Amazon Polarity
80.0%
80.0% 90.0%
80.0%
70.0% 80.0%
Accuracy
Accuracy
Accuracy
Accuracy
70.0% 70.0%
70.0%
60.0% 60.0% 60.0% 60.0%
50.0% 50.0% 50.0% 50.0%
1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k
Training Data Size Training Data Size Training Data Size Training Data Size
StoryCloze Yahoo Topic Emotion
50% 50.0% Full/3000-shot
80.0% 1000-shot
40% 500-shot
Accuracy
Accuracy
Accuracy
70.0% 40.0%
30%
200-shot
30.0% 32-shot
60.0% 20% 0-shot
50.0% 10%
20.0% random
1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k
Training Data Size Training Data Size Training Data Size
Figure 3: N-shot accuracy of Statement-Tuned RoBERTa-base models across training datasets of different sizes.
The x-axis denotes the number of statements per Statement-Tuning training dataset, with the number of training
datasets fixed at 8. Additionally, a baseline comparison with random choice is included for reference.
Statement Average accuracy with more examples (shots) provided. Amazon Po-
Sample RoB-base RoB-large
larity has achieved strong performance even under
1,000 62.2 69.7
2,000 64.7 72.9
a zero-shot setting, hence the improvement in a
3,000 66.2 73.5 few-shot scenario is limited.
4,000 67.3 73.5
5,000 66.9 73.1
We experimented with re-purposing the model
10,000 66.2 73.4 back as a standard classification model. However,
20,000 66.7 73.2
40,000 66.5 73.6
performance was not as good on preliminary exper-
50,000 58.0 74.7 imentation so we do not explore this avenue further.
Detailed results can be seen in the Appendix D.
Table 2: Average accuracy over all evaluation tasks
when trained with different statement sample size. N-shot Performance Additionally, there seems
to be a general trend of improved performance by
using a larger n for n-shot continual fine-tuning.
benefits and improves significantly from using a However, the results seem to indicate a general
larger finetuning size with the best average perfor- trend of diminishing returns past using 200-shot
mance observed when it is finetuned on 50,000 finetuning. Nevertheless, it seems that a great deal
statements per training corpus. We hypothesize of the potential performance is achieved with the
that this is due to a larger "capacity" to understand zero-shot application of the approach, hence fur-
and discriminate between natural language state- ther supporting the utility of our approach when
ments which allows RoBERTa-large to benefit from task-specific data is scarce. Additionally, as seen
more training data as opposed to RoBERTa-base in Fig. 5, there appears to be a great degree of
which has a more limited "capacity" to develop correlation between 32-shot and 0-shot model per-
a general semantic understanding of the truthful- formance, indicating that observed trends in the
ness of statements. Due to computational and time 0-shot scenario can be informative for the few-shot
constraints, we are not able to exhaust the upper case.
limit of RoBERTa-large’s performance. Despite Effect of Statement Tuning Statement Diversity
this, we already achieved remarkable performance As part of our investigation of Statement-Tuning,
with training set sizes of up to 50,000 statements. we would like to explore the effect of template di-
However, we leave extensive hyperparameter and versity during Statement-Tuning. We hypothesize
training set size optimization for future work. that randomly applying a larger number of different
While the statement-tuned model shows zero- statement templates per training corpus will allow
shot generalization, we can further fine-tune the for improved performance on unseen tasks as it will
model with the target downstream task. As seen help the model be more robust to the phrasing of
in Figure 3, we investigate the effect of various statement templates and prevent it from forming de-
statement tuning training sample sizes on the base pendencies on superficial cues in some templates.
model’s accuracy across different n-shot continual To recall, each corpus employs several templates
fine-tuning on the 7 evaluation datasets. We ob- (see Appendix A). In this experiment, we limit each
serve a general trend where performance increases corpus to only use the maximum of N different
BCOPA MRPC FigQA Amazon Polarity
25.0% 20.0%
40%
20.0% 15.0% 20.0%
30%
Delta 15.0% 10.0%
Delta
Delta
Delta
10.0% 20%
10.0% 5.0%
5.0% 10%
0.0% 0.0%
0.0% 0%
5.0%
1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k
Training Data Size Training Data Size Training Data Size Training Data Size
StoryCloze Yahoo Topic Emotion
30.0% Full/3000-shot
30.0% 20% 1000-shot
20.0% 500-shot
20.0%
Delta
Delta
Delta
10.0% 0% 200-shot
32-shot
10.0% 0-shot
0.0% 20%
0.0%
1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k
Training Data Size Training Data Size Training Data Size
Figure 4: N-shot improvement of Statement-Tuned RoBERTa-base of varying training set sizes across training
datasets of different sizes. The y-axis, Delta, is the difference between the accuracy of the Statement-Tuned
model and the accuracy achieved by regular-finetuning of RoBERTa-base on the task. A positive Delta indicates
improvement over the baseline approach.
Correlation Matrix between Shot Types and Average Accuracy SPC BCOPA MRPC FigQA AP S-Cloze YA Topic Emotion AVG
1.0
500-shot Full/3000-shot
0.69 0.94 1 0.78 0.75 Table 3: The Zero-shot performance of the base model
shot
0.7
using various degrees of SPC, where a larger SPC value
32-shot
0.45 0.66 0.75 0.95 1 0.5 performance to account for the differing accuracy ranges
of each task.
t
ot
t
ot
ho
ho
ho
-sh
-sh
0-s
0-s
0-s
32
00
50
20
0
ll/3
Fu
shot
stances such as BCOPA and (to a certain extent)
Figure 5: N-shot correlation using the average accuracy FigQA, we notice that even for a higher number
across all training set sizes and evaluation sets. of few-shot examples we tend to observe a benefit
against regular fine-tuning.
Overall, this result demonstrates that the partic-
templates, which we call Statements per Category ular strength of our approach tends to be in the
(SPC). We Statement-Tune RoBERTa base models few-shot and zero-shot case, whereas, in the case
with a fixed training set size of 4,000 statements of data availability, it tends to be better to directly
per corpus with a varying level of SPC. fine-tune a RoBERTa-base model on the task. We
The result can be seen in Table 3. We observe assume that the superiority of our method in situa-
on average that increasing SPC increases perfor- tions with limited target task data can be attributed
mance to a certain extent before leveling off around not only to the improved generalizability gained
63.7% with an SPC of 4. The result is unsurpris- from multitask statement tuning but also to the data
ing, however, confirms our decision to use many augmentation effect brought by the statement tun-
statement templates per task. Detailed exploration ing on few-shot target task examples, which allows
of statement design is left to future work. the model to not only learn the relation between a
text and the correct class label but also the wrong
Comparison Against Standard Fine-tuning To ones, thus improving data efficiency.
observe the improvement over regular fine-tuning
of RoBERTa-base, we also include Figure 4, where Effect of Statement Tuning Task Diversity We
the y-axis, Delta, represents the improvement over explore how important it is to cover a variety of
regular fine-tuning for the particular n-shot. For tasks during statement tuning. In this experiment,
zero-shot, we take random choice as a baseline. we group statement tuning datasets across 9 differ-
Generally, continually fine-tuning our model is bet- ent task categories of Summarization (SU), Sen-
ter than fine-tuning vanilla RoBERTa under an ex- timent Analysis (SA), Question Answering (QA),
tremely low N-shot setting. However, in some in- Natural Language Inference (NLI), Commonsense
Statement tuning Training Evaluation
PD CR NLI QnA SA WSD IC OLI SU BCOPA MRPC FIGQA AMAZON P. StoryCloze YA Topic Emotion AVG
x x x x x x x x x 71.6 70.0 58.6 92.9 80.7 39.6 49.3 63.8
x x x x x x x x 72.4 71.0 59.2 93.1 81.2 38.3 48.5 63.8
x x x x x x x 72.2 70.7 57.0 93.1 77.7 38.0 52.5 63.6
x x x x x x 70.4 71.2 58.0 93.4 80.9 18.6 46.0 56.7
x x x x x 72.8 71.7 57.0 93.5 75.7 25.7 49.4 59.6
x x x x 72.4 71.2 54.8 60.3 74.4 28.5 53.9 57.0
x x x 70.0 67.4 59.2 78.0 75.3 36.6 40.3 58.8
Table 4: Comparison of the effect of reducing task diversity in the training of Statement-Tuning models on zero-shot
accuracy on unseen datasets. The last common is the average using the geometric mean to account for the different
accuracy ranges of the different evaluation sets. The total training set size remains constant at approximately
100,000 statements across all configurations.
Reasoning (CR), Paraphrase Detection (PD), Word few-shot and zero-shot prompting of many much
Sense Disambiguation (WSD), Intent Classifica- larger decoder-only or encoder-decoder models on
tion (IC), and Offensive Language Identification many tasks at a fraction of the parameters. Experi-
(OLI). Dataset to task categories breakdown can mentation shows that the approach can be leveraged
be seen in Appendix E. We then perform statement by training on as few as 16,000 statements. We find
tuning on RoBERTa-base on various subsets of Statement-Tuning training task and prompt diver-
tasks. To control for the total training size, we dy- sity to be generally helpful. We speculate that the
namically sample the data until we have 100k total benefits of this approach could extend beyond task
statements. generalization and could prove useful for cross-
Table 4 shows the 0-shot performance of the lingual task transfer, and would like to explore this
statement tuning approach when the training set in future work.
size is fixed but task types are incrementally re-
moved from the mix. We generally observe that the Limitations
inclusion of more tasks seems to be overall benefi- While our approach offers advantages in computa-
cial; however, it seems to also unsurprisingly corre- tional efficiency compared to LLMs, the cost scales
late with how closely the training set resembles the with the number of possible targets due to the re-
evaluation set. For example, we see a significant quirement of one forward pass per label. Addition-
jump in accuracy on Amazon Polarity once we in- ally, task-specific full fine-tuning can still achieve
troduce another out-of-domain sentiment analysis better performance. Furthermore, the effectiveness
data. Likewise, MRPC’s performance is already of our approach in generalizing to unseen tasks re-
high since another paraphrase detection dataset is lies on the similarity of those tasks to the training
always introduced regardless of the configuration. data. Tasks with minimal relation to existing la-
Nevertheless, for truly unseen tasks, statement task beled datasets might see limited performance com-
diversity is beneficial, especially for multi-class pared to tasks with higher similarity. Finally, our
classification tasks such as Yahoo Answers Topic reliance on encoder-based models restricts its ap-
and Emotion. In average, training task diversity plication to Natural Language Understanding tasks,
seems to have an overall beneficial effect on gen- excluding tasks like translation or summarization.
eralizability and contributes to the ability of our
approach to generalize effectively to unseen tasks. Ethics Statement
Pride Kavumba, Naoya Inoue, Benjamin Heinzerling, Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Keshav Singh, Paul Reisert, and Kentaro Inui. 2019. Dario Amodei, Ilya Sutskever, et al. 2019. Language
When choosing plausible alternatives, clever hans can models are unsupervised multitask learners. OpenAI
be clever. In Proceedings of the First Workshop on blog, 1(8):9.
Commonsense Inference in Natural Language Pro-
cessing, pages 33–42, Hong Kong, China. Associa- Altaf Rahman and Vincent Ng. 2012. Resolving com-
tion for Computational Linguistics. plex cases of definite pronouns: the winograd schema
challenge. In Proceedings of the 2012 Joint Confer-
Tushar Khot, Peter Clark, Michal Guerquin, Peter ence on Empirical Methods in Natural Language
Jansen, and Ashish Sabharwal. 2020. Qasc: A Processing and Computational Natural Language
dataset for question answering via sentence composi- Learning, pages 777–789. Association for Computa-
tion. arXiv:1910.11473v2. tional Linguistics.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and pages 255–269, Online. Association for Computa-
Percy Liang. 2016. SQuAD: 100,000+ questions for tional Linguistics.
machine comprehension of text. In Proceedings of
the 2016 Conference on Empirical Methods in Natu- Timo Schick and Hinrich Schütze. 2021b. It’s not just
ral Language Processing, pages 2383–2392, Austin, size that matters: Small language models are also few-
Texas. Association for Computational Linguistics. shot learners. In Proceedings of the 2021 Conference
of the North American Chapter of the Association
Melissa Roemmele, Cosmin Adrian Bejan, and An- for Computational Linguistics: Human Language
drew S. Gordon. 2011. Choice of plausible alter- Technologies, pages 2339–2352, Online. Association
natives: An evaluation of commonsense causal rea- for Computational Linguistics.
soning. In Logical Formalizations of Commonsense
Reasoning, Papers from the 2011 AAAI Spring Sym- Priyanka Sen, Alham Fikri Aji, and Amir Saffari.
posium, Technical Report SS-11-06, Stanford, Cali- 2022. Mintaka: A complex, natural, and multilin-
fornia, USA, March 21-23, 2011. AAAI. gual dataset for end-to-end question answering. In
Proceedings of the 29th International Conference
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- on Computational Linguistics, pages 1604–1619,
ula, and Yejin Choi. 2020. Winogrande: An adver- Gyeongju, Republic of Korea. International Com-
sarial winograd schema challenge at scale. In The mittee on Computational Linguistics.
Thirty-Fourth AAAI Conference on Artificial Intelli-
gence, AAAI 2020, The Thirty-Second Innovative Ap- Lakshay Sharma, Laura Graesser, Nikita Nangia, and
plications of Artificial Intelligence Conference, IAAI Utku Evci. 2019. Natural language understand-
2020, The Tenth AAAI Symposium on Educational ing with the quora question pairs dataset. CoRR,
Advances in Artificial Intelligence, EAAI 2020, New abs/1907.01041.
York, NY, USA, February 7-12, 2020, pages 8732–
8740. AAAI Press. Derek Tam, Rakesh R. Menon, Mohit Bansal, Shashank
Srivastava, and Colin Raffel. 2021. Improving and
Julian Salazar, Davis Liang, Toan Q. Nguyen, and Ka- simplifying pattern exploiting training. In Proceed-
trin Kirchhoff. 2020. Masked language model scor- ings of the 2021 Conference on Empirical Methods
ing. In Proceedings of the 58th Annual Meeting of in Natural Language Processing, pages 4980–4991,
the Association for Computational Linguistics, pages Online and Punta Cana, Dominican Republic. Asso-
2699–2712, Online. Association for Computational ciation for Computational Linguistics.
Linguistics.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
M Saiful Bari, Canwen Xu, Urmish Thakker, Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Shanya Sharma Sharma, Eliza Szczechla, Taewoon Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Harshit Pandey, Rachel Bawden, Thomas Wang, Tr- Isabel Kloumann, Artem Korenev, Punit Singh Koura,
ishala Neeraj, Jos Rozen, Abheesht Sharma, An- Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
drea Santilli, Thibault Févry, Jason Alan Fries, Ryan ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
Teehan, Teven Le Scao, Stella Biderman, Leo Gao, tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
Thomas Wolf, and Alexander M. Rush. 2022. Multi- bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
task prompted training enables zero-shot task gener- stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
alization. In The Tenth International Conference on Ruan Silva, Eric Michael Smith, Ranjan Subrama-
Learning Representations, ICLR 2022, Virtual Event, nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
April 25-29, 2022. OpenReview.net. lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Melanie Kambadur, Sharan Narang, Aurelien Ro-
Junlin Wu, and Yi-Shin Chen. 2018. CARER: Con- driguez, Robert Stojnic, Sergey Edunov, and Thomas
textualized affect representations for emotion recog- Scialom. 2023. Llama 2: Open foundation and fine-
nition. In Proceedings of the 2018 Conference on tuned chat models.
Empirical Methods in Natural Language Processing,
pages 3687–3697, Brussels, Belgium. Association Alex Wang and Kyunghyun Cho. 2019. BERT has a
for Computational Linguistics. mouth, and it must speak: BERT as a markov random
field language model. CoRR, abs/1902.04094.
Timo Schick and Hinrich Schütze. 2021a. Exploiting
cloze-questions for few-shot text classification and Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin
natural language inference. In Proceedings of the Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
16th Conference of the European Chapter of the Asso- drew M. Dai, and Quoc V. Le. 2022a. Finetuned
ciation for Computational Linguistics: Main Volume, language models are zero-shot learners. In The Tenth
International Conference on Learning Representa-
tions, ICLR 2022, Virtual Event, April 25-29, 2022.
OpenReview.net.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, Ed H.
Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy
Liang, Jeff Dean, and William Fedus. 2022b. Emer-
gent abilities of large language models. Transactions
on Machine Learning Research.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed-
ings of the 2018 Conference of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume
1 (Long Papers), pages 1112–1122, New Orleans,
Louisiana. Association for Computational Linguis-
tics.
C 32-shot Generalization
Table 5 shows the results of 32-shot fine-tuning on
7 target downstream datasets of our models and
baselines. We observe similar trends to zero-shot
setting as discussed in Section 5.1.
D Regular Classification of
Statement-Tuned Models
In figure 6, we visualize the relative improvement
of our Statement-Tuned RoBERTa-base models
regularly fine-tuned on N-shot downstream data
over the regularly fine-tuned RoBERTa-base. The
results are not as good as fine-tuning using state-
ments.
Table 5: Comparison of our approach against many pretrained open-source Encoder-Decoder and Decoder-only
Pretrained Large Language Models on 7 Natural Language Understanding tasks in 32-shot conditions. We highlight
all scores in red where our approach with RoBERTa-base (best) exceeds or is equal to the score given by the model.
Delta
Delta
Delta
0.0% 2.5%
5.0% 5.0%
2.0% 5.0%
4.0% 7.5% 0.0% 0.0%
6.0% 10.0% 5.0% 5.0%
1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k 1k 2k 3k 4k 5k 10k 20k 40k 50k
Training Data Size Training Data Size Training Data Size Training Data Size
Yahoo Topic Emotion
0% 0.0%
10% Full/3000-shot
10.0% 1000-shot
20% 500-shot
Delta
Delta
Figure 6: N-shot improvement of Statement-Tuned RoBERTa-base models used for regular finetuning. The y-axis,
Delta, is the difference between the accuracy of the Statement-Tuned model fine-tuned for the task directly by
discarding the Statement-Tuning classification head and the accuracy achieved by regular fine-tuning of RoBERTa-
base on the task. A positive Delta indicates improvement over the baseline approach.