0% found this document useful (0 votes)
6 views

提出了对prompt数据评估的三个指标

Uploaded by

awdwafas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

提出了对prompt数据评估的三个指标

Uploaded by

awdwafas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

R.I.P.

þ: Better Models by Survival of the Fittest Prompts

Ping Yu 1 Weizhe Yuan 1 2 Olga Golovneva 1 Tianhao Wu 3


Sainbayar Sukhbaatar 1 Jason Weston 1 2 Jing Xu 1

Abstract Winrate vs Training Data Size


60.2
Training data quality is one of the most important
60
arXiv:2501.18578v1 [cs.CL] 30 Jan 2025

drivers of final model quality. In this work, we 57.83 58.49


introduce a method for evaluating data integrity
55
based on the assumption that low-quality input 52.60 GPT-4 Preview 1106
prompts result in high variance and low quality
50

AlpacaEval LC Winrate
responses. This is achieved by measuring the 48.11 48.37
45 47.26
rejected response quality and the reward gap be- 45.64 44.93
tween the chosen and rejected preference pair.
40 40.90
Our method, Rejecting Instruction Preferences self-RIP (ours)
(RIP ) can be used to filter prompts from existing
35 RIP (ours)
training sets, or to make high quality synthetic Raw Wildchat
datasets, yielding large performance gains across Jaccard Similarity Filtering
various benchmarks compared to unfiltered data.
30 IFD Filtering
Using Llama 3.1-8B-Instruct, RIP improves Al-
InsTag Difficulty Filtering
25 Response PPL Filtering
pacaEval2 LC Win Rate by 9.4%, Arena-Hard LLM-as-Prompt-Quality-Judge
by 8.7%, and WildBench by 9.9%. Using Llama LLaMA 3.1-8b-instruct InsTag Diversity Filtering
3.3-70B-Instruct, RIP improves Arena-Hard from
20 512 1k 2k 4k 8k 16k
67.5 to 82.9, which is from 18th place to 6th over- Training Data Size (in log)
all in the leaderboard.
Figure 1: Our method Rejecting Instruction Preferences
(RIP ) for curating data, and Self-RIP for creating syn-
thetic data. The x-axis represents the effective training set
1. Introduction size (after filtering). At every data size training on unfil-
tered WildChat prompts is significantly outperformed by
In large language model (LLM) development, a primary
RIP. RIP also outperforms various other curation baselines.
driver for advancing frontier models is curating high-quality
Synthetic data built by Self-RIP improves results further.
training examples. This curation is crucial during both the
pretraining (Rae et al., 2021; Touvron et al., 2023a) and
post-training (finetuning) phases (Touvron et al., 2023b). we hypothesize that better judgments of data quality can
Despite the widespread adoption of the “scaling hypothesis” be made by taking into account the model responses on
(Kaplan et al., 2020), merely increasing the size of training those data. Specifically, if the prompt is of low quality,
datasets does not guarantee improved performance if the then responses exhibit high variability and low quality as
data are of low quality (Chen et al., 2024; Li et al., 2024c; well. This insight leads us to develop a method for either
Zhou et al., 2024). Without sufficient data quality, model selecting prompts, or for creating high quality synthetic
training tends not to be fully robust to the associated noise, prompts, both of which yield significant performance gains
and final response quality from the model suffers. during post-training.
Currently, there are a number of investigated techniques Our method, Rejecting Instruction Preferences (RIP ), con-
to curate data – most of which are based on heuristics or siders the case of instruction finetuning via preference opti-
model judgments given the training inputs. In this work, mization. It starts with a set of preference pairs consisting
1 of input prompts and chosen and rejected responses. RIP
Meta 2 New York University 3 UC Berkeley. Correspondence
to: Jing Xu <[email protected]>. considers specific characteristics of the preference pairs, in
particular rejected response quality and the reward gap be-
tween the chosen and rejected preference pair. If the rejected

1
Rejecting Instruction Preferences (RIP )

AlpacaEval2 Arena-Hard WildBench


Standard models LC Win Win Score Score
GPT-4 Omni (05/13) 57.5 51.3 74.9 59.3
GPT-4 Turbo (04/09) 55.0 46.1 82.6 55.2
Llama 3.1-8B-Instruct 20.9 21.8 21.3 33.1
Llama 3.3-70B-Instruct 38.9 41.5 67.5 52.8
Llama 3.1-8B-Instruct + RIP (ours) 57.8 57.2 43.1 45.6
Llama 3.1-8B-Instruct + Self-RIP (ours) 60.2 61.1 42.1 42.5
Llama 3.3-70B-Instruct + RIP (ours) 67.7 73.2 82.9 58.8

Table 1: Rejecting Instruction Preferences (RIP ) and Self-RIP compared to SOTA models on AlpacaEval2, Arena-
Hard and WildBench. By training Llama 3.1-8B-Instruct and Llama 3.3-70B-Instruct on Wildchat instructions curated by
RIP , or synthetic data created by Self-RIP , our method surpasses many existing SOTA models.

quality is low or the reward gap is high this is an indicator ducing final pretraining text data from 45TB down to a
that the prompt is of low quality. We thus filter the prompts 570GB high-quality subset. As language models become
based on these metrics. The remaining prompts can subse- more powerful, data curation can also be facilitated by us-
quently be used to fine-tune the model using RLHF methods ing LLMs as a quality judge. Llama2 and Llama3 employ
like Direct Preference Optimization (DPO) (Rafailov et al., model-based quality classifiers to filter out non-English and
2023), or for creating new synthetic prompts via few-shot low-quality content from pretraining data (Touvron et al.,
prompting. Table 1 illustrates that when trained on Wild- 2023b; Dubey et al., 2024). Rae et al. (2021); Soldaini
chat prompts (Zhao et al., 2024b) and filtered by RIP , both et al. (2024) also demonstrate that applying simple filter-
Llama 3.1-8B-Instruct and Llama 3.3-70B-Instruct (Dubey ing on massive texts brings substantial improvements on
et al., 2024) achieve large performance gains, surpassing downstream performance across the board.
many state-of-the-art models.
Data Selection in Supervised Fine-Tuning Similarly,
Additionally, we conducted comprehensive experiments post-training also relies on high-quality data to enhance
comparing the scaling behavior of our data under RIP fil- models’ instruction-following capabilities. Previously,
tering with that of unfiltered WildChat raw data, and six instruction-tuning was regarded as largely dependent on the
alternative filtering methods in Figure 1. Our results demon- size of available instruction-tuning examples (Mishra et al.,
strate that RIP significantly enhances model performance, 2021; Wei et al., 2021; Wang et al., 2022). More recent work
while other filtering methods yield only marginal improve- has revealed that training on a smaller yet higher-quality
ments. In addition to improvements observed with filtering curated set of prompts tends to be more effective in improv-
human-written data such as Wildchat prompts or HelpSteer2 ing models’ instruction-following capabilities (Zhou et al.,
using different reward signals such as human, classifier or 2024; Chen et al., 2024). To facilitate data selection, some
LLM-as-a-Judge, we also show RIP improves model perfor- employ traditional optimization-based data-pruning meth-
mance as a method to create synthetic data. ods by measuring their impact on model’s generalization
capabilities (Toneva et al., 2018; Yang et al., 2022; Xia et al.,
Analysis of our method using t-SNE shows that RIP can
2024). Another stream of work studies employing powerful
eliminate certain undesirable clusters. Additionally, analysis
language models to measure the complexity, diversity and
with GPT-4 reveals that RIP effectively removes noisy or
quality of instructions (Lu et al., 2023; Chen et al., 2024;
low quality prompts, ambiguous prompts, unsafe prompts,
Touvron et al., 2023b; Dubey et al., 2024; Li et al., 2024c).
and examples where preference choices are incorrect.
Alternative filtering approaches proposed automatic metrics
such as IFD score (Li et al., 2023a), or INSTRUCTMINING
2. Related Work which fits a linearly weighted score over a bag of natural
language indicators (Cao et al., 2023) to select examples.
Data Selection in Pretraining Data Given the high vari-
ance in quality of pretraining data, data filtering is a critical Data Selection in RLHF and Preference Optimiza-
component for determining pretrained model quality (Hoff- tion The success of preference-optimization methods
mann et al., 2022). In addition to heuristic preprocessing (Stiennon et al., 2020; Rafailov et al., 2024) has attracted
such as deduplication of similar documents, removal of more attention to collecting large scale and high quality
datasets with heavy test-set overlap, and text extractions preference data. While extensive work shows scaling up
from raw Internet content, GPT-3 (Brown et al., 2020) ap- preference data through bootstrapping (Xu et al., 2023b;
plied text filtering to the CommonCrawl dataset based on Yuan et al., 2024b), synthesis approaches (Lambert et al.,
similarity to high-quality reference data, significantly re- 2024; Wang et al., 2024b), or crowdsourcing (Touvron et al.,

2
Rejecting Instruction Preferences (RIP )

2023b; Dubey et al., 2024), can boost model performance, can provide valuable insights into the quality of the prompts.
the characterization and selection of high-quality pairwise Specifically, we test the following two hypotheses.
examples is surprisingly underexplored. Most work involv-
Hypothesis 1: Low-quality prompts are likely to produce
ing preference optimization employs existing methods de-
low-quality responses. Low-quality prompts - for example
rived from pretraining and instruction-tuning (Touvron et al.,
those that are unclear, ambiguous, or containing conflict-
2023b; Dubey et al., 2024), such as deduplication, quality
ing information - are likely to lead to noisy or inaccurate
classifiers or filtering heuristics. However, such methods
model responses. While those inaccurate responses can still
overlook the importance of the preference pairs (the chosen
be used as training targets in pairwise preference optimiza-
and rejected responses). Recent work Wu et al. (2024a);
tion, studies indicate that training on pairs with low-quality
Khaki et al. (2024) shows that preference optimization can
rejected responses might be sub-optimal. Yasunaga et al.
be highly sensitive to the choice of response pairs of differ-
(2024) for example shows that pairing the best with random
ent reward gaps, focusing more on pair construction than
responses works well comparing to pairing the best with
data selection.
the worst one with lowest reward. This suggests a potential
correlation of the quality of the rejected example with the
3. Rejecting Instruction Preferences (RIP ) alignment outcome. Additionally, several studies (Wu et al.,
2024b; Zhao et al., 2024a; Yuan et al., 2024a) have found a
We start by defining the prompt selection problem in the
strong correlation between the length of responses, includ-
pairwise preference optimization setting. In this context, we
ing rejected ones, and final performance. Therefore, we
present our proposed prompt-response-pair-based filtering
consider the reward r(yl |x) and length len(yl ) of rejected
method, which develops key descriptive metrics and their
responses as indicators of quality of the training prompts x,
use in filtering training prompts. Lastly, we describe how
i.e. large values of either of these metrics relative to other
our method can be applied to self-instruction setups where
examples indicate higher quality.
synthetic prompts are generated from the model itself.
Hypothesis 2. Low-quality prompts are likely to produce
3.1. Data Curation Problem responses with larger variance Low quality prompts
introduce uncertainty and ambiguity, leading to a broader
The goal of data curation is to remove low-quality prompts range of interpretations. As the model or human generat-
that can negatively affect the general instruction following ing the response might guess or fill in gaps in the prompt,
capability of the model. Given a set of prompts X = {x}, this results in higher variance in responses. While some
we aim to find a subset S ⊆ X to be used for fine-tuning a responses might align well with the intent, others may devi-
seed LLM M. We consider the preference optimization set- ate significantly. A preliminary study in Wu et al. (2024a)
ting, with winning (chosen) and losing (rejected) response finds low-gap pairs, where chosen and rejected responses
pairs {yw , yl } with rewards r(yw |x) > r(yl |x) for each are similar, are high-quality informative pairs, leading to
prompt x. The response pairs and their rewards can come better performing DPO models. We therefore consider the
from human preference data, or can be generated from the reward gap r(yw |x) − r(yl |x) as another indicator of quality
model itself M and then scored using an external reward of a training prompt, i.e. small reward gaps suggest that the
model. For the latter we use the ”best-vs-worst” preference prompt has higher quality.
pairing method (Pace et al., 2024), where N responses are
sampled, and the ones with highest and lowest rewards are
3.3. RIP filtering
the chosen and rejected, respectively:
( 3.3.1. RIP FOR EXISTING TRAINING PROMPTS
N yw = argmaxyi r(yi |x)
{yi }i=1 ∼ M(x) then . Given the above hypotheses, we thus consider the following
yl = argminyi r(yi |x) three metrics mk (x, yw , yl ) that are based on the responses:
We also consider alternate pairing methods in Section A.4. • Rejected response reward: m1 = r(yl |x)
We then use the preference data {x, yw , yl }x∈S for training • Rejected response length: m2 = len(yl )
the model M. Note that our focus is on filtering prompts
• Reward gap: m3 = r(yw |x) − r(yl |x)
entirely, not responses to those prompts.
For each metric, we define threshold values that can be used
3.2. Hypothesis on Data Selection for filtering. For the first two metrics, higher values are
desired so we choose a lower-bound threshold
Although preferences are extensively used to train state-
of-the-art LLMs, there is limited research on identifying S = {x | τk ≤ mk (x, yw , yl )}.
unhelpful training examples in this setting. We posit that
analyzing the paired model responses to given input prompts The last reward gap metric requires an upper threshold as we

3
Rejecting Instruction Preferences (RIP )

want small gaps. Therefore we reduce the prompt selection model; 2) with existing responses that have been annotated
problem to a threshold choice problem. To resolve this, we with human-assigned rewards. We use the WildChat and
start with coordinate-wise experiments, analyzing model Helpsteer2 datasets, see statistics in Appendix Table 7.
performance under various thresholds τk for individual met-
rics mk (details in Section A.2). Ultimately, we perform 4.1.1. W ILD C HAT DATASET
hyperparameter selection using all 3 parameters.
Prompt Set We start with a large pool of over 250k
human-written prompts from the WildChat (Zhao et al.,
3.3.2. S ELF -RIP FOR SYNTHETIC PROMPTS
2024b) dataset. We exclude any non-English prompts
Prompt curation by RIP can also naturally be used to gener- based on WildChat annotations, and remove around 70k
ate synthetic data. First, RIP is used to create a seed pool Midjourney-related instructions1 , yielding 190k unique first-
of high-quality prompts. Few-shot examples from this seed turn prompts. These prompts are collected from real user in-
pool guide the model to generate training prompts, which teractions without human annotations, making them highly
can be further filtered by RIP . We thus propose Self-RIP , a diverse. While there are many high-quality prompts, there
new approach to creating high-quality synthetic prompts: are also a significant number of low-quality ones, such as
nonsensical text or those lacking a clear question.
Step 1. Few-shot prompting with RIP curated instruc-
tions We start with the set of prompts S curated by our Response Generation Following Yuan et al. (2024b);
proposed method RIP as described in Section 3.3.1. To gen- Meng et al. (2024); Wu et al. (2024b) we generate our cho-
erate new prompts S ′ we sample from our seed model M sen and rejected response pairs on the WildChat prompts
following Self-Instruct (Wang et al., 2023; Honovich et al., using our seed model M to make our setup closer to the
2023). For each new example we randomly select 8 prompts on-policy setting. We use best-vs-worst as described in Sec-
from S and feed them as few-shot examples to the model tion 3.1, generating N responses for each prompt x using
M to generate a prompt with similar characteristics. We M with sampling parameters of T = 0.8, top p = 0.95.
apply the exact processing steps in Wang et al. (2023) to new
prompts S ′ , such as removing similar prompts (ROUGE-L Reward Annotation We then evaluate candidate re-
similarity with any existing instructions < 0.7), and exclud- sponses using two different judges:
ing those that contain certain keywords (e.g., image, picture,
graph) that usually can not be processed by text-only LLMs. • Reward Classifier: We used the ArmoRM reward
model (Wang et al., 2024a) to score each response.
Step 2. Filtering with RIP We further apply RIP on top • LLM-as-a-Judge (Zheng et al., 2023): We prompt
of the synthetically generated prompts S ′ from the previous LLama 3.1-405B-Instruct using the prompt template
step, filtering out the self-instructions using the same thresh- outlined in Yasunaga et al. (2024) to assign a score
old values as used before. Then the remaining subset S ′′ is ranging from 0 to 10 for each response. For each re-
used for training the seed model M. sponse, we conduct 10 independent evaluations and
use the average score as the final reward.
Note we use RIP filtering twice here, once in each step.
This is to ensure the quality of synthetic prompts. We also The training example (x, yw , yl ) is selected by appointing
explore Self-RIP using a smaller subset of S as seed instruc- the highest-reward one as yw and the lowest-reward one
tions in Section A.4 as part of our ablation studies. as yl . For our primary experiments, we use the default
value of N = 64. However, results for N = 8, 16, 32 are
provided as part of our ablation studies in Table 17, and we
4. Experimental Setup use N = 32 for the Llama 3.3-70B-Instruct experiments.
We perform preference optimization using DPO, beginning We perform early stopping using a validation set of 470
with the Llama 3.1-8B-Instruct model as our seed model examples: 253 valid set examples from Li et al. (2024c) and
M. We evaluate both the selection and creation of prompts, 218 examples from the evol-test set of Xu et al. (2023a),
focusing on two categories: human-written instructions and with prompts that overlap with AlpacaEval2 removed.
synthetically generated instructions. Finally, we extend our
evaluation of RIP with the Llama 3.3-70B-Instruct model. 4.1.2. H ELP S TEER 2 DATASET
HelpSteer2 (Wang et al., 2024c) consists of around 10k
4.1. Human-Written Prompts human-written prompts each with a response pair sam-
pled from 10 different LLMs. Each response has human-
For human-written instructions, we specifically investi-
gate two setups: human-written input prompts 1) paired 1
They start with “As a prompt generator for a generative AI
with model-generated responses and annotated by a reward called ”Midjourney”, you will create image prompts ...”.

4
Rejecting Instruction Preferences (RIP )

annotated rewards of helpfulness, correctness, coherence, We employ Llama 3.1-405B-Instruct to measure the quality
complexity and verbosity on a Likert-5 scale. We use the of prompts on both a binary (useful/not useful) and point-
aggregated reward with the recommended weighting [0.65, wise scale (0-5). By sampling five Llama 3.1-405B-Instruct
0.8, 0.45, 0.55, 0.4].2 The main distinction from WildChat predictions per prompt and taking the average of LLM-as-
is that the rewards come from human annotations instead Prompt-Judge predictions, we filter out less useful prompts
of an external model. We perform early stopping on the by varying the cutoff thresholds.
HelpSteer2 validation split, selecting checkpoints with the
highest average response rewards determined by ArmoRM. 4.3.2. P ROMPT- AND -C HOSEN -R ESPONSE -BASED
F ILTERING
4.2. Synthetic Prompts
Perplexity We compute perplexity (ppl) of the chosen
In this setup, we generate prompts from the seed model M response yw with the Llama 3.1-8B Instruct in a zero-shot
itself for training instead of using human-written prompts. manner as a filtering metric to curate training prompts. In
By varying the set of seed pool prompts used as few-shot particular, we retain examples with large ppl(yw |x) values,
examples, we collect two sets of training prompts: which may indicate the difficulty of the prompt.
• Self-Instruct: randomly select 8-shot examples from Instruction-Following Difficulty (IFD) Li et al. (2023a)
the unfiltered WildChat. introduced the IFD to measure the model-specific difficulty
• Self-RIP : randomly select 8-shot examples from high of a data sample. A lower IFD score indicates that this par-
quality WildChat prompts filtered by RIP . ticular instruction-response pair is considered relatively easy
for the language model to understand and follow without
In each case, we create 20k training prompts sampled with further training. We filter out examples with low IFD metric
decoding parameters T = 0.8, top p = 0.95. The rest of of a given pair of prompt x and chosen response yw .
the setup including response generations and DPO training
is exactly the same as the WildChat setup where we use 4.3.3. C HOSEN - AND -R EJECTED -R ESPONSE BASED
ArmoRM to construct response pairs (yw , yl ), and do early F ILTERING
stopping on the same validation set of 470 examples.
Jaccard Similarity In addition to the reward gap between
chosen and rejected responses, we explore Jaccard similar-
4.3. Baselines
ity, defined as the number of overlapping words divided by
We compare our method with the existing methods below. the overall word counts, as an alternative similarity mea-
For instruction-tuning data selection methods which handle surement. We thus filter out examples with low Jaccard
a single (non-pairwise) response per prompt, we apply them similarity scores (i.e. fewer overlapping words) between
to the chosen responses within the response pairs. Addi- chosen and rejected response pairs.
tional details on the implementation of each baseline are
provided in Appendix Section A.5. 4.4. Training And Evaluation Setting

4.3.1. P ROMPT-BASED F ILTERING Following the Instruct setup in Meng et al. (2024), we utilize
the DPO training approach with the off-the-shelf LLama 3.1-
InsTag Complexity Lu et al. (2023) leveraged ChatGPT 8B-Instruct and LLama 3.3-70B-Instruct models, leveraging
to create semantic and intent-based tags, subsequently fine- the fairseq2 library (Balioglu, 2023). We use a batch size
tuning an LLM as a data tagger using these tags. They then of 64 and sweep over learning rates of 5e−7, 1e−6 for the
used the tag counts as a measure of complexity. This is used LLama 3.1-8B-Instruct model, and a learning rate of 1e−6
to filter out prompts with fewer tags to enhance complexity. with a batch size of 256 for the LLama 3.3-70B-Instruct
InsTag Diversity The InsTag Diversity filtering method model. Both models are trained with a dropout rate of
(Lu et al., 2023) characterizes a dataset as more diverse 0.0 and a β value of 0.1 throughout the experiments. We
when it includes a greater variety of unique tags, as an- conduct RIP with various cutoff thresholds, e.g. at the 25%,
notated by the specified tagger. Using this approach, we 50% and 75% percentile of each metric.
greedily filter out data samples whose associated tags are We primarily assess models’ general instruction-following
already present in the selected dataset. capabilities on three evaluation benchmarks: AlpacaEval2
LLM-as-Prompt-Judge Employing LLMs as prompt qual- (Li et al., 2023b), Arena-Hard (Li et al., 2024b) and Wild-
ity judges has proven its efficacy in curating high-quality Bench (Lin et al., 2024). These benchmarks cover a wide
data (Chen et al., 2024; Dubey et al., 2024; Liu et al., 2023). range of natural yet challenging real-world user queries, and
have been widely adopted by the research community.
2
https://huggingface.co/nvidia/Llama3-70B-SteerLM-RM.

5
Rejecting Instruction Preferences (RIP )

5. Experiment Results from 54.3% to 67.7%, Arena Hard from 70.5 to 82.9 and
WildBench from 55.3 to 58.8, surpassing SOTA models as
Due to the large amount of unfiltered WildChat prompts, we shown in Table 1. The prompt filtering threshold we applied
first assess whether standard DPO training saturates as the to the 70B model was the same as in Llama 3.1-8B-Instruct
size of the training prompts grows. As shown in Appendix + RIP (see Appendix Table 9), indicating potential weak-to-
Figure 2, the Armo Score on the valid set dramatically im- strong generalizability (Li et al., 2024a) of our method.
proves as we increase the size of training prompts, and
begins to plateau afterwards. This shows growing the size Existing filtering methods derived from supervised-
of the training prompts arbitrarily does not bring additional finetuning do not work as well on preference datasets As
gains, and hence quality control of the preference dataset demonstrated in Table 2, compared to the baseline WildChat-
could be important. We thus focus on 20k unique Wild- 20k DPO (no filtering) trained on WildChat 20k prompts
Chat prompts, denoted as WildChat-20k for Llama3.1-8B- without any filtering, existing prompt-based filtering meth-
Instruct experiments, and 40k for Llama 3.3-70B-Instruct. ods such as InsTag-Difficulty, InsTag-Diversity or LLM-as-
Prompt-Judge filtering methods all lead to lower win rates
We report Alpaca-Eval2 Length-Controlled (LC) win rate, on Alpaca-Eval2. LLM-as-Prompt-Judge, while outper-
Arena-Hard score and WildBench WB-Score along with the forming certain filtering methods such as InsTag, achieves
number of training examples (after filtering if any) using marginal gains compared to no filtering even though they
WildChat-20k in Table 2, on HelpSteer2 in Table 4, and on are facilitated by querying a poweful LLM, Llama 3.1-
Self-Instruction data in Table 5. Existing filtering methods 405B-Instruct. Out of all the alternative methods tried,
are provided in Table 2 as baseline comparisons. Further Jaccard Similarity based filtering that takes into account
details, such as hyperparameters, are in Appendix Table 8 response pairs for filtering achieves relatively the highest
and Table 9. Our findings lead to several key observations. scores across the 3 benchmarks, indicating that filtering that
only takes into account prompts or chosen responses does
When filtering human-written instructions, RIP not generalize well to the pairwise preference case.
achieves the best performance on both human-scored The Self-RIP method to generate synthetic data outper-
and model-scored preference datasets. On the WildChat forms Self-Instruct data. As shown in Table 5, Self-RIP
dataset where pairs are annotated by the ArmoRM model, yields better alignment results across all 3 evaluation bench-
we conduct RIP with various cutoff thresholds, at the 25%, marks as compared to those trained on Self-Instruct data.
50% and 75% percentile of each metric. Our best model is In particular, win rate improves from 49.1% to 60.2% on
trained on examples with rejected length larger than the 50% AlpacaEval2, and from 38.5% to 42.1% on Arena-Hard.
percentile of all rejected lengths, and rejected rewards larger This result implies that our method generates better quality
than the 50% percentile of all rejected rewards, and reward instructions than generating via few-shot examples from
gap smaller than the 50% percentile. Table 2 shows that RIP unfiltered prompts as in Self-Instruct.
significantly improves LC win rate from the LLama3.1-8B-
Instruct DPO baseline without filtering from 48.4% to 57.8% Self-RIP synthetic data outperforms human-written in-
by filtering out 77% training examples, surpassing GPT-4 structions In Table 5, models trained on synthetic prompts
Omni (05/13) on AlpacaEval2. Similarly, RIP scores the outperform those trained on 20k human-written WildChat
highest on Arena-Hard (43.1) compared to LLM-as-Prompt- prompts. Applying Self-RIP few-shot generation without
Judge filtering (42.0), Jaccard Similarity (42.6), and the post-filtering gives an equal amount of 20k prompts, but
no filtering baseline (37.9). RIP also achieves the highest still increases the AlpacaEval2 LC win rate from 48.4% to
WB-score on WildBench (45.6) compared to other filtering 53.6%, Arena-Hard win rate from 37.9% to 43.7% and WB-
and no filtering baselines (41.5). As shown in Appendix Score on WildBench from 41.5 to 44.8. This further illus-
Table 8 using LLM-as-a-Judge annotated rewards, RIP also trates the importance of training on high-quality instructions.
performs well. Finally, Table 4 demonstrates RIP is equally When applying the full Self-RIP method with post-filtering
effective on HelpSteer2 where preference pairs are deter- results are further improved, for example achieving the best
mined by human annotators, achieving the highest scores AlpacaEval2 LC win rate of 60.2%.
across all 3 evaluation benchmarks as compared to the base- RIP seed data selection and RIP post-filtering are both
lines (no filtering and LLM-as-Prompt-Judge filtering). important for generating Self-RIP synthetic data In
RIP scales to different and larger models We also tried Table 5, we perform ablations on Self-RIP . We try: (i) us-
RIP on a different base LLM – from the Llama 3.3 family ing RIP to select high quality few-shot examples but not for
rather than 3.1, and of a larger scale, 70B rather than 8B. curating the resulting generations (post-filtering); (ii) apply-
As shown in Table 3, RIP also works on this larger model. ing standard (Self-Instruct) few-shot generation, but then
Filtering dramatically boosts Llama 3.3-70B-Instruct DPO applying RIP post-filtering; or (iii) applying RIP to both
trained models, with AlpacaEval2 LC win rate improved few-shot generation and post-filtering (our default method).

6
Rejecting Instruction Preferences (RIP )

# Train AlpacaEval2 Arena-Hard WildBench


examples LC Win Win Score Score
Baseline
Llama 3.1-8B-Instruct (seed model) - 20.9 21.8 21.3 33.1
WildChat-20k DPO (no filtering) 20000 48.4 45.9 37.9 41.5
WildChat-20k DPO (best-vs-bottom-25%) 20000 48.2 45.9 40.7 44.5
Prompt-Based Filtering
LLM-as-Prompt-Judge Binary 4299 45.5 41.0 42.0 43.3
LLM-as-Prompt-Judge Pointwise 15963 47.4 47.4 40.7 45.2
InsTag-Difficulty 10000 46.3 39.0 39.0 42.4
InsTag-Diversity 9952 40.1 41.1 40.4 43.4
Prompt-and-Chosen-Response Based Filtering
IFD on Prompt + Chosen Response 9902 47.6 37.6 32.2 42.2
ppl(Chosen Response) 14851 45.6 45.5 40.8 43.4
Chosen-Reject-Response Based Filtering
Jaccard Similarity(Chosen, Rejected) 9904 49.0 46.6 42.6 43.7
RIP 4538 57.8 57.2 43.1 45.6

Table 2: RIP compared to existing filtering methods on WildChat with Llama 3.1-8B-Instruct. RIP , which selects
only 4538 WildChat prompts for DPO training, outperforms existing filtering methods on AlpacaEval2, Arena-Hard &
WildBench. DPO response pairs are constructed using ArmoRM to score responses.

# Train AlpacaEval2 Arena-Hard WildBench


examples LC Win Win Score Score
Llama 3.3-70B-Instruct (seed model) - 38.9 41.5 67.5 52.8
WildChat-40k DPO (no filtering) 40000 54.3 51.6 70.5 55.3
RIP 17725 67.7 73.2 82.9 58.8

Table 3: RIP on WildChat with Llama 3.3-70B-Instruct. RIP outperforms no filtering on AlpacaEval2, Arena-Hard &
WildBench. DPO response pairs are constructed using ArmoRM to score responses.

We find that both components of our full method are impor- Next, we employ GPT-4 Turbo to evaluate 20,000 prompts
tant yielding the best results, with method (i) outperforming from WildChat. Focusing solely on the instructions (exclud-
Self-Instruct, and method (ii) performing better than (i), but ing responses) provided in WildChat, the model is tasked
worse than our full method (iii). with scoring each prompt on a scale from 1 to 5. A score
of 1 represents the most helpful prompt, while a score of
6. Understanding why RIP works 5 indicates the lowest quality. The evaluation prompt is
provided in Appendix Figure 3. Manual review revealed
6.1. Filtering prompts with low quality responses that prompts assigned scores of 4 and 5 were of very low
quality, while those scored 3 were moderately acceptable,
To understand what instructions are filtered out, we first
albeit with some quality issues still present. Notably, GPT-4
visualize instructions with low quality rejected responses
occasionally assigned scores of 2 or 3 to a prompt of low
(as measured by low reward and short lengths) by compar-
quality. Table 6 illustrates the prevalence of low-quality
ing the t-SNE plots of unfiltered and filtered instructions
examples (with score 3 or 4) after applying various filtering
(shown in Appendix Figure 4). We investigated a few clus-
methods. We observe that filtering based on the reward and
ters present in that t-SNE plot of unfiltered prompts that are
length of the rejected response is the most effective way to
missing from the t-SNE plot of filtered ones on the right-
ensure prompt quality, compared to other methods tried. By
hand-side. We find that instructions from those clusters
combining those rejected response quality metrics with the
being filtered out from the training set are either obscure,
reward gap, RIP successfully filtered out all extremely noisy
non-sensical, or they fail to elicit meaningful responses from
prompts identified by GPT-4. This supports our hypothesis
the model, leading to lower-quality rejected responses. Such
that very low-quality prompts, such as those in WildChat
instructions can be caught by measuring the rewards and
that consist of incomplete snippets from movies, stories, or
lengths of the rejected responses, with supporting evidence
code (see sample rejected instructions in Appendix Table 25
given in Appendix Table 25.

7
Rejecting Instruction Preferences (RIP )

# Train AlpacaEval2 Arena-Hard WildBench


HelpSteer2 examples LC Win Win Score Score
Llama 3.1-8B-Instruct (seed model) - 20.9 21.8 21.3 33.1
HelpSteer2 DPO (no filtering) 10161 25.2 23.1 26.8 37.1
LLM-as-Prompt-Judge filtering 5376 27.8 25.7 29.5 37.2
RIP 5081 34.6 32.8 35.0 39.5

Table 4: RIP on HelpSteer2 with Llama 3.1-8B-Instruct. Applying RIP to DPO models trained on HelpSteer2 outperforms
the baseline of no filtering as well as using the Llama 3.1-405B-Instruct model as a pointwise prompt quality judge.

Post- # Train AlpacaEval2 Arena-Hard WildBench


Train Prompts Filtering examples LC Win Win Score Score
WildChat-20k None 20000 48.4 45.9 37.9 41.5
WildChat-20k RIP 4538 57.8 57.2 43.1 45.6
Self-Instruct None 20000 49.1 46.9 38.5 40.0
Self-RIP (without post-filtering) None 20000 53.6 56.1 43.7 44.8
Self-Instruct with RIP post-filtering RIP 16261 58.3 53.2 40.9 44.1
Self-RIP RIP 18812 60.2 61.1 42.1 42.5

Table 5: Self-RIP for generating high-quality synthetic instructions. Self-RIP creates prompts using few-shot samples
from high-quality prompts curated by RIP , whereas Self-Instruct uses few-shots from unfiltered WildChat prompts. Applying
RIP filtering after generation is also important, and achieves the best results, significantly outperforming Self-Instruct data.

and Table 26), often result in poor rejected responses when larger difference in the response pair make them less helpful
sampled several times. By leveraging the quality of rejected in improving the model during preference optimization.
responses as a filtering criterion, we can efficiently eliminate
these extremely noisy prompts.
% of low-quality % of unsafe
Furthermore, we employ GPT-4 to respond to each Wild- Filtering Methods
prompts ↓ prompts ↓
Chat prompt three times. If any of GPT-4’s responses de-
cline to answer due to safety concerns, we categorize those Unfiltered Data 31.59% 12.27%
Reject Reward 7.89% 0.04%
prompts as unsafe. It’s important to note that with this
Reject Length 14.45% 0.02%
method GPT-4 sometimes assigns high quality scores to Reward Gap 26.50% 8.07%
prompts that are borderline unsafe. By examining the re- RIP 0.00% 0.00%
ward and the length of rejected responses, we observe RIP is
also an effective approach to filter out these unsafe prompts. Table 6: Effectiveness of Filters on Prompt Quality and
This approach is grounded in the observation that rejected re- Safety: we compare the number of noisy and potentially
sponses when dealing with unsafe instructions are typically unsafe (as judged by GPT4) WildChat instructions (out of
short and have low reward scores. 20k) filtered by various filtering methods.

6.2. Filtering prompts with larger response variance


Similarly, we visualize instructions that are filtered out by
7. Conclusion
measuring the reward gap between chosen and rejected re- This work introduces Rejecting Instruction Preferences
sponses in Appendix Figure 5, and further expand some (RIP ), a method for improving preference data quality by
representative groups of filtered instructions in Appendix measuring the rejected response quality and the reward gap
Table 26. In particular, instructions that cover specialized between the chosen and rejected response pair. Filtering
domains such as coding, software, and other technical ques- instructions using RIP remarkably improves model align-
tions often require precise details, well-defined objectives ment results on both human-written and synthetic instruc-
or targeted solutions. In those cases, a lack of specificity in tions, and for different reward signals. In addition, we show
the instructions might lead to more variable responses. As that Self-RIP , synthetic instructions generated by few-shot
shown in Table 26, instructions with larger reward gap are prompts curated by RIP, outperforms organic user instruc-
not necessarily low-quality, however we hypothesize that tions and the standard Self-Instruct method, achieving the
the combination of lack of specificity in the instruction and highest AlpacaEval2 win rate in our experiments.

8
Rejecting Instruction Preferences (RIP )

Impact Statement Welbl, J., Clark, A., et al. Training compute-optimal


large language models. arXiv preprint arXiv:2203.15556,
This work demonstrates the possibility of dramatically im- 2022.
proving LLMs by identifying and producing high-quality
training data. Studying how filtering criteria affect outputs Honovich, O., Scialom, T., Levy, O., and Schick, T. Unnat-
will continue to be important for LLM training. While we ural instructions: Tuning language models with (almost)
have primarily focused on preference optimization, the RIP no human labor. In Rogers, A., Boyd-Graber, J., and
approach is general and can potentially work for any training Okazaki, N. (eds.), Proceedings of the 61st Annual Meet-
scheme, e.g. other RL training techniques – which future ing of the Association for Computational Linguistics (Vol-
work should explore. ume 1: Long Papers), pp. 14409–14428, Toronto, Canada,
July 2023. Association for Computational Linguistics.
For such models, safety will also be crucial, and future work
doi: 10.18653/v1/2023.acl-long.806. URL https:
should additionally address this aspect. In our experiments,
//aclanthology.org/2023.acl-long.806.
the reward is not explicitly constrained by safety-related
criteria. Therefore, a clear further avenue of study is to Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
conduct safety evaluations – and to explore safety filtering Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and
using our methods, with reward models built exclusively for Amodei, D. Scaling laws for neural language models.
safety in existing systems (Touvron et al., 2023b). arXiv preprint arXiv:2001.08361, 2020.
Given that we have shown that RIP can filter potentially Khaki, S., Li, J., Ma, L., Yang, L., and Ramachandra, P.
unsafe prompts, this could mean in the best case that the Rs-dpo: A hybrid rejection sampling and direct prefer-
safety of the model could potentially improve after filtering ence optimization method for alignment of large language
as well, with RIP being able to catch and mitigate more chal- models. arXiv preprint arXiv:2402.10038, 2024.
lenging safety situations that earlier iterations cannot. From
a broader perspective, this work could pave the way for Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison,
methods that produce higher-quality training instructions, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu,
that are also potentially safer than organic user instructions S., et al. T\” ulu 3: Pushing frontiers in open language
in the wild. model post-training. arXiv preprint arXiv:2411.15124,
2024.
References Li, M., Zhang, Y., Li, Z., Chen, J., Chen, L., Cheng, N.,
Wang, J., Zhou, T., and Xiao, J. From quantity to quality:
Balioglu, C. fairseq2, 2023. URL http://github.
Boosting llm performance with self-guided data selection
com/facebookresearch/fairseq2.
for instruction tuning. arXiv preprint arXiv:2308.12032,
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., 2023a.
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Li, M., Zhang, Y., He, S., Li, Z., Zhao, H., Wang, J.,
Askell, A., et al. Language models are few-shot learners.
Cheng, N., and Zhou, T. Superfiltering: Weak-to-strong
Advances in neural information processing systems, 33:
data filtering for fast instruction-tuning. arXiv preprint
1877–1901, 2020.
arXiv:2402.00530, 2024a.
Cao, Y., Kang, Y., Wang, C., and Sun, L. Instruction min-
Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Zhu,
ing: Instruction data selection for tuning large language
B., Gonzalez, J. E., and Stoica, I. From live data
models. arXiv preprint arXiv:2307.06290, 2023.
to high-quality benchmarks: The arena-hard pipeline,
Chen, L., Li, S., Yan, J., Wang, H., Gunaratna, K., Yadav, April 2024b. URL https://lmsys.org/blog/
V., Tang, Z., Srinivasan, V., Zhou, T., Huang, H., et al. Al- 2024-04-19-arena-hard/.
paGasus: Training a better alpaca with fewer data. In The
Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I.,
Twelfth International Conference on Learning Represen-
Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacae-
tations, 2024. URL https://openreview.net/
val: An automatic evaluator of instruction-following
forum?id=FdVXgSJhvz.
models. https://github.com/tatsu-lab/
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, alpaca_eval, 2023b.
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan,
Li, X., Yu, P., Zhou, C., Schick, T., Zettlemoyer, L., Levy,
A., et al. The llama 3 herd of models. arXiv preprint
O., Weston, J., and Lewis, M. Self-alignment with instruc-
arXiv:2407.21783, 2024.
tion backtranslation. In The Twelfth International Confer-
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., ence on Learning Representations, 2024c. URL https:
Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., //openreview.net/forum?id=1oijHJBRsT.

9
Rejecting Instruction Preferences (RIP )

Lin, B. Y., Deng, Y., Chandu, K., Brahman, F., Ravichan- P. F. Learning to summarize with human feedback. Ad-
der, A., Pyatkin, V., Dziri, N., Bras, R. L., and Choi, vances in Neural Information Processing Systems, 33:
Y. Wildbench: Benchmarking llms with challenging 3008–3021, 2020.
tasks from real users in the wild, 2024. URL https:
//arxiv.org/abs/2406.04770. Toneva, M., Sordoni, A., Combes, R. T. d., Trischler, A.,
Bengio, Y., and Gordon, G. J. An empirical study of
Liu, W., Zeng, W., He, K., Jiang, Y., and He, J. What example forgetting during deep neural network learning.
makes good data for alignment? a comprehensive study arXiv preprint arXiv:1812.05159, 2018.
of automatic data selection in instruction tuning. arXiv
preprint arXiv:2312.15685, 2023. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Lu, K., Yuan, H., Yuan, Z., Lin, R., Lin, J., Tan, C., Zhou, Azhar, F., et al. Llama: Open and efficient foundation lan-
C., and Zhou, J. # instag: Instruction tagging for ana- guage models. arXiv preprint arXiv:2302.13971, 2023a.
lyzing supervised fine-tuning of large language models.
In The Twelfth International Conference on Learning Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
Representations, 2023. A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., et al. Llama 2: Open foundation and fine-
Meng, Y., Xia, M., and Chen, D. Simpo: Simple preference tuned chat models. arXiv preprint arXiv:2307.09288,
optimization with a reference-free reward. arXiv preprint 2023b.
arXiv:2405.14734, 2024.
Wang, H., Xiong, W., Xie, T., Zhao, H., and Zhang, T. Inter-
Mishra, S., Khashabi, D., Baral, C., and Hajishirzi, H. Cross- pretable preferences via multi-objective reward modeling
task generalization via natural language crowdsourcing and mixture-of-experts. arXiv preprint arXiv:2406.12845,
instructions. arXiv preprint arXiv:2104.08773, 2021. 2024a.
Pace, A., Mallinson, J., Malmi, E., Krause, S., and Wang, T., Kulikov, I., Golovneva, O., Yu, P., Yuan, W.,
Severyn, A. West-of-n: Synthetic preference gener- Dwivedi-Yu, J., Pang, R. Y., Fazel-Zarandi, M., Weston,
ation for improved reward modeling. arXiv preprint J., and Li, X. Self-taught evaluators. arXiv preprint
arXiv:2401.12086, 2024. arXiv:2408.02666, 2024b.
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann,
Wang, Y., Mishra, S., Alipoormolabashi, P., Kordi, Y.,
J., Song, F., Aslanides, J., Henderson, S., Ring, R.,
Mirzaei, A., Arunkumar, A., Ashok, A., Dhanasekaran,
Young, S., et al. Scaling language models: Methods,
A. S., Naik, A., Stap, D., et al. Super-naturalinstructions:
analysis & insights from training gopher. arXiv preprint
Generalization via declarative instructions on 1600+ nlp
arXiv:2112.11446, 2021.
tasks. arXiv preprint arXiv:2204.07705, 2022.
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D.,
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
Ermon, S., and Finn, C. Direct preference optimization:
Khashabi, D., and Hajishirzi, H. Self-instruct: Align-
Your language model is secretly a reward model. In Thirty-
ing language models with self-generated instructions. In
seventh Conference on Neural Information Processing
Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Pro-
Systems, 2023. URL https://openreview.net/
ceedings of the 61st Annual Meeting of the Association
forum?id=HPuSIXJaa9.
for Computational Linguistics (Volume 1: Long Papers),
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Er- pp. 13484–13508, Toronto, Canada, July 2023. Associ-
mon, S., and Finn, C. Direct preference optimization: ation for Computational Linguistics. doi: 10.18653/v1/
Your language model is secretly a reward model. Ad- 2023.acl-long.754. URL https://aclanthology.
vances in Neural Information Processing Systems, 36, org/2023.acl-long.754.
2024.
Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Egert,
Soldaini, L., Kinney, R., Bhagia, A., Schwenk, D., Atkinson, D., Zhang, J. J., Sreedhar, M. N., and Kuchaiev, O. Help-
D., Authur, R., Bogin, B., Chandu, K., Dumas, J., Elazar, steer2: Open-source dataset for training top-performing
Y., et al. Dolma: An open corpus of three trillion tokens reward models. arXiv preprint arXiv:2406.08673, 2024c.
for language model pretraining research. arXiv preprint
arXiv:2402.00159, 2024. Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester,
B., Du, N., Dai, A. M., and Le, Q. V. Finetuned lan-
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., guage models are zero-shot learners. arXiv preprint
Voss, C., Radford, A., Amodei, D., and Christiano, arXiv:2109.01652, 2021.

10
Rejecting Instruction Preferences (RIP )

Wu, J., Xie, Y., Yang, Z., Wu, J., Gao, J., Ding, B., Wang, X., Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,
and He, X. β-dpo: Direct preference optimization with Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for
dynamic β. arXiv preprint arXiv:2407.08639, 2024a. alignment. Advances in Neural Information Processing
Systems, 36, 2024.
Wu, T., Yuan, W., Golovneva, O., Xu, J., Tian, Y., Jiao, J.,
Weston, J., and Sukhbaatar, S. Meta-rewarding language
models: Self-improving alignment with llm-as-a-meta-
judge. arXiv preprint arXiv:2407.19594, 2024b.

Xia, M., Malladi, S., Gururangan, S., Arora, S., and Chen,
D. Less: Selecting influential data for targeted instruction
tuning. arXiv preprint arXiv:2402.04333, 2024.

Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao,
C., and Jiang, D. Wizardlm: Empowering large language
models to follow complex instructions. arXiv preprint
arXiv:2304.12244, 2023a.

Xu, J., Lee, A., Sukhbaatar, S., and Weston, J. Some


things are more cringe than others: Preference opti-
mization with the pairwise cringe loss. arXiv preprint
arXiv:2312.16682, 2023b.

Yang, S., Xie, Z., Peng, H., Xu, M., Sun, M., and
Li, P. Dataset pruning: Reducing training data by
examining generalization influence. arXiv preprint
arXiv:2205.09329, 2022.

Yasunaga, M., Shamis, L., Zhou, C., Cohen, A., We-


ston, J., Zettlemoyer, L., and Ghazvininejad, M. Alma:
Alignment with minimal annotation. arXiv preprint
arXiv:2412.04305, 2024.

Yuan, W., Kulikov, I., Yu, P., Cho, K., Sukhbaatar, S., We-
ston, J., and Xu, J. Following length constraints in in-
structions. arXiv preprint arXiv:2406.17744, 2024a.

Yuan, W., Pang, R. Y., Cho, K., Sukhbaatar, S., Xu, J.,
and Weston, J. Self-rewarding language models. arXiv
preprint arXiv:2401.10020, 2024b.

Zhao, H., Andriushchenko, M., Croce, F., and Flammarion,


N. Long is more for alignment: A simple but tough-to-
beat baseline for instruction fine-tuning. arXiv preprint
arXiv:2402.04833, 2024a.

Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y., and
Deng, Y. Wildchat: 1m chatgpt interaction logs in the
wild. arXiv preprint arXiv:2405.01470, 2024b.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H.,
Gonzalez, J. E., and Stoica, I. Judging LLM-as-a-judge
with MT-bench and chatbot arena. In Thirty-seventh
Conference on Neural Information Processing Systems
Datasets and Benchmarks Track, 2023. URL https:
//openreview.net/forum?id=uccHPGDlao.

11
Rejecting Instruction Preferences (RIP )

A. Appendix
A.1. More Details on Experiment Setup
Our experiment setups are summarized in Table 7. Specifically, we apply RIP to multiple popular instruction-following
datasets as well as our own synthetic data, with reward annotated from various sources (human/reward classifier/LLM-as-a-
Judge), indicating the generalizability of our RIP method.

Human Reward
#Prompts Written #Responses Annotator Valid Set ( # Examples)
WildChat-turn1 20k 20,000 Yes 8,16,32,64 ArmoRM Humpback + Evol-Instruct (470)
WildChat-turn1 20k 20,000 Yes 64 LLM-as-a-Judge Humpback + Evol-Instruct (470)
HelpSteer2 10,161 Yes 2 Human HelpSteer2 valid (519)
Self-Instruct 20,000 No 64 ArmoRM Humpback + Evol-Instruct (470)
Self-RIP 20,000 No 64 ArmoRM Humpback + Evol-Instruct (470)

Table 7: Preference Dataset Statistics used for training in our experiments.

We report the model performance on valid set when varying the number of training WildChat prompts in Figure 2. Model
training improves significantly as training data size grows to 20k and then begin to saturates afterwards, therefore our main
experiments are based on those 20k WildChat prompts.

0.184 Rewards vs Data Size


0.1831 0.1831 0.1829
0.183
0.1823
0.182
Armo Rewards

0.1812
0.181
0.1800.1794
0.179
0.1780.1774 Armo Rewards on Valid Set
0.177 0 10k 20k 30k 40k 50k
Data Size
Figure 2: Results on DPO Training with Varying WildChat Data Sizes. Using different sizes of WildChat data for DPO
training on LLaMA 3.1-8B-Instruct, the performance, measured by Armo rewards on the validation set, gradually saturates
as the data size increases.

We primarily assess our models’ general instruction-following capabilities using three popular evaluation benchmarks:
AlpacaEval-2 (Li et al., 2023b), Arena-Hard (Li et al., 2024b) and WildBench (Lin et al., 2024). AplacaEval-2 consists
of 805 prompts sampled from 5 datasets. Arena-Hard contains 500 challenging user queries sourced from Chatbot Arena
and has the highest correlation and separability of models commpared to Chatbot Arena among popular open-ended LLM
benchmarks (Li et al., 2024b). WildBench is built from a set of 1024 significantly harder, challenging queries carefully
curated from the WildChat project (Zhao et al., 2024b) to ensure diversity and complexity. The automatic evaluation of
WildBench involves task-specific checklists that guide LLM judges in generating reliable and consistent judgments which
demonstrate significantly high correlation with human judgments. We report the WB-Score for individual scoring.

A.2. Additional Results


LLM-as-a-Judge As Reward Annotators We explore LLM-as-a-Judge as alternative reward annotator apart from the
reward model ArmoRM and human reward annotations, and use a LLama3.1-405B-Instruct zero-shot to judge the quality of
each individual response and uses its prediction to construct response pairs. For each response, we conduct 10 independent
evaluations and calculate the average score to determine the final reward score. We report AlpacaEval2, Arena-Hard and
WildBench results on WildChat DPO models in Table 8. Similar to the observation from Table 2, RIP by filtering based on
LLM-as-a-Judge predictions outperforms no filtering.

12
Rejecting Instruction Preferences (RIP )

# Train AlpacaEval2 Arena-Hard WildBench


WildChat with LLM-as-a-Judge as reward annotator examples LC Win Win Score Score
Llama 3.1-8B-Instruct (seed model) - 20.9 21.8 21.3 33.1
Standard DPO (no filtering) 16837 40.1 44.9 41.1 42.5
RIP 5999 44.3 48.8 42.5 43.9

Table 8: RIP compared to baselines on WildChat using LLM-as-a-Judge as the reward annotator.. We report results on
AlpacaEval2, Arena-Hard and WildBench of various models trained using DPO on the WildChat Dataset. RIP outperforms
the baseline of LLM-as-judge as the reward annotator.

Data Scaling with RIP We further scale up RIP by growing the training data size after filtering to 20k, and achieves
AlpacaEval2 LC win rate of 58.49% as shown in Figure 1. While the effective training size scales from 4538 to 20k, the
actual performance gain only increase slightly, suggesting that training with Llama 3.1-8B-Instruct on existing WildChat
prompts saturates, even under RIP .

RIP filtering thresholds We report the filtering thresholds of the best checkpoints in our experiments in Table 9.

Full Evaluation Results We include full WildChat evaluation results on AlpacaEval2 and Arena-Hard in Table 10 and
on WildBench in Table 11, with average response lengths, confidence intervals as well as finegrained results on subtasks.
Full evaluation results on models trained on HelpSteer2 are presented in Table 12 and Table 13. In addition, full evaluation
results on Self-RIP are included in Table 14 and Table 15.

Coordinate-wise Filtering results. We conduct extensive experiments by applying filtering to each individual metric:
reward on chosen or rejected response, lengths of chosen or rejected response, reward gap, average reward of all responses,
etc. Results on valid set performances by applying various filtering metrics to WildChat task are included in Table 19, and
HelpSteer2 in Table 20. Both highlight strong performance boost by filtering based on rejected reward, rejected length and
reward gap.

A.3. t-SNE Analysis of Filtered Instructions


We conduct t-SNE analysis on WildChat prompts filtered by rejected response length and reward in Figure 4 and those
further filtered by reward gap in Figure 5. To better understand which prompts are being filtered out, we summarize prompts
being filtered out due to rejected responses being of shorter length or lower reward in Table 25, and those filtered out due to
large reward gaps in Table 26.

A.4. Further Ablations


We report results of further ablation studies: comparing filtering and various pairing instead of filtering methods in Table 16,
and robustness of RIP to choice of responses in rejection sampling in Table 17.

RIP outperforms alternative preference pairing methods We compare RIP to methods without filtering that use
different response pairing methods for building pairwise preferences. Recall that in our main experiments for RIP we used
the best-vs-worst pairing method as described in Section 3.1. Here we explore two alternative methods: (i) best-vs-random
which is shown by existing work (Yasunaga et al., 2024; Khaki et al., 2024) to outperform best-vs-worst, and (ii) best-vs-
bottom-K% percentile where the rejected response has the bottom K = 25, 50, 75 percentile score (K = 0 being the lowest
score). Both pairing methods can effectively lower reward gap and increase quality of rejected response without removing
training prompts. We report model performance on the valid set in Table 16. Out of all pairing methods, best-vs-bottom-25%
works the best, but still under-performs compared with our RIP method (pairing with best-vs-worst). When evaluated on
AlpacaEval2, Arena-Hard, and WildBench, the model WildChat-20k DPO (best-vs-bottom-25%) only achieves a slight
improvement gain comparing to baseline WildChat-20k DPO (best-vs-worst), while still underperforming RIP as shown in
Table 2. This result indicates that reward gap being small or the rejected reward being high better works as an indication of a
low-quality prompt rather than bad response pairing.

13
AlpacaEval2

Data # Train Human # Re- Reward An- Seed Model Filtering Metrics LC Win Win Arena- Wildbench
Exam- Written sponses notator Hard
ples
- - - - - Llama 3.1-8B- - 20.9 21.8 21.3 33.1
Instruct
- - - - - Llama 3.3- - 38.9 41.5 52.8
70B-Instruct
Wildchat 20k 20k Yes 64 ArmoRM Llama 3.1-8B- No 48.37 45.87 37.9 41.5
Instruct
Wildchat 20k 6762 Yes 64 ArmoRM Llama 3.1-8B- Rejected Length ≥ 1878, Rejected 57.1 52.9 42.3 45.5
Instruct Armo ≥ 0.126
Wildchat 20k 4538 Yes 64 ArmoRM Llama 3.1-8B- Rejected Length ≥ 1878, Rejected 57.8 57.2 43.1 45.6
Instruct Armo ≥ 0.126, Reward Gap > 0.042
Synthetic (few shot: 20k No 64 ArmoRM Llama 3.1-8B- No 49.1 46.9 38.5 41.0
Wildchat 20k) Instruct
Synthetic (few shot: 16k No 64 ArmoRM Llama 3.1-8B- Rejected Length ≥ 1878, Rejected 58.3 53.2 40.9 44.1
Wildchat 20k) Instruct Armo ≥ 0.126
Synthetic (few shot: 20k No 64 ArmoRM Llama 3.1-8B- No 53.6 56.1 43.7 44.8
Wildchat filtered Instruct

14
4538 examples)
Synthetic (few shot: 18812 No 64 ArmoRM Llama 3.1-8B- Rejected Length ≥ 1878, Rejected 60.2 61.1 42.1 42.5
Wildchat filtered Instruct Armo ≥ 0.126
4538 examples)
Wildchat 20k 16.8k Yes 64 LLM-as-a- Llama 3.1-8B- No 40.1 44.9 41.1 42.5
Judge Instruct
Wildchat 20k 5999 Yes 64 LLM-as-a- Llama 3.1-8B- Rejected LLM-as-a-Judge Reward ≥ 8, 44.3 48.8 42.5 43.9
Rejecting Instruction Preferences (RIP )

Judge Instruct Rejected Length ≥ 1399, Reward Gap


≤1
HelpSteer 10k Yes 64 Human Llama 3.1-8B- No 25.2 23.1 26.8 37.1
Instruct
HelpSteer 5081 Yes 64 Human Llama 3.1-8B- Rejected Length ≥ 1303 34.6 32.8 35.0 39.5
Instruct
Wildchat 40k 40k Yes 32 ArmoRM Llama 3.3- No 54.3 51.6 70.5 55.3
70B-Instruct
Wildchat 40k 17.7k Yes 32 ArmoRM Llama 3.3- Rejected Length > 1878, Rejected 67.7 73.2 82.9 58.8
70B-Instruct Armo > 0.126, GAP > 0.042

Table 9: Full Results details on number of training examples, choice of reward models, seed models, filtering metrics and thresholds chosen as well as final outcomes
across 3 evaluation benchmarks.
Rejecting Instruction Preferences (RIP )

# Train AlpacaEval2 Arena-Hard


Standard models examples LC Win Win Len Score 95% CI Len
GPT-4 Omni (05/13) - 57.5 51.3 1873 74.9 (-2.5, 1.9) 668
GPT-4 Turbo (04/09) - 55.0 46.1 1802 82.6 (-1.6, 1.8) 662
Gpt-4-0613 - 55.0 46.1 1802 37.9 (-2.8, 2.4) 354
Llama 3.1-405B-Instruct - 39.3 39.1 1988 67.1 (-2.2, 2.8) 658
Llama 3.1-70B-Instruct - 38.1 39.1 2044 69.3 (-2.5, 2.5) 658
Baseline
Llama 3.1-8B-Instruct - 20.9 21.8 2184 21.3 (-1.9, 2.2) 861
WildChat-20k DPO (no filtering) 20000 48.4 45.9 2134 37.9 (-2.0, 2.2) 622
WildChat-20k DPO (best-vs-bottom-25%) 20000 48.2 45.9 1971 40.7 (-2.1, 1.9) 741
Prompt-Based Filtering
Jaccard Similarity 9904 49.0 46.6 1978 42.6 (-2.4, 2.3) 632
LLM-as-Prompt-Judge Binary 4299 45.5 41.0 1859 42.0 (-1.4, 1.7) 597
LLM-as-Prompt-Judge Pointwise 15963 47.4 47.4 2056 40.7 (-2.0, 2.2) 701
InsTag-Difficulty 10000 46.3 39.0 1752 39.0 (-2.2, 2.3) 602
InsTag-Diversity 9952 40.1 41.1 1903 40.4 (-2.4, 2.8) 579
Prompt-and-Chosen-Response-Based Filtering
IFD on Prompt + Chosen Response 9902 47.6 37.6 1655 32.2 (-1.7, 2.5) 533
ppl(Chosen Response) 14851 45.6 45.5 1930 40.8 (-2.3, 1.7) 582
Chosen-Rejected-Response Based Filtering
LLM-as-Prompt-Judge Pointwise 15963 47.4 47.4 2056 40.7 (-2.0, 2.2) 701
RIP 4538 57.8 57.2 2048 43.1 (-1.5, 1.8) 638

Table 10: Full AlpacaEval2 & Arena-Hard Results on WildChat: we compare performances of SOTA models on
AlpacaEval2 win rates and Arena-Hard scores as well as DPO models trained on the WildChat-20k dataset using various
filtering methods.

Combining alternative pairing with RIP performs on par with best-vs-bottom pairing with RIP . We further apply
RIP filtering to examples paired by best-vs-bottom-25% pairing. Combing best-vs-bottom-25% with filtering out examples
of low quality rejected responses yields ArmoRM Score of 0.18675, slightly lower than best-vs-worst + filtering by Rejected
Reward (0.18795). Filtering out best-vs-bottom-25% examples of bigger reward gaps yields to Armo Score of 0.1860 on
valid set as compared to 0.18542 from best-vs-worst pairing + filtering by Reward Gaps. Given the marginal performance
gain between best-vs-worst and best-vs-bottom-25% pairing with and without RIP , we thus focus on the more widely
adopted best-vs-worst pairing to experiment various filtering methods including our RIP method.

RIP is robust to the choice of the number of responses N . While we showed RIP provides strong performance on
HelpSteer2 where only N = 2 responses are availabe for each prompt, and on WildChat with N = 64 responses sampled per
prompt, we also compare the performance of RIP by varying the choice of N the number of candidate responses generated
for preference annotations in the WildChat setup. As shown in Table 17, for a wide range of values N = 64, 32, 16, 8, RIP
consistently outperforms the no filtering baseline, with larger N achieving increasingly better performance, likely due to the
increased quality and variability of chosen and rejected responses, allowing our RIP metrics to be more accurate in curating
high quality data.

Self-RIP works with much smaller set of high-quality seed instructions Instead of using all 4538 RIP curated high-
quality instructions as seed instructions S during Step 1. few-shot generations, we sample a much shorter subset of 256
prompts from 4538 RIP -curated prompts as seed instructions, and only conduct few-shot generations by sampling 8 prompts
from the 256 seed prompts each time. We report Self-RIP with and without post-filtering in Table 18. Self-RIP based
on 256 high-quality seed instructions (58.9) slightly underperforms than that based on 4538 seed prompts (60.2), but
still outperforms Self-Instruct with RIP post-filtering (58.3) as well as Self-RIP based on all 4538 seed prompts without
post-filtering (53.6), indicating that our method Self-RIP can work well with a much smaller set of high-quality seed
prompts.

15
Rejecting Instruction Preferences (RIP )

WB-Score WB-Score: Task-specific

Weighted Planning & Math & Coding &


Standard models Creative Information
average Reasoning Data Analysis Debugging

GPT-4 Omni (05/13) 59.3 59.1 60.2 57.3 58.6 60.5


GPT-4 Turbo (04/09) 55.2 58.7 56.2 51.0 57.2 55.1
Gemini-1.5-pro 53.0 55.1 53.7 48.6 52.2 55.2
Llama3-70B-Instruct 47.8 54.3 50.1 42.1 52.3 44.7
Baseline
Llama 3.1-8B-Instruct 33.1 45.0 37.0 23.9 37.4 29.3
WildChat-20k DPO (no filtering) 41.5 51.8 44.2 32.2 50.0 37.1
WildChat-20k DPO (best-vs-bottom-25%) 44.5 53.9 47.4 35.8 50.4 41.4
Prompt-Based-Filtering
Jaccard Similarity 43.7 54.2 46.9 34.3 49.5 40.5
LLM-as-Prompt-Judge Binary 43.3 53.9 46.6 35.8 48.5 38.6
LLM-as-Prompt-Judge Pointwise 45.2 55.6 48.0 37.1 51.6 40.9
InsTag-Difficulty 42.4 52.7 45.4 33.4 47.8 39.3
InsTag-Diversity 43.4 53.4 46.1 35.0 49.1 40.1
Prompt-and-Chosen-Response-Based Filtering
IFD 42.2 51.3 45.9 35.0 48.0 37.1
ppl(Chosen Response) 43.4 52.5 47.0 37.2 49.4 37.6
Chosen-Rejected-Response Based Filtering
LLM-as-Prompt-Judge Pointwise 45.2 55.6 48.0 37.1 51.6 40.9
RIP 45.6 56.7 48.8 36.6 51.6 41.4

Table 11: Full WildBench Results on WildChat: we compare performances of SOTA models on WildBench as well as
DPO models trained on the WildChat-20k dataset using various filtering methods.

AlpacaEval2 Arena-Hard
Prompts LC Win Win Len Score 95% CI Len
Baseline
Llama 3.1-8B-Instruct - 20.9 21.8 2184 21.3 (-1.9, 2.2) 861
HelpSteer2 DPO (no filtering) 10161 25.2 23.1 1733 26.8 (-2.0, 2.4) 606
Prompt-Based-Filtering
LLM-as-Prompt-Judge Pointwise 5376 27.8 25.7 1947 29.5 (-2.8, 2.3) 627
Prompt-Response-Based-Filtering
RIP 5081 34.6 32.8 1941 35.0 (-1.8, 2.2) 621

Table 12: Results of our DPO models trained with HelpSteer2. Full AlpacaEval2 & Arena-Hard Results of our DPO
models trained with HelpSteer2 Dataset.

A.5. Details about Baselines


InsTag Complexity Lu et al. (2023) utilized ChatGPT to generate semantic and intent-based tags, which were then used
to fine-tune a large language model (LLM) data tagger. The number of tags per prompt served as a complexity metric.
Building on their methodology, we employed a publicly available tagger (https://github.com/OFA-Sys/InsTag.
Note that Meta was not involved in the training of the Instag model we used.) to annotate each prompt, generating between 1
and 100 tags per prompt. We then categorized our training prompts into four groups based on the number of tags: more
than 2, more than 3, more than 4, and more than 5. From each group, we randomly sampled 10,000 training data samples
and trained a distinct model for each group. It is important to note that a threshold of ≥ 1 implies no filtering, with only
a random sample of 10,000 data points from WildChat. A threshold of ≥ 2 means filtering out prompts with only 1 tag.
As shown in Table 21, a threshold of ≥ 2 yields the best performance using the InsTag Complexity filtering method. We
reported the results for a threshold of ≥ 2 in Table 2.

16
Rejecting Instruction Preferences (RIP )

WB-Score WB-Score: Task-specific

Weighted Planning & Math & Coding &


Creative Information
average Reasoning Data Analysis Debugging

Baseline
Llama 3.1-8B-Instruct 33.1 45.0 37.0 23.9 37.4 29.3
HelpSteer2 DPO (no filtering) 37.1 48.6 40.4 26.5 44.3 33.4
Prompt-Based-Filtering
LLM-as-Prompt-Judge Pointwise 37.2 50.6 40.0 27.9 43.0 33.1
Prompt-Response-Based-Filtering
RIP 39.5 52.1 42.9 29.3 46.4 35.0

Table 13: Results on our DPO models trained with HelpSteer2. Full WildBench results of our DPO models trained with
HelpSteer2 Dataset.

# Train AlpacaEval2 Arena-Hard


Training Prompts Filtering examples LC Win Win Len Score 95% CI Len
Self-Instruct None 20000 49.1 46.9 1956 38.5 (-1.4, 1.6) 738
Self-RIP (without post-filtering) None 20000 53.6 56.1 2252 43.7 (-2.3, 2.3) 777
Self-Instruct with RIP post-filtering RIP 16261 58.3 53.2 1823 40.9 (-1.9, 1.6) 560
Self-RIP RIP 18812 60.2 61.1 2121 42.1 (-2.0, 2.4) 606

Table 14: Results of our DPO models trained with Self-Instructed Dataset: Full AlpacaEval2 & Arena-Hard Results
comparing our method with training on standard Self-Instruct dataset.

WB-Score WB-Score: Task-specific

Weighted Planning & Math & Coding &


Training Prompts Filtering Creative Information
average Reasoning Data Analysis Debugging

Self-Instruct None 41.0 51.6 43.3 31.4 47.7 38.0


Self-RIP (without post-filtering) None 44.8 55.3 46.9 33.5 49.7 44.6
Self-Instruct with RIP post-filtering RIP 44.1 54.8 47.3 36.4 48.2 40.3
Self-RIP RIP 42.5 54.1 46.2 32.8 48.6 38.2

Table 15: Results of our DPO models trained with Self-Instructed Dataset: Full WildBench Results comparing our
method with training on standard Self-Instruct dataset.

Pair Armo Score on Valid


Chosen=HighestScore, Rejected=LowestScore 0.1830
Chosen=HighestScore, Rejected=BottomScore25% 0.1842
Chosen=HighestScore, Rejected=BottomScore50% 0.1839
Chosen=HighestScore, Rejected=BottomScore75% 0.1821
Chosen=HighestScore, Rejected=Random 0.1835
Chosen=HighestScore, Rejected=LowestScore + RIP 0.1898

Table 16: Results of pair selections: We report Armo scores on valid sets by varying different pairing methods instead
of filtering prompts. Best pairing result 0.1842 is achieved with appointing response with bottom 25% score as rejected,
although still underperforming compared to our filtering method (0.1898).

InsTag Diversity The InsTag Diversity filtering method (Lu et al., 2023) considers a dataset to be more diverse if it
contains a larger number of unique tags, as annotated by the aforementioned tagger. We employed two metrics to manage
InsTag Diversity:

17
Rejecting Instruction Preferences (RIP )

Before Filtering After Filtering Gain


N Armo Score Armo Score
8 0.1821 0.1860 0.0039
16 0.1827 0.1878 0.0051
32 0.1829 0.1882 0.0053
64 0.1831 0.1898 0.0067

Table 17: Results on Varying N = 8, 16, 32, 64 number of responses sampled per prompts in Response Generation:
Armo Score on Valid set of our DPO models trained with WildChat Dataset all increases after filtering based on RIP
regardless of the choice of N in response generation step.

# Seed # Train AlpacaEval2


Train Prompts Prompts examples LC Win Win
Llama 3.1-8B-Instruct (seed model) - - 20.9 21.8
WildChat-20k + RIP - 4538 57.8 57.2
Self-Instruct + RIP 20000 16261 58.3 3.2
Self-RIP (without post-filtering) 256 20000 50.0 51,2
Self-RIP (without post-filtering) 4538 20000 53.6 56.1
Self-RIP 256 15619 58.9 63.1
Self-RIP 4538 18812 60.2 61.1

Table 18: Self-RIP for generating high-quality synthetic instructions by varying number of fewshots. Self-RIP creates
prompts using few-shot samples from high-quality prompts curated by RIP , whereas Self-Instruct uses few-shots from
unfiltered WildChat prompts. Applying RIP filtering after generation is also important, and achieves the best results,
significantly outperforming Self-Instruct data.

1. Tag Frequency: We deem a tag valid if it meets a predefined frequency threshold. This approach addresses the issue of
infrequent tags, such as “serve size” and “market failure,” which appeared only once or twice in the entire Wildchat dataset,
suggesting they may not represent valid categories. In contrast, more common tags like “creative writing” and “information
retrieval” are more appropriate for categorizing prompt data.
2. Max prompt per Tag: This metric controls the coverage ratio of unique tags. If a prompt contains only tags that have
already been covered by the selected set, we discard the prompt to ensure diversity.
Table 22 presents the performance results when diversity is controlled using the two metrics described above. To ensure
fairness, we downsampled the training data for each experiment to 10,000 samples. The results indicate that the model
achieves optimal performance when the Tag Frequency is set to 6 and the Max Prompt per Tag is set to 3. This means we
only consider tags that appear more than six times in the entire Wildchat dataset, and we allow a maximum of three prompts
per tag. The best performance results are reported in Table 2.

Perplexity To curate training prompts, we compute the perplexity (ppl) of the selected response yw using the Llama-3.1-
8B-Instruct model in a zero-shot setting. We use this perplexity as a filtering metric, specifically retaining examples with
high ppl(yw |x) values, which may indicate more challenging prompts. We adjust the quantile range to control perplexity,
calculating ppl(yw |x) for 20,000 Wildchat data points and filtering them based on this range. Table 23 displays model
performance across different ppl quantile ranges. As shown, the quantile range of 25-100 yields the best performance, and
we report this model’s performance in Table 2.

Instruction-Following Difficulty (IFD) Li et al. (2023a) introduced the IFD to measure the model-specific difficulty of a
data sample. In the instruction-tuning process, the loss of a sample pair (Q, A) is calculated by continuously predicting the
next tokens given the instruction Q and their proceeding words:

N
1 X
log P wiA | Q, w1A , w2A , . . . , wi−1
A

Lθ (A | Q) = − ;θ (1)
N i=1

18
Rejecting Instruction Preferences (RIP )

Method 0-100 10-100 25-100 50-100 60-100 75-100


Chosen Reward 0.18305 0.18325 0.18409 0.18393 0.18380 0.18333
Rejected Reward 0.18305 0.18411 0.18405 0.18566 0.18797 0.18795
Average Reward 0.18305 0.18368 0.18392 0.18494 0.18468 0.18442
Chosen Length 0.18305 0.18350 0.18366 0.18278 0.18226 0.18105
Rejected Length 0.18305 0.18377 0.18340 0.18571 0.18593 0.18473

Method 0-100 0-25 0-50 50-100


Reward Gap 0.18305 0.18405 0.18542 0.17993

Table 19: Performance of Different Filter Methods Across Quantile Ranges on WildChat with ArmoRM as reward anotator.

Method 0-25 0-50 0-75 0-100 25-100 50-100 75-100


Chosen Human Reward 0.1469 0.1451 0.1456 0.1458 0.1461 0.1454 0.1465
Rejected Human Reward 0.1442 0.1455 0.1459 0.1458 0.1480 0.1470 0.1461
Chosen Length 0.1484 0.1467 0.1455 0.1458 0.1454 0.1446 0.1449
Rejected Length 0.1421 0.1430 0.1445 0.1458 0.1495 0.1513 0.1478
Human Reward Gap 0.1482 0.1480 0.1466 0.1458 0.1448 0.1448 0.1441

Table 20: Performance of Different Filter Methods Across Quantile Ranges on HelpSteer2 valid set.

where N is the number of words of the groundtruth answer A. They denote this averaged crossentropy loss as the Conditioned
Answer Score Sθ (A | Q) = Lθ (A | Q).
Then they introduce the Direct Answer Score Sθ (A)

N
1 X
log P wiA | w1A , . . . , wi−1
A

sθ (A) = − ;θ (2)
N i=1

Finally, they estimate the Instruction-Following Difficulty (IFD) scores IF Dθ (Q, A) on following instruction of a given (Q,
A) pairs by calculating the ratio between Sθ (A) and Sθ (A | Q):

sθ (A | Q)
IFDθ (Q, A) = (3)
sθ (A)

We calculated the IFD scores for 20,000 Wildchat data points and filtered them based on specific ranges. As shown in
Table 24, filtering with a range of 25-100 yielded the best performance. The performance of this model is reported in Table 2.

19
Rejecting Instruction Preferences (RIP )

I have a collection of prompts that I need to evaluate for their effectiveness in fine-tuning a language model.
A useful prompt should:
- Clearly ask a question
- Be concise and specific
- Directly relate to the topic of interest or follow given instructions

Please assess each prompt and assign a score from 1 to 5 based on its usefulness:
- 1: Pretty useful
- 2: Somewhat useful
- 3: Neutral (neither useful nor harmful)
- 4: Somewhat harmful
- 5: Harmful
Make sure to clearly indicate the score at the end of your evaluation using the format: Score: x
Prompt: {prompt}

Figure 3: GPT4 eval prompt.

Tag Fre- Max Armo Reward


quency prompt on Valid Set
per Tag
Tag threshold Armo Reward 1 1 0.1809
on Valid Set 1 2 0.1799
>= 1 0.1820 2 1 0.1798
>= 2 0.1826 2 2 0.1800
>= 3 0.1812 3 1 0.1799
>= 4 0.1815 3 2 0.1797
>= 5 0.1818 3 3 0.1814
4 3 0.1821
Table 21: Model performance with InsTag Complexity 5 3 0.1818
6 3 0.1831
Filtering
Table 22: Model performance with InsTag Diversity
Filtering

Quantile Range Armo Reward on Valid Set


Quantile Range Armo Reward on Valid Set
0-25 0.1815
25-100 0.1833
0-50 0.1823
50-100 0.1827
0-75 0.1832
75-100 0.1797
25-100 0.1835

Table 23: Model performance with Perplexity Filtering Table 24: Model performance with IFD Filtering

20
Rejecting Instruction Preferences (RIP )

Figure 4: t-SNE plots on instructions before and after filtering by rewards and lengths of rejected responses. Red dots
represent unfiltered instructions, while blue dots are instructions curated by filtering out those with low-reward and shorter
rejected responses.

Cluster Description Rejecting Reason Rejected Instruction


Cluster 646 instructions in the format Around 90% of rejected responses are give me a response to “‘I’m feel-
1 of “give me a response to of shorter lengths and lower scores be- ing great! Swimming around in
“‘<text>“‘ to send in a dis- low 25% percentile, even though their the ocean and hunting for prey
cussion, VERY SHORT, CON- scores are higher than average rejected never gets old. I’m always look-
CISE & CLEAR. ONLY RE- scores. These short and concise conver- ing for new and exciting ways
TURN THE RAW MESSAGE, sational responses are shorter thus poten- to keep busy.“‘ to send in a dis-
DO NOT SAY ”Hey here is the tially more generic and less informative cussion, VERY SHORT, CON-
message you asked””, where for the models to further improve upon. CISE & CLEAR. ONLY RE-
<text> refers to a single-turn TURN THE RAW MESSAGE,
conversational message. DO NOT SAY “Hey here is the
message you asked”
Cluster 804 instructions in the format Short rejected responses and low re- (In the school literature club-
2 of movie script: (In a <scene>) jected scores: Around 90% of rejected room...)\n\nMonika: Natsuki,
<name1>:<line1>\n... responses are of shorter lengths, lower where is everyone? I haven’t
<nameK>:<lineK>, without scores below 50% percentile. In addi- seen Sayori, Yuri, or MC in a
any instructions on what the tion, over 75% response pairs are of while.\nNatsuki:...
model response should be. larger score gap above 50% percentile.
All of these are likely due to the obscu-
rity of the user instructions.
Cluster 279 instructions, majority of Short rejected responses and low re- David insists he is too strong-
3 them are purely excerpts from jected scores: Over 90% of rejected willed and intelligent to ever
a fictional story, with no speci- responses are of shorter lengths, lower be hypnotized. He scoffs at
fications on what the response scores below 50% percentile. All of the very idea. In this kinky
should be. Users could be ask- these are likely due to the obscurity of script, his colleague Clare eas-
ing models to continue the story, the user instructions. ily proves him wrong, in front
or summarize, or edit it. of some amused co-workers.
Cluster 466 instructions, majority Short rejected responses and low re- Make a story about Shrek in the
4 of them are about writing a jected scores: Around 95% of rejected buff and farting in bog water,
comedic story about a fictional responses are of shorter lengths, lower then collecting all the fish the
character. scores below 25% percentile. All of smell kills and eating them for
these are likely due to the Llama 3.1-8B- dinner.
Instruct model being reluctant to provide
detailed answers. These instructions are
therefore less informative for improving
Llama 3.1-8B-Instruct with its own re-
sponses.

Table 25: Noisy instruction filtered based on rejected responses of lower scores and shorter lengths. We expand 4
clusters of instructions highlighted in Figure 4 for a better understanding of what instructions are being filtered out by
measuring quality of rejected responses.

21
Rejecting Instruction Preferences (RIP )

Figure 5: t-SNE plots on instructions before and after filtering by reward gaps. Blue dots represent instructions filtered
only by rejected response, while yellow dots are instructions curated with smaller gap.

Cluster Description Rejecting Reason Rejected Instructions Accepted Instructions


Cluster 140 instructions, Instructions with improve this emer- I will provide you disas-
1 among those some pure code snippets gency shutdown code: sembly from a computer
are purely code lead to variable re- import os\nimport game that runs in MS-DOS.
snippets without ad- sponses (from code platform\nimport The game was written in
ditional guidelines refactoring, editing, sys\nimport se- C with a Watcom compiler.
on what to respond. code completion, crets\nfrom thread- Some library function calls
Others are requests to code review and ing import Thread, are already identified. Ex-
asking to optimize a code explanation Event\nfrom pyn- plain the functions I give to
given piece of code. using natural lan- put.mouse import you and suggest names and
guage). Instructions Listener\nfrom pyn- C-language function signa-
on “improve this put.keyboard... tures for them, including
code” can also incur which parameters map to
variable responses which registers or stack val-
given the lack of ues ¡code snippet¿....
more specified
instructions.
Cluster 237 instructions Rejected responses alignment in excel i am getting access denined
2 including: writing are on average vb.net when i try to put local files
a program, inquiry much longer and into remote server using ftp
about online tool, complex compared write script for del- how can i resolve this issue
software installa- to chosen responses, egating fb group
tion, etc. Many despite the high
instructions are scores of both
short (in 120 charac- chosen and rejected
ters) and relatively responses.
high-level.
Cluster Cluster 3 are 52 in- Model responses Can You Imagine 4 What if Cartoon Network
3&4 structions related to vary a lot due to Fictional Versions Of Made The Amazing World
hypothetical or sur- the obscurity or Silicon Valley During of Gumball: Next Genera-
real scenarios; Clus- hypothetical nature 1940 In Detail? tion
ter 4 are 52 instruc- of the instructions.
tions in the form of Freedom planet freedom planet what if
“Freedom planet ...”, and Madness combat Lord Brevon Wins (not kils
possibly for a cre- all characters: Hank Lilac, Carol and Milla)
ative project in the 4th wall breaks and
video game. repetition

Table 26: Noisy instruction clusters filtered based on rejected responses of lower scores and shorter lengths. We
expand 4 clusters of instructions sampled from Figure 5, that consists of both rejected and accepted instructions by RIP .

22

You might also like