提出了对prompt数据评估的三个指标
提出了对prompt数据评估的三个指标
AlpacaEval LC Winrate
responses. This is achieved by measuring the 48.11 48.37
45 47.26
rejected response quality and the reward gap be- 45.64 44.93
tween the chosen and rejected preference pair.
40 40.90
Our method, Rejecting Instruction Preferences self-RIP (ours)
(RIP ) can be used to filter prompts from existing
35 RIP (ours)
training sets, or to make high quality synthetic Raw Wildchat
datasets, yielding large performance gains across Jaccard Similarity Filtering
various benchmarks compared to unfiltered data.
30 IFD Filtering
Using Llama 3.1-8B-Instruct, RIP improves Al-
InsTag Difficulty Filtering
25 Response PPL Filtering
pacaEval2 LC Win Rate by 9.4%, Arena-Hard LLM-as-Prompt-Quality-Judge
by 8.7%, and WildBench by 9.9%. Using Llama LLaMA 3.1-8b-instruct InsTag Diversity Filtering
3.3-70B-Instruct, RIP improves Arena-Hard from
20 512 1k 2k 4k 8k 16k
67.5 to 82.9, which is from 18th place to 6th over- Training Data Size (in log)
all in the leaderboard.
Figure 1: Our method Rejecting Instruction Preferences
(RIP ) for curating data, and Self-RIP for creating syn-
thetic data. The x-axis represents the effective training set
1. Introduction size (after filtering). At every data size training on unfil-
tered WildChat prompts is significantly outperformed by
In large language model (LLM) development, a primary
RIP. RIP also outperforms various other curation baselines.
driver for advancing frontier models is curating high-quality
Synthetic data built by Self-RIP improves results further.
training examples. This curation is crucial during both the
pretraining (Rae et al., 2021; Touvron et al., 2023a) and
post-training (finetuning) phases (Touvron et al., 2023b). we hypothesize that better judgments of data quality can
Despite the widespread adoption of the “scaling hypothesis” be made by taking into account the model responses on
(Kaplan et al., 2020), merely increasing the size of training those data. Specifically, if the prompt is of low quality,
datasets does not guarantee improved performance if the then responses exhibit high variability and low quality as
data are of low quality (Chen et al., 2024; Li et al., 2024c; well. This insight leads us to develop a method for either
Zhou et al., 2024). Without sufficient data quality, model selecting prompts, or for creating high quality synthetic
training tends not to be fully robust to the associated noise, prompts, both of which yield significant performance gains
and final response quality from the model suffers. during post-training.
Currently, there are a number of investigated techniques Our method, Rejecting Instruction Preferences (RIP ), con-
to curate data – most of which are based on heuristics or siders the case of instruction finetuning via preference opti-
model judgments given the training inputs. In this work, mization. It starts with a set of preference pairs consisting
1 of input prompts and chosen and rejected responses. RIP
Meta 2 New York University 3 UC Berkeley. Correspondence
to: Jing Xu <[email protected]>. considers specific characteristics of the preference pairs, in
particular rejected response quality and the reward gap be-
tween the chosen and rejected preference pair. If the rejected
1
Rejecting Instruction Preferences (RIP )
Table 1: Rejecting Instruction Preferences (RIP ) and Self-RIP compared to SOTA models on AlpacaEval2, Arena-
Hard and WildBench. By training Llama 3.1-8B-Instruct and Llama 3.3-70B-Instruct on Wildchat instructions curated by
RIP , or synthetic data created by Self-RIP , our method surpasses many existing SOTA models.
quality is low or the reward gap is high this is an indicator ducing final pretraining text data from 45TB down to a
that the prompt is of low quality. We thus filter the prompts 570GB high-quality subset. As language models become
based on these metrics. The remaining prompts can subse- more powerful, data curation can also be facilitated by us-
quently be used to fine-tune the model using RLHF methods ing LLMs as a quality judge. Llama2 and Llama3 employ
like Direct Preference Optimization (DPO) (Rafailov et al., model-based quality classifiers to filter out non-English and
2023), or for creating new synthetic prompts via few-shot low-quality content from pretraining data (Touvron et al.,
prompting. Table 1 illustrates that when trained on Wild- 2023b; Dubey et al., 2024). Rae et al. (2021); Soldaini
chat prompts (Zhao et al., 2024b) and filtered by RIP , both et al. (2024) also demonstrate that applying simple filter-
Llama 3.1-8B-Instruct and Llama 3.3-70B-Instruct (Dubey ing on massive texts brings substantial improvements on
et al., 2024) achieve large performance gains, surpassing downstream performance across the board.
many state-of-the-art models.
Data Selection in Supervised Fine-Tuning Similarly,
Additionally, we conducted comprehensive experiments post-training also relies on high-quality data to enhance
comparing the scaling behavior of our data under RIP fil- models’ instruction-following capabilities. Previously,
tering with that of unfiltered WildChat raw data, and six instruction-tuning was regarded as largely dependent on the
alternative filtering methods in Figure 1. Our results demon- size of available instruction-tuning examples (Mishra et al.,
strate that RIP significantly enhances model performance, 2021; Wei et al., 2021; Wang et al., 2022). More recent work
while other filtering methods yield only marginal improve- has revealed that training on a smaller yet higher-quality
ments. In addition to improvements observed with filtering curated set of prompts tends to be more effective in improv-
human-written data such as Wildchat prompts or HelpSteer2 ing models’ instruction-following capabilities (Zhou et al.,
using different reward signals such as human, classifier or 2024; Chen et al., 2024). To facilitate data selection, some
LLM-as-a-Judge, we also show RIP improves model perfor- employ traditional optimization-based data-pruning meth-
mance as a method to create synthetic data. ods by measuring their impact on model’s generalization
capabilities (Toneva et al., 2018; Yang et al., 2022; Xia et al.,
Analysis of our method using t-SNE shows that RIP can
2024). Another stream of work studies employing powerful
eliminate certain undesirable clusters. Additionally, analysis
language models to measure the complexity, diversity and
with GPT-4 reveals that RIP effectively removes noisy or
quality of instructions (Lu et al., 2023; Chen et al., 2024;
low quality prompts, ambiguous prompts, unsafe prompts,
Touvron et al., 2023b; Dubey et al., 2024; Li et al., 2024c).
and examples where preference choices are incorrect.
Alternative filtering approaches proposed automatic metrics
such as IFD score (Li et al., 2023a), or INSTRUCTMINING
2. Related Work which fits a linearly weighted score over a bag of natural
language indicators (Cao et al., 2023) to select examples.
Data Selection in Pretraining Data Given the high vari-
ance in quality of pretraining data, data filtering is a critical Data Selection in RLHF and Preference Optimiza-
component for determining pretrained model quality (Hoff- tion The success of preference-optimization methods
mann et al., 2022). In addition to heuristic preprocessing (Stiennon et al., 2020; Rafailov et al., 2024) has attracted
such as deduplication of similar documents, removal of more attention to collecting large scale and high quality
datasets with heavy test-set overlap, and text extractions preference data. While extensive work shows scaling up
from raw Internet content, GPT-3 (Brown et al., 2020) ap- preference data through bootstrapping (Xu et al., 2023b;
plied text filtering to the CommonCrawl dataset based on Yuan et al., 2024b), synthesis approaches (Lambert et al.,
similarity to high-quality reference data, significantly re- 2024; Wang et al., 2024b), or crowdsourcing (Touvron et al.,
2
Rejecting Instruction Preferences (RIP )
2023b; Dubey et al., 2024), can boost model performance, can provide valuable insights into the quality of the prompts.
the characterization and selection of high-quality pairwise Specifically, we test the following two hypotheses.
examples is surprisingly underexplored. Most work involv-
Hypothesis 1: Low-quality prompts are likely to produce
ing preference optimization employs existing methods de-
low-quality responses. Low-quality prompts - for example
rived from pretraining and instruction-tuning (Touvron et al.,
those that are unclear, ambiguous, or containing conflict-
2023b; Dubey et al., 2024), such as deduplication, quality
ing information - are likely to lead to noisy or inaccurate
classifiers or filtering heuristics. However, such methods
model responses. While those inaccurate responses can still
overlook the importance of the preference pairs (the chosen
be used as training targets in pairwise preference optimiza-
and rejected responses). Recent work Wu et al. (2024a);
tion, studies indicate that training on pairs with low-quality
Khaki et al. (2024) shows that preference optimization can
rejected responses might be sub-optimal. Yasunaga et al.
be highly sensitive to the choice of response pairs of differ-
(2024) for example shows that pairing the best with random
ent reward gaps, focusing more on pair construction than
responses works well comparing to pairing the best with
data selection.
the worst one with lowest reward. This suggests a potential
correlation of the quality of the rejected example with the
3. Rejecting Instruction Preferences (RIP ) alignment outcome. Additionally, several studies (Wu et al.,
2024b; Zhao et al., 2024a; Yuan et al., 2024a) have found a
We start by defining the prompt selection problem in the
strong correlation between the length of responses, includ-
pairwise preference optimization setting. In this context, we
ing rejected ones, and final performance. Therefore, we
present our proposed prompt-response-pair-based filtering
consider the reward r(yl |x) and length len(yl ) of rejected
method, which develops key descriptive metrics and their
responses as indicators of quality of the training prompts x,
use in filtering training prompts. Lastly, we describe how
i.e. large values of either of these metrics relative to other
our method can be applied to self-instruction setups where
examples indicate higher quality.
synthetic prompts are generated from the model itself.
Hypothesis 2. Low-quality prompts are likely to produce
3.1. Data Curation Problem responses with larger variance Low quality prompts
introduce uncertainty and ambiguity, leading to a broader
The goal of data curation is to remove low-quality prompts range of interpretations. As the model or human generat-
that can negatively affect the general instruction following ing the response might guess or fill in gaps in the prompt,
capability of the model. Given a set of prompts X = {x}, this results in higher variance in responses. While some
we aim to find a subset S ⊆ X to be used for fine-tuning a responses might align well with the intent, others may devi-
seed LLM M. We consider the preference optimization set- ate significantly. A preliminary study in Wu et al. (2024a)
ting, with winning (chosen) and losing (rejected) response finds low-gap pairs, where chosen and rejected responses
pairs {yw , yl } with rewards r(yw |x) > r(yl |x) for each are similar, are high-quality informative pairs, leading to
prompt x. The response pairs and their rewards can come better performing DPO models. We therefore consider the
from human preference data, or can be generated from the reward gap r(yw |x) − r(yl |x) as another indicator of quality
model itself M and then scored using an external reward of a training prompt, i.e. small reward gaps suggest that the
model. For the latter we use the ”best-vs-worst” preference prompt has higher quality.
pairing method (Pace et al., 2024), where N responses are
sampled, and the ones with highest and lowest rewards are
3.3. RIP filtering
the chosen and rejected, respectively:
( 3.3.1. RIP FOR EXISTING TRAINING PROMPTS
N yw = argmaxyi r(yi |x)
{yi }i=1 ∼ M(x) then . Given the above hypotheses, we thus consider the following
yl = argminyi r(yi |x) three metrics mk (x, yw , yl ) that are based on the responses:
We also consider alternate pairing methods in Section A.4. • Rejected response reward: m1 = r(yl |x)
We then use the preference data {x, yw , yl }x∈S for training • Rejected response length: m2 = len(yl )
the model M. Note that our focus is on filtering prompts
• Reward gap: m3 = r(yw |x) − r(yl |x)
entirely, not responses to those prompts.
For each metric, we define threshold values that can be used
3.2. Hypothesis on Data Selection for filtering. For the first two metrics, higher values are
desired so we choose a lower-bound threshold
Although preferences are extensively used to train state-
of-the-art LLMs, there is limited research on identifying S = {x | τk ≤ mk (x, yw , yl )}.
unhelpful training examples in this setting. We posit that
analyzing the paired model responses to given input prompts The last reward gap metric requires an upper threshold as we
3
Rejecting Instruction Preferences (RIP )
want small gaps. Therefore we reduce the prompt selection model; 2) with existing responses that have been annotated
problem to a threshold choice problem. To resolve this, we with human-assigned rewards. We use the WildChat and
start with coordinate-wise experiments, analyzing model Helpsteer2 datasets, see statistics in Appendix Table 7.
performance under various thresholds τk for individual met-
rics mk (details in Section A.2). Ultimately, we perform 4.1.1. W ILD C HAT DATASET
hyperparameter selection using all 3 parameters.
Prompt Set We start with a large pool of over 250k
human-written prompts from the WildChat (Zhao et al.,
3.3.2. S ELF -RIP FOR SYNTHETIC PROMPTS
2024b) dataset. We exclude any non-English prompts
Prompt curation by RIP can also naturally be used to gener- based on WildChat annotations, and remove around 70k
ate synthetic data. First, RIP is used to create a seed pool Midjourney-related instructions1 , yielding 190k unique first-
of high-quality prompts. Few-shot examples from this seed turn prompts. These prompts are collected from real user in-
pool guide the model to generate training prompts, which teractions without human annotations, making them highly
can be further filtered by RIP . We thus propose Self-RIP , a diverse. While there are many high-quality prompts, there
new approach to creating high-quality synthetic prompts: are also a significant number of low-quality ones, such as
nonsensical text or those lacking a clear question.
Step 1. Few-shot prompting with RIP curated instruc-
tions We start with the set of prompts S curated by our Response Generation Following Yuan et al. (2024b);
proposed method RIP as described in Section 3.3.1. To gen- Meng et al. (2024); Wu et al. (2024b) we generate our cho-
erate new prompts S ′ we sample from our seed model M sen and rejected response pairs on the WildChat prompts
following Self-Instruct (Wang et al., 2023; Honovich et al., using our seed model M to make our setup closer to the
2023). For each new example we randomly select 8 prompts on-policy setting. We use best-vs-worst as described in Sec-
from S and feed them as few-shot examples to the model tion 3.1, generating N responses for each prompt x using
M to generate a prompt with similar characteristics. We M with sampling parameters of T = 0.8, top p = 0.95.
apply the exact processing steps in Wang et al. (2023) to new
prompts S ′ , such as removing similar prompts (ROUGE-L Reward Annotation We then evaluate candidate re-
similarity with any existing instructions < 0.7), and exclud- sponses using two different judges:
ing those that contain certain keywords (e.g., image, picture,
graph) that usually can not be processed by text-only LLMs. • Reward Classifier: We used the ArmoRM reward
model (Wang et al., 2024a) to score each response.
Step 2. Filtering with RIP We further apply RIP on top • LLM-as-a-Judge (Zheng et al., 2023): We prompt
of the synthetically generated prompts S ′ from the previous LLama 3.1-405B-Instruct using the prompt template
step, filtering out the self-instructions using the same thresh- outlined in Yasunaga et al. (2024) to assign a score
old values as used before. Then the remaining subset S ′′ is ranging from 0 to 10 for each response. For each re-
used for training the seed model M. sponse, we conduct 10 independent evaluations and
use the average score as the final reward.
Note we use RIP filtering twice here, once in each step.
This is to ensure the quality of synthetic prompts. We also The training example (x, yw , yl ) is selected by appointing
explore Self-RIP using a smaller subset of S as seed instruc- the highest-reward one as yw and the lowest-reward one
tions in Section A.4 as part of our ablation studies. as yl . For our primary experiments, we use the default
value of N = 64. However, results for N = 8, 16, 32 are
provided as part of our ablation studies in Table 17, and we
4. Experimental Setup use N = 32 for the Llama 3.3-70B-Instruct experiments.
We perform preference optimization using DPO, beginning We perform early stopping using a validation set of 470
with the Llama 3.1-8B-Instruct model as our seed model examples: 253 valid set examples from Li et al. (2024c) and
M. We evaluate both the selection and creation of prompts, 218 examples from the evol-test set of Xu et al. (2023a),
focusing on two categories: human-written instructions and with prompts that overlap with AlpacaEval2 removed.
synthetically generated instructions. Finally, we extend our
evaluation of RIP with the Llama 3.3-70B-Instruct model. 4.1.2. H ELP S TEER 2 DATASET
HelpSteer2 (Wang et al., 2024c) consists of around 10k
4.1. Human-Written Prompts human-written prompts each with a response pair sam-
pled from 10 different LLMs. Each response has human-
For human-written instructions, we specifically investi-
gate two setups: human-written input prompts 1) paired 1
They start with “As a prompt generator for a generative AI
with model-generated responses and annotated by a reward called ”Midjourney”, you will create image prompts ...”.
4
Rejecting Instruction Preferences (RIP )
annotated rewards of helpfulness, correctness, coherence, We employ Llama 3.1-405B-Instruct to measure the quality
complexity and verbosity on a Likert-5 scale. We use the of prompts on both a binary (useful/not useful) and point-
aggregated reward with the recommended weighting [0.65, wise scale (0-5). By sampling five Llama 3.1-405B-Instruct
0.8, 0.45, 0.55, 0.4].2 The main distinction from WildChat predictions per prompt and taking the average of LLM-as-
is that the rewards come from human annotations instead Prompt-Judge predictions, we filter out less useful prompts
of an external model. We perform early stopping on the by varying the cutoff thresholds.
HelpSteer2 validation split, selecting checkpoints with the
highest average response rewards determined by ArmoRM. 4.3.2. P ROMPT- AND -C HOSEN -R ESPONSE -BASED
F ILTERING
4.2. Synthetic Prompts
Perplexity We compute perplexity (ppl) of the chosen
In this setup, we generate prompts from the seed model M response yw with the Llama 3.1-8B Instruct in a zero-shot
itself for training instead of using human-written prompts. manner as a filtering metric to curate training prompts. In
By varying the set of seed pool prompts used as few-shot particular, we retain examples with large ppl(yw |x) values,
examples, we collect two sets of training prompts: which may indicate the difficulty of the prompt.
• Self-Instruct: randomly select 8-shot examples from Instruction-Following Difficulty (IFD) Li et al. (2023a)
the unfiltered WildChat. introduced the IFD to measure the model-specific difficulty
• Self-RIP : randomly select 8-shot examples from high of a data sample. A lower IFD score indicates that this par-
quality WildChat prompts filtered by RIP . ticular instruction-response pair is considered relatively easy
for the language model to understand and follow without
In each case, we create 20k training prompts sampled with further training. We filter out examples with low IFD metric
decoding parameters T = 0.8, top p = 0.95. The rest of of a given pair of prompt x and chosen response yw .
the setup including response generations and DPO training
is exactly the same as the WildChat setup where we use 4.3.3. C HOSEN - AND -R EJECTED -R ESPONSE BASED
ArmoRM to construct response pairs (yw , yl ), and do early F ILTERING
stopping on the same validation set of 470 examples.
Jaccard Similarity In addition to the reward gap between
chosen and rejected responses, we explore Jaccard similar-
4.3. Baselines
ity, defined as the number of overlapping words divided by
We compare our method with the existing methods below. the overall word counts, as an alternative similarity mea-
For instruction-tuning data selection methods which handle surement. We thus filter out examples with low Jaccard
a single (non-pairwise) response per prompt, we apply them similarity scores (i.e. fewer overlapping words) between
to the chosen responses within the response pairs. Addi- chosen and rejected response pairs.
tional details on the implementation of each baseline are
provided in Appendix Section A.5. 4.4. Training And Evaluation Setting
4.3.1. P ROMPT-BASED F ILTERING Following the Instruct setup in Meng et al. (2024), we utilize
the DPO training approach with the off-the-shelf LLama 3.1-
InsTag Complexity Lu et al. (2023) leveraged ChatGPT 8B-Instruct and LLama 3.3-70B-Instruct models, leveraging
to create semantic and intent-based tags, subsequently fine- the fairseq2 library (Balioglu, 2023). We use a batch size
tuning an LLM as a data tagger using these tags. They then of 64 and sweep over learning rates of 5e−7, 1e−6 for the
used the tag counts as a measure of complexity. This is used LLama 3.1-8B-Instruct model, and a learning rate of 1e−6
to filter out prompts with fewer tags to enhance complexity. with a batch size of 256 for the LLama 3.3-70B-Instruct
InsTag Diversity The InsTag Diversity filtering method model. Both models are trained with a dropout rate of
(Lu et al., 2023) characterizes a dataset as more diverse 0.0 and a β value of 0.1 throughout the experiments. We
when it includes a greater variety of unique tags, as an- conduct RIP with various cutoff thresholds, e.g. at the 25%,
notated by the specified tagger. Using this approach, we 50% and 75% percentile of each metric.
greedily filter out data samples whose associated tags are We primarily assess models’ general instruction-following
already present in the selected dataset. capabilities on three evaluation benchmarks: AlpacaEval2
LLM-as-Prompt-Judge Employing LLMs as prompt qual- (Li et al., 2023b), Arena-Hard (Li et al., 2024b) and Wild-
ity judges has proven its efficacy in curating high-quality Bench (Lin et al., 2024). These benchmarks cover a wide
data (Chen et al., 2024; Dubey et al., 2024; Liu et al., 2023). range of natural yet challenging real-world user queries, and
have been widely adopted by the research community.
2
https://huggingface.co/nvidia/Llama3-70B-SteerLM-RM.
5
Rejecting Instruction Preferences (RIP )
5. Experiment Results from 54.3% to 67.7%, Arena Hard from 70.5 to 82.9 and
WildBench from 55.3 to 58.8, surpassing SOTA models as
Due to the large amount of unfiltered WildChat prompts, we shown in Table 1. The prompt filtering threshold we applied
first assess whether standard DPO training saturates as the to the 70B model was the same as in Llama 3.1-8B-Instruct
size of the training prompts grows. As shown in Appendix + RIP (see Appendix Table 9), indicating potential weak-to-
Figure 2, the Armo Score on the valid set dramatically im- strong generalizability (Li et al., 2024a) of our method.
proves as we increase the size of training prompts, and
begins to plateau afterwards. This shows growing the size Existing filtering methods derived from supervised-
of the training prompts arbitrarily does not bring additional finetuning do not work as well on preference datasets As
gains, and hence quality control of the preference dataset demonstrated in Table 2, compared to the baseline WildChat-
could be important. We thus focus on 20k unique Wild- 20k DPO (no filtering) trained on WildChat 20k prompts
Chat prompts, denoted as WildChat-20k for Llama3.1-8B- without any filtering, existing prompt-based filtering meth-
Instruct experiments, and 40k for Llama 3.3-70B-Instruct. ods such as InsTag-Difficulty, InsTag-Diversity or LLM-as-
Prompt-Judge filtering methods all lead to lower win rates
We report Alpaca-Eval2 Length-Controlled (LC) win rate, on Alpaca-Eval2. LLM-as-Prompt-Judge, while outper-
Arena-Hard score and WildBench WB-Score along with the forming certain filtering methods such as InsTag, achieves
number of training examples (after filtering if any) using marginal gains compared to no filtering even though they
WildChat-20k in Table 2, on HelpSteer2 in Table 4, and on are facilitated by querying a poweful LLM, Llama 3.1-
Self-Instruction data in Table 5. Existing filtering methods 405B-Instruct. Out of all the alternative methods tried,
are provided in Table 2 as baseline comparisons. Further Jaccard Similarity based filtering that takes into account
details, such as hyperparameters, are in Appendix Table 8 response pairs for filtering achieves relatively the highest
and Table 9. Our findings lead to several key observations. scores across the 3 benchmarks, indicating that filtering that
only takes into account prompts or chosen responses does
When filtering human-written instructions, RIP not generalize well to the pairwise preference case.
achieves the best performance on both human-scored The Self-RIP method to generate synthetic data outper-
and model-scored preference datasets. On the WildChat forms Self-Instruct data. As shown in Table 5, Self-RIP
dataset where pairs are annotated by the ArmoRM model, yields better alignment results across all 3 evaluation bench-
we conduct RIP with various cutoff thresholds, at the 25%, marks as compared to those trained on Self-Instruct data.
50% and 75% percentile of each metric. Our best model is In particular, win rate improves from 49.1% to 60.2% on
trained on examples with rejected length larger than the 50% AlpacaEval2, and from 38.5% to 42.1% on Arena-Hard.
percentile of all rejected lengths, and rejected rewards larger This result implies that our method generates better quality
than the 50% percentile of all rejected rewards, and reward instructions than generating via few-shot examples from
gap smaller than the 50% percentile. Table 2 shows that RIP unfiltered prompts as in Self-Instruct.
significantly improves LC win rate from the LLama3.1-8B-
Instruct DPO baseline without filtering from 48.4% to 57.8% Self-RIP synthetic data outperforms human-written in-
by filtering out 77% training examples, surpassing GPT-4 structions In Table 5, models trained on synthetic prompts
Omni (05/13) on AlpacaEval2. Similarly, RIP scores the outperform those trained on 20k human-written WildChat
highest on Arena-Hard (43.1) compared to LLM-as-Prompt- prompts. Applying Self-RIP few-shot generation without
Judge filtering (42.0), Jaccard Similarity (42.6), and the post-filtering gives an equal amount of 20k prompts, but
no filtering baseline (37.9). RIP also achieves the highest still increases the AlpacaEval2 LC win rate from 48.4% to
WB-score on WildBench (45.6) compared to other filtering 53.6%, Arena-Hard win rate from 37.9% to 43.7% and WB-
and no filtering baselines (41.5). As shown in Appendix Score on WildBench from 41.5 to 44.8. This further illus-
Table 8 using LLM-as-a-Judge annotated rewards, RIP also trates the importance of training on high-quality instructions.
performs well. Finally, Table 4 demonstrates RIP is equally When applying the full Self-RIP method with post-filtering
effective on HelpSteer2 where preference pairs are deter- results are further improved, for example achieving the best
mined by human annotators, achieving the highest scores AlpacaEval2 LC win rate of 60.2%.
across all 3 evaluation benchmarks as compared to the base- RIP seed data selection and RIP post-filtering are both
lines (no filtering and LLM-as-Prompt-Judge filtering). important for generating Self-RIP synthetic data In
RIP scales to different and larger models We also tried Table 5, we perform ablations on Self-RIP . We try: (i) us-
RIP on a different base LLM – from the Llama 3.3 family ing RIP to select high quality few-shot examples but not for
rather than 3.1, and of a larger scale, 70B rather than 8B. curating the resulting generations (post-filtering); (ii) apply-
As shown in Table 3, RIP also works on this larger model. ing standard (Self-Instruct) few-shot generation, but then
Filtering dramatically boosts Llama 3.3-70B-Instruct DPO applying RIP post-filtering; or (iii) applying RIP to both
trained models, with AlpacaEval2 LC win rate improved few-shot generation and post-filtering (our default method).
6
Rejecting Instruction Preferences (RIP )
Table 2: RIP compared to existing filtering methods on WildChat with Llama 3.1-8B-Instruct. RIP , which selects
only 4538 WildChat prompts for DPO training, outperforms existing filtering methods on AlpacaEval2, Arena-Hard &
WildBench. DPO response pairs are constructed using ArmoRM to score responses.
Table 3: RIP on WildChat with Llama 3.3-70B-Instruct. RIP outperforms no filtering on AlpacaEval2, Arena-Hard &
WildBench. DPO response pairs are constructed using ArmoRM to score responses.
We find that both components of our full method are impor- Next, we employ GPT-4 Turbo to evaluate 20,000 prompts
tant yielding the best results, with method (i) outperforming from WildChat. Focusing solely on the instructions (exclud-
Self-Instruct, and method (ii) performing better than (i), but ing responses) provided in WildChat, the model is tasked
worse than our full method (iii). with scoring each prompt on a scale from 1 to 5. A score
of 1 represents the most helpful prompt, while a score of
6. Understanding why RIP works 5 indicates the lowest quality. The evaluation prompt is
provided in Appendix Figure 3. Manual review revealed
6.1. Filtering prompts with low quality responses that prompts assigned scores of 4 and 5 were of very low
quality, while those scored 3 were moderately acceptable,
To understand what instructions are filtered out, we first
albeit with some quality issues still present. Notably, GPT-4
visualize instructions with low quality rejected responses
occasionally assigned scores of 2 or 3 to a prompt of low
(as measured by low reward and short lengths) by compar-
quality. Table 6 illustrates the prevalence of low-quality
ing the t-SNE plots of unfiltered and filtered instructions
examples (with score 3 or 4) after applying various filtering
(shown in Appendix Figure 4). We investigated a few clus-
methods. We observe that filtering based on the reward and
ters present in that t-SNE plot of unfiltered prompts that are
length of the rejected response is the most effective way to
missing from the t-SNE plot of filtered ones on the right-
ensure prompt quality, compared to other methods tried. By
hand-side. We find that instructions from those clusters
combining those rejected response quality metrics with the
being filtered out from the training set are either obscure,
reward gap, RIP successfully filtered out all extremely noisy
non-sensical, or they fail to elicit meaningful responses from
prompts identified by GPT-4. This supports our hypothesis
the model, leading to lower-quality rejected responses. Such
that very low-quality prompts, such as those in WildChat
instructions can be caught by measuring the rewards and
that consist of incomplete snippets from movies, stories, or
lengths of the rejected responses, with supporting evidence
code (see sample rejected instructions in Appendix Table 25
given in Appendix Table 25.
7
Rejecting Instruction Preferences (RIP )
Table 4: RIP on HelpSteer2 with Llama 3.1-8B-Instruct. Applying RIP to DPO models trained on HelpSteer2 outperforms
the baseline of no filtering as well as using the Llama 3.1-405B-Instruct model as a pointwise prompt quality judge.
Table 5: Self-RIP for generating high-quality synthetic instructions. Self-RIP creates prompts using few-shot samples
from high-quality prompts curated by RIP , whereas Self-Instruct uses few-shots from unfiltered WildChat prompts. Applying
RIP filtering after generation is also important, and achieves the best results, significantly outperforming Self-Instruct data.
and Table 26), often result in poor rejected responses when larger difference in the response pair make them less helpful
sampled several times. By leveraging the quality of rejected in improving the model during preference optimization.
responses as a filtering criterion, we can efficiently eliminate
these extremely noisy prompts.
% of low-quality % of unsafe
Furthermore, we employ GPT-4 to respond to each Wild- Filtering Methods
prompts ↓ prompts ↓
Chat prompt three times. If any of GPT-4’s responses de-
cline to answer due to safety concerns, we categorize those Unfiltered Data 31.59% 12.27%
Reject Reward 7.89% 0.04%
prompts as unsafe. It’s important to note that with this
Reject Length 14.45% 0.02%
method GPT-4 sometimes assigns high quality scores to Reward Gap 26.50% 8.07%
prompts that are borderline unsafe. By examining the re- RIP 0.00% 0.00%
ward and the length of rejected responses, we observe RIP is
also an effective approach to filter out these unsafe prompts. Table 6: Effectiveness of Filters on Prompt Quality and
This approach is grounded in the observation that rejected re- Safety: we compare the number of noisy and potentially
sponses when dealing with unsafe instructions are typically unsafe (as judged by GPT4) WildChat instructions (out of
short and have low reward scores. 20k) filtered by various filtering methods.
8
Rejecting Instruction Preferences (RIP )
9
Rejecting Instruction Preferences (RIP )
Lin, B. Y., Deng, Y., Chandu, K., Brahman, F., Ravichan- P. F. Learning to summarize with human feedback. Ad-
der, A., Pyatkin, V., Dziri, N., Bras, R. L., and Choi, vances in Neural Information Processing Systems, 33:
Y. Wildbench: Benchmarking llms with challenging 3008–3021, 2020.
tasks from real users in the wild, 2024. URL https:
//arxiv.org/abs/2406.04770. Toneva, M., Sordoni, A., Combes, R. T. d., Trischler, A.,
Bengio, Y., and Gordon, G. J. An empirical study of
Liu, W., Zeng, W., He, K., Jiang, Y., and He, J. What example forgetting during deep neural network learning.
makes good data for alignment? a comprehensive study arXiv preprint arXiv:1812.05159, 2018.
of automatic data selection in instruction tuning. arXiv
preprint arXiv:2312.15685, 2023. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Lu, K., Yuan, H., Yuan, Z., Lin, R., Lin, J., Tan, C., Zhou, Azhar, F., et al. Llama: Open and efficient foundation lan-
C., and Zhou, J. # instag: Instruction tagging for ana- guage models. arXiv preprint arXiv:2302.13971, 2023a.
lyzing supervised fine-tuning of large language models.
In The Twelfth International Conference on Learning Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
Representations, 2023. A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., et al. Llama 2: Open foundation and fine-
Meng, Y., Xia, M., and Chen, D. Simpo: Simple preference tuned chat models. arXiv preprint arXiv:2307.09288,
optimization with a reference-free reward. arXiv preprint 2023b.
arXiv:2405.14734, 2024.
Wang, H., Xiong, W., Xie, T., Zhao, H., and Zhang, T. Inter-
Mishra, S., Khashabi, D., Baral, C., and Hajishirzi, H. Cross- pretable preferences via multi-objective reward modeling
task generalization via natural language crowdsourcing and mixture-of-experts. arXiv preprint arXiv:2406.12845,
instructions. arXiv preprint arXiv:2104.08773, 2021. 2024a.
Pace, A., Mallinson, J., Malmi, E., Krause, S., and Wang, T., Kulikov, I., Golovneva, O., Yu, P., Yuan, W.,
Severyn, A. West-of-n: Synthetic preference gener- Dwivedi-Yu, J., Pang, R. Y., Fazel-Zarandi, M., Weston,
ation for improved reward modeling. arXiv preprint J., and Li, X. Self-taught evaluators. arXiv preprint
arXiv:2401.12086, 2024. arXiv:2408.02666, 2024b.
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann,
Wang, Y., Mishra, S., Alipoormolabashi, P., Kordi, Y.,
J., Song, F., Aslanides, J., Henderson, S., Ring, R.,
Mirzaei, A., Arunkumar, A., Ashok, A., Dhanasekaran,
Young, S., et al. Scaling language models: Methods,
A. S., Naik, A., Stap, D., et al. Super-naturalinstructions:
analysis & insights from training gopher. arXiv preprint
Generalization via declarative instructions on 1600+ nlp
arXiv:2112.11446, 2021.
tasks. arXiv preprint arXiv:2204.07705, 2022.
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D.,
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
Ermon, S., and Finn, C. Direct preference optimization:
Khashabi, D., and Hajishirzi, H. Self-instruct: Align-
Your language model is secretly a reward model. In Thirty-
ing language models with self-generated instructions. In
seventh Conference on Neural Information Processing
Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Pro-
Systems, 2023. URL https://openreview.net/
ceedings of the 61st Annual Meeting of the Association
forum?id=HPuSIXJaa9.
for Computational Linguistics (Volume 1: Long Papers),
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Er- pp. 13484–13508, Toronto, Canada, July 2023. Associ-
mon, S., and Finn, C. Direct preference optimization: ation for Computational Linguistics. doi: 10.18653/v1/
Your language model is secretly a reward model. Ad- 2023.acl-long.754. URL https://aclanthology.
vances in Neural Information Processing Systems, 36, org/2023.acl-long.754.
2024.
Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Egert,
Soldaini, L., Kinney, R., Bhagia, A., Schwenk, D., Atkinson, D., Zhang, J. J., Sreedhar, M. N., and Kuchaiev, O. Help-
D., Authur, R., Bogin, B., Chandu, K., Dumas, J., Elazar, steer2: Open-source dataset for training top-performing
Y., et al. Dolma: An open corpus of three trillion tokens reward models. arXiv preprint arXiv:2406.08673, 2024c.
for language model pretraining research. arXiv preprint
arXiv:2402.00159, 2024. Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester,
B., Du, N., Dai, A. M., and Le, Q. V. Finetuned lan-
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., guage models are zero-shot learners. arXiv preprint
Voss, C., Radford, A., Amodei, D., and Christiano, arXiv:2109.01652, 2021.
10
Rejecting Instruction Preferences (RIP )
Wu, J., Xie, Y., Yang, Z., Wu, J., Gao, J., Ding, B., Wang, X., Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,
and He, X. β-dpo: Direct preference optimization with Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for
dynamic β. arXiv preprint arXiv:2407.08639, 2024a. alignment. Advances in Neural Information Processing
Systems, 36, 2024.
Wu, T., Yuan, W., Golovneva, O., Xu, J., Tian, Y., Jiao, J.,
Weston, J., and Sukhbaatar, S. Meta-rewarding language
models: Self-improving alignment with llm-as-a-meta-
judge. arXiv preprint arXiv:2407.19594, 2024b.
Xia, M., Malladi, S., Gururangan, S., Arora, S., and Chen,
D. Less: Selecting influential data for targeted instruction
tuning. arXiv preprint arXiv:2402.04333, 2024.
Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao,
C., and Jiang, D. Wizardlm: Empowering large language
models to follow complex instructions. arXiv preprint
arXiv:2304.12244, 2023a.
Yang, S., Xie, Z., Peng, H., Xu, M., Sun, M., and
Li, P. Dataset pruning: Reducing training data by
examining generalization influence. arXiv preprint
arXiv:2205.09329, 2022.
Yuan, W., Kulikov, I., Yu, P., Cho, K., Sukhbaatar, S., We-
ston, J., and Xu, J. Following length constraints in in-
structions. arXiv preprint arXiv:2406.17744, 2024a.
Yuan, W., Pang, R. Y., Cho, K., Sukhbaatar, S., Xu, J.,
and Weston, J. Self-rewarding language models. arXiv
preprint arXiv:2401.10020, 2024b.
Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y., and
Deng, Y. Wildchat: 1m chatgpt interaction logs in the
wild. arXiv preprint arXiv:2405.01470, 2024b.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H.,
Gonzalez, J. E., and Stoica, I. Judging LLM-as-a-judge
with MT-bench and chatbot arena. In Thirty-seventh
Conference on Neural Information Processing Systems
Datasets and Benchmarks Track, 2023. URL https:
//openreview.net/forum?id=uccHPGDlao.
11
Rejecting Instruction Preferences (RIP )
A. Appendix
A.1. More Details on Experiment Setup
Our experiment setups are summarized in Table 7. Specifically, we apply RIP to multiple popular instruction-following
datasets as well as our own synthetic data, with reward annotated from various sources (human/reward classifier/LLM-as-a-
Judge), indicating the generalizability of our RIP method.
Human Reward
#Prompts Written #Responses Annotator Valid Set ( # Examples)
WildChat-turn1 20k 20,000 Yes 8,16,32,64 ArmoRM Humpback + Evol-Instruct (470)
WildChat-turn1 20k 20,000 Yes 64 LLM-as-a-Judge Humpback + Evol-Instruct (470)
HelpSteer2 10,161 Yes 2 Human HelpSteer2 valid (519)
Self-Instruct 20,000 No 64 ArmoRM Humpback + Evol-Instruct (470)
Self-RIP 20,000 No 64 ArmoRM Humpback + Evol-Instruct (470)
We report the model performance on valid set when varying the number of training WildChat prompts in Figure 2. Model
training improves significantly as training data size grows to 20k and then begin to saturates afterwards, therefore our main
experiments are based on those 20k WildChat prompts.
0.1812
0.181
0.1800.1794
0.179
0.1780.1774 Armo Rewards on Valid Set
0.177 0 10k 20k 30k 40k 50k
Data Size
Figure 2: Results on DPO Training with Varying WildChat Data Sizes. Using different sizes of WildChat data for DPO
training on LLaMA 3.1-8B-Instruct, the performance, measured by Armo rewards on the validation set, gradually saturates
as the data size increases.
We primarily assess our models’ general instruction-following capabilities using three popular evaluation benchmarks:
AlpacaEval-2 (Li et al., 2023b), Arena-Hard (Li et al., 2024b) and WildBench (Lin et al., 2024). AplacaEval-2 consists
of 805 prompts sampled from 5 datasets. Arena-Hard contains 500 challenging user queries sourced from Chatbot Arena
and has the highest correlation and separability of models commpared to Chatbot Arena among popular open-ended LLM
benchmarks (Li et al., 2024b). WildBench is built from a set of 1024 significantly harder, challenging queries carefully
curated from the WildChat project (Zhao et al., 2024b) to ensure diversity and complexity. The automatic evaluation of
WildBench involves task-specific checklists that guide LLM judges in generating reliable and consistent judgments which
demonstrate significantly high correlation with human judgments. We report the WB-Score for individual scoring.
12
Rejecting Instruction Preferences (RIP )
Table 8: RIP compared to baselines on WildChat using LLM-as-a-Judge as the reward annotator.. We report results on
AlpacaEval2, Arena-Hard and WildBench of various models trained using DPO on the WildChat Dataset. RIP outperforms
the baseline of LLM-as-judge as the reward annotator.
Data Scaling with RIP We further scale up RIP by growing the training data size after filtering to 20k, and achieves
AlpacaEval2 LC win rate of 58.49% as shown in Figure 1. While the effective training size scales from 4538 to 20k, the
actual performance gain only increase slightly, suggesting that training with Llama 3.1-8B-Instruct on existing WildChat
prompts saturates, even under RIP .
RIP filtering thresholds We report the filtering thresholds of the best checkpoints in our experiments in Table 9.
Full Evaluation Results We include full WildChat evaluation results on AlpacaEval2 and Arena-Hard in Table 10 and
on WildBench in Table 11, with average response lengths, confidence intervals as well as finegrained results on subtasks.
Full evaluation results on models trained on HelpSteer2 are presented in Table 12 and Table 13. In addition, full evaluation
results on Self-RIP are included in Table 14 and Table 15.
Coordinate-wise Filtering results. We conduct extensive experiments by applying filtering to each individual metric:
reward on chosen or rejected response, lengths of chosen or rejected response, reward gap, average reward of all responses,
etc. Results on valid set performances by applying various filtering metrics to WildChat task are included in Table 19, and
HelpSteer2 in Table 20. Both highlight strong performance boost by filtering based on rejected reward, rejected length and
reward gap.
RIP outperforms alternative preference pairing methods We compare RIP to methods without filtering that use
different response pairing methods for building pairwise preferences. Recall that in our main experiments for RIP we used
the best-vs-worst pairing method as described in Section 3.1. Here we explore two alternative methods: (i) best-vs-random
which is shown by existing work (Yasunaga et al., 2024; Khaki et al., 2024) to outperform best-vs-worst, and (ii) best-vs-
bottom-K% percentile where the rejected response has the bottom K = 25, 50, 75 percentile score (K = 0 being the lowest
score). Both pairing methods can effectively lower reward gap and increase quality of rejected response without removing
training prompts. We report model performance on the valid set in Table 16. Out of all pairing methods, best-vs-bottom-25%
works the best, but still under-performs compared with our RIP method (pairing with best-vs-worst). When evaluated on
AlpacaEval2, Arena-Hard, and WildBench, the model WildChat-20k DPO (best-vs-bottom-25%) only achieves a slight
improvement gain comparing to baseline WildChat-20k DPO (best-vs-worst), while still underperforming RIP as shown in
Table 2. This result indicates that reward gap being small or the rejected reward being high better works as an indication of a
low-quality prompt rather than bad response pairing.
13
AlpacaEval2
Data # Train Human # Re- Reward An- Seed Model Filtering Metrics LC Win Win Arena- Wildbench
Exam- Written sponses notator Hard
ples
- - - - - Llama 3.1-8B- - 20.9 21.8 21.3 33.1
Instruct
- - - - - Llama 3.3- - 38.9 41.5 52.8
70B-Instruct
Wildchat 20k 20k Yes 64 ArmoRM Llama 3.1-8B- No 48.37 45.87 37.9 41.5
Instruct
Wildchat 20k 6762 Yes 64 ArmoRM Llama 3.1-8B- Rejected Length ≥ 1878, Rejected 57.1 52.9 42.3 45.5
Instruct Armo ≥ 0.126
Wildchat 20k 4538 Yes 64 ArmoRM Llama 3.1-8B- Rejected Length ≥ 1878, Rejected 57.8 57.2 43.1 45.6
Instruct Armo ≥ 0.126, Reward Gap > 0.042
Synthetic (few shot: 20k No 64 ArmoRM Llama 3.1-8B- No 49.1 46.9 38.5 41.0
Wildchat 20k) Instruct
Synthetic (few shot: 16k No 64 ArmoRM Llama 3.1-8B- Rejected Length ≥ 1878, Rejected 58.3 53.2 40.9 44.1
Wildchat 20k) Instruct Armo ≥ 0.126
Synthetic (few shot: 20k No 64 ArmoRM Llama 3.1-8B- No 53.6 56.1 43.7 44.8
Wildchat filtered Instruct
14
4538 examples)
Synthetic (few shot: 18812 No 64 ArmoRM Llama 3.1-8B- Rejected Length ≥ 1878, Rejected 60.2 61.1 42.1 42.5
Wildchat filtered Instruct Armo ≥ 0.126
4538 examples)
Wildchat 20k 16.8k Yes 64 LLM-as-a- Llama 3.1-8B- No 40.1 44.9 41.1 42.5
Judge Instruct
Wildchat 20k 5999 Yes 64 LLM-as-a- Llama 3.1-8B- Rejected LLM-as-a-Judge Reward ≥ 8, 44.3 48.8 42.5 43.9
Rejecting Instruction Preferences (RIP )
Table 9: Full Results details on number of training examples, choice of reward models, seed models, filtering metrics and thresholds chosen as well as final outcomes
across 3 evaluation benchmarks.
Rejecting Instruction Preferences (RIP )
Table 10: Full AlpacaEval2 & Arena-Hard Results on WildChat: we compare performances of SOTA models on
AlpacaEval2 win rates and Arena-Hard scores as well as DPO models trained on the WildChat-20k dataset using various
filtering methods.
Combining alternative pairing with RIP performs on par with best-vs-bottom pairing with RIP . We further apply
RIP filtering to examples paired by best-vs-bottom-25% pairing. Combing best-vs-bottom-25% with filtering out examples
of low quality rejected responses yields ArmoRM Score of 0.18675, slightly lower than best-vs-worst + filtering by Rejected
Reward (0.18795). Filtering out best-vs-bottom-25% examples of bigger reward gaps yields to Armo Score of 0.1860 on
valid set as compared to 0.18542 from best-vs-worst pairing + filtering by Reward Gaps. Given the marginal performance
gain between best-vs-worst and best-vs-bottom-25% pairing with and without RIP , we thus focus on the more widely
adopted best-vs-worst pairing to experiment various filtering methods including our RIP method.
RIP is robust to the choice of the number of responses N . While we showed RIP provides strong performance on
HelpSteer2 where only N = 2 responses are availabe for each prompt, and on WildChat with N = 64 responses sampled per
prompt, we also compare the performance of RIP by varying the choice of N the number of candidate responses generated
for preference annotations in the WildChat setup. As shown in Table 17, for a wide range of values N = 64, 32, 16, 8, RIP
consistently outperforms the no filtering baseline, with larger N achieving increasingly better performance, likely due to the
increased quality and variability of chosen and rejected responses, allowing our RIP metrics to be more accurate in curating
high quality data.
Self-RIP works with much smaller set of high-quality seed instructions Instead of using all 4538 RIP curated high-
quality instructions as seed instructions S during Step 1. few-shot generations, we sample a much shorter subset of 256
prompts from 4538 RIP -curated prompts as seed instructions, and only conduct few-shot generations by sampling 8 prompts
from the 256 seed prompts each time. We report Self-RIP with and without post-filtering in Table 18. Self-RIP based
on 256 high-quality seed instructions (58.9) slightly underperforms than that based on 4538 seed prompts (60.2), but
still outperforms Self-Instruct with RIP post-filtering (58.3) as well as Self-RIP based on all 4538 seed prompts without
post-filtering (53.6), indicating that our method Self-RIP can work well with a much smaller set of high-quality seed
prompts.
15
Rejecting Instruction Preferences (RIP )
Table 11: Full WildBench Results on WildChat: we compare performances of SOTA models on WildBench as well as
DPO models trained on the WildChat-20k dataset using various filtering methods.
AlpacaEval2 Arena-Hard
Prompts LC Win Win Len Score 95% CI Len
Baseline
Llama 3.1-8B-Instruct - 20.9 21.8 2184 21.3 (-1.9, 2.2) 861
HelpSteer2 DPO (no filtering) 10161 25.2 23.1 1733 26.8 (-2.0, 2.4) 606
Prompt-Based-Filtering
LLM-as-Prompt-Judge Pointwise 5376 27.8 25.7 1947 29.5 (-2.8, 2.3) 627
Prompt-Response-Based-Filtering
RIP 5081 34.6 32.8 1941 35.0 (-1.8, 2.2) 621
Table 12: Results of our DPO models trained with HelpSteer2. Full AlpacaEval2 & Arena-Hard Results of our DPO
models trained with HelpSteer2 Dataset.
16
Rejecting Instruction Preferences (RIP )
Baseline
Llama 3.1-8B-Instruct 33.1 45.0 37.0 23.9 37.4 29.3
HelpSteer2 DPO (no filtering) 37.1 48.6 40.4 26.5 44.3 33.4
Prompt-Based-Filtering
LLM-as-Prompt-Judge Pointwise 37.2 50.6 40.0 27.9 43.0 33.1
Prompt-Response-Based-Filtering
RIP 39.5 52.1 42.9 29.3 46.4 35.0
Table 13: Results on our DPO models trained with HelpSteer2. Full WildBench results of our DPO models trained with
HelpSteer2 Dataset.
Table 14: Results of our DPO models trained with Self-Instructed Dataset: Full AlpacaEval2 & Arena-Hard Results
comparing our method with training on standard Self-Instruct dataset.
Table 15: Results of our DPO models trained with Self-Instructed Dataset: Full WildBench Results comparing our
method with training on standard Self-Instruct dataset.
Table 16: Results of pair selections: We report Armo scores on valid sets by varying different pairing methods instead
of filtering prompts. Best pairing result 0.1842 is achieved with appointing response with bottom 25% score as rejected,
although still underperforming compared to our filtering method (0.1898).
InsTag Diversity The InsTag Diversity filtering method (Lu et al., 2023) considers a dataset to be more diverse if it
contains a larger number of unique tags, as annotated by the aforementioned tagger. We employed two metrics to manage
InsTag Diversity:
17
Rejecting Instruction Preferences (RIP )
Table 17: Results on Varying N = 8, 16, 32, 64 number of responses sampled per prompts in Response Generation:
Armo Score on Valid set of our DPO models trained with WildChat Dataset all increases after filtering based on RIP
regardless of the choice of N in response generation step.
Table 18: Self-RIP for generating high-quality synthetic instructions by varying number of fewshots. Self-RIP creates
prompts using few-shot samples from high-quality prompts curated by RIP , whereas Self-Instruct uses few-shots from
unfiltered WildChat prompts. Applying RIP filtering after generation is also important, and achieves the best results,
significantly outperforming Self-Instruct data.
1. Tag Frequency: We deem a tag valid if it meets a predefined frequency threshold. This approach addresses the issue of
infrequent tags, such as “serve size” and “market failure,” which appeared only once or twice in the entire Wildchat dataset,
suggesting they may not represent valid categories. In contrast, more common tags like “creative writing” and “information
retrieval” are more appropriate for categorizing prompt data.
2. Max prompt per Tag: This metric controls the coverage ratio of unique tags. If a prompt contains only tags that have
already been covered by the selected set, we discard the prompt to ensure diversity.
Table 22 presents the performance results when diversity is controlled using the two metrics described above. To ensure
fairness, we downsampled the training data for each experiment to 10,000 samples. The results indicate that the model
achieves optimal performance when the Tag Frequency is set to 6 and the Max Prompt per Tag is set to 3. This means we
only consider tags that appear more than six times in the entire Wildchat dataset, and we allow a maximum of three prompts
per tag. The best performance results are reported in Table 2.
Perplexity To curate training prompts, we compute the perplexity (ppl) of the selected response yw using the Llama-3.1-
8B-Instruct model in a zero-shot setting. We use this perplexity as a filtering metric, specifically retaining examples with
high ppl(yw |x) values, which may indicate more challenging prompts. We adjust the quantile range to control perplexity,
calculating ppl(yw |x) for 20,000 Wildchat data points and filtering them based on this range. Table 23 displays model
performance across different ppl quantile ranges. As shown, the quantile range of 25-100 yields the best performance, and
we report this model’s performance in Table 2.
Instruction-Following Difficulty (IFD) Li et al. (2023a) introduced the IFD to measure the model-specific difficulty of a
data sample. In the instruction-tuning process, the loss of a sample pair (Q, A) is calculated by continuously predicting the
next tokens given the instruction Q and their proceeding words:
N
1 X
log P wiA | Q, w1A , w2A , . . . , wi−1
A
Lθ (A | Q) = − ;θ (1)
N i=1
18
Rejecting Instruction Preferences (RIP )
Table 19: Performance of Different Filter Methods Across Quantile Ranges on WildChat with ArmoRM as reward anotator.
Table 20: Performance of Different Filter Methods Across Quantile Ranges on HelpSteer2 valid set.
where N is the number of words of the groundtruth answer A. They denote this averaged crossentropy loss as the Conditioned
Answer Score Sθ (A | Q) = Lθ (A | Q).
Then they introduce the Direct Answer Score Sθ (A)
N
1 X
log P wiA | w1A , . . . , wi−1
A
sθ (A) = − ;θ (2)
N i=1
Finally, they estimate the Instruction-Following Difficulty (IFD) scores IF Dθ (Q, A) on following instruction of a given (Q,
A) pairs by calculating the ratio between Sθ (A) and Sθ (A | Q):
sθ (A | Q)
IFDθ (Q, A) = (3)
sθ (A)
We calculated the IFD scores for 20,000 Wildchat data points and filtered them based on specific ranges. As shown in
Table 24, filtering with a range of 25-100 yielded the best performance. The performance of this model is reported in Table 2.
19
Rejecting Instruction Preferences (RIP )
I have a collection of prompts that I need to evaluate for their effectiveness in fine-tuning a language model.
A useful prompt should:
- Clearly ask a question
- Be concise and specific
- Directly relate to the topic of interest or follow given instructions
Please assess each prompt and assign a score from 1 to 5 based on its usefulness:
- 1: Pretty useful
- 2: Somewhat useful
- 3: Neutral (neither useful nor harmful)
- 4: Somewhat harmful
- 5: Harmful
Make sure to clearly indicate the score at the end of your evaluation using the format: Score: x
Prompt: {prompt}
Table 23: Model performance with Perplexity Filtering Table 24: Model performance with IFD Filtering
20
Rejecting Instruction Preferences (RIP )
Figure 4: t-SNE plots on instructions before and after filtering by rewards and lengths of rejected responses. Red dots
represent unfiltered instructions, while blue dots are instructions curated by filtering out those with low-reward and shorter
rejected responses.
Table 25: Noisy instruction filtered based on rejected responses of lower scores and shorter lengths. We expand 4
clusters of instructions highlighted in Figure 4 for a better understanding of what instructions are being filtered out by
measuring quality of rejected responses.
21
Rejecting Instruction Preferences (RIP )
Figure 5: t-SNE plots on instructions before and after filtering by reward gaps. Blue dots represent instructions filtered
only by rejected response, while yellow dots are instructions curated with smaller gap.
Table 26: Noisy instruction clusters filtered based on rejected responses of lower scores and shorter lengths. We
expand 4 clusters of instructions sampled from Figure 5, that consists of both rejected and accepted instructions by RIP .
22