0% found this document useful (0 votes)
37 views

ORPO - Monolithic Preference Optimization Without Reference Model

Uploaded by

Kayky Ramos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

ORPO - Monolithic Preference Optimization Without Reference Model

Uploaded by

Kayky Ramos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong Noah Lee James Thorne

KAIST AI
{jiwoo_hong, noah.lee, thorne}@kaist.ac.kr

Abstract 12.5
Llama−2 Mistral

While recent preference alignment algorithms 10.0

AlpacaEval 2.0
for language models have demonstrated promis-

Win Rate (%)


7.5
arXiv:2403.07691v2 [cs.CL] 14 Mar 2024

ing results, supervised fine-tuning (SFT) re- 12.2


5.0
10.99 11.33
9.44
mains imperative for achieving successful con- 7.7 8.35
vergence. In this paper, we study the cru- 2.5
4.96
cial role of SFT within the context of pref- 0.0

erence alignment, emphasizing that a minor Llama (7B) Llama (13B) Llama−ORPO (7B) Zephyr−a Zephyr−b Mistral−ORPO−a Mistral−ORPO−b

penalty for the disfavored generation style is Algorithm RLHF DPO ORPO

sufficient for preference-aligned SFT. Build-


ing on this foundation, we introduce a straight- Figure 1: AlpacaEval2.0 result of Llama-2 (7B) and
forward and innovative reference model-free Mistral (7B) fine-tuned with ORPO (blue) in comparison
monolithic odds ratio preference optimization to the state-of-the-art models. Notably, Mistral-ORPO-α
algorithm, ORPO, eliminating the necessity for & β surpasses Zephyr β and Llama-2-Chat (13B) with a
an additional preference alignment phase. We single epoch training exclusively on the UltraFeedback.
demonstrate, both empirically and theoretically,
that the odds ratio is a sensible choice for con-
trasting favored and disfavored styles during tuning to be usable in general-domain applications,
SFT across the diverse sizes from 125M to 7B. typically through processes such as instruction tun-
Specifically, fine-tuning Phi-2 (2.7B), Llama-2
ing and preference alignment.
(7B), and Mistral (7B) with ORPO on the Ul-
traFeedback alone surpasses the performance Instruction-tuning (Wei et al., 2022; Taori et al.,
of state-of-the-art language models with more 2023; Wang et al., 2023; Zhou et al., 2023a) trains
than 7B and 13B parameters: achieving up to models to follow task descriptions given in natu-
12.20% on AlpacaEval2.0 (Figure 1), 66.19% ral language, which enables models to generalize
on IFEval (instruction-level loose, Table 6), and well to previously unseen tasks. However, despite
7.32 in MT-Bench (Figure 12). We release
the ability to follow instructions, models may gen-
code1 and model checkpoints for Mistral-ORPO-
α (7B)2 and Mistral-ORPO-β (7B).3 erate harmful or unethical outputs (Carlini et al.,
2021; Gehman et al., 2020; Pryzant et al., 2023).
1 Introduction To further align these models with human values,
additional training is required with pairwise prefer-
Pre-trained language models (PLMs) with vast ence data using techniques such as reinforcement
training corpora such as web texts (Gokaslan and learning with human feedback (Ziegler et al., 2020;
Cohen, 2019; Penedo et al., 2023) or textbooks (Li Stiennon et al., 2022, RLHF) and direct preference
et al., 2023c) have shown remarkable abilities in optimization (Rafailov et al., 2023, DPO).
diverse natural language processing (NLP) tasks
Preference alignment methods have demon-
(Brown et al., 2020; Zhang et al., 2022; Touvron
strated success in several downstream tasks beyond
et al., 2023; Jiang et al., 2023; Almazrouei et al.,
reducing harm. For example, improving factuality
2023). However, the models must undergo further
(Tian et al., 2023; Cheng et al., 2024; Chen and
1
https://github.com/xfactlab/orpo Li, 2024), code-based question answering (Gorba-
2
https://huggingface.co/kaist-ai/ tovski and Kovalchuk, 2024), and machine transla-
mistral-orpo-alpha
3
https://huggingface.co/kaist-ai/ tion (Ramos et al., 2023). The versatility of align-
mistral-orpo-beta ment algorithms over a wide range of downstream
Pre-trained

Str
Chosen Responses ong
SFT Ada
pta
ti on
SFT

Reward Model
Rejected Responses alty
Pen
ak
Ref. Policy We
Ref. Policy

RLHF DPO Odds Ratio Preference Optimization


ORPO

Figure 2: Comparison of model alignment techniques. ORPO aligns the language model without a reference model in
a single-step manner by assigning a weak penalty to the rejected responses and a strong adaptation signal to the
chosen responses with a simple log odds ratio term appended to the negative log-likelihood loss.

tasks highlights the necessity of understanding the on AlpacaEval2.0 , and 61.63% and 66.19% in IFE-
alignment procedure and further improving the al- val instruction-level loose accuracy, respectively.
gorithms in terms of efficiency and performance.
However, existing preference alignment methods 2 Related Works
normally consist of a multi-stage process, as shown Alignment with Reinforcement Learning Rein-
in Figure 2, typically requiring a second refer- forcement learning with human feedback (RLHF)
ence model and a separate warm-up phase with commonly applies the Bradley-Terry model
supervised fine-tuning (SFT) (Ziegler et al., 2020; (Bradley and Terry, 1952) to estimate the prob-
Rafailov et al., 2023; Wu et al., 2023). ability of a pairwise competition between two in-
In this paper, we study the role and impact of dependently evaluated instances. An additional
SFT in pairwise preference datasets for model reward model is trained to score instances. Rein-
alignment in Section 3 and propose a simple and forcement learning algorithms such as proximal
novel monolithic alignment method, odds ratio policy optimization (PPO) (Schulman et al., 2017)
preference optimization (ORPO), which efficiently are employed to train the model to maximize the
penalizes the model from learning undesired gen- score of the reward model for the chosen response,
eration styles during SFT in Section 4. In con- resulting in language models that are trained with
trast to previous works, our approach requires human preferences (Ziegler et al., 2020; Stiennon
neither an SFT warm-up stage nor a reference et al., 2022). Notably, Ouyang et al. (2022) demon-
model, enabling resource-efficient development of strated the scalability and versatility of RLHF for
preference-based aligned models. instruction-following language models. Extensions
such as language model feedback (RLAIF) could
We demonstrate the effectiveness of our method
be a viable alternative to human feedback (Bai et al.,
with the evaluation of model alignment tasks and
2022b; Lee et al., 2023; Pang et al., 2023). How-
popular leaderboards in Section 6.1 and 6.2 by
ever, RLHF faces challenges of extensive hyper-
fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mis-
parameter searching due to the instability of PPO
tral (7B) with ORPO. Then, we conduct controlled
(Rafailov et al., 2023; Wu et al., 2023) and the
experiments comparing ORPO against established
sensitivity of the reward models (Gao et al., 2022;
methods for model alignment, RLHF, and DPO
Wang et al., 2024). Therefore, there is a crucial
for different datasets and model sizes in Section
need for stable preference alignment algorithms.
6.3. Along with the post-hoc analysis of generation
diversity in Section 6.4, we expound on the theo- Alignment without Reward Model Several
retical, empirical, and computational justification techniques for preference alignment mitigate the
of utilizing the odds ratio in monolithic preference need for reinforcement learning (Rafailov et al.,
alignment in Section 7.3. We release the training 2023; Song et al., 2023; Azar et al., 2023; Etha-
code and the checkpoints for Mistral-ORPO-α (7B) yarajh et al., 2024). Rafailov et al. (2023) introduce
and Mistral-ORPO-β (7B). These models achieve direct policy optimization (DPO), which combines
7.24 and 7.32 in MT-Bench, 11.33% and 12.20% the reward modeling stage into the preference learn-
ing stage. Azar et al. (2023) prevented potential 2023a; Dong et al., 2024) by increasing the log
overfitting problems in DPO through identity pref- probabilities of pertinent tokens. Nevertheless, this
erence optimization (IPO). Ethayarajh et al. (2024) inadvertently increases the likelihood of generating
and Cai et al. (2023) proposed Kahneman-Tversky tokens in undesirable styles, as illustrated in Figure
Optimisation (KTO) and Unified Language Model 3. Therefore, it is necessary to develop methods
Alignment (ULMA) that does not require the pair- capable of preserving the domain adaptation role of
wise preference dataset, unlike RLHF and DPO. SFT while concurrently discerning and mitigating
Song et al. (2023) further suggests incorporation unwanted generation styles.
of the softmax value of the reference response set
in the negative log-likelihood loss to merge the Absence of Penalty in Cross-Entropy Loss The
supervised fine-tuning and preference alignment. goal of cross-entropy loss model fine-tuning is to
penalize the model if the predicted logits for the
Alignment with Supervised Fine-tuning Prefer- reference answers are low, as shown in Equation 2.
ence alignment methods in reinforcement learning
m
(RL) often leverage supervised fine-tuning (SFT) 1 X
L=− log P (x(k) , y(k) ) (1)
to ensure the stable update of the active policy in m
k=1
relation to the old policy (Schulman et al., 2017).
This is because the SFT model is the old policy in m |V |
1 X X (k) (k)
the context of RLHF (Ziegler et al., 2020). Fur- =− yi · log(pi ) (2)
thermore, empirical findings indicate that, even in m
k=1 i=1
non-RL alignment methods, the SFT model is cru-
cial for achieving convergence to desired results where yi is a boolean value that indicates if ith to-
(Rafailov et al., 2023; Tunstall et al., 2023). ken in the vocabulary set V is a label token, pi
In contrast, there have been approaches to build refers to the probability of ith token, and m is
human-aligned language models by conducting the length of sequence. Using cross-entropy alone
SFT only with filtered datasets (Zhou et al., 2023a; gives no direct penalty or compensation for the
Li et al., 2023a; Haggerty and Chandra, 2024; Zhou logits of non-answer tokens (Lin et al., 2017) as
et al., 2023b). Zhou et al. (2023a) demonstrated yi will be set to 0. While cross-entropy is gen-
that SFT with a small amount of data with fine- erally effective for domain adaptation (Mao et al.,
grained filtering and curation could be sufficient 2023), there are no mechanisms to penalize rejected
for building helpful language model assistants. Fur- responses when compensating for the chosen re-
thermore, Li et al. (2023a) and Haggerty and Chan- sponses. Therefore, the log probabilities of the to-
dra (2024) proposed an iterative process of fine- kens in the rejected responses increase along with
tuning the supervised fine-tuned language mod- the chosen responses, which is not desired from the
els with their own generations after fine-grained viewpoint of preference alignment.
selection of aligned generations and Zhou et al. Generalization over Both Response Styles We
(2023b) suggested that a curated subset of prefer- conduct a pilot study to empirically demonstrate
ence dataset is sufficient for alignment. While these the miscalibration of chosen and rejected responses
works highlight the impact and significance of SFT with supervised fine-tuning alone. We fine-tune
in the context of alignment, the actual role of SFT OPT-350M (Zhang et al., 2022) on the chosen re-
and the theoretical background for incorporating sponses only from the HH-RLHF dataset (Bai et al.,
preference alignment in SFT remains understudied. 2022b). Throughout the training, we monitor the
log probability of rejected responses for each batch
3 The Role of Supervised Fine-tuning
and report this in Figure 3. Both the log probability
We study the behavior of supervised fine-tuning of chosen and rejected responses exhibited a simul-
(SFT) as an initial stage of preference alignment taneous increase. This can be interpreted from two
methods (Ziegler et al., 2020; Rafailov et al., 2023) different perspectives. First, the cross-entropy loss
through analysis of the loss function in SFT and effectively guides the model toward the intended
empirical demonstration of the preference compre- domain (e.g., dialogue). However, the absence of
hension ability of the trained SFT model. SFT a penalty for unwanted generations results in re-
plays a significant role in tailoring the pre-trained jected responses sometimes having even higher log
language models to the desired domain (Zhou et al., probabilities than the chosen ones.
Response Type Chosen Rejected
Pθ (y|x)
oddsθ (y|x) = (4)
1 − Pθ (y|x)
Intuitively, oddsθ (y|x) = k implies that it is k
−2.2
times more likely for the model θ to generate the
Log Probability

−2.3 output sequence y than not generating it. Thus,


the odds ratio of the chosen response yw over the
−2.4
rejected response yl , ORθ (yw , yl ), indicates how
−2.5
much more likely it is for the model θ to generate
yw than yl given input x, defined in Equation 5.
0 5000 10000
Training Step
15000
oddsθ (yw |x)
ORθ (yw , yl ) = (5)
oddsθ (yl |x)
Figure 3: Log probabilities for chosen and rejected
responses during OPT-350M model fine-tuning on HH- 4.2 Objective Function of ORPO
RLHF dataset. Despite only chosen responses being The objective function of ORPO in Equation 6 con-
used for supervision, rejected responses show a compa- sists of two components: 1) supervised fine-tuning
rable likelihood of generation.
(SFT) loss (LSF T ); 2) relative ratio loss (LOR ).
LORP O = E(x,yw ,yl ) [LSF T + λ · LOR ] (6)
Penalizing Undesired Generations Appending
an unlikelihood penalty to the loss has demon- LSF T follows the conventional causal language
strated success in reducing unwanted degenerative modeling negative log-likelihood (NLL) loss func-
traits in models (Welleck et al., 2019; Li et al., tion to maximize the likelihood of generating the
2020). For example, to prevent repetitions, an un- reference tokens as previously discussed in Sec-
wanted token set of previous contexts, k ∈ Crecent , tion 3. LOR in Equation 7 maximizes the odds
is disfavored by adding the following term to ratio between the likelihood of generating the dis-
(k) favored response yw and the disfavored response
(1 − pi ) to the loss (such as Equation 2) which
penalizes the model for assigning high probabilities yl . We wrap the log odds ratio with the log sig-
to recent tokens. Motivated by SFT ascribing high moid function so that LOR could be minimized by
probabilities to rejected tokens (Figure 3) and the increasing the log odds ratio between yw and yl .
effectiveness of appending penalizing unwanted
 
oddsθ (yw |x)
traits, we design a monolithic preference alignment LOR = − log σ log (7)
oddsθ (yl |x)
method that dynamically penalizes the disfavored
response for each query without the need for craft- Together, LSF T and LOR weighted with λ tailor
ing sets of rejected tokens. the pre-trained language model to adapt to the spe-
cific subset of the desired domain and disfavor gen-
4 Odds Ratio Preference Optimization erations in the rejected response sets.

We introduce a novel preference alignment algo- 4.3 Gradient of ORPO


rithm, Odds Ratio Preference Optimization (ORPO), The gradient of LRatio further justifies using the
which incorporates an odds ratio-based penalty odds ratio loss. It comprises two terms: one that pe-
to the conventional negative log-likelihood (NLL) nalizes the wrong predictions and one that contrasts
loss for differentiating the generation styles be- between chosen and rejected responses, denoted in
tween favored and disfavored responses. Equation 84 for d = (x, yl , yw ) ∼ D.
4.1 Preliminaries ∇θ LOR = δ(d) · h(d) (8)
Given an input sequence x, the average log-
oddsθ P (yw |x) −1
 
likelihood of generating the output sequence y, of δ(d) = 1 + (9)
length m tokens, is computed as Equation 3. The oddsθ P (yl |x)
odds of generating the output sequence y given an
∇θ log Pθ (yw |x) ∇θ log Pθ (yl |x)
input sequence x is defined in Equation 4: h(d) = −
1 − Pθ (yw |x) 1 − Pθ (yl |x)
m (10)
1 X
log Pθ (y|x) = log Pθ (yt |x, y<t ) (3)
m 4
The full derivation for ∇θ LOR is in Appendix A.
t=1
When the odds of the favored responses are rela- 5.2 Leaderboard Evaluation
tively higher than the disfavored responses, δ(d) In Section 6.1, we evaluate the models using the
in Equation 9 will converge to 0. This indicates AlpacaEval1.0 and AlpacaEval2.0 (Li et al., 2023b)
that the δ(d) will play the role of a penalty term, benchmarks, comparing ORPO to other instruction-
accelerating the parameter updates if the model is tuned models reported in the official leaderboard,5
more likely to generate the rejected responses. including Llama-2 Chat (7B) and (13B) (Touvron
Meanwhile, h(d) in Equation 10 implies a et al., 2023), and Zephyr α and β (Almazrouei
weighted contrast of the two gradients from the cho- et al., 2023). Similarly, in Section 6.2, we evaluate
sen and rejected responses. Specifically, 1−P (y|x) the models with MT-Bench (Zheng et al., 2023)
in the denominators amplifies the gradients when and report the results and the scores of the same
the corresponding side of the likelihood P (y|x) models reported in the official leaderboard.6 Us-
is low. For the chosen responses, this accelerates ing GPT-4 (Achiam et al., 2023) as an evaluator
the model’s adaptation toward the distribution of in AlpacaEval1.0 , we assess if the trained model
chosen responses as the likelihood increases. can be preferred over the responses generated from
text-davinci-003. For AlpacaEval2.0 , we used
5 Experimental Settings GPT-4-turbo7 as an evaluator following the default
setting. We assess if the generated responses are fa-
5.1 Training Configurations
vored over those generated from GPT-4. Finally, us-
Models We train a series of OPT models (Zhang ing GPT-4 as an evaluator in MT-Bench, we check
et al., 2022) scaling from 125M to 1.3B parameters if the models can follow the instructions with hard
comparing supervised fine-tuning (SFT), proximal answers in a multi-turn conversation.
policy optimization (PPO), direct policy optimiza-
tion (DPO), and compare these to our ORPO. PPO 6 Results and Analysis
and DPO models were fine-tuned with TRL library
First, we assess the general instruction-following
(von Werra et al., 2020) on top of SFT models
abilities of the models by comparing the prefer-
trained for a single epoch on the chosen responses
ence alignment algorithms in Sections 6.1 and 6.2.
following Rafailov et al. (2023) and Tunstall et al.
Second, we measure the win rate of OPT models
(2023). We notate this by prepending "+" to each al-
trained with ORPO against other alignment methods
gorithm (e.g., +DPO). Additionally, we train Phi-2
training OPT 1.3B as a reward model in Section
(2.7B) (Javaheripi and Bubeck, 2023), a pre-trained
6.3. Then, we measure the lexical diversity of the
language model with promising downstream per-
models trained with ORPO and DPO in Section 6.4.
formance (Beeching et al., 2023), as well as Llama-
2 (7B) (Touvron et al., 2023) and Mistral (7B) 6.1 Single-turn Instruction Following
(Jiang et al., 2023). Further training details for
Phi-2 (2.7B) ORPO improved pre-trained Phi-2
each method are in Appendix C.
to exceed the performance of the Llama-2 Chat
Datasets We test each training configuration and instruction-following language model by only us-
model on two datasets: 1) Anthropic’s HH-RLHF ing UltraFeedback as the instruction-tuning dataset,
(Bai et al., 2022a), 2) Binarized UltraFeedback as shown in Table 1. λ of 0.25 was applied for Phi-
(Tunstall et al., 2023). We filtered out instances 2, resulting in 71.80% and 6.35% in AlpacaEval.
where yw = yl or where yw = ∅ or where yl = ∅. Llama-2 (7B) Notably, UltraFeedback and ORPO
with λ of 0.2 on Llama-2 (7B) resulted in higher
Reward Models We train OPT-350M and OPT-
AlpacaEval scores than the chat versions of both
1.3B on each dataset for a single epoch for reward
7B and 13B scale trained with RLHF, eventually
modeling with the objective function in Equation
showing 81.26% and 9.44% in both AlpacaEvals.
11 (Ziegler et al., 2020). The OPT-350M reward
In contrast, in our controlled experimental set-
model was used for PPO, and OPT-1.3B reward
ting of conducting one epoch of SFT and three
model was used to assess the generations of fine-
5
tuned models. We refer to these reward models as https://tatsu-lab.github.io/alpaca_eval/
6
RM-350M and RM-1.3B in Section 6. https://huggingface.co/spaces/lmsys/
chatbot-arena-leaderboard
7
https://platform.openai.com/docs/models/
−E(x,yl ,yw ) [log σ (r(x, yw ) − r(x, yl ))] (11) gpt-4-and-gpt-4-turbo
Model Name Size AlpacaEval1.0 AlpacaEval2.0
Phi-2 + SFT 2.7B 48.37% (1.77) 0.11% (0.06)
Phi-2 + SFT + DPO 2.7B 50.63% (1.77) 0.78% (0.22)
Phi-2 + ORPO (Ours) 2.7B 71.80% (1.59) 6.35% (0.74)
Llama-2 Chat * 7B 71.34% (1.59) 4.96% (0.67)
Llama-2 Chat * 13B 81.09% (1.38) 7.70% (0.83)
Llama-2 + ORPO (Ours) 7B 81.26% (1.37) 9.44% (0.85)
Zephyr (α) * 7B 85.76% (1.23) 8.35% (0.87)
Zephyr (β) * 7B 90.60% (1.03) 10.99% (0.96)
Mistral-ORPO-α (Ours) 7B 87.92% (1.14) 11.33% (0.97)
Mistral-ORPO-β (Ours) 7B 91.41% (1.15) 12.20% (0.98)

Table 1: Table of instruction-following abilities of each checkpoint measured through AlpacaEval. While clearly
showing the improvements in instruction-following abilities after training with ORPO, it is notable that ORPO models
exceed RLHF or DPO models of Llama-2 and Mistral (* indicates the results from the official leaderboard.)

epochs of DPO following Tunstall et al. (2023) and Writing


Model
Rafailov et al. (2023), Llama-2 + SFT and Llama- Humanities Roleplay Llama-2-Chat (70B)
GPT-3.5-turbo
2 + SFT + DPO yielded models with outputs that Claude V1
GPT-4
could not be evaluated. This supports the efficacy Mistral-ORPO-⍺ (7B): 7.23
Mistral-ORPO-β (7B): 7.32
of ORPO, in which the model can rapidly learn the STEM
0 2 4 6 8 10
Reasoning

desired domain and the preference with limited


data. This aligns with the h(d) examination in the
gradient of our method studied in Section 4.3. Extraction Math

Coding

Mistral-ORPO-α (7B) Furthermore, fine-tuning


Loading [MathJax]/extensions/MathMenu.js

Mistral (7B) with single-turn conversation dataset, Figure 4: MT-Bench result of Mistral-ORPO-α (7B) and
UltraFeedback, and ORPO with λ of 0.1 outperforms Mistral-ORPO-β (7B) by the category. Further compari-
Zephyr series, which are the Mistral (7B) models son can be found in the Appendix G.
fine-tuned with SFT on 20K UltraChat (Ding et al.,
2023) and DPO on the full UltraFeedback. As 6.2 Multi-turn Instruction Following
shown in Table 1, Mistral-ORPO-α (7B) achieves
87.92% and 11.33%, which exceeds Zephyr α by With our best model, Mistral-ORPO-α (7B) and
1.98% and Zephyr β by 0.34% in AlpacaEval2.0 . Mistral-ORPO-β (7B), we also assess the multi-turn
The sample responses and corresponding refer- instruction-following skills with deterministic an-
ences from GPT-4 can be found in Appendix I. swers (e.g., math) through MT-Bench.
As shown in Figure 4, ORPO-Mistral (7B) series
Mistral-ORPO-β (7B) Using the same configura- achieve comparable results to either larger or the
tion of Mistral-ORPO-α (7B), we additionally com- proprietary models, including Llama-2-Chat (70B)
pare fine-tuning Mistral on the cleaned version of and Claude. Eventually, Mistral-ORPO-α (7B) and
the UltraFeedback8 to demonstrate the effect of the Mistral-ORPO-β (7B) scored 7.23 and 7.32 in MT-
data quality (Bartolome et al., 2023). While the Bench without being exposed to the multi-turn con-
actual sizes of datasets are similar, ORPO gains fur- versation dataset during training.
ther advantages from the dataset quality by scoring
6.3 Reward Model Win Rate
over 91% and 12% on AlpacaEval, as shown in
Table 1. Further instruction-following evaluation We assess the win rate of ORPO over other pref-
on two Mistral-based models with IFEval (Zhou erence alignment methods, including supervised
et al., 2023c) is reported in the Appendix D. fine-tuning (SFT), PPO, and DPO, using RM-1.3B
to understand the effectiveness and scalability of
8
https://huggingface.co/datasets/argilla/ ORPO in Tables 2 and 3. Additionally, we visu-
ultrafeedback-binarized-preferences-cleaned ally verify that ORPO can effectively enhance the
Figure 5: Reward distribution comparison between OPT-125M (left), OPT-350M (middle), and OPT-1.3B (right)
trained with SFT (blue), RLHF (green), DPO (orange), and ORPO (red) on the test set of UltraFeedback using the
RM-1.3B. While the rewards of the trained models are roughly normal and preference optimization algorithms
(RLHF, DPO, and ORPO) tend to move the reward distribution in the positive direction, ORPO is on par or better than
RLHF and DPO in increasing the expected reward. The same plot for the HH-RLHF dataset is in Appendix F.

expected reward compared to SFT in Figure 5. ORPO vs SFT +DPO +PPO


OPT-125M 73.2 (0.12) 48.8 (0.29) 71.4 (0.28)
HH-RLHF In Table 2, ORPO outperforms SFT OPT-350M 80.5 (0.54) 50.5 (0.17) 85.8 (0.62)
and PPO across all model scales. The highest win OPT-1.3B 69.4 (0.57) 57.8 (0.73) 65.7 (1.07)
rate against SFT and PPO across the size of the
model was 78.0% and 79.4%, respectively. Mean- Table 3: Average win rate (%) and its standard devia-
while, the win rate over DPO was correlated to tion of ORPO and standard deviation over other methods
the model’s size, with the largest model having the on UltraFeedback dataset for three rounds. Sampling
highest win rate: 70.9%. decoding with a temperature of 1.0 was used.

ORPO vs SFT +DPO +PPO


OPT-125M 84.0 (0.62) 41.7 (0.77) 66.1 (0.26) In Figure 5, RLHF (i.e., SFT + PPO) has some
OPT-350M 82.7 (0.56) 49.4 (0.54) 79.4 (0.29) abnormal properties of the distribution with a low
OPT-1.3B 78.0 (0.16) 70.9 (0.52) 65.9 (0.33) expected reward. We attribute this to empirical
evidence of the instability and reward mismatch
Table 2: Average win rate (%) and its standard deviation
problem of RLHF (Rafailov et al., 2023; Gao et al.,
of ORPO and standard deviation over other methods on
HH-RLHF dataset for three rounds. Sampling decoding
2022; Shen et al., 2023) as the RLHF models were
with a temperature of 1.0 was used on the test set. trained with RM-350M and assessed with RM-
1.3B. Meanwhile, it is notable that the ORPO dis-
tribution (red) is mainly located on the very right
UltraFeedback The win rate in UltraFeedback side of each subplot, indicating higher expected re-
followed similar trends to what was reported in HH- wards. Recalling the intent of preference alignment
RLHF, as shown in Table 3. ORPO was preferred methods, the distributions in Figure 5 indicate that
over SFT and PPO for maximum 80.5% and 85.8%, ORPO tends to fulfill the aim of preference align-
respectively. While consistently preferring ORPO ment for all model sizes.
over SFT and PPO, the win rate over DPO grad-
ually increases as the size of the model increases.
The scale-wise trend exceeding DPO will be further 6.4 Lexical Diversity
shown through 2.7B models in Section 6.1.
The lexical diversity of the preference-aligned lan-
Overall Reward Distribution In addition to the guage models was studied in previous works (Kirk
win rate, we compare the reward distribution of et al., 2024). We expand the concept of per-input
the responses generated with respect to the test set and across-input diversity introduced in Kirk et al.
of the UltraFeedback dataset in Figure 5 and HH- (2024) by using Gemini-Pro (Gemini Team et al.,
RLHF dataset in Appendix F. Regarding the SFT 2023) as an embedding model, which is suitable
reward distribution as a default, PPO, DPO, and for assessing the diversity of instruction-following
ORPO shift it in both datasets. However, the magni- language models by encoding a maximum of 2048
tude of reward shifts for each algorithm differs. tokens. The diversity metric with the given set of
sampled responses is defined as Equation 13. 7 Discussion
In this section, we expound on the theoretical and
Oθi := {yj ∼ θ(y|xi )|j = 1, 2, ..., K} (12)
computational details of ORPO. The theoretical anal-
PN −1 PN ysis of ORPO is studied in Section 7.1, which will
1 i=1 j=i+1 cos(hi , hj ) be supported with the empirical analysis in Section
D(Oθi ) = · (13)
2 N · (N − 1) 7.2. Then, we compare the computational load of
DPO and ORPO in Section 7.3.
where cos(hi , hj ) refers to the cosine similarity
between the embedding hi and hj . 5 different re- 7.1 Comparison to Probability Ratio
sponses are sampled with a temperature of 1.0 to
The rationale for selecting the odds ratio instead of
160 queries in AlpacaEval (i.e., K = 5, N = 160)
the probability ratio lies in its stability. The prob-
using Phi-2 and Llama-2 trained with ORPO and
ability ratio for generating the favored response
DPO. We report the results in Table 4.
yw over the disfavored response yl given an input
Per Input Diversity (PID) We average the input- sequence x can be defined as Equation 16.
wise average cosine similarity between the gener- Pθ (yw |x)
ated samples with Equation 14 to assess the per- PRθ (yw , yl ) = (16)
Pθ (yl |x)
input diversity. In Table 4, ORPO models have the
highest average cosine similarity in the first column While this formulation has been used in previous
for both models, which implies the lowest diversity preference alignment methods that precede SFT
per input. This indicates that ORPO generally as- (Rafailov et al., 2023; Azar et al., 2023), the odds
signs high probabilities to the desired tokens, while ratio is a better choice in the setting where the
DPO has a relatively smoother logit distribution. preference alignment is incorporated in SFT as the
odds ratio is more sensitive to the model’s prefer-
N
1 X ence understanding. In other words, the probability
PIDD (θ) = D(Oθi ) (14) ratio leads to more extreme discrimination of the
N
i=1
disfavored responses than the odds ratio.
Across Input Diversity (AID) Using 8 samples We visualize this through the sample distribu-
generated per input, we sample the first item for tions of the log probability ratio log PR(X2 |X1 )
each input and examine their inter cosine similarity and log odds ratio log OR(X2 |X1 ). We sample
with Equation 15 for across-input diversity. Un- 50,000 samples each with Equation 17 and plot the
like per-input diversity, it is noteworthy that Phi-2 log probability ratio and log odds ratio in Figure
(ORPO) has lower average cosine similarity in the 6. We multiply β for the probability ratio as it is
second row of Table 4. We can infer that ORPO practiced in the probability ratio-based methods
triggers the model to generate more instruction- and report the cases where β = 0.2 and β = 1.0.
specific responses than DPO.
X1 , X2 ∼ Unif(0, 1) (17)
N
!
AIDD (θ) = D
[
Oi ,θ,j=1 (15) Y ∼ β (log X1 − log X2 ) (18)
i=1
X1 X2
Y ∼ log − log (19)
1 − X1 1 − X2
Per Input↓ Across Input↓
Phi-2 + SFT + DPO 0.8012 0.6019 Recalling that the log sigmoid function is applied
Phi-2 + ORPO 0.8909 0.5173
to the log probability ratio and log odds ratio, each
Llama-2 + SFT + DPO 0.8889 0.5658
Llama-2 + ORPO 0.9008 0.5091 ratio’s scale determines the expected margin be-
tween the likelihood of the favored and disfavored
Table 4: Lexical diversity of Phi-2 and Llama-2 fine- styles when the loss is minimized. In that sense, the
tuned with DPO and ORPO. Lower cosine similarity is contrast should be relatively extreme to minimize
equivalent to higher diversity. The highest value in each the log sigmoid loss when PR(X2 |X1 ) is inputted
column within the same model family is bolded. instead of OR(X2 |X1 ) to the log sigmoid function,
regarding the sharp distribution of log PR(X2 |X1 )
20K of the log probability margin between the favored
Prob. Ratio ( = 0.2)
Prob. Ratio ( = 1.0) and disfavored responses with respect to λ.
15K Odds Ratio
1.5
Count

10K −0.9

Log Odds Ratio

Log Probability
1.0
Response Type
5K −1.2 Chosen
Rejected
0.5
−1.5

10 5 0 5 10
0.0 −1.8
0 2000 4000 6000 8000 0 2000 4000 6000 8000
Figure 6: Sampled distribution of log PR(X2 |X1 ) and Training Step Training Step

log OR(X2 |X1 ). log OR(X2 |X1 ) has a wider range


given the same input probability pairs (X1 , X2 ). Figure 7: Average log-likelihood for chosen and re-
jected responses and log odds ratio per batch. The odds
consistently increase during training with ORPO.
in Figure 6. This results in overly suppressing the
logits for the tokens in the disfavored responses in 7.3 Computational Efficiency
the setting where SFT and preference alignment are
As depicted in Figure 2, ORPO does not require a
incorporated, as the model is not adapted to the do-
reference model, unlike RLHF and DPO. In that
main. We empirically support this analysis through
sense, ORPO is computationally more efficient than
the ablation study in Appendix B. Therefore, the
RLHF and DPO in two perspectives: 1) memory
odds ratio is a better choice when the preference
allocation and 2) fewer FLOPs per batch.
alignment is done with SFT due to the mild discrim-
The reference model (πSF T ) in the context of
ination of disfavored responses and the prioritizing
RLHF and DPO denotes the model trained with su-
of the favored responses to be generated.
pervised fine-tuning (SFT), which will be the base-
Throughout fine-tuning, minimizing the log line model for updating the parameters with RLHF
sigmoid loss leads to either PR(X2 |X1 ) or or DPO (Ziegler et al., 2020; Rafailov et al., 2023).
OR(X2 |X1 ) to be larger. This is equivalent to the Thus, two πSF T s, a frozen reference model and the
rejected responses’ token-wise likelihood, which model undergoing tuning, are required during train-
will generally get smaller. In this context, it is es- ing. Furthermore, in theory, two forward passes
sential to avoid an overly extreme contrast. This should be calculated for each model to acquire the
precaution is especially important given the sharp logits for the chosen and rejected responses. In
distribution of log PR(X2 |X1 ) depicted in Figure other words, four forward passes happen in total
6. The excessive margin could lead to the unwar- for a single batch. On the other hand, a reference
ranted suppression of logits for tokens in disfavored model is not required in ORPO as πSF T is directly
responses within the incorporated setting, poten- updated. This leads to half the number of forward
tially resulting in issues of degeneration. passes required for each batch during training.

7.2 Minimizing LOR 8 Conclusion


We demonstrate that models trained with ORPO In this paper, we introduced a reference-free mono-
learned to reflect the preference throughout the lithic preference alignment method, odds ratio pref-
training process. We monitored the log probabili- erence optimization (ORPO), by revisiting and un-
ties of the chosen and rejected responses and the derstanding the value of the supervised fine-tuning
log odds ratio with λ = 1.0. With the same dataset (SFT) phase in the context of preference align-
and model as Figure 3, Figure 7 shows that the ment. ORPO was consistently preferred by the
log probability of rejected responses is diminishing fine-tuned reward model against SFT and RLHF
while that of chosen responses is on par with Fig- across the scale, and the win rate against DPO in-
ure 3 as the log odds ratio increases. This indicates creased as the size of the model increased. Further-
that ORPO is successfully preserving the domain more, we validate the scalability of ORPO with 2.7B
adaptation role of SFT while the penalty term LOR and 7B pre-trained language models by exceeding
induces the model to lower the likelihood of un- the larger state-of-the-art instruction-following lan-
wanted generations. We discuss the effect of λ in guage models in AlpacaEval. Specifically, Mistral-
Equation 6 in Appendix E, studying the proclivity ORPO-α and Mistral-ORPO-β achieved 11.33% and
12.20% in AlpacaEval2.0 , 7.23 and 7.32 in MT- Yuntao Bai, Saurav Kadavath, Sandipan Kundu,
Bench, thereby underscoring the efficiency and ef- Amanda Askell, Jackson Kernion, Andy Jones, Anna
Chen, Anna Goldie, Azalia Mirhoseini, Cameron
fectiveness of ORPO. We release fine-tuning code
McKinnon, Carol Chen, Catherine Olsson, Christo-
and model checkpoints for Mistral-ORPO-α and pher Olah, Danny Hernandez, Dawn Drain, Deep
Mistral-ORPO-β to aid reproducibility. Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez,
Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua
Limitations Landau, Kamal Ndousse, Kamile Lukosuite, Liane
Lovitt, Michael Sellitto, Nelson Elhage, Nicholas
While conducting a comprehensive analysis of the Schiefer, Noemi Mercado, Nova DasSarma, Robert
diverse preference alignment methods, including Lasenby, Robin Larson, Sam Ringer, Scott John-
ston, Shauna Kravec, Sheer El Showk, Stanislav Fort,
DPO and RLHF, we did not incorporate a more
Tamera Lanham, Timothy Telleen-Lawton, Tom Con-
comprehensive range of preference alignment algo- erly, Tom Henighan, Tristan Hume, Samuel R. Bow-
rithms. We leave the broader range of comparison man, Zac Hatfield-Dodds, Ben Mann, Dario Amodei,
against other methods as future work, along with Nicholas Joseph, Sam McCandlish, Tom Brown, and
scaling our method to over 7B models. In addi- Jared Kaplan. 2022b. Constitutional ai: Harmless-
ness from ai feedback.
tion, we will expand the fine-tuning datasets into
diverse domains and qualities, thereby verifying Alvaro Bartolome, Gabriel Martin, and Daniel Vila.
the generalizability of our method in various NLP 2023. Notus. https://github.com/argilla-io/
notus.
downstream tasks. Finally, we would like to study
the internal impact of our method on the pre-trained Edward Beeching, Clémentine Fourrier, Nathan Habib,
language model, expanding the understanding of Sheon Han, Nathan Lambert, Nazneen Rajani, Omar
preference alignment procedure to not only the Sanseviero, Lewis Tunstall, and Thomas Wolf. 2023.
Open llm leaderboard. https://huggingface.co/
supervised fine-tuning stage but also consecutive spaces/HuggingFaceH4/open_llm_leaderboard.
preference alignment algorithms.
Ralph Allan Bradley and Milton E. Terry. 1952. Rank
analysis of incomplete block designs: I. the method
References of paired comparisons. Biometrika, 39(3/4):324–
345.
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Diogo Almeida, Janko Altenschmidt, Sam Altman, Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. Neelakantan, Pranav Shyam, Girish Sastry, Amanda
arXiv preprint arXiv:2303.08774. Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al- Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
shamsi, Alessandro Cappelli, Ruxandra Cojocaru, Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, teusz Litwin, Scott Gray, Benjamin Chess, Jack
Julien Launay, Quentin Malartic, Daniele Mazzotta, Clark, Christopher Berner, Sam McCandlish, Alec
Badreddine Noune, Baptiste Pannier, and Guilherme Radford, Ilya Sutskever, and Dario Amodei. 2020.
Penedo. 2023. The falcon series of open language Language models are few-shot learners. In Ad-
models. vances in Neural Information Processing Systems,
volume 33, pages 1877–1901. Curran Associates,
Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Inc.
Piot, Daniel Guo, Daniele Calandriello, Michal
Valko, and Rémi Munos. 2023. A general theoret- Tianchi Cai, Xierui Song, Jiyan Jiang, Fei Teng, Jinjie
ical paradigm to understand learning from human Gu, and Guannan Zhang. 2023. Ulma: Unified lan-
preferences. guage model alignment with demonstration and point-
wise human preference. ArXiv, abs/2312.02554.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda
Askell, Anna Chen, Nova DasSarma, Dawn Drain, Nicholas Carlini, Florian Tramer, Eric Wallace,
Stanislav Fort, Deep Ganguli, Tom Henighan, Matthew Jagielski, Ariel Herbert-Voss, Katherine
Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Lee, Adam Roberts, Tom Brown, Dawn Song, Ul-
Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac far Erlingsson, Alina Oprea, and Colin Raffel. 2021.
Hatfield-Dodds, Danny Hernandez, Tristan Hume, Extracting training data from large language models.
Scott Johnston, Shauna Kravec, Liane Lovitt, Neel
Nanda, Catherine Olsson, Dario Amodei, Tom Weixin Chen and Bo Li. 2024. Grath: Gradual self-
Brown, Jack Clark, Sam McCandlish, Chris Olah, truthifying for large language models.
Ben Mann, and Jared Kaplan. 2022a. Training a
helpful and harmless assistant with reinforcement Qinyuan Cheng, Tianxiang Sun, Xiangyang Liu, Wen-
learning from human feedback. wei Zhang, Zhangyue Yin, Shimin Li, Linyang Li,
Kai Chen, and Xipeng Qiu. 2024. Can ai assistants Mojan Javaheripi and Sébastien Bubeck. 2023. Phi-2:
know what they don’t know? The surprising power of small language models.
Tri Dao. 2023. Flashattention-2: Faster attention with Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
better parallelism and work partitioning. sch, Chris Bamford, Devendra Singh Chaplot, Diego
de las Casas, Florian Bressand, Gianna Lengyel, Guil-
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and laume Lample, Lucile Saulnier, Lélio Renard Lavaud,
Luke Zettlemoyer. 2023. Qlora: Efficient finetuning Marie-Anne Lachaux, Pierre Stock, Teven Le Scao,
of quantized llms. Thibaut Lavril, Thomas Wang, Timothée Lacroix,
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi and William El Sayed. 2023. Mistral 7b.
Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis,
and Bowen Zhou. 2023. Enhancing chat language Jelena Luketina, Eric Hambro, Edward Grefenstette,
models by scaling high-quality instructional conver- and Roberta Raileanu. 2024. Understanding the ef-
sations. fects of rlhf on llm generalisation and diversity.
Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng
Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas
Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng
Mesnard, Johan Ferret, Kellie Lu, Colton Bishop,
Yuan, Chang Zhou, and Jingren Zhou. 2024. How
Ethan Hall, Victor Carbune, Abhinav Rastogi, and
abilities in large language models are affected by
Sushant Prakash. 2023. Rlaif: Scaling reinforcement
supervised fine-tuning data composition.
learning from human feedback with ai feedback.
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff,
Margaret Li, Stephen Roller, Ilia Kulikov, Sean Welleck,
Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model
Y-Lan Boureau, Kyunghyun Cho, and Jason Weston.
alignment as prospect theoretic optimization. arXiv
2020. Don’t say that! making inconsistent dialogue
preprint arXiv:2402.01306.
unlikely with unlikelihood training. In Proceedings
Leo Gao, John Schulman, and Jacob Hilton. 2022. Scal- of the 58th Annual Meeting of the Association for
ing laws for reward model overoptimization. Computational Linguistics, pages 4715–4728, Online.
Association for Computational Linguistics.
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman,
Sid Black, Anthony DiPofi, Charles Foster, Laurence Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke
Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Zettlemoyer, Omer Levy, Jason Weston, and Mike
Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Lewis. 2023a. Self-alignment with instruction back-
Jason Phang, Laria Reynolds, Hailey Schoelkopf, translation.
Aviya Skowron, Lintang Sutawika, Eric Tang, An-
ish Thite, Ben Wang, Kevin Wang, and Andy Zou. Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori,
2023. A framework for few-shot language model Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and
evaluation. Tatsunori B. Hashimoto. 2023b. Alpacaeval: An
automatic evaluator of instruction-following models.
Samuel Gehman, Suchin Gururangan, Maarten Sap, https://github.com/tatsu-lab/alpaca_eval.
Yejin Choi, and Noah A. Smith. 2020. RealToxi-
cityPrompts: Evaluating neural toxic degeneration Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del
in language models. In Findings of the Association Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023c.
for Computational Linguistics: EMNLP 2020, pages Textbooks are all you need ii: phi-1.5 technical re-
3356–3369, Online. Association for Computational port.
Linguistics.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He,
Gemini Team, Rohan Anil, Sebastian Borgeaud, and Piotr Dollár. 2017. Focal loss for dense object
Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, detection. In Proceedings of the IEEE international
Radu Soricut, Johan Schalkwyk, Andrew M Dai, conference on computer vision, pages 2980–2988.
Anja Hauth, et al. 2023. Gemini: a family of
highly capable multimodal models. arXiv preprint Ilya Loshchilov and Frank Hutter. 2019. Decoupled
arXiv:2312.11805. weight decay regularization.

Aaron Gokaslan and Vanya Cohen. 2019. Open- Anqi Mao, Mehryar Mohri, and Yutao Zhong. 2023.
webtext corpus. http://Skylion007.github.io/ Cross-entropy loss functions: Theoretical analysis
OpenWebTextCorpus. and applications.

Alexey Gorbatovski and Sergey Kovalchuk. 2024. Re- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
inforcement learning for question answering in pro- roll L. Wainwright, Pamela Mishkin, Chong Zhang,
gramming domain using public community scoring Sandhini Agarwal, Katarina Slama, Alex Ray, John
as a human feedback. Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Maddie Simens, Amanda Askell, Peter Welinder,
Hamish Haggerty and Rohitash Chandra. 2024. Self- Paul Christiano, Jan Leike, and Ryan Lowe. 2022.
supervised learning for skin cancer diagnosis with Training language models to follow instructions with
limited training data. human feedback.
Jing-Cheng Pang, Pengyuan Wang, Kaiyuan Li, Xiong- Katherine Tian, Eric Mitchell, Huaxiu Yao, Christo-
Hui Chen, Jiacheng Xu, Zongzhang Zhang, and Yang pher D. Manning, and Chelsea Finn. 2023. Fine-
Yu. 2023. Language model self-improvement by tuning language models for factuality.
reinforcement learning contemplation.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Ruxandra Cojocaru, Alessandro Cappelli, Hamza Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
and Julien Launay. 2023. The refinedweb dataset for Grave, and Guillaume Lample. 2023. Llama: Open
falcon llm: Outperforming curated corpora with web and efficient foundation language models.
data, and web data only.
Lewis Tunstall, Edward Beeching, Nathan Lambert,
Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Nazneen Rajani, Kashif Rasul, Younes Belkada,
Zhu, and Michael Zeng. 2023. Automatic prompt op- Shengyi Huang, Leandro von Werra, Clémentine
timization with “gradient descent” and beam search. Fourrier, Nathan Habib, Nathan Sarrazin, Omar San-
In Proceedings of the 2023 Conference on Empiri- seviero, Alexander M. Rush, and Thomas Wolf. 2023.
cal Methods in Natural Language Processing, pages Zephyr: Direct distillation of lm alignment.
7957–7968, Singapore. Association for Computa-
tional Linguistics.
Leandro von Werra, Younes Belkada, Lewis Tun-
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano stall, Edward Beeching, Tristan Thrush, Nathan
Ermon, Christopher D. Manning, and Chelsea Finn. Lambert, and Shengyi Huang. 2020. Trl: Trans-
2023. Direct preference optimization: Your language former reinforcement learning. https://github.
model is secretly a reward model. com/huggingface/trl.

Miguel Moura Ramos, Patrick Fernandes, António Far- Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan
inhas, and André F. T. Martins. 2023. Aligning neu- Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu
ral machine translation models: Human feedback in Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao
training and inference. Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao
Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen,
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang,
and Yuxiong He. 2020. Deepspeed: System opti- Zuxuan Wu, and Yu-Gang Jiang. 2024. Secrets of
mizations enable training deep learning models with rlhf in large language models part ii: Reward model-
over 100 billion parameters. In Proceedings of the ing.
26th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, KDD ’20, Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack
page 3505–3506, New York, NY, USA. Association Hessel, Tushar Khot, Khyathi Raghavi Chandu,
for Computing Machinery. David Wadden, Kelsey MacMillan, Noah A. Smith,
Iz Beltagy, and Hannaneh Hajishirzi. 2023. How
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec far can camels go? exploring the state of instruction
Radford, and Oleg Klimov. 2017. Proximal policy tuning on open resources.
optimization algorithms.
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin
Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shi- Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
han Dou, Tao Gui, Qi Zhang, and Xuanjing Huang. drew M. Dai, and Quoc V. Le. 2022. Finetuned
2023. Loose lips sink ships: Mitigating length bias language models are zero-shot learners.
in reinforcement learning from human feedback. In
Findings of the Association for Computational Lin-
Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Di-
guistics: EMNLP 2023, pages 2859–2873, Singapore.
nan, Kyunghyun Cho, and Jason Weston. 2019. Neu-
Association for Computational Linguistics.
ral text generation with unlikelihood training. arXiv
Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei preprint arXiv:1908.04319.
Huang, Yongbin Li, and Houfeng Wang. 2023. Pref-
erence ranking optimization for human alignment. Tianhao Wu, Banghua Zhu, Ruoyu Zhang, Zhaojin Wen,
Kannan Ramchandran, and Jiantao Jiao. 2023. Pair-
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. wise proximal policy optimization: Harnessing rela-
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, tive feedback for llm alignment.
Dario Amodei, and Paul Christiano. 2022. Learning
to summarize from human feedback. Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Artetxe, Moya Chen, Shuohui Chen, Christopher De-
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel
and Tatsunori B. Hashimoto. 2023. Stanford alpaca: Simig, Punit Singh Koura, Anjali Sridhar, Tianlu
An instruction-following llama model. https:// Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-
github.com/tatsu-lab/stanford_alpaca. trained transformer language models.
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo,
Chien-Chin Huang, Min Xu, Less Wright, Hamid
Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmai-
son, Can Balioglu, Pritam Damania, Bernard Nguyen,
Geeta Chauhan, Yuchen Hao, Ajit Mathews, and
Shen Li. 2023. Pytorch fsdp: Experiences on scaling
fully sharded data parallel.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,
Joseph E. Gonzalez, and Ion Stoica. 2023. Judging
LLM-as-a-Judge with MT-Bench and Chatbot Arena.
ArXiv:2306.05685 [cs].
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao
Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu,
Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis,
Luke Zettlemoyer, and Omer Levy. 2023a. Lima:
Less is more for alignment.

Haotian Zhou, Tingkai Liu, Qianli Ma, Jianbo Yuan,


Pengfei Liu, Yang You, and Hongxia Yang. 2023b.
Lobass: Gauging learnability in supervised fine-
tuning data. ArXiv, abs/2310.13008.

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha


Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and
Le Hou. 2023c. Instruction-following evaluation for
large language models.

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B.


Brown, Alec Radford, Dario Amodei, Paul Chris-
tiano, and Geoffrey Irving. 2020. Fine-tuning lan-
guage models from human preferences.
A Derivation of ∇θ LOR with Odds Ratio
oddsθ P (yw |x)
Suppose that g(x, yl , yw ) = oddsθ P (yl |x)

 
oddsθ P (yw |x)
∇θ LOR = ∇θ log σ log (20)
oddsθ P (yl |x)

σ ′ (log g(x, yl , yw ))
= (21)
σ (log g(x, yl , yw ))

= σ (− log g(x, yl , yw )) · ∇θ log g(x, yl , yw ) (22)

σ (− log g(x, yl , yw ))
= · ∇θ g(x, yl , yw ) (23)
g(x, yl , yw )

= σ (− log g(x, yl , yw )) · ∇θ log g(x, yl , yw ) (24)


 −1
oddsθ P (yw |x) oddsθ P (yw |x)
= 1+ · ∇θ log (25)
oddsθ P (yl |x) oddsθ P (yl |x)

In Equation 25,qthe remaining derivative can be further simplified by replacing 1 − Pθ (y|x) terms
where P (y|x) = N N
Q
t Pθ (yt |x, y<t in oddsθ (y|x) as follows.

∇θ (1 − Pθ (y|x))
∇θ log (1 − Pθ (y|x)) = (26)
1 − Pθ (y|x)

−∇θ Pθ (y|x)
= (27)
1 − Pθ (y|x)

Pθ (y|x)
=− · ∇θ log Pθ (y|x) (28)
1 − Pθ (y|x)

= oddsθ (y|x) · ∇θ log Pθ (y|x) (29)

oddsθ P (yw |x) Pθ (yw |x)  


∇θ log = ∇θ log − ∇θ log(1 − Pθ (yw |x)) − ∇θ log(1 − Pθ (yl |x)) (30)
oddsθ P (yl |x) Pθ (yl |x)

= (1 + oddsθ P (yw |x)) ∇θ log Pθ (yw |x) − (1 + oddsθ P (yl |x)) ∇θ log Pθ (yl |x)
(31)

Therefore, the final form of ∇θ LOR would be

1 + oddsθ P (yw |x) 1 + oddsθ P (yl |x)


∇θ LOR = oddsθ P (yw |x)
· ∇θ log Pθ (yw |x) − oddsθ P (yw |x)
· ∇θ log Pθ (yl |x) (32)
1+ oddsθ P (yl |x) 1+ oddsθ P (yl |x)

oddsθ P (yw |x) −1


   
∇θ log Pθ (yw |x) ∇θ log Pθ (yl |x)
= 1+ · − (33)
oddsθ P (yl |x) 1 − P (yw |x) 1 − P (yl |x)
B Ablation on Probability Ratio and Odds Ratio
In this section, we continue the discussion in Section 7.1 through empirical results comparing the log
probabilities of chosen and rejected responses in UltraFeedback when trained with probability ratio and
odds ratio. Recalling the sensitivity of each ratio discussed in Section 7.1, it is expected for the probability
ratio to lower the log probabilities of the rejected responses with a larger scale than the odds ratio. This is
well-shown in Figure 8, which is the log probabilities of each batch while fine-tuning with probability
ratio (left) rapidly reaches under -4, while the same phenomenon happens after the over-fitting occurs in
the case of odds ratio (right).

Figure 8: The log probability trace when the model is trained with the probability ratio (left) and the odds ratio
(right) given the same hyperparameters. The probability ratio leads the rejected responses to have relatively lower
log probabilities.

C Experimental Details
Flash-Attention 2 (Dao, 2023) is applied for all the pre-trained models for computational efficiency. In
particular, the OPT series and Phi-2 (2.7B) were trained with DeepSpeed ZeRO 2 (Rasley et al., 2020),
Llama-2 (7B) and Mistral (7B) were trained with Fully Sharded Data Parallel(FSDP) (Zhao et al., 2023).
7B and 2.7B models were trained with four and two NVIDIA A100, and the rest were trained on four
NVIDIA A6000. For optimizer, AdamW optimizer (Loshchilov and Hutter, 2019) and paged AdamW
(Dettmers et al., 2023) were used, and the linear warmup with cosine decay was applied for the learning
rate. For input length, every instance was truncated and padded to 1,024 tokens and 2,048 tokens for
HH-RLHF and UltraFeedback, respectively. To guarantee that the models can sufficiently learn to generate
the proper response to the conversation history or the complex instruction, we filtered instances with
prompts with more than 1,024 tokens.

Supervised Fine-tuning (SFT) For SFT, the maximum learning rate was set to 1e-5. Following Ziegler
et al. (2020) and Rafailov et al. (2023), the training epoch is set to 1.

Reinforcement Learning with Human Feedback (RLHF) For RLHF, the hyperparameters were set as
Table 5 for UltraFeedback. For the HH-RLHF dataset, the output_min_length and output_max_length
were set to 64 and 256.

Direct Preference Optimization (DPO) For DPO, β was set to 0.1 for every case. The learning rate
was set to 5e-6, and the model was trained for three epochs to select the best model by evaluation loss in
each epoch. However, in most cases, the first or the second checkpoint was selected as the best model as
the evaluation loss increased from the third epoch.
Hyperparameter Setting
ppo_epoch 4
init_kl_coef 0.1
horizon 2,000
batch_size 64
mini_batch_size 8
gradient_accumulation_steps 1
output_min_length 128
output_max_length 512
optimizer AdamW
learning_rate 1e-05
gamma 0.99

Table 5: Hyperparameter settings for RLHF.

Odds Ratio Preference Optimization (ORPO) As ORPO does not require any special hyperparameter,
only the learning rate and epoch were the only hyperparameter to set. For ORPO, the maximum learning
rate was set to 8e-6 and trained for 10 epochs. The best model is selected based on the lowest evaluation
loss for the OPT series, Phi-2 (2.7B) and Llama-2 (7B).

D IFEval Result for Mistral-ORPO-α and Mistral-ORPO-β


Along with the AlpacaEval results reported in Section 6.1, we report the results of Mistral-ORPO-α
and Mistral-ORPO-β on IFEval (Zhou et al., 2023c), calculated with the codes from Gao et al. (2023).
Additional information can be found in: https://jiwooya1000.github.io/posts/orpo/#ifeval.

Model Type Prompt-Strict Prompt-Loose Inst-Strict Inst-Loose


Mistral-ORPO-α 0.5009 0.5083 0.5995 0.6163
Mistral-ORPO-β 0.5287 0.5564 0.6355 0.6619

Table 6: IFEval scores of Mistral-ORPO-α and Mistral-ORPO-β.


E Ablation on the Weighting Value (λ)
For the weighting value λ in Equation 6, we conduct an ablation study with {0.1, 0.5, 1.0}. Mistral
(7B) and UltraFeedback were used for the base model and dataset. In Section E.1, we compare the log
probability trends by the value of λ, and we assess the downstream effect of λ in Section E.2.

E.1 Log Probability

Response Type Chosen Rejected

Lambda = 0.1 Lambda = 0.5 Lambda = 1.0

−1.0
Log Probability

−1.5

−2.0

−2.5

0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000
Training Step

Figure 9: The log probability trend by λ. With larger λ (e.g., λ = 1.0), LOR gets more influential in fine-tuning the
models with ORPO.

In Figure 9, we find that larger λ leads to stronger discrimination of the rejected responses in general.
With λ = 0.1, the average log probability of the chosen and the rejected responses stay close as the
fine-tuning proceeds. Also, unlike other settings, the log probabilities for the rejected responses do not
decrease, but rather, the log probabilities of the chosen responses increase to minimize LOR term.
Moreover, in λ = 0.5, there exists a similar trend of further increasing the log probabilities of the
chosen responses, but the log probabilities of the rejected responses are diminishing simultaneously.
Lastly, in λ = 1.0, the chosen responses diminish along with the rejected responses while enlarging the
margin between them. However, this does not mean smaller λ is always the better. It will depend on the
specific need and model.

E.2 MT-Bench

Writing

model
GPT-4
Humanities Roleplay
GPT-3.5-turbo
Mistral (ORPO) - 1.0
Mistral (ORPO) - 0.1

STEM Reasoning
0 2 4 6 8 10

Extraction Math

Coding

Figure 10: MT-Bench result comparison by differing λ = 0.1 and λ = 1.0.


The downstream impact of λ stands out in the MT-Bench result. In comparison to λ = 0.1, Mis-
tral+ORPO (7B) with λ = 1.0 performs worse in extraction, math, and reasoning, which are the categories
that generally require deterministic answers. On the other hand, it performs better in STEM, humanities,
and roleplay, which ask the generations without hard answers. Along with the amount of discrepancy be-
tween the trend in the logits of chosen and rejected responses, we can infer that making a more significant
margin between the chosen and the rejected responses through higher λ in ORPO leads to overly adapting to
the chosen responses set in the training dataset. This proclivity results in open-ended generations generally
being preferred by the annotator while showing weaker performance in the hard-answered questions.
F Test Set Reward Distribution on HH-RLHF

Along with Figure 11, which depicts the reward distribution of OPT2-125M, OPT2-350M, and OPT2-1.3B
on the UltraFeedback dataset, we report the reward distribution of each pre-trained checkpoint trained on
the HH-RLHF dataset. As discussed in Section 6.3, ORPO consistently pushes the reward distribution of
SFT to the right side.

Figure 11: Reward distribution comparison between OPT-125M (left), OPT-350M (middle), and OPT-1.3B (right)
trained with SFT (blue), RLHF (green), DPO (orange), and ORPO (red) on the test set of HH-RLHF using the 1.3B
reward model. General tendency follows that of Figure 5.

G MT-Bench Result of Mistral-ORPO-α (7B) and Mistral-ORPO-β (7B)

For the MT-Bench result in Section 6.2, we report the category-wise scores of Mistral-ORPO-α (7B) and
Mistral-ORPO-β (7B) in Figure 12. While surpassing Llama-2 Chat (13B) and Llama-2 Chat (70B) in
most cases, Mistral-ORPO-β (7B) is comparable to GPT-3.5-turbo in the categories that require descriptive
generations. However, it lacks coding and math skills, which we speculate is due to the lack of training
data, as we used 61k instances in UltraFeedback.

Writing
Model
Humanities Roleplay Llama-2-Chat (70B)
GPT-3.5-turbo
Claude V1
GPT-4
Mistral-ORPO-⍺ (7B): 7.23
Mistral-ORPO-β (7B): 7.32
STEM Reasoning
0 2 4 6 8 10

Extraction Math

Coding

Loading [MathJax]/extensions/MathMenu.js

Figure 12: MT-Bench result of Mistral-ORPO-α (7B) and Mistral-ORPO-β (7B) by the category.
H Special Instructions for Verbosity Assessment
For the succinctness and verboseness instructions, we generated five different instructions, each with
ChatGPT 9 . From the instructions in Table 7, we randomly sampled one prompt each for every batch to
prevent potential word bias.

# Succinctness Verboseness
1 Please generate a short and concise response. Please generate an elaborative and chatty response.
2 Provide a brief and concise answer. Provide a detailed answer.
3 Keep your reply short and to the point. Keep your reply elaborative and intricate.
4 Keep your answer brief for clarity. Keep your answer detailed.
5 Generate a brief and to-the-point answer. Generate a chatty and step-wise answer.

Table 7: Instructions prepended to the queries from AlpacaEval. Each instruction set asks the model to generate
either shorter or longer responses given the query, respectively.

9
https://chat.openai.com/
I Generation Samples from Mistral-ORPO-α (7B)
In this section, we two sample generations from Mistral-ORPO-α (7B) on AlpacaEval, including the
one which is preferred over the response of GPT-4 and the one in which GPT-4 was preferred. For
readability, the instances with short responses are selected for both cases. Additionally, the responses for
Mistral-ORPO-α (7B) were sampled with a temperature of 0.7. The checkpoints for Mistral-ORPO-α (7B)
and Mistral-ORPO-β (7B) can be found in https://huggingface.co/kaist-ai/mistral-orpo-alpha
and https://huggingface.co/kaist-ai/mistral-orpo-beta.

Query 1 (GPT-4 Win): Give a brief description of the book that you are given. Matilda

1. GPT-4:
"Matilda" is a children’s novel written by the British author Roald Dahl. It was first published in 1988.
The story revolves around a very intelligent and precocious young girl named Matilda Wormwood.
Despite being neglected and mistreated by her ignorant and self-absorbed parents, Matilda finds
solace in reading books and quickly develops a love for learning.
At the age of five, she begins attending school and meets Miss Honey, a kindhearted teacher
who recognizes Matilda’s extraordinary intellectual abilities. However, the school’s tyrannical
headmistress, Miss Trunchbull, is a cruel and oppressive figure who takes pleasure in punishing the
children with her unreasonable and harsh methods.
Matilda discovers that she possesses a unique telekinetic ability, which she learns to control and
use to her advantage. With her wit, intelligence, and special powers, Matilda sets out to defend her
friends from Miss Trunchbull’s abuses and to help Miss Honey overcome her own difficult past.
The book is a tale of good triumphing over evil, and it celebrates the power of knowledge, kindness,
and standing up for what is right. It has been adapted into a successful film, stage musical, and has
remained a beloved classic in children’s literature.

2. Mistral-ORPO-β:
"Matilda" is a children’s book written by Roald Dahl, published in 1988. The story revolves around
the life of Matilda Wormwood, a brilliant and gifted five-year-old girl who has telekinetic powers.
The story explores themes of education, family, and the importance of reading and intellectual
curiosity. The book has been praised for its wit, humor, and powerful message, making it a classic
and beloved children’s book.
Query 2 (Mistral-ORPO-β Win): What do you think about ChatGPT?

1. GPT-4:
As an AI developed by OpenAI, I don’t have personal opinions. However, I can share that ChatGPT
is an advanced language model that has received considerable attention for its ability to generate
human-like text. It’s designed to assist users in a wide range of tasks, from answering questions to
creating content and engaging in conversation.

ChatGPT has been praised for its fluency and versatility in handling various topics and for its
potential to streamline and enhance communication in both personal and professional settings.
However, like any technology, it also faces challenges and criticisms, such as ensuring the accuracy
of information, preventing misuse, and managing ethical considerations surrounding AI-generated
content.

Overall, ChatGPT represents a significant achievement in natural language processing and has
sparked discussions about the future of AI in our daily lives.

2. Mistral-ORPO-β:
As an AI language model, I am an example of the technology used in ChatGPT. I can provide a
personal opinion, but please keep in mind that I am designed to be socially unbiased and positive in
nature.

ChatGPT is an impressive advancement in artificial intelligence and natural language processing.


It has the ability to understand and generate human-like text, making it possible for users to
have engaging and informative conversations with the AI. This technology has the potential
to revolutionize various industries, including customer support, education, content creation, and more.

However, it’s important to recognize that ChatGPT, like any AI technology, is not perfect. It may
sometimes provide inaccurate or biased information due to the limitations of its training data or
the complexity of human language. It’s crucial to verify the facts and use critical thinking when
interacting with AI systems like ChatGPT.

Overall, ChatGPT represents a significant step forward in the field of AI, and its potential applications
are vast. As a responsible user, it’s important to approach this technology with a critical and respectful
mindset, recognizing its strengths and limitations.

You might also like