0% found this document useful (0 votes)
15 views

Early Optimal Algorithms For Contextual Dueling Bandits From Adversarial Feedback

The document proposes a new algorithm called RCDB for contextual dueling bandits that is robust to adversarial feedback. RCDB uses uncertainty-dependent weights in maximum likelihood estimation to diminish the impact of adversarial feedback. The algorithm achieves a regret bound of O(√dTeT + dC), which matches known lower bounds and is optimal whether adversarial feedback exists or not.

Uploaded by

scribd.6do57
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Early Optimal Algorithms For Contextual Dueling Bandits From Adversarial Feedback

The document proposes a new algorithm called RCDB for contextual dueling bandits that is robust to adversarial feedback. RCDB uses uncertainty-dependent weights in maximum likelihood estimation to diminish the impact of adversarial feedback. The algorithm achieves a regret bound of O(√dTeT + dC), which matches known lower bounds and is optimal whether adversarial feedback exists or not.

Uploaded by

scribd.6do57
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Nearly Optimal Algorithms for Contextual Dueling

Bandits from Adversarial Feedback


Qiwei Di∗ and Jiafan He† and Quanquan Gu‡
arXiv:2404.10776v1 [cs.LG] 16 Apr 2024

Abstract
Learning from human feedback plays an important role in aligning generative models, such
as large language models (LLM). However, the effectiveness of this approach can be influenced
by adversaries, who may intentionally provide misleading preferences to manipulate the output
in an undesirable or harmful direction. To tackle this challenge, we study a specific model
within this problem domain–contextual dueling bandits with adversarial feedback, where the
true preference label can be flipped by an adversary. We propose an algorithm namely robust
contextual dueling bandit (RCDB), which is based
√ on uncertainty-weighted maximum likelihood
estimation. Our algorithm achieves an O(d T + dC) regret bound, where T is the number of
e
rounds, d is the dimension of the context, and 0 ≤ C ≤ T is the total number of adversarial
feedback. We also prove a lower bound to show that our regret bound is nearly optimal, both in
scenarios with and without (C = 0) adversarial feedback. Additionally, we conduct experiments
to evaluate our proposed algorithm against various types of adversarial feedback. Experimental
results demonstrate its superiority over the state-of-the-art dueling bandit algorithms in the
presence of adversarial feedback.

1 Introduction
Acquiring an appropriate reward proves challenging in numerous real-world applications, often
necessitating intricate instrumentation (Zhu et al., 2020) and time-consuming calibration (Yu et al.,
2020) to achieve satisfactory levels of sample efficiency. For instance, in training large language
models (LLM) using reinforcement learning from human feedback (RLHF), the diverse values and
perspectives of humans can lead to uncalibrated and noisy rewards (Ouyang et al., 2022). In contrast,
preference-based data, which involves comparing or ranking various actions, is a more straightforward
method for capturing human judgments and decisions. In this context, the dueling bandit model
(Yue et al., 2012) provides a problem framework that focuses on optimal decision-making through
pairwise comparisons, rather than relying on the absolute reward for each action.
However, human feedback may not always be reliable. In real-world applications, human feedback
is particularly vulnerable to manipulation through preference label flip. Adversarial feedback can
significantly increase the risk of misleading a large language model (LLM) into erroneously prioritizing

Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:
[email protected]

Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:
[email protected]

Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected]

1
harmful content, under the false belief that it reflects human preference. Despite the significant
influence of adversarial feedback, there is limited existing research on the impact of adversarial
feedback specifically within the context of dueling bandits. A notable exception is Agarwal et al.
(2021), which studies dueling bandits when an adversary can flip some of the preference labels
received by the learner. They proposed an algorithm that is agnostic to the amount of adversarial
feedback introduced by the adversary. However, their setting has the following two limitations.
First, their study was confined to a finite-armed setting, which renders their results less applicable
to modern applications such as RLHF. Second, their adversarial feedback focuses on the comparison
matrix. At each round, the adversary observes the outcomes of all pairwise comparisons and then
decides to corrupt some of the pairs before the agent selects the actions. This assumption does not
align well with the real-world scenario, where the adversary often flips the preference label based on
the information of the selected actions.
In this paper, to address the above challenge, we aim to develop contextual dueling bandit
algorithms that are robust to adversarial feedback. This enables us to effectively tackle problems
involving a large number of actions while also taking advantage of contextual information. We
specifically consider a scenario where the adversary knows the selected action pair and the true
preference of their comparison. In this setting, the adversary’s only decision is whether to flip the
preference label or not. We highlight our contributions as follows:

• We propose a new algorithm RCDB, which integrates uncertainty-dependent weights into the
Maximum Likelihood Estimator (MLE). Intuitively, our choice of weight is designed to induce a
higher degree of skepticism about potentially “untrustworthy” feedback. The agent is encouraged
to focus more on feedback that is more likely to be genuine, effectively diminishing the impact of
any adversarial feedback.
• We analyze the regret of our algorithm under at most
√ C number of adversarial feedback. Our result
consists of two terms: a C-independent term O(d
e T ), which matches the lower bound established
in Bengs et al. (2022) for uncorrupted linear contextual dueling bandits, and a C-dependent term
O(dC).
e Furthermore, we establish a lower bound for dueling bandits with adversarial feedback,
demonstrating the optimality of our adversarial term. Consequently, our algorithm for dueling
bandits attains the optimal regret in both scenarios, with and without adversarial feedback.
• We conduct extensive experiments to validate the effectiveness of our algorithm RCDB. To compre-
hensively assess RCDB’s robustness against adversarial feedback, we evaluate its performance under
various types of adversarial feedback and compare the results with state-of-the-art dueling bandit
algorithms. Experimental results demonstrate the superiority of our algorithm in the presence of
adversarial feedback, which corroborate our theoretical analysis.

Notation. In this paper, we use plain letters such as x to denote scalars, lowercase bold letters
such as x to denote vectors and uppercase bold letters such as X to denote matrices. For a vector
x, ∥x∥2 denotes its√ℓ2 -norm. The weighted ℓ2 -norm associated with a positive-definite matrix A is
defined as ∥x∥A = x⊤ Ax. For two symmetric matrices A and B, we use A ⪰ B to denote A − B
is positive semidefinite. We use 1 to denote the indicator function and 0 to denote the zero vector.
For two actions a, b, we use a ≻ b to denote a is more preferable to b. For a postive integer N , we
use [N ] to denote {1, 2, . . . , N }. We use standard asymptotic notations including O(·), Ω(·), Θ(·),
and O(·),
e Ω(·),
e Θ(·)
e will hide logarithmic factors.

2
Table 1: Comparison of algorithms for robust bandits and dueling bandits.
Model Algorithm Setting Regret
Multi-layer Active Arm Elimination Race √ 
K-armed Bandits e K 1.5 C T
O
(Lykouris et al., 2018)
BARBAR √ 
K-armed Bandits O
e KT + KC
(Gupta et al., 2019)
SBE e d2 C/∆ + d5 /∆2

Bandits Linear Bandits O
(Li et al., 2019)
Robust Phased Elimination √
dT + d1.5 C + C 2

Linear Bandits O
e
(Bogunovic et al., 2021)
Robust weighted OFUL √ 
Linear Contextual Bandits e dC T
O
(Zhao et al., 2021)
CW-OFUL √ 
Linear Contextual Bandits e d T + dC
O
(He et al., 2022)
WIWR e K 2 C/∆min + P ∗ K 2 /∆2

K-armed Dueling Bandits O i̸=i i
(Agarwal et al., 2021)
Versatile-DB √ 
Dueling Bandits K-armed Dueling Bandits e C + P ∗ 1/∆i +
O K
(Saha and Gaillard, 2022) i̸=i

RCDB √ 
Contextual Dueling Bandits O
e d T + dC
(Our work)

2 Related Work
Bandits with Adversarial Reward. The multi-armed bandit problem, involving an agent
making sequential decisions among multiple arms, has been studied with both stochastic rewards
(Lai et al., 1985; Lai, 1987; Auer, 2002; Auer et al., 2002a; Kalyanakrishnan et al., 2012; Lattimore
and Szepesvári, 2020; Agrawal and Goyal, 2012), and adversarial rewards (Auer et al., 2002b; Bubeck
et al., 2012). Moreover, a line of works focuses on designing algorithms that can achieve near-optimal
regret bounds for both stochastic bandits and adversarial bandits simultaneously (Bubeck and
Slivkins, 2012; Seldin and Slivkins, 2014; Auer and Chiang, 2016; Seldin and Lugosi, 2017; Zimmert
and Seldin, 2019; Lee et al., 2021), which is known as “the best of both worlds” guarantee. Distinct
from fully stochastic and fully adversarial models, Lykouris et al. (2018) studied a setting, where
only a portion of the rewards is subject to corruption. They proposed an algorithm with a regret
dependent on the corruption level C, defined as the cumulative sum of the corruption magnitudes at
each round. Their result is C times worse than the regret without corruption. Gupta et al. (2019)
improved the result by providing a regret guarantee comprising two terms, a corruption-independent
term that matches the regret lower bound without corruption, and a corruption-dependent term that
is linear in C. In addition, Gupta et al. (2019) proved a lower bound demonstrating the optimality
of the linear dependency on C.
Contextual Bandits with Corruption. Li et al. (2019) studied stochastic linear bandits with
corruption and presented an instance-dependent regret bound linearly dependent on the corruption
level C. Bogunovic et al. (2021) studied the same problem and proposed an algorithm with near-
optimal regret in the non-corrupted case. Lee et al. (2021) studied this problem in a different setting,
where the adversarial corruptions are generated through the inner product of a corrupted vector
and the context vector. For linear contextual bandits, Bogunovic et al. (2021) proved that under an
additional context diversity assumption, the regret of a simple greedy algorithm is nearly optimal
with an additive corruption term. Zhao et al. (2021) and Ding et al. (2022) extended the OFUL
algorithm (Abbasi-Yadkori et al., 2011) and proved a regret with a corruption term polynomially

3
dependent on the total number of rounds T . He et al. (2022) proposed an algorithm for known
corruption level C to remove the polynomial dependency on T in the corruption term, which only
has a linear dependency on C. They also proved a lower bound showing the optimality of linear
dependency on C for linear contextual bandits with a known corruption level. Additionally, He et al.
(2022) extended the proposed algorithm to an unknown corruption level and provided a near-optimal
performance guarantee that matches the lower bound. For more extensions, Kuroki et al. (2023)
studied best-of-both-worlds algorithms for linear contextual bandits. Ye et al. (2023) proposed a
corruption robust algorithm for nonlinear contextual bandits.
Dueling Bandits. The dueling bandit model was first proposed in Yue et al. (2012). Compared
with bandits, the agent will select two arms and receive the preference feedback between the two arms
from the environment. For general preference, there may not exist the “best” arm that always wins in
the pairwise comparison. Therefore, various alternative winners are considered, including Condorcet
winner (Zoghi et al., 2014; Komiyama et al., 2015), Copeland winner (Zoghi et al., 2015; Wu and
Liu, 2016; Komiyama et al., 2016), Borda winner (Jamieson et al., 2015; Falahatgar et al., 2017;
Heckel et al., 2018; Saha et al., 2021; Wu et al., 2023) and von Neumann winner (Ramamohan et al.,
2016; Dudı́k et al., 2015; Balsubramani et al., 2016), along with their corresponding performance
metrics. To handle potentially large action space or context information, Saha (2021) studied
a structured contextual dueling bandit setting. In this setting, each arm possesses an unknown
intrinsic reward. The comparison is determined based on a logistic function of the relative rewards.
In a similar setting, Bengs et al. (2022) studied contextual linear stochastic transitivity model
with contextualized utilities. Di et al. (2023) proposed a layered algorithm with variance aware
regret bound. Another line of works does not make the reward assumption. Instead, they assume
the preference feedback can be represented by a function class. Saha and Krishnamurthy (2022)
designed an algorithm that achieves the optimal regret for K-armed contextual dueling bandit
problem. Sekhari et al. (2023) studied contextual dueling bandits in a more general setting and
proposed an algorithm the provides guarantees for both regret and the number of queries.
Dueling Bandits with Adversarial Feedback. A line of work has focused on dueling bandits
with adversarial feedback or corruption. Gajane et al. (2015) studied a fully adversarial utility-based
version of dueling bandits, which was proposed in Ailon et al. (2014). Saha et al. (2021) considered
the Borda regret for adversarial dueling bandits without the assumption of utility. In a setting
parallel to that in Lykouris et al. (2018); Gupta et al. (2019), Agarwal et al. (2021) studied K-armed
dueling bandits in a scenario where an adversary has the capability to corrupt part of the feedback
received by the learner. They designed an algorithm whose regret comprises two terms: one that
is optimal in uncorrupted scenarios, and another that is linearly dependent on the total times of
adversarial feedback C. Later on, Saha and Gaillard (2022) achieved “best-of-both world” result for
noncontextual dueling bandits and improved the adversarial term of Agarwal et al. (2021) in the
same setting. For contextual dueling bandits, Wu et al. (2023) proposed an EXP3-type algorithm for
the adversarial linear setting using Borda regret. In this paper, we study the influence of adversarial
feedback within contextual dueling bandits, particularly in a setting where only a minority of the
feedback is adversarial. Compared to previous studies, most studies have focused on the multi-armed
dueling bandit framework without integrating context information. The notable exception is Wu
et al. (2023); however, this study does not provide guarantees regarding the dependency on the
number of adversarial feedback instances.

4
3 Preliminaries
In this work, we study linear contextual dueling bandits with adversarial feedback. At each round
t ∈ [T ], the agent observes the context information xt from a context set X and the corresponding
action set A. Utilizing this context information, the agent selects two actions, at and bt . Subsequently,
the environment will generate a binary feedback (i.e., preference label) lt = 1(at ≻ bt ) ∈ {0, 1}
indicating the preferable action. Following Bengs et al. (2022); Di et al. (2023), we assume the
existence of a reward function r∗ (x, a) dependent on the context information x and action a.
Furthermore, there exists a monotonically increasing link function σ satisfying σ(x) + σ(−x) = 1.
The preference probability will be determined by the link function and the difference between the
rewards of the selected arms, i.e.,
P(a ≻ b|x) = σ r∗ (x, a) − r∗ (x, b) .

(3.1)
As a common practice in the study of dueling bandits (Saha, 2021; Di et al., 2023), we make a linear
assumption on the reward function:
Assumption 3.1. Let ϕ : X × A → Rd be a known feature map, with ∥ϕ(x, a)∥2 ≤ 1 for any
(x, a) ∈ X × A. We define the reward function rθ parameterized by θ ∈ Rd , with rθ (x, a) =
⟨θ, ϕ(x, a)⟩. Moreover, there exists θ ∗ satisfying rθ∗ = r∗ , with ∥θ ∗ ∥2 ≤ B.
Assumption 3.1 aligns with the setting studied in Xiong et al. (2023) and acts as a special case
of Sekhari et al. (2023).
During each round t, the optimal action is denoted by a∗t = argmaxa∈A r∗ (xt , a). The regret over
the first T rounds is defined as the cumulative sub-optimality between the rewards of the optimal
actions and the rewards of the selected actions:
T
X
Regret(T ) = 2r∗ (xt , a∗t ) − r∗ (xt , at ) − r∗ (xt , bt ). (3.2)
t=1

Remark 3.2. The linear contextual dueling bandit model shares a similar parametric structure
with both the generalized linear bandit model (Filippi et al., 2010; Li et al., 2017) and the logistic
bandit model (Faury et al., 2020; Abeille et al., 2021; Faury et al., 2022). Specifically, in each
round t, the act of selecting two arms, at and bt , in the dueling bandit framework can be reduced to
pulling an auxiliary ”arm” ϕ(x, at ) − ϕ(x, bt ) with the linear parameter θ∗ , as in the logistic bandit
or generalized bandit models. However, a key distinction arises in the computation of regret between
these tasks. In the dueling bandit problem, the reward for the selected action is represented as
r( xt , at ) + r( xt , bt ), in contrast to r( xt , at ) − r( xt , bt ) in the single-arm analogy. This discrepancy
implies that the linear contextual dueling bandit problem diverges significantly from both linear
bandits and logistic bandits.
We also make an assumption on the derivative of the link function, which is common in the
study of generalized linear models for bandits and dueling bandits (Filippi et al., 2010; Di et al.,
2023).
Assumption 3.3. The link function σ is differentiable. Furthermore, its first-order derivative
satisfies:
σ̇(·) ≥ κ
for some constant κ > 0.

5
In our setting, however, the agent does not directly observe the true binary feedback. Instead, an
adversary will see both the choice of the agent and the true feedback. Based on the information, the
adversary can decide whether to corrupt the binary feedback or not. We represent the adversary’s
decision at round t by a corruption indicator ct , which takes values from the set {0, 1}. If the
adversary chooses not to corrupt the result, we have ct = 0. Otherwise, we have ct = 1, which means
adversarial feedback at this round, and the agent will observe a flipped preference label, i.e., the
observation ot = 1 − lt . We define C as the total number of adversarial feedback 1
T
X
C= ct .
t=1

Remark 3.4. In our work, the adversary can corrupt the feedback after observing the action
chosen by the agent, which is referred to as the strong adversary in Bogunovic et al. (2021); He
et al. (2022). In comparison, a weak adversary is limited to corrupting feedback without knowledge
of the agent’s selection. Since the weak adversary lacks knowledge of the selected action, it usually
predetermines a set of action pairs to corrupt. Under this situation, the number of adversarial
feedback ct at round t is defined either as the maximum corruption magnitude for all pairs (For our
binary feedback, this would be 1 if adversarial feedback exists for one action, and 0 otherwise) or
as the total number of corrupted pairs at round t (Agarwal et al., 2021). Therefore, for a strong
adversary with knowledge of the agent’s selected action, the problem of adversarial feedback becomes
more challenging compared to scenarios involving a weak adversary.

4 Algorithm
In this section, we present our new algorithm RCDB, designed for learning contextual linear dueling
bandits. The main algorithm is illustrated in Algorithm 1. At a high level, we incorporate uncertainty-
dependent weighting into the Maximum Likelihood Estimator (MLE) to counter adversarial feedback.
Specifically, for each round k ∈ [K], we construct the estimator of parameter θ by solving the
following equation: X
wi σ(ϕ⊤

λθ + i θ) − oi ϕi = 0, (4.1)
i

where we denote ϕi = ϕ(xi , ai ) − ϕ(xi , bi ) for simplicity. Here, the uncertainty-aware weight wi is
selected as a truncation of the exploration bonus ∥ϕi ∥Σ−1 , i.e.,
t

wt = min{1, α/∥ϕt ∥Σ−1 }. (4.2)


t

This type of weighting is initially introduced by He et al. (2022) for the linear bandit problem.
Intuitively speaking, the exploration bonus captures the uncertainty associated with the duel (ai , bi )
in the learning process. For duel (ai , bi ) with high uncertainty, it becomes challenging for the agent
to detect whether the observation has been corrupted. Under this situation, the Algorithm 1 will
only assign a small weight to the observation, aiming to avoid potentially large estimation errors
resulting from adversarial feedback. Conversely, for duels with low uncertainty, a larger weight (up
1
Our corruption indicator ct is slightly different from the corruption level ct for general feedback (He et al., 2022).
In the binary feedback scenario, the corruption level ct = |ot − lt |P
within each round is 0 or 1. Therefore, the total
number of adversarial feedback C also equals the corruption level Tt=1 |ot − lt | with binary feedback.

6
to 1) is assigned. With the help of this mechanism, the estimation error between θt and θ is upper
bounded by

θt − θ ∗ Σt ≤ O(αC
e + d)/κ.

Furthermore, by selecting a proper threshold parameter, e.g., α = d/C, the weighted MLE shares
the confidence radius with that of the no-adversary scenario. In contrast, the standard regularized
estimator employed in dueling bandits (Di√et al., 2023) displays a linear dependence in the confidence
radius, notably characterized by O(C
e + d).
After constructing the estimator θt from the weighted MLE, the estimated reward for each
⊤
duel (a, b) can be denoted by ϕ(xt , a) + ϕ(xt , b) θt . To encourage the exploration of duel (a, b)
with high uncertainty during the learning process, we introduce an exploration bonus with the
following β ϕ(xt , a) − ϕ(xt , b) Σ−1 , which is a well-established technique in the context of linear
t
bandit problems (Abbasi-Yadkori et al., 2011). The selection of action pairs (a, b) is subsequently
determined by maximizing the estimated reward with the exploration bonus term, i.e.,
⊤
ϕ(xt , a) + ϕ(xt , b) θt + β ϕ(xt , a) − ϕ(xt , b) Σ−1
.
t

More discussion about the selection rule was discussed in Appendix A of Di et al. (2023).

Algorithm 1 Robust Contextual Dueling Bandit (RCDB)


1: Require: α > 0, Regularization parameter λ, confidence radius β.
2: for t = 1, . . . , T do
3: Set
t−1
X  ⊤
Σt = λI + wi ϕ(xi , ai ) − ϕ(xi , bi ) ϕ(xi , ai ) − ϕ(xi , bi ) . (4.3)
i=1

4: Calculate the MLE θt by solving the equation:


t−1 h 
X ⊤  i 
λκθ + wi σ ϕ(xi , ai ) − ϕ(xi , bi ) θ − oi ϕ(xi , ai ) − ϕ(xi , bi ) = 0. (4.4)
i=1

5: Observe the context vectorn xt . o


⊤
6: Choose at , bt = argmaxa,b ϕ(xt , a) + ϕ(xt , b) θt + β ϕ(xt , a) − ϕ(xt , b) Σ−1 .
t
The adversary sees the feedback lt = 1(at ≻ bt ) and decides the indicator ct .
Observe ot = lt when ct = 0, otherwise observe ot = 1 − lt .
7: Set weight wt as (4.2).
8: end for

7
5 Main Results
5.1 Known Number of Adversarial Feedback
At the center of our algorithm design is the uncertainty-weighted MLE. When faced with adversarial
feedback, the estimation error of the weighted MLE θt can be characterized by the following lemma.
√ p 
Lemma 5.1. If we set β = λB + αC + d log((1 + 2T /λ)/δ) /κ, then with probability at least
1 − δ, for any t ∈ [T ], we have

θt − θ ∗ Σt
≤ β.
√ √
Remark 5.2. If we set α = ( d + λB)/C, then the bonus radius β has no direct dependency on
the number of adversarial feedback C. This observation plays a key role in proving the adversarial
term in the regret without polynomial dependence on the total number of rounds T .
With Lemma 5.1, we can present the following regret guarantee of our algorithm RCDB in the
dueling bandit framework.
Theorem 5.3. Under Assumption 3.1 and 3.3, let 0 < δ < 1, the total number of adversarial
feedback be C. If we set the bonus radius to be
√ p 
β = λB + αC + d log((1 + 2T /λ)/δ) /κ,

then with probability at least 1 − δ, the regret in the first t rounds can be upper bounded by
√ p
Regret(T ) ≤ 4 λB + αC/κ dT log(1 + 2T /λ)
√ √  
+ 4d T /κ + λB/α + 4C/κ log (1 + 2T /λ)/δ
q
1.5
log3 (1 + 2T /λ)/δ /(ακ).

+ 4d
√ √
Moreover, if we set α = ( d + λB)/C, λ = 1/B 2 , the regret upper bound can be simplified to
 d√T dC 
Regret(T ) = O
e + ,
κ κ
where O
e hides logarithmic factors on T, B and 1/δ.

Remark
√ 5.4. Our regret bound consists of two √ terms. The first one is a C-independent term
O(d
e T ), which matches the lower bound Ω(d e T ) proved in Bengs et al. (2022). This indicates
that our result is optimal in scenarios without adversarial feedback (C = 0). Additionally, our
result includes √
an additive term that is linearly dependent on the number of adversarial feedback C.
When C = O( T ), the order of regret will be the same as the stochastic setting. It indicates the
robustness of our algorithm to adversarial feedback. Additionally, the following theorem we present
establishes a lower bound for this adversarial term, indicating that our dependency on the number
of adversarial feedback C and the context dimension d is also optimal.
Theorem 5.5. For any dimension d, there exists an instance of dueling bandits with |A| = d, such
that any algorithm with the knowledge of the number of adversarial feedback C must incur Ω(dC)
regret with probability at least 1/2.

8
Remark 5.6. The proof of Theorem 5.5 follows Bogunovic et al. (2021). In the constructed
instances, only one action has reward 1, while others have 0. Compared with linear bandits, where
the feedback is an exact reward, dueling bandits deal with the comparison between a pair of actions.
A critical observation from our preference model, as formulated in (3.1), is that two actions with
identical rewards result in a pair that is challenging to differentiate. The lower bound can be proved
by corrupting every comparison into a random guess until the total times of adversarial feedback
have been used up. For detailed proof, please refer to Section A.2. Our proved lower bound Ω(dC)
shows that our result is nearly optimal because of the linear dependency on C, d and only logarithmic
dependency on the total number of rounds T .

5.2 Unknown Number of Adversarial Feedback


In our previous analysis, the selection of parameters depended on having prior knowledge of the
total number of adversarial feedback C, which be impractical in certain scenarios. In this subsection,
we extend our previous result to address the challenge posed by an unknown number of adversarial
feedback C. Our approach to tackle this uncertainty is straightforward: we introduce an adversarial
tolerance threshold C̄ for the adversary count. This threshold can be regarded as an optimistic
estimator of the actual number of adversarial feedback C. Under this situation, the subsequent
theorem provides an upper bound for regret of Algorithm 1 in the case of an unknown number of
adversarial feedback C.

Theorem 5.7. Under Assumptions 3.1 and 3.3, if we set the the confidence radius as
√  q 
β = λB + αC̄ + d log (1 + 2T /λ)/δ /κ,
√ √
with the pre-defined adversarial tolerance threshold C̄ and α = ( d+ λB)/C̄, then with probability
at least 1 − δ, the regret of Algorithm 1 can be upper bounded as following:

• If the actual number of adversarial feedback C is smaller than the adversarial tolerance threshold
C̄, then we have

Regret(T ) = O(d
e T + dC̄).

• If the actual number of adversarial feedback C is larger than the adversarial tolerance threshold
C̄, then we have Regret(T ) = O(T ).

Remark 5.8. For the weak adversary discussed in Remark 3.4, the COBE framework (Wei et al.,
2022) can be applied to adapt an algorithm with a known corruption level C to an algorithm facing
an unknown corruption level. However, our focus is on the strong adversary, and our results are
optimal in the following sense. Theorem 4.12 in He et al. (2022) demonstrates that any linear
bandit algorithm with an optimal
√ regret upper bound in an uncorrupted setting will fail when the
corruption level exceeds Ω( T ), where T is the total number of steps. This
√ result also extends to
the domain of dueling bandits. When the corruption level reaches C̄ = Ω( T ), every algorithm that
is optimal when uncorrupted will incur linear regret. Thus, it is impossible to design an algorithm
that adapts to an unknown level of corruption while maintaining efficient performance guarantees
in the uncorrupted case.

9
Remark 5.9. It is important to highlight that the effectiveness of our algorithm is highly related to
the choice of the pre-defined adversarial tolerance threshold C̄. When this threshold is conservative,
our algorithm fails to provide a non-trivial performance guarantee. Conversely, an excessively
optimistic adversarial tolerance threshold C̄ can defend against a diverse range of attacks but may
deteriorate the overall algorithmic performance. Therefore, there exists a trade-off between robust
adversarial
√ defense and near-optimal algorithmic
√ performance. For a special scenario, when we set
C̄ = O( T ), our algorithm achieves O(d T ) regret. This result implies that our algorithm can
e e
exhibit similar performance as√ the no-adversary case, even in the presence of an unknown number
of adversarial feedback C ≤ T .

6 Roadmap of the Proof


6.1 Uncertainty-weighted MLE with Adversarial Feedback
In this section, we offer an overview of the proof for Lemma 5.1. The general proof idea for the
uncertainty-weighted MLE with adversarial feedback lies in decomposing the estimation error into
three terms, a stochastic error term, an adversarial term, and an additional regularization term.
Following the analysis of standard (weighted) MLE (Li et al., 2017), we introduce an auxiliary
function:
t−1 h 
X ⊤ 
Gt (θ) = λκθ + wi σ ϕ(xi , ai ) − ϕ(xi , bi ) θ
i=1
 ⊤ i
ϕ(xi , ai ) − ϕ(xi , bi ) θ ∗

−σ ϕ(xi , ai ) − ϕ(xi , bi ) .

It satisfies two conditions: First, for the true parameter value θ ∗ , Gt (θ ∗ ) has a simple expression,
i.e.,

Gt (θ ∗ ) = λκθ ∗ .

Second, according to (4.4), we can get the value of function Gt at the MLE θt ,
t−1
X 
Gt (θt ) = wi γi ϕ(xi , ai ) − ϕ(xi , bi ) , (6.1)
i=1

where γt = ot − σ (ϕ(xt , at ) − ϕ(xt , bt ))⊤ θ ∗ . To connect the desired estimation error with the


function Gt , we use the mean value theorem. This leads to an upper bound of the estimation error:
1 1 1
∥θt − θ ∗ ∥Σt ≤ Gt (θt ) − Gt (θ ∗ ) Σ−1
≤ λ∥θ ∗ ∥Σ−1 + Gt (θt ) Σ−1
.
κ t κ
| {z
t
} κ
| {z t
}
Regularization term I1

For term I1 , we can decompose the summation in (6.1) based on the adversarial feedback ct , i.e.,
X  X 
Gt (θt ) = wi γi ϕ(xi , ai ) − ϕ(xi , bi ) + wi γi ϕ(xi , ai ) − ϕ(xi , bi ) ,
i<t:ci =0 i<t:ci =1
| {z }
I2

10
where I2 can be further decomposed as
X  X 
I2 = wi ϵi ϕ(xi , ai ) − ϕ(xi , bi ) + wi (γi − ϵi ) ϕ(xi , ai ) − ϕ(xi , bi ) .
i<t:ci =1 i<t:ci =1

where ϵt = ot − σ (ϕ(xt , at ) − ϕ(xt , bt ))⊤ θ . With our notation of adversarial feedback, when



ci = 0, we have γi = ϵi . Therefore, we have |γi − ϵi | ≤ 1 and


t−1
1 X  1 X 
I1 ≤ wi ϵi ϕ(xi , ai ) − ϕ(xi , bi ) −1
+ wi ϕ(xi , ai ) − ϕ(xi , bi ) .
κ Σt κ Σ−1
t
| i=1 {z } |
i<t:ci =1
{z }
Stochastic term Adversarial term

The stochastic term can be upper bounded with the concentration inequality (Lemma C.2). Addi-
tionally, by employing our specifically chosen weight, as (4.2), we can control the adversarial term,
with wi ∥ϕ(xi , ai ) − ϕ(xi , bi )∥Σ−1 ≤ α. Therefore, the adversarial term can be bounded by αC/κ.
t

6.2 Regret Upper Bound


With a similar discussion of the symmetric arm selection rule to Di et al. (2023), the regret defined
in (3.2) can be bounded by
T
X n o
Regret(T ) ≤ min 4, 2β∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 .
t
t=1

Note that in our selection of weight wt , it has two possible values. We decompose the summation
based on the two cases separately. We have
X n o X n o
Regret(T ) ≤ min 4, 2β∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 + min 4, 2β∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 .
t t
wt =1 wt <1
| {z } | {z }
J1 J2
P
We consider J1 , J2 separately. For the term J1 , we define Λt = λI + i≤t−1,wi =1 ϕ(xi , ai ) −
 ⊤
ϕ(xi , bi ) ϕ(xi , ai ) − ϕ(xi , bi ) . Then, we have Σt ⪰ Λt , and therefore
∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 ≤ ∥ϕ(xt , at ) − ϕ(xt , bt )∥Λ−1 .
t t

Using Lemma C.3 with xt = ϕ(xt , at ) − ϕ(xt , bt ), we have


p
J1 ≤ 4β dT log(1 + 2T /λ). (6.2)
For term J2 , we note that wt < 1 implies that wt = α/∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 . Therefore, we
t
have
T n √
X 4β o
J2 ≤ min 1, ∥ wt (ϕ(xt , at ) − ϕ(xt , bt ))∥2Σ−1 .
α t
t=1

Using Lemma C.3 with x′t = wt (ϕ(xt , at ) − ϕ(xt , bt )), we have
4dβ log(1 + 2T /λ)
J2 ≤ . (6.3)
α
We conclude the proof of regret by combining (6.2) and (6.3).

11
7 Experiments
7.1 Experiment Setup
Preference Model. We study the effect of adversarial feedback with the preference model
determined by (3.1), where σ(x) = 1/(1 + e−x ). We randomly generate the underlying parameter in
[−0.5, 0.5]d and normalize it to be a vector with ∥θ ∗ ∥2 = 2. Then, we set it to be the underlying
parameter and construct the reward utilized in the preference model as r∗ (x, a) = ⟨θ ∗ , ϕ(x, a)⟩. We
 √ √ d
set the action set A = − 1/ d, 1/ d . For simplicity, we assume ϕ(x, a) = a. In our experiment,
we set the dimension d = 5, with the size of action set |A| = 2d = 32.

Adversarial Attack Methods. We study the performance of our algorithm using different
adversarial attack methods. We categorize the first two methods as “weak” primarily because the
adversary in these scenarios does not utilize information about the agent’s actions. In contrast,
we classify the latter two methods as “strong” attacks. In these cases, the adversary leverages a
broader scope of information, including knowledge of the actions selected by the agent and the true
preference model. This enables it to devise more targeted adversarial methods.

• “Greedy Attack”: The adversary will flip the preference label for the first C rounds. After that, it
will not corrupt the result anymore.
• “Random Attack”: At each round, the adversary will flip the preference label with the probability
of 0 < p < 1, until the times of adversarial feedback reach C.
• “Adversarial Attack”: The adversary can have access to the true preference model. It will only
flip the preference label when it aligns with the preference model, i.e., the probability for the
preference model to make that decision is larger than 0.5, until the times of adversarial feedback
reach C.
• “Misleading Attack”: The adversary selects a suboptimal action. It will make sure this arm is
always the winner in the comparison until the times of adversarial feedback reach C. In this way,
it will mislead the agent to believe this action is the optimal one.

Experiment Setup. For each experiment instance, we simulate the interaction with the environ-
ment for T = 2000 rounds. At each round, the feedback for the action pair selected by the algorithm
is generated according to the defined preference model. Subsequently, the adversary observes both
the selected actions and their corresponding feedback and then engages in one of the previously
described adversarial attack methods. We report the regret defined in (3.2) averaged across 10
random runs.

7.2 Performance Comparison


We first introduce the algorithms studied in this section.

• MaxInP: Maximum Informative Pair by Saha (2021). It involves maintaining a standard MLE.
With the estimated model, it then identifies a set of promising arms possible to beat the rest. The
selection of arm pairs is then strategically designed to maximize the uncertainty in the difference
between the two arms within this promising set, referred to as “maximum informative”.

12
Greedy Attack Random Attack Adversarial Attack Misleading Attack
1750 MaxInp 1750 MaxInp 5000 MaxInp MaxInp
CoLSTIM CoLSTIM CoLSTIM 8000 CoLSTIM
1500 MaxPairUCB 1500 MaxPairUCB MaxPairUCB MaxPairUCB
RCDB RCDB 4000 RCDB RCDB
1250 1250 6000
1000 3000
1000
Regret

Regret

Regret

Regret
750 750 4000
2000
500 500
1000 2000
250 250

0 0 0 0
0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000
Epoch Epoch Epoch Epoch

(a) Greedy Attack (b) Random Attack (c) Adversarial Attack (d) Misleading Attack

Figure 1: Comparison of RCDB (Our Algorithm 1), MaxInp (Saha, 2021), CoLSTIM (Bengs et al.,
2022) and MaxPairUCB (Di et al., 2023). We report the cumulative regret with various adversarial
attack methods (Greedy, Random, Adversarial, Misleading). For the baselines, the parameters
are carefully tuned to achieve
√better
 results with different attack methods. The total number of
adversarial feedback is C = T .

• CoLSTIM: The method by Bengs et al. (2022). It involves maintaining a standard MLE for the
estimated model. Based on this model, the first arm is selected as the one with the highest
estimated reward, implying it is the most likely to prevail over competitors. The second arm is
selected to be the first arm’s toughest competitor, with an added uncertainty bonus.
• MaxPairUCB: This algorithm was proposed in Di et al. (2023). It uses the regularized MLE to
estimate the parameter θ ∗ . Then it selects the actions based on a symmetric action selection rule,
i.e. the actions with the largest estimated reward plus some uncertainty bonus.
• RCDB: Algorithm 1 proposed in this paper. The key difference from the other algorithms is the
use of uncertainty weight in the calculation of MLE (4.4). The we use the same symmetric action
selection rule as MaxPairUCB. Our experiment results show that the uncertainty weight is critical
in the face of adversarial feedback.

Our results are demonstrated in Figure 1. In Figure 1(a) and Figure 1(b), we observe scenarios
where the adversary is “weak” due to the lack of access to information regarding the selected actions
and the underlying preference model. Notably, in these situations, our algorithm RCDB outperforms
all other baseline algorithms, demonstrating its robustness. Among the other algorithms, CoLSTIM
performs as the strongest competitor.
In Figure 1(c), the adversary employs a ’stronger’ adversarial method. Due to the inherent
randomness of the model, some labels may naturally be ’incorrect’. An adversary with knowledge
of the selected actions and the preference model can strategically neglect these naturally incorrect
labels and selectively flip the others. This method proves catastrophic for algorithms to learn the
true model, as it results in the agent encountering only incorrect preference labels at the beginning.
Our results indicate that this leads to significantly higher regret. However, it’s noteworthy that our
algorithm RCDB demonstrates considerable robustness.
In Figure 1(d), the adversary employs a strategy aimed at misleading algorithms into believing
a suboptimal action is the best choice. The algorithm CoLSTIM appears to be the most susceptible
to being cheated by this method. Despite the deployment of ’strong’ adversarial methods, as shown
in both Figure 1(c) and Figure 1(d), our algorithm, RCDB, consistently demonstrates exceptional
robustness against these attacks. A significant advantage of RCDB lies in that our parameter is
selected solely based on the number of adversarial feedback C, irrespective of the nature of the
adversarial methods employed. This contrasts with other algorithms where parameter tuning must

13
be specifically adapted for each distinct adversarial method.
Cumulative Regret versus Adversarial Feedback
MaxInp
3500 CoLSTIM
MaxPairUCB
RCDB
3000

2500

Regret
2000

1500

1000

500
25 50 75 100 125 150 175 200
Adversarial Feedback

Figure 2: The relationship between cumulative regret and the number of adversarial feedback C.
For this specific experiment, we employ the ”greedy attack” method to generate the adversarial
feedback. C is selected from the set [20, 40, 60, 80, 100, 120, 140, 160, 180, 200] (10 adversarial levels).

7.3 Robustness to Different Numbers of Adversarial Feedback


In this section, we test the performance of algorithms with increasing times of adversarial feedback.
Our results show a linear dependency on the number of adversarial feedback C, which is consistent
with the theoretical results we have proved in Theorem 5.3 and 5.5. In comparison to other
algorithms, RCDB demonstrates superior robustness against adversarial feedback, as evidenced by its
notably smaller regret.

8 Conclusion
In this paper, we focus on the contextual dueling bandit problem from adversarial feedback. We
introduce a novel algorithm, RCDB, which utilizes an uncertainty-weighted Maximum Likelihood
Estimator (MLE) approach. This algorithm not only achieves optimal theoretical results in scenarios
with and without adversarial feedback but also demonstrates superior performance with synthetic
data. For future direction, we aim to extend our uncertainty-weighted method to encompass more
general settings involving preference-based data. A particularly promising future direction of our
research lies in addressing adversarial feedback within the process of aligning large language models
using Reinforcement Learning from Human Feedback (RLHF).

A Proof of Theorems in Section 5


A.1 Proof of Theorem 5.3
In this subsection, we provide the proof of Theorem 5.3. We condition on the high-probability event
in Lemma 5.1
n o
E = θt − θ ∗ Σt ≤ β, ∀t ∈ [T ] .

14
Let rt = 2r∗ (xt , a∗t ) − r∗ (xt , at ) − r∗ (xt , bt ) be the regret incurred at round t. The following lemma
provides the upper bound of rt .
√ p 
Lemma A.1. Let 0 < δ < 1. If we set β = λB + αC + d log((1 + 2T /λ)/δ) /κ, on event E,
the regret of Algorithm 1 incurred at round t can be upper bounded by
n o
rt ≤ min 4, 2β∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 .
t

With Lemma A.1, we can provide the proof of Theorem 5.3.

Proof of Theorem 5.3. Using Lemma A.1, the total regret can be upper bounded by
T
X n o
Regret(T ) ≤ min 4, 2β∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 .
t
t=1

Our weight wt has two possible values. We decompose the summation based on the two cases
separately. We have
X n o X n o
Regret(T ) ≤ min 4, 2β∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 + min 4, 2β∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 .
t t
wt =1 wt <1
| {z } | {z }
J1 J2

For the term J1 , we consider a partial summation at rounds when wt = 1. Let Λt = λI +


P  ⊤
i≤k−1,wi =1 ϕ(xi , ai ) − ϕ(xi , bi ) ϕ(xi , ai ) − ϕ(xi , bi ) . Then we have
X n o
J1 ≤ 4β min 1, ∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1
t
t:wt =1
X n o
≤ 4β min 1, ∥ϕ(xt , at ) − ϕ(xt , bt )∥Λ−1
t
t:wt =1
s X
min 1, ∥ϕ(xt , at ) − ϕ(xt , bt )∥2Λ−1

≤ 4β T
t
t:wt =1
p
≤ 4β dT log(1 + 2T /λ), (A.1)

where the second inequality holds due to Σt ⪰ Λt . The third inequality holds due to the Cauchy-
Schwartz inequality, The last inequality holds due to Lemma C.3.
For the term J2 , the weight in this summation satisfies wt < 1, and therefore wt = α/∥ϕ(xt , at ) −
ϕ(xt , bt )∥Σ−1 . Then we have
t

X n o
J2 = min 4, 2β∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 wt ∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 /α
t t
wt <1
T
X n √ o
≤ min 4, 2β/α∥ wt (ϕ(xt , at ) − ϕ(xt , bt ))∥2Σ−1
t
t=1
T n √
X 4β o
≤ min 1, ∥ wt (ϕ(xt , at ) − ϕ(xt , bt ))∥2Σ−1
α t
t=1

15
4dβ log(1 + 2T /λ)
≤ , (A.2)
α
where the first equality holds due to the choice of wt . The first inequality holds because each term
in the summation is positive. The last inequality holds due to Lemma C.3. Combining (A.1) and
(A.2), we complete the proof of Theorem 5.3.

A.2 Proof of Theorem 5.5


Proof of Theorem 5.5. Our proof adapts the argument in Bogunovic et al. (2021) to dueling bandits.
For any dimension d, we construct d instances, each with θi = ei , where ei is the i-th standard
basis vector. We set the action set A = {ei }di=1 . Therefore, in the i-th instance, the reward for the
i-th action will be 1. For the other actions, it will be 0. Therefore, the i-th action will be more
preferable to any other action. While for other pairs, the feedback is simply a random guess.
Consider an adversary that knows the exact instance. When the comparison involves the i-th
action, it will corrupt the feedback with a random guess. Otherwise, it will not corrupt. In the i-th
instance, the adversary stops the adversarial attack only after C times of comparison involving the
i-th action. However, after Cd/4 rounds, at least d/2 actions have not been compared for C times.
For the instances corresponding to these actions, the agent learns no information and suffers from
Ω(dC) regret. This completes the proof of Theorem 5.5.

A.3 Proof of Theorem 5.7


Proof of Theorem 5.7. Here, based on the relationship between C and the threshold C̄, we discuss
two distinct cases separately.
• In the scenario where C̄ < C, Algorithm 1 can ensures a trivial regret bound, with the guarantee
that Regret(T ) ≤ 2T .

• In the scenario where C ≤ C̄, we know that C̄ is remains a valid upper bound on the number of
adversarial feedback. Under this situation, Algorithm 1 operates successfully with C̄ adversarial
feedback. Therefore, according to Theorem 5.3, the regret is upper bounded by

Regret(T ) ≤ O(d
e T + dC̄).

B Proof of Lemmas 5.1 and A.1


B.1 Proof of Lemma 5.1
Proof of Lemma 5.1. Using a similar reasoning in Li et al. (2017), we define some auxiliary quantities
t−1 h 
X ⊤ 
Gt (θ) = λκθ + wi σ ϕ(xi , ai ) − ϕ(xi , bi ) θ
i=1
 ⊤ i
− σ ϕ(xi , ai ) − ϕ(xi , bi ) θ ∗

ϕ(xi , ai ) − ϕ(xi , bi ) ,
 ⊤ 
ϵt = lt − σ ϕ(xt , at ) − ϕ(xt , bt ) θ ∗ ,

16
 ⊤ 
γt = ot − σ ϕ(xt , at ) − ϕ(xt , bt ) θ∗ ,
t−1
X 
Zt = wi γi ϕ(xi , ai ) − ϕ(xi , bi ) .
i=1

In Algorithm 1, θt is chosen to be the solution of the following equation,


t−1 h 
X ⊤  i 
λκθt + wi σ ϕ(xi , ai ) − ϕ(xi , bi ) θt − oi ϕ(xi , ai ) − ϕ(xi , bi ) = 0. (B.1)
i=1

Then we have
t−1 h 
X ⊤ 
Gt (θt ) = λκθt + wi σ ϕ(xi , ai ) − ϕ(xi , bi ) θt
i=1
 ⊤ i
ϕ(xi , ai ) − ϕ(xi , bi ) θ ∗

−σ ϕ(xi , ai ) − ϕ(xi , bi )
t−1 h 
X ⊤ i
wi oi − σ ϕ(xi , ai ) − ϕ(xi , bi ) θ ∗

= ϕ(xi , ai ) − ϕ(xi , bi )
i=1
= Zt .

The analysis in Li et al. (2017); Di et al. (2023) shows that this equation has a unique solution,
with θt = G−1 d
t (Zt ). Using the mean value theorem, for any θ1 , θ2 ∈ R , there exists m ∈ [0, 1] and
θ̄ = mθ1 + (1 − m)θ2 , such that the following equation holds,
t−1 h 
X ⊤ 
Gt (θ1 ) − Gt (θ2 ) = λκ(θ1 − θ2 ) + wi σ ϕ(xi , ai ) − ϕ(xi , bi ) θ1
i=1
 ⊤ i 
−σ ϕ(xi , ai ) − ϕ(xi , bi ) θ2 ϕ(xi , ai ) − ϕ(xi , bi )
h t−1 
X ⊤ 
= λκI + wi σ̇ ϕ(xi , ai ) − ϕ(xi , bi ) θ̄
i=1
 ⊤ i
ϕ(xi , ai ) − ϕ(xi , bi ) ϕ(xi , ai ) − ϕ(xi , bi ) (θ1 − θ2 ).

We define F (θ̄) as
t−1 
X ⊤   ⊤ i
F (θ̄) = λκI + wi σ̇ ϕ(xi , ai ) − ϕ(xi , bi ) θ̄ ϕ(xi , ai ) − ϕ(xi , bi ) ϕ(xi , ai ) − ϕ(xi , bi ) .
i=1
Pt−1
Moreover, we can see that Gt (θ ∗ ) = λκθ ∗ . Recall Σt = λI+

i=1 wi ϕ(xi , ai )−ϕ(xi , bi ) ϕ(xi , ai )−
⊤
ϕ(xi , bi ) . We have
2
Gt (θt ) − Gt (θ ∗ ) Σ−1
= (θt − θ ∗ )⊤ F (θ̄)Σ−1 ∗
t F (θ̄)(θt − θ )
t

≥ κ2 (θt − θ ∗ )⊤ Σt (θt − θ ∗ )
= κ2 ∥θt − θ ∗ ∥2Σt ,

17
where the first inequality holds due to µ̇(·) ≥ κ > 0 and F (θ̄) ⪰ κΣt . Then we have the following
estimate of the estimation error:
1
∥θt − θ ∗ ∥Σt ≤ Gt (θt ) − Gt (θ ∗ ) Σ−1
κ t

∗ 1
≤ λ∥θ ∥Σ−1 + ∥Zt ∥Σ−1
t κ t
√ 1
≤ λ∥θ ∗ ∥2 + ∥Zt ∥Σ−1 ,
κ t

where the second inequality holds due to the triangle inequality and Gt (θ ∗ ) = λκθ ∗ . The last
inequality holds due to Σt ⪰ λI. Finally, we need to bound the ∥Zt ∥Σ−1 term. To study the impact
t
of adversarial feedback, we decompose the summation in (6.1) based on the adversarial feedback ct ,
i.e.,
X  X 
Zt = wi γi ϕ(xi , ai ) − ϕ(xi , bi ) + wi γi ϕ(xi , ai ) − ϕ(xi , bi ) ,
i<t:ci =0 i<t:ci =1

When ci = 1, i.e. with adversarial feedback, |γi − ϵi | = 1. On the contrary, when ci = 0, γi = ϵi .


Therefore,
X  X 
wi γi ϕ(xi , ai ) − ϕ(xi , bi ) = wi ϵi ϕ(xi , ai ) − ϕ(xi , bi ) ,
i<t:ci =0 i<t:ci =0
X  X 
wi γi ϕ(xi , ai ) − ϕ(xi , bi ) = wi ϵi ϕ(xi , ai ) − ϕ(xi , bi )
i<t:ci =1 i<t:ci =1
X 
+ wi γi − ϵi )(ϕ(xi , ai ) − ϕ(xi , bi ) .
i<t:ci =1

Summing up the two equalties, we have


t−1
X X
 
Zt = wi ϵi ϕ(xi , ai ) − ϕ(xi , bi ) + wi (γi − ϵi ) ϕ(xi , ai ) − ϕ(xi , bi ) .
i=1 i<t:ci =1

Therefore,
t−1
X X
 
∥Zt ∥Σ−1 ≤ wi ϵi ϕ(xi , ai ) − ϕ(xi , bi ) + wi ϕ(xi , ai ) − ϕ(xi , bi ) .
t
i=1 Σ−1
t i<t:ci =1 Σ−1
t
| {z } | {z }
I1 I2

For the term I1 , with probability at least 1 − δ, for all t ∈ [T ], it can be bounded by
r
 det(Σ )1/2 det(Σ )−1/2 
t 0
I1 ≤ 2 log ,
δ

due to Lemma C.2. Using wi ≤ 1, we have wi ∥ϕ(xi , ai ) − ϕ(xi , bi )∥2 ≤ 2. Moreover, we have
 d
Tr(Σt )
det(Σt ) ≤
d

18
Pt−1 d
wi ∥(ϕ(xi , ai ) − ϕ(xi , bi ))∥22

dλ + i=1
=
d
 d
dλ + 2T
≤ ,
d

where the first inequality holds because for every matrix A ∈ Rd×d , det A ≤ (Tr(A)/d)d . The

second inequality holds due to wi ∥ϕ(xi , ai ) − ϕ(xi , bi )∥2 ≤ 2. Easy to see that det(Σ0 ) = λd . The
term I1 can be bounded by
p
I1 ≤ d log((1 + 2T /λ)/δ). (B.2)

For I2 , with our choice of the weight wi , we have


X
I2 ≤ wi (ϕ(xi , ai ) − ϕ(xi , bi )) Σ−1
t
i<t:ci =1
X
≤ wi (ϕ(xi , ai ) − ϕ(xi , bi )) Σ−1
i
i<t:ci =1
X
≤ α
i<t:ci =1
≤ αC, (B.3)

where the second inequality holds due to Σt ⪰ Σi . The third inequality holds due to wi ≤
α/∥(ϕ(xi , ai ) − ϕ(xi , bi )) Σ−1 . The last inequality holds due to the definition of C. Combining
i
(B.2) and (B.3), we complete the proof of Lemma 5.1.

B.2 Proof of Lemma A.1


Proof of Lemma A.1. Let the regret incurred at the t-th round by rt = 2r∗ (xt , a∗t ) − r∗ (xt , at ) −
r∗ (xt , bt ). It can be decomposed as

rt = 2r∗ (xt , a∗t ) − r∗ (xt , at ) − r∗ (xt , bt )


= ⟨ϕ(xt , a∗t ) − ϕ(xt , at ), θ ∗ ⟩ + ⟨ϕ(xt , a∗t ) − ϕ(xt , bt ), θ ∗ ⟩
= ⟨ϕ(xt , a∗t ) − ϕ(xt , at ), θ ∗ − θt ⟩ + ⟨ϕ(xt , a∗t ) − ϕ(xt , bt ), θ ∗ − θt ⟩
+ ⟨2ϕ(xt , a∗t ) − ϕ(xt , at ) − ϕ(xt , bt ), θt ⟩
≤ ∥ϕ(xt , a∗t ) − ϕ(xt , at )∥Σ−1 ∥θ ∗ − θt ∥Σt + ∥ϕ(xt , a∗t ) − ϕ(xt , bt )∥Σ−1 ∥θ ∗ − θt ∥Σt
t t

+ ⟨2ϕ(xt , a∗t ) − ϕ(xt , at ) − ϕ(xt , bt ), θt ⟩


≤ β∥ϕ(xt , a∗t ) − ϕ(xt , at )∥Σ−1 + β∥ϕ(xt , a∗t ) − ϕ(xt , bt )∥Σ−1
t t

+ ⟨2ϕ(xt , a∗t ) − ϕ(xt , at ) − ϕ(xt , bt ), θt ⟩,

where the first inequality holds due to the Cauchy-Schwarz inequality. The second inequality holds
due to the high probability confidence event E. Using our action selection rule, we have

⟨ϕ(xt , a∗t ) − ϕ(xt , at ), θt ⟩ + β∥ϕ(xt , a∗t ) − ϕ(xt , at )∥Σ−1


t

≤ ⟨ϕ(xt , bt ) − ϕ(xt , at ), θt ⟩ + β∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1


t

19
⟨ϕ(xt , a∗t ) − ϕ(xt , bt ), θt ⟩ + β∥ϕ(xt , a∗t ) − ϕ(xt , bt )∥Σ−1
t

≤ ⟨ϕ(xt , at ) − ϕ(xt , bt ), θt ⟩ + β∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 .


t

Adding the above two inequalities, we have

β∥ϕ(xt , a∗t ) − ϕ(xt , at )∥Σ−1 + β∥ϕ(xt , a∗t ) − ϕ(xt , bt )∥Σ−1


t t

≤ ⟨ϕ(xt , at ) + ϕ(xt , bt ) − 2ϕ(xt , a∗t ), θt ⟩ + 2β∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 .


t

Therefore, we prove that the regret at round t can be upper bounded by

rt ≤ 2β∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 .


t

With a simple observation, we have rt ≤ 4. Therefore, the total regret can be upper bounded by
T
X n o
Regret(T ) ≤ min 4, 2β∥ϕ(xt , at ) − ϕ(xt , bt )∥Σ−1 .
t
t=1

C Auxiliary Lemmas
Lemma C.1 (Azuma–Hoeffding inequality, Cesa-Bianchi and Lugosi 2006). Let {ηk }K k=1 be a
martingale difference sequence with respect to a filtration {Ft } satisfying |ηt | ≤ R for some constant
R, ηt is Ft+1 -measurable, E[ηt |Ft ] = 0. Then for any 0 < δ < 1, with probability at least 1 − δ, we
have
T
X p
ηt ≤ R 2T log 1/δ.
t=1

Lemma C.2 (Lemma 9 Abbasi-Yadkori et al. 2011). Let {ϵt }Tt=1 be a real-valued stochastic
process with corresponding filtration {Ft }Tt=0 such that ϵt is Ft -measurable and ϵt is conditionally
R-sub-Gaussian, i.e.
 λ2 R 2 
∀λ ∈ R, E[eλϵt |Ft−1 ] ≤ exp .
2
Let {xt }Tt=1 be an Rd -valued
P stochastic process where xt is Ft−1 -measurable and for any t ∈ [T ], we
further define Σt = λI + ti=1 xi x⊤
i . Then with probability at least 1 − δ, for all t ∈ [T ], we have

T 2
det(Σt )1/2 det(Σ0 )−1/2
X  
2
xi η i ≤ 2R log .
Σ−1 δ
i=1 t

Lemma C.3 (Lemma 11, Abbasi-Yadkori


Pt−1 et al. 2011). For any λ > 0 and sequence {xt }Tt=1 ⊆ Rd

for t ∈ [T ], define Zt = λI + i=1 xi xi . Then, provided that ∥xt ∥2 ≤ L holds for all t ∈ [T ], we
have
T
X
min{1, ∥xt ∥2Z−1 } ≤ 2d log(1 + T L2 /(dλ)).
t
t=1

20
References
Abbasi-Yadkori, Y., Pál, D. and Szepesvári, C. (2011). Improved algorithms for linear
stochastic bandits. In Advances in Neural Information Processing Systems.

Abeille, M., Faury, L. and Calauzènes, C. (2021). Instance-wise minimax-optimal algorithms


for logistic bandits. In International Conference on Artificial Intelligence and Statistics. PMLR.

Agarwal, A., Agarwal, S. and Patil, P. (2021). Stochastic dueling bandits with adversarial
corruption. In Algorithmic Learning Theory. PMLR.

Agrawal, S. and Goyal, N. (2012). Analysis of thompson sampling for the multi-armed bandit
problem. In Conference on learning theory. JMLR Workshop and Conference Proceedings.

Ailon, N., Karnin, Z. and Joachims, T. (2014). Reducing dueling bandits to cardinal bandits.
In International Conference on Machine Learning. PMLR.

Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of


Machine Learning Research 3 397–422.

Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002a). Finite-time analysis of the multiarmed
bandit problem. Machine Learning 47 235–256.

Auer, P., Cesa-Bianchi, N., Freund, Y. and Schapire, R. E. (2002b). The nonstochastic
multiarmed bandit problem. SIAM journal on computing 32 48–77.

Auer, P. and Chiang, C.-K. (2016). An algorithm with nearly optimal pseudo-regret for both
stochastic and adversarial bandits. In Conference on Learning Theory. PMLR.

Balsubramani, A., Karnin, Z., Schapire, R. E. and Zoghi, M. (2016). Instance-dependent


regret bounds for dueling bandits. In Conference on Learning Theory. PMLR.

Bengs, V., Saha, A. and Hüllermeier, E. (2022). Stochastic contextual dueling bandits under
linear stochastic transitivity models. In International Conference on Machine Learning. PMLR.

Bogunovic, I., Losalka, A., Krause, A. and Scarlett, J. (2021). Stochastic linear bandits
robust to adversarial attacks. In International Conference on Artificial Intelligence and Statistics.
PMLR.

Bubeck, S., Cesa-Bianchi, N. et al. (2012). Regret analysis of stochastic and nonstochastic
multi-armed bandit problems. Foundations and Trends® in Machine Learning 5 1–122.

Bubeck, S. and Slivkins, A. (2012). The best of both worlds: Stochastic and adversarial bandits.
In Conference on Learning Theory. JMLR Workshop and Conference Proceedings.

Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university
press.

Di, Q., Jin, T., Wu, Y., Zhao, H., Farnoud, F. and Gu, Q. (2023). Variance-aware regret
bounds for stochastic contextual dueling bandits. arXiv preprint arXiv:2310.00968 .

21
Ding, Q., Hsieh, C.-J. and Sharpnack, J. (2022). Robust stochastic linear contextual bandits
under adversarial attacks. In International Conference on Artificial Intelligence and Statistics.
PMLR.

Dudı́k, M., Hofmann, K., Schapire, R. E., Slivkins, A. and Zoghi, M. (2015). Contextual
dueling bandits. In Conference on Learning Theory. PMLR.

Falahatgar, M., Hao, Y., Orlitsky, A., Pichapati, V. and Ravindrakumar, V. (2017).
Maxing and ranking with few assumptions. Advances in Neural Information Processing Systems
30.

Faury, L., Abeille, M., Calauzènes, C. and Fercoq, O. (2020). Improved optimistic algorithms
for logistic bandits. In International Conference on Machine Learning. PMLR.

Faury, L., Abeille, M., Jun, K.-S. and Calauzènes, C. (2022). Jointly efficient and optimal
algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics.
PMLR.

Filippi, S., Cappe, O., Garivier, A. and Szepesvári, C. (2010). Parametric bandits: The
generalized linear case. Advances in Neural Information Processing Systems 23.

Gajane, P., Urvoy, T. and Clérot, F. (2015). A relative exponential weighing algorithm
for adversarial utility-based dueling bandits. In International Conference on Machine Learning.
PMLR.

Gupta, A., Koren, T. and Talwar, K. (2019). Better algorithms for stochastic bandits with
adversarial corruptions. In Conference on Learning Theory. PMLR.

He, J., Zhou, D., Zhang, T. and Gu, Q. (2022). Nearly optimal algorithms for linear contextual
bandits with adversarial corruptions. Advances in Neural Information Processing Systems 35
34614–34625.

Heckel, R., Simchowitz, M., Ramchandran, K. and Wainwright, M. (2018). Approximate


ranking from pairwise comparisons. In International Conference on Artificial Intelligence and
Statistics. PMLR.

Jamieson, K., Katariya, S., Deshpande, A. and Nowak, R. (2015). Sparse dueling bandits. In
Artificial Intelligence and Statistics. PMLR.

Kalyanakrishnan, S., Tewari, A., Auer, P. and Stone, P. (2012). Pac subset selection in
stochastic multi-armed bandits. In ICML, vol. 12.

Komiyama, J., Honda, J., Kashima, H. and Nakagawa, H. (2015). Regret lower bound and
optimal algorithm in dueling bandit problem. In Conference on learning theory. PMLR.

Komiyama, J., Honda, J. and Nakagawa, H. (2016). Copeland dueling bandit problem: Regret
lower bound, optimal algorithm, and computationally efficient algorithm. In International
Conference on Machine Learning. PMLR.

Kuroki, Y., Rumi, A., Tsuchiya, T., Vitale, F. and Cesa-Bianchi, N. (2023). Best-of-both-
worlds algorithms for linear contextual bandits. arXiv preprint arXiv:2312.15433 .

22
Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. The annals
of statistics 1091–1114.

Lai, T. L., Robbins, H. et al. (1985). Asymptotically efficient adaptive allocation rules. Advances
in applied mathematics 6 4–22.

Lattimore, T. and Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press.

Lee, C.-W., Luo, H., Wei, C.-Y., Zhang, M. and Zhang, X. (2021). Achieving near instance-
optimality and minimax-optimality in stochastic and adversarial linear bandits simultaneously.
In International Conference on Machine Learning. PMLR.

Li, L., Lu, Y. and Zhou, D. (2017). Provably optimal algorithms for generalized linear contextual
bandits. In International Conference on Machine Learning. PMLR.

Li, Y., Lou, E. Y. and Shan, L. (2019). Stochastic linear optimization with adversarial corruption.
arXiv preprint arXiv:1909.02109 .

Lykouris, T., Mirrokni, V. and Paes Leme, R. (2018). Stochastic bandits robust to adversarial
corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C.,
Agarwal, S., Slama, K., Ray, A. et al. (2022). Training language models to follow instructions
with human feedback. Advances in Neural Information Processing Systems 35 27730–27744.

Ramamohan, S. Y., Rajkumar, A. and Agarwal, S. (2016). Dueling bandits: Beyond condorcet
winners to general tournament solutions. Advances in Neural Information Processing Systems 29.

Saha, A. (2021). Optimal algorithms for stochastic contextual preference bandits. Advances in
Neural Information Processing Systems 34 30050–30062.

Saha, A. and Gaillard, P. (2022). Versatile dueling bandits: Best-of-both world analyses for
learning from relative preferences. In International Conference on Machine Learning. PMLR.

Saha, A., Koren, T. and Mansour, Y. (2021). Adversarial dueling bandits. In International
Conference on Machine Learning. PMLR.

Saha, A. and Krishnamurthy, A. (2022). Efficient and optimal algorithms for contextual dueling
bandits under realizability. In International Conference on Algorithmic Learning Theory. PMLR.

Sekhari, A., Sridharan, K., Sun, W. and Wu, R. (2023). Contextual bandits and imitation
learning via preference-based active queries. arXiv preprint arXiv:2307.12926 .

Seldin, Y. and Lugosi, G. (2017). An improved parametrization and analysis of the exp3++
algorithm for stochastic and adversarial bandits. In Conference on Learning Theory. PMLR.

Seldin, Y. and Slivkins, A. (2014). One practical algorithm for both stochastic and adversarial
bandits. In International Conference on Machine Learning. PMLR.

Wei, C.-Y., Dann, C. and Zimmert, J. (2022). A model selection approach for corruption robust
reinforcement learning. In International Conference on Algorithmic Learning Theory. PMLR.

23
Wu, H. and Liu, X. (2016). Double thompson sampling for dueling bandits. Advances in neural
information processing systems 29.

Wu, Y., Jin, T., Lou, H., Farnoud, F. and Gu, Q. (2023). Borda regret minimization for
generalized linear dueling bandits. arXiv preprint arXiv:2303.08816 .

Xiong, W., Dong, H., Ye, C., Zhong, H., Jiang, N. and Zhang, T. (2023). Gibbs sampling from
human feedback: A provable kl-constrained framework for rlhf. arXiv preprint arXiv:2312.11456 .

Ye, C., Xiong, W., Gu, Q. and Zhang, T. (2023). Corruption-robust algorithms with uncertainty
weighting for nonlinear contextual bandits and markov decision processes. In International
Conference on Machine Learning. PMLR.

Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C. and Levine, S. (2020).
Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In
Conference on robot learning. PMLR.

Yue, Y., Broder, J., Kleinberg, R. and Joachims, T. (2012). The k-armed dueling bandits
problem. Journal of Computer and System Sciences 78 1538–1556.

Zhao, H., Zhou, D. and Gu, Q. (2021). Linear contextual bandits with adversarial corruptions.
arXiv preprint arXiv:2110.12615 .

Zhu, H., Yu, J., Gupta, A., Shah, D., Hartikainen, K., Singh, A., Kumar, V. and
Levine, S. (2020). The ingredients of real-world robotic reinforcement learning. arXiv preprint
arXiv:2004.12570 .

Zimmert, J. and Seldin, Y. (2019). An optimal algorithm for stochastic and adversarial bandits.
In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR.

Zoghi, M., Karnin, Z. S., Whiteson, S. and De Rijke, M. (2015). Copeland dueling bandits.
Advances in neural information processing systems 28.

Zoghi, M., Whiteson, S., Munos, R. and Rijke, M. (2014). Relative upper confidence bound
for the k-armed dueling bandit problem. In International conference on machine learning. PMLR.

24

You might also like