0% found this document useful (0 votes)

3 views

EXP3

The document discusses stochastic bandits, focusing on concentration inequalities, the weak law of large numbers, and the exploration-then-commit algorithm. It presents various theorems, including Chebyshev's and Hoffding's inequalities, which provide bounds on probabilities related to random variables. The exploration-then-commit algorithm aims to minimize expected regret in selecting arms with unknown reward distributions, achieving a regret bound of O(T^(2/3) (log T)^(1/3) K^(1/3)).

Uploaded by

yadavnaman546

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

EXP3

Uploaded by

yadavnaman546

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Stochastic Bandits

February 21, 2025

1 Concentration Inequalities
Theorem 1 (Chebyshev’s Inequality). Let X be any random variable. Then
for all t > 0,
V ar(X)
Pr(|X − E[X]| ≥ t) ≤ .
t2
Theorem 2 (Markov inequality). Let X be any random variable that takes only
non negative values. Then for any c > 0,

Pr(X ≥ cE[X]) ≤ 1/c.

Theorem 3 (Hoffding’s inequality). Suppose X1 , . . . , Xn be independent bounded

random variables such that for all i ∈ [n],Pwe have ai ≤ Xi ≤ bi for all i ∈ [n].
Let Sn = X1 + · · · + Xn and so E[Sn ] = i E[Xi ].
For any t > 0, we have
2t2
−P 2
Pr(|Sn − E[Sn ]| ≥ t) ≤ 2e i (bi −ai )

If 0 ≤ Xi ≤ 1 for all i ∈ [n] then we have,

2t2
Pr(|Sn − E[Sn ]| ≥ t) ≤ 2e− n

P
X1 +···+Xn E[Xi ]
Let X̄n = Sn /n = n . So E[X̄n ] = i
n For any t > 0, we have
2t2 n2
−P 2
Pr(|X̄n − E[X̄n ]| ≥ t) ≤ 2e i (bi −ai )

When for all i, 0 ≤ Xi ≤ 1 then

2
Pr(|X̄n − E[X̄n ]| ≥ t) ≤ 2e−2t n

Remark. Chebyshev’s and Hoffding’s inequalities works even for r.v. that can
take negative values.

1
2 Weak Law of Large Numbers
Let X1 , . . . , Xn be n iid (independent and identically distributed)
P
random vari-
X
ables. Let E[X] = µ and V ar(X) = v. Let X̄n = in i be the empirical
average. Note that E[X̄n ] = µ.
When n → ∞ then the empirical average X̄n converges to the true mean µ.
We can see this from Chebyshev’s inequality. As X1 , . . . , Xn are independent
we have P
Xi nV ar(X) V ar(X)
V ar(X̄n ) = V ar( i ) == 2
= .
n n n
So as n → ∞, the variance of X̄n tends to 0. So we have
V ar(X) v
Pr(|X̄n − µ| ≥ ϵ) ≤ 2
= 2
nϵ nϵ
So for any fixed ϵ > 0, for sufficiently large values of n (this value will depend
on ϵ), the probability Pr(|X̄n − µ| ≥ ϵ) will approach 0.

3 Stochastic Bandits
We start with the basic model of bandits, which is independent rewards. An
algorithm has K possible arms to choose from, and there are T rounds, both
K and T are known in advance (we will see later that one can relax the as-
sumption that T should be known in advance). Each arm a is associated with
a reward distribution Da (pmf/pdf) which is unknown to the algorithm. The
mean reward of the distribution Da will be most relevant for us and is denoted
by µa , i.e., µa = E [r] (r ∼ Da denotes that r is sampled/drawn from the
r∼Da
distribution Da ).
In each round t ∈ T :
1. algorithm picks an arm at ∈ [K].
2. reward rt is sampled from the distribution Dat .
3. algorithm receives the reward rt .
The algorithm can be randomized, i.e., in any round it can fix a probability
distribution over arms and can pick an arm from this distribution.
Again, it is important to note that µ1 , . . . , µK (and distributions D1 , . . . , DK )
are unknown to the algorithm. The algorithm only sees the rewards (of the arms
pulled by the algorithm).
We make the following assumptions:
1. rewards are bounded. For simplicty, we will assume rewards at any round
will be in [0, 1]. So the means µ1 , . . . , µK all are in [0, 1].
2. as we are in stochastic setting, we assume all drawn rewards are indepen-
dent.

2
An important point to note that, regardless of the algorithm is deterministic
or randomized, the arms pulled by the algorithm is a random variable. Because
the algorithm’s decision to pull any arm in some round will depend on past
history of rewards observed by the algorithm. Since the received rewards are
itself a random variable, the arms pulled are also random variable. This should
become clear when we will see some algorithms.

Notations: Throughout the course, we will stick to the following notations.

The set of arms will be denoted by [K] and the total number of rounds will be
denoted by T . The best arm is the arm a that has highest mean reward. We
will use a∗ for the best arm and µ∗ for the mean reward of the best arm. That
is a∗ = arg maxa µa and µ∗ = µa∗ = maxa µa . For any suboptimal arm, we
will use ∆a := µ∗ − µa for the gap of arm a. Finally, the arm pulled by the
algorithm in tth round will be denoted by at .

Regret: How do we measure the performance of an algorithm? Recall that

the goal of the algorithm is to maximize the sum of rewards received in all
rounds.
One standard approach is to compare the algorithm performance with the
best possible algorithm that knows the distributions Da for all a ∈ [K]. Note
that if means µ1 , . . . , µK are known then best strategy to maximize total ex-
pected rewards is to always pick the arm a∗ to get total expected reward of
T µ∗ . It makes sense to define the regret of an algorithm after T rounds as
T
X T
X
R(T ) = µ∗ T − µat = (µ∗ − µat )
t=1 t=1

where at is the arm pulled by the algorithm in the t th round. Thus R(T ) is
the regret incurred by the algorithm after T rounds of not knowing the means
µ1 , µ2 , . . . , µK . Note that the regret R(T ) is a random variable. We are inter-
ested in the expected regret E[R(T )] of the algorithm.

XT
E[R(T )] = T µ∗ − E[ µat ]
t=1

The expectation in the above definition is taken over all the randomness, i.e.,
randomness over the draw of rewards from the distributions D1 , . . . , DK and
internal randomness of the algorithm (if the algorithm is randomized).
Remark. 1. By the definition only, R(T ) and E[R(T )] of an algorithm de-
pends on the problem-instance, i.e., means µ1 , . . . , µK . Of course, we want
an algorithm whose expected regret is small for all instances. In the previ-
ous line, ‘for all problem instances’ is important. For example, consider a
stupid algorithm that always pulls arm 1. This algorithm will have regret
0 if the arm 1 has mean 1 and other arms have mean 0. But of course
this algorithm will suffer for other instance such as if µ1 = 0 and µ2 = 1

3
and will have regret of T . So we want an algorithm that performs well on
all instances.
2. If the regret is small then the algorithm’s performance is close to best
performance when the distributions are known. So we kind of learn the
distributions.
3. If r1 , . . . , rT are rewards received by P
the algorithmP
in rounds 1, 2, . . . , T
T T
respectively then one can see that E[ i=1 rt ] = E[ t=1 µat ]. Hence the
expected regret of an algorithm is the difference of expected total rewards
of best strategy that knows µ1 , . . . , µK and the expected total rewards of
the algorithm.
Note: In some text books, R(T ) is called random-regret or realized regret
and E[R(T )] is called Regret. Its just a matter of convention, what names
should we use. Often, from the context it will be clear whether we are talking
about R[T ] or E[R(T )].
Our goal is to design an algorithm whose expected regret E[R(T )] is as small
as possible for all problem-instances. Like a general discussion on algorithms,
we will ignore constants and only consider Big-O dependence on T . Also, we
assume T ≥ K as any reasonable algorithm will pull each arm atleast once.
Note that since reward in any round is in [0, 1], R(T ) ≤ T always. We want
an algorithms for which R(T ) grows sublinear in T , i.e., R(T
T
)
→ 0 as T → ∞.
In other words, average regret per round should be 0 when T is large (so we
essentially we have learned the unknown distributions). Smaller the regret’s
dependence on T , faster the rate of convergence to 0 for the average regret per
round.

3.1 Explore Then Commit Algorithm

This algorithm is very intuitive and perhaps we will first come up with this algo-
rithm. If we know the means µ1 , . . . , µK then in each round the best algorithm
will pick the arm a∗ (recall notation, a∗ has highest mean µ∗ = maxa∈A {µa })
to maximize total expected rewards. This algorithm first try each arm m times
and finds the estimate µ̂a of the mean µa for every arm a. Thereafter, the algo-
rithm will always pick an arm that maximizes the empirical mean µ̂a (we will
2/3 1/3
call this arm a′ ). The value of m is set to T K
(log T )
2/3 to optimze the expected
regret.
We will now prove our first theorem of this course.
Theorem 4. For any instance (i.e., for any values of unknowns µ1 , . . . , µK ),
the expected regret of the algorithm ETC is O(T 2/3 (log T )1/3 K 1/3 ).
q
Proof. Let ϵ = 5 ln T
m . For any arm a, let Bada be the event that |µ̂a − µa | ≥ ϵ.
Pm
rai
As µ̂a = i=1
m and all rewards are in [0, 1], from Hoffding’s inequality, we have
2 2
Pr(Bada ) = Pr(|µ̂a − µa | ≥ ϵ) ≤ = 10 .
e2ϵ2 m T

4
Algorithm 1: Explore Then Commit
2/3 1/3
1. m = T K (log T )
2/3 ;
2. Try each arm m times. Let for arm a, ra1 , ra2 , . . . , ram be the
rewards received.;
/* Line 2 is exploration/investment phase. In this phase,
the algorithm tries each arm so even the worst arms. But
we hope that we will be able to find near-optimal arm for
remaining rounds. */
3. For each arm a, setP µ̂a as the average received reward for the arm a,
ria
i.e., we have µ̂a = i∈[m]m . Let a′ be the arm that maximizes µ̂a ,
′
i.e., a = arg maxa {µ̂a }.;
/* We hope a′ is the near-optimal arm, i.e., µ∗ − µa′ is very
small. */
4. From the round m · K + 1, always pick the arm a′ .

Let Bad = ∪a Bada be the event that for some arm a, |µ̂a − µa | ≥ ϵ holds.
So Good = Badc is the event that, for all arms a, we have |µ̂a − µa | < ϵ.
By union bound, we have
X 2 2
Pr(Bad) ≤ Pr(Bada ) ≤ K · 10
≤ 9
a
T T

(as we are assuming T ≥ K as any reasonable algorithm will try each arm at
least once).
Also,
2
Pr(Good) = 1 − Pr(Bad) ≥ 1 − 9
T
q
As the Pr(Bad) is negligible (we have chosen ϵ = 5 ln T
m to ensure that this
happens), it will suffice for us to only bound the expeceted regret conditioned
on event Good. Following calculation formally show this.

E[R(T )] = Pr(Bad)E[R(T )|Bad] + Pr(Good)E[R(T )|Good]

2
≤ 9 · T + 1 · E[R(T )|Good]
T
2
≤ 8 + ·E[R(T )|Good]
T
the second inequality comes from E[R(T )|Bad] ≤ T as all rewards are in [0, 1].

Bounding E[R(T )|Good]. We will now assume event Good, i.e., for all arms
PT
a, we have |µ̂a − µa | < ϵ. Recall that R(T ) = t=1 µ∗ − µat .
The contribution to regret in the investment phase can be at most mK
(assuming worst case contribution of 1 in each round). The contribution to the

5
regret from (mK + 1)st round till end is (T − mK)(µ∗ − µa′ ) (recall that a′ is
the arm that maximizes µ̂a and is always chosen from (mK + 1)st round) . We
now claim that (µ∗ − µa′ ) < 2ϵ. For now let us assume this claim and bound
the regret. We will prove the claim later.

R(T )|Good ≤ mK + (T − mK)2ϵ

≤ mK + 2T ϵ
r
5 ln T
= mK + 2T
m
q
As m inreases mK increases and 2T 5 ln T
m decreases so the above quantity will
q √ 2/3 T 2/3 (log T )1/3
be maximized when mK = 2T 5 ln T
m , i.e., when m = (2 5) K 2/3
.
Substituting this value of m we get R(T ) (conditioned on Good) equal to
O(T 2/3 K 1/3 (log T )1/3 ). Obviously, we also have then
E[R(T )|Good] = O(T 2/3 K 1/3 (log T )1/3 ).
From our earlier calculation, E[R(T )] = T28 +·E[R(T )|Good] = O(T 2/3 K 1/3 (log T )1/3 ).
Now all it remains to prove the claim that (µ∗ − µa′ ) < 2ϵ. Note that µ̂a′ ≥
µ̂a∗ > µ∗ − ϵ and also µa′ > µ̂a′ − ϵ. Thus we have µ∗ − µa′ < 2ϵ.

6
4 Successive Elimination Algorithm
One drawback of ETC algorithm is that it will continue to explore an arm large
number of times (m times) even if an arm’s reward history might suggest to not
pull this arm further. In Successive Elimination algorithm, we discontinue the
arm forever once we have belief that the arm is not good. Below is the high
level description of this algorithm.

Algorithm 2: Successive Elimination - High Level Description

1) Pull every arm once;
2) If there is ‘sufficient evidence’ that some arm a is not a good arm
then remove this arm;
Repeat the above steps over the remaining arms

Now to describe Successive Elimination fully, we just need to specify what

‘sufficient evidence’ is. For the same, we introduce some notations. For any
arm a and round t, let na (t) be thePnumber of times the arm a is pulled till
the round t. Obviously, we have a na (t) = t. Further, let µ̂a (t) be the
empirical mean of received rewards from the arm a till round t. Formally, let
rai be the reward received from q the arm a on the ith pull. Then we have
Pna (t)
r
µ̂a (t) = i=1
na (t)
ai
. Let ϵa (t) = 5nlog T
a (t)
. Finally, let U CBa (t) = µ̂a (t) + ϵa (t)
and LCBa (t) = µ̂a (t) − ϵa (t).
Now we describe the ‘sufficent evidence’ which Succeesive Elimination em-
ploys. Recall the analysis of ETC algorithm. There we define an event Good
(and show that it holds with high probability)q and show that conditioned on
5 log T
Good, µ̂a − ϵ ≤ µ ≤ µ̂a + ϵ where ϵ = m . Here also, we will define an
event Good (and show it will hold with high probability) conditioned on which
for all arm a and round t, we will have LCBa (t) ≤ µ ≤ U CBa (t). Now if at
any time t, we have U CBa (t) < LCBa′ (t) for some arms a and a′ then we know
that µa < µa′ . Hence, it is not a good strategy to pull arm a in any subsequent
rounds because we will be better off pulling arm a′ . In other words, we can
eliminate the arm a for future.

Algorithm 3: Successive Elimination

Activate all the arms;
while #-rounds < T do
Pull each active arm once (and receive rewards);
Deactivate all arms a such that there exits some another arm a′
with U CBa < LCBa′ ;
end

7
Theorem 5. For all instances, i.e, for all values of µ1 , . . . , µK ,
p
E[R(T )] = O( KT log T )

Proof. Let Good be the following event: for all arms a and for all rounds t, we
have |µ̂a (t) − µa | < ϵa (t) (that is, LCBa (t) ≤ µa ≤ U CBa (t)). We will prove
that
1
Pr(Good) ≥ 1 − O( 8 )
T
For now, let us assume the above. Like ETC, it will suffice to bound E[R(T )|Good].

E[R(T )] = Pr(Bad)E[R(T )|Bad] + Pr(Good)E[R(T )|Good]

1
≤ O( 8 ) · T + 1 · E[R(T )|Good]
T
1
≤ O( 7 ) + ·E[R(T )|Good]
T
the second inequality comes from E[R(T )|Bad] ≤ T as all rewards are in [0, 1].

Bounding E[R(T )|Good] Each

P calculation in
P this paragraph is conditioned
(µ∗ − µat ) = a na (T )(µ∗ − µa ). If we show
on Good. Note that R(T ) = tq
for each arm a, (µ∗ − µa ) ≤ (8 5nlog T
a (T )
) then we are done. This is because
q
R(T ) = a na (T )(µ∗ − µa ) ≤ a na (T )(8 5nlog T
P P P p
a (T )
) ≤ a 8 5na (T ) log T . As
P √
√
qP
n (T ) a na (T )
p
x is a concave function so we have a K a ≤ K = T /K. Thus
√
R(T ) ≤ 8 5KT q log T . So it remains to show that for any arm a, we have
(µ − µa ) ≤ 8 5nlog
∗ T
a (T )
. For the sake of analysis, we will refer to the ith iteration
of while loop as phase i (for any i). Our first easy observation is that the
arm a∗ with will never be deactivated. Let t be the last round (corresponding
to the end of some phase) where the arm a remained active (in other words,
arm a was played exactly once after t). Note that na (t) = na∗ (t) (because
both arms a and a∗ are active till t so both of them are played equal number
of times, which is the number of phases completed till t). As the arm a is
not deactivated atqt, we mustqhave U CBa (t) ≥ LCBa∗ (t). Further, we have
5 log T 5 log T
ϵa (t) = ϵa∗ (t) = na (t) = na (T )−1 . It is easy to now see that we have
q q
5 log T
µ∗ − µa ≤ 4ϵa (t) = 4 na (T )−1 ≤ 8 5nlog T
a (T )
.

Bounding Pr(Good) It remains to show :

1
Pr(Good) ≥ 1 − O( )
T8
Let ra1 , ra2 , . . . , raT be T samples q from the distribution Da . Let Bada (t) be
the event that | ra1 +···+rat
t − µa | ≥ 5 log
t
T
. By Hoffding’s inequality, we have

8
Pr(Bada (t)) ≤ O(1/e10 log T ) = O( T110 ). Let Bada be the event that for some
q
5 log T
1 ≤ t ≤ T , we have | ra1 +···+r
t
at
− µa | ≥ t . By union bound, Pr(Bada ) ≤
T · O( T110 ) = O( T19 ) Let Bad = ∪a Bada . Again by union bound, Pr(Bad) ≤
K · O( T19 ) = O( T18 ).

The regret in the above theorem is worst-case regret, i.e, for any problem-
instance (that √
is, for any values of K, T and µ1 , . . . , µK , the expected regret
E[R(T )] ≤ O( KT log T ). Now we will show another type of bounds on ex-
pected regret, which will be instance -dependent bound.
Theorem 6. The expected regret of Successive Elimination satisfies
X 1
E[R(T )] ≤ O(log T )
△a
a:△a >0

where △a = µ∗ − µa is the gap of arm a.

Proof. We define the events Bad and Good as in the above theorem. Again it will
suffice to bound E[R(T )|Good]. All calculations now are conditioned on Good.
We claim that for any suboptimal arm a, we have na (T ) ≤ 50000 log T
△2a . This
1
P P
implies that R(T ) = a na (T )△a ≤ O(log T ) a:△a >0 △a . Suppose na (T ) >
50000 log T log T
△2 . Consider the time t when na (t) = 50000 △2 . As the arm a
∗
a a

is always active we also have na∗ (t) = 50000 log T

△2a . So now we have ϵa (t) =
q
ϵa∗ (t) = 5nlog T
a (t)
= △a /100. As we have assumed Good, we have LCBa∗ (t) ≥
∗
µ −2·△a /100 and U CBa (t) ≤ µa +2·△a /100. So we have U CBa (t) < LCBa∗ (t)
which implies that arm a will be eliminated in round t which contradicts that
na (T ) > 50000 log T
△2 .
a

9
UCB Algorithm
Let na (t), ϵa (t), µ̂a , U CBa (t) be as defined before (in Successive Elimination
algorithm).
The idea of UCB is to add bonus to the empirical mean and then pick an arm
that has highest value of qempirical mean plus bonus. The bonus is chosen to be
ϵa (t) which is equal to 5nlog T
a (t)
(since µ̂a (t) + ϵa (t) = U CBa (t), the algorithm,
at any time t, pulls an arm that has highest value of U CBa (t)).
Let us now go into the intuition behind the UCB algorithm in detail. During
the initial rounds, the difference between the empirical mean and the actual
mean of an arm can be significant. Consequently, selecting an arm solely based
on the maximum empirical mean value is not a good strategy. To address this
issue, the algorithm incorporates a bonus term. If an arm a is underexplored,
the bonus is large, which encourages the algorithm to explore that arm (even
if its empirical mean is small at this time). One concern might be whether the
bonus term could lead the algorithm to pull bad arms excessively. However, this
scenario will not happen because the bonus term diminishes as the number of
pulls for the arm increases.

Algorithm 4: UCB Algorithm

U CBa = ∞ for all arms a;
/* Initialization */
while #-rounds < T do
Pull an arm that has the highest value of U CBa ;
/* Recall that U CBa (t) = µ̂a (t) + ϵa (t) */
end

UCB acheives the same guarantee on expected regret as that of Successive

Elimination. The proof is almost same so here we do not give a complete proof.
Theorem 7. For all instances, i.e, for all values of µ1 , . . . , µK ,
p
E[R(T )] = O( KT log T )

and X 1
E[R(T )] = O(log T )
△a
a:△a >0
∗
where △a = µ − µa is the gap of arm a.
Proof. The proof is almost same as that of Successive Elimination. To q prove
√
E[R(T )] = O( KT log T ), we now show that for any arm a, µ −µa ≤ 2 5nlog
∗ T
a (T )

(assuming Good) (and then the rest of the proof is exactly same). Let ta be
the last round when the arm a was pulled. Thus na (T ) = na (t). Note µa ≥
U CBa (t) − 2ϵa (t) (as we are assuming good), U CBa∗ ≥ µ∗ (as we are assuming

10
Good) and U CB q a∗ (ta ) ≤ UqCBa (ta ) (as arm a was pulled in round ta ). Hence,
µ − µa ≤ 2 na (ta ) = 2 5nlog
∗ 5 log T T
a (T )
. Now the proof goes the same was as in
Successive Elimination algorithm.
The proof of instance dependent bound is also similar. We here prove that
(assuming Good) for any suboptimal arm a, we have na (T ) ≤ 50000 log T
△2a (and
then the proof is exactly same). Suppose not. Consider a time t when na (t) =
50000 log T
△2a . We have ϵa (t) = △a /100. Note that for any time t > t, we
′

have ϵa (t′ ) ≤ ϵa (t). For any time t′ > t, we have U CBa (t′ ) ≤ µa + 2ϵa (t′ ) ≤
µa + 2ϵa (t) < µ∗ ≤ U CBa∗ (t′ ). This means that for any t′ > t we will have
U CBa∗ (t′ ) > U CBa (t′ ) which means that arm a will never be pulled after t.
This contradicts that na (t) > 50000 log T
△2 .
a

11
MOSS algorithm (UCB2)
√
MOSS is a variant of UCB and it has the expected regret of O( KT ) and hence
it beats both Successive Elimination and UCB in theory. In the next lecture,
√
we will see that no algorithm can have smaller expected regret than O( KT )
and hence MOSS is the best possible algorithm.
MOSS
r is same as UCB expect how bonus is calculated. Now the bonus is
max(log KnT (t) ,0)
set to na (t)
a
. Let
s
T
max(log Kna (t) , 0)
Ia (t) = µ̂a (t) + .
na (t)
At any time t, MOSS pulls an arm that has a maximum value of Ia . One
reason MOSS has better expected regret bound than UCB is that the estimation
precision when na (t) is large is more accurate in MOSS than UCB. In other
words, the bonus more quickly goes to 0 in MOSS than UCB as na (t) increases.

Algorithm 5: MOSS Algorithm

Ia = ∞ for all arms a;
/* Initialization */
while #-rounds < T do
Pull an arm that has the highest value of Ia ;
end

Theorem 8. The expected regret E[R(T )] of MOSS satisfies

√
E[R(T )] = O( KT )

Proof. We will use a trick that is frequently employed in analysis of randomized

algorithms. Instead of sampling from the distribution Da at the time when the
arm a is pulled, we assume that T independent samples from every distribution
Da has already been sampled before the start of the algorithm. For each arm
a, let ra1 , . . . , raT be the T independent samples drawn from the distribution
Da . Now when the algorithm pulls arm a, we provide sample (to the algorithm)
from ra1 , . . . , raT . In particular, the sample provided to the algorithm for ith
pull of arm a is rai . Px
raj
Let us define new notations. Let for any 1 ≤ x ≤ T , µ̂a x = j=1 x be
the average of first x rewards (of T samples drawn beforehand).With respect to
this new notation, note that µ̂a (t) = µ̂a na (t) . Further, for any 1 ≤ x ≤ T , let
q T
max(log Kx ,0)
Ia x = µ̂a x + x . Again note that Ia (t) = Ia na (t) . We will call Ia (t)
as the index of arm a after time t.
Let δ = max{µ∗ − min1≤x≤T Ia∗ x , 0}. Note that δ is a random variable. By
definition only, the index of the best arm will never be less than µ∗ − δ, i.e.,

12
q
Ia∗ (t) ≥ µ∗ − δ for all t. We will prove later that E[δ] ≤ 10 K T . It will be
helpful to keep this fact in mind. q
Let us call an arm a as Good if △a ≤ 5 K T (note that this is different - in
previousqalgorithms, Good and Bad were events). An arm a is called Bad if
△a > 5 K T . Now it will be clear why we have defined Good and Bad arms in
P
this way. Recall that R(T ) = a Ra (T ) where Ra (T ) = na (T )△a .
X X
R(T ) = Ra (T ) + Ra (T )
a:a is Good a:a is Bad
r
X K X
≤ na (T )5 + Ra (T )
T
a:a is Good a:a is Bad
r
K X X
≤5 na (T ) + Ra (T )
T
a:a is Good a:a is Bad
r
K X
≤5 ·T + Ra (T )
T
a:a is Bad
√ X
≤ 5 KT + Ra (T )
a:a is Bad
P √
Thus it suffices to show a:a is Bad Ra (T ) = O( KT ). Let us introduce few
more notations. For any bad arm a, we define a value ka (which is a random
variable) as follows:
△a
ka = |{1 ≤ x ≤ T |Ia x > µa + }|
2
We also define J which is is a random subset of bad arms defined as below:
J = {a ∈ [K]|a is Bad and △a > 2δ}
A very important observation is that for any arm in J, we have na (T ) ≤ ka .
This is the main crux of the analysis. As directly showing bounds for E[na (t)]
is difficult but later we will be able to show bounds for E[ka ] for bad arms.
Now
X X X
Ra (T ) = na (T )△a + na (T )△a
a:a is Bad a∈J a̸∈J:a is Bad
X
≤ ka △a + 2δT
a:a is Bad

Now
X X
E[ Ra (T )] ≤ E[ka ]△a + 2E[δ]T
a:a is Bad a:a is Bad
X √
≤ E[ka ]△a + 20 KT
a:a is Bad

13
q
as we earlier claimed (without proof) that E[δ] ≤ 10 K T . Thus it suffices to
P √
show that a:a is Bad E[ka ]△a = O( KT ). Recall that for any event A, we use
1A for indicator r.v. that takes the value 1 if A happens and 0 otherwise.
For any bad arm a, we have
△a
E[ka ] = E[|{1 ≤ x ≤ T |Ia x > µa + }|]
2
T
X
= E[ 1Ia x >µa + △a ]
2
x=1
T
X
= E[1Ia x >µa + △a ]
2
x=1
T
X △a
= Pr(Ia x > µa + )
x=1
2
s
T T
X max(log Kx , 0) △a
= Pr(µ̂a x + > µa + )
x=1
x 2
s
T T
X △a max(log Kx , 0)
= Pr(µ̂a x − µa > − )
x=1
2 x
T △2a
log
8 K s
△2
a T T
X X △a max(log Kx , 0)
≤ 1+ Pr(µ̂a x − µa > − )
x=1
2 x
T △2
a
log
x=8 △2K
a
s
T △2 T T
log K a X △a max(log Kx , 0)
=8 + Pr(µ̂a x − µa > − )
△2a 2 x
T △2a
8 log
x= K
△2a

q
As the arm a in the above calculations is bad, we have △a > 5 KT . This implies
T △2
log K a
that for x ≥ 8 △2a , we have

T
max(log T △2a
, 0)
log
T K
max(log Kx , 0)
K8 △2
≤ T △2
a

x log K a
8 △2a
T △2
△2 log 8K log(T △2a /K)
a

= a·
8 log(T △2a /K)
2
△
≤ a
8

14
q
The last inequality holds because log(T △2a /K) > 1 as △a > 5 KT . Therefore
s
T
△a max(log Kx , 0) △a △a
Pr(µ̂a x − µa > − ) ≤ Pr(µ̂a x − µa > − √ )
2 x 2 2 2
2 2
≤ 2exp(−2c △a x)

where c = 21 (1 − √1 ).
2
Now
T △2 T
log K a X
E[ka ] ≤ 8 + 2exp(−2c2 △2a x)
△2a
T △2a
8 log
x= K
△2a

T △2 ∞
log K a X
≤8 + 2exp(−2c2 △2a x)
△2a
T △2a
8 log
x= K
△2
a

T △2
log K a exp(−2c2 △2a x0 )
=8 +
△2a 1 − exp(−2c2 △2a )
T △2
a
log
where x0 = 8 △2K and last equality comes from the geometric series summa-
a
tion.
As exp(−2c2 △2a x0 ) < 1 we have
T △2
log K a 1
E[ka ] ≤ 8 +
△2a 1 − exp(−2c2 △2a )

And hence
T △2
log K a △a
△a E[ka ] ≤ 8 +
△a 1 − exp(−2c2 △2a )

Using 1 − e−y ≥ y − y 2 /2 we have

T △2
log K a △a
△a E[ka ] ≤ 8 + 2 2
△a 2c △a − 2c4 △4a
T △2
log K a 1
≤8 + 2
△a 2c △a (1 − c2 △2a )
T △2
log K a 1
≤8 + 2
△a 2c △a (1 − c2 )
T △2 √
log K a 1 T
≤8 + 2 √
△a 2c (1 − c2 ) 5 K

15
2
q
One can check that the maximum value of f (y) = log(T yy /K) is O( KT
).
Thus √
X X T √
△a E[ka ] ≤ O( √ ) = O( KT )
a:a is bad a:a is bad
K
q
All remains to show: E[δ] ≤ 10 K T .

16
5 Lower Bounds
We started with ETC and show its expected regret of O(T 2/3 K 1/3 (log T )1/3 ).
Then √we show that the expected regret of Successive Elimination and UCB
is O( KT log T ). Finally, we show that √ MOSS beats UCB and Successive
Elimination and has expected regret of O( KT ). What next? √ Can we have
an algorithm that has even better regret than MOSS, i.e., O( KT )? What if
researchers are unable to find a better regret bound then can we conclude that
better bound is not possible? Surely not. It could be lack of understanding or
knowledge, preventing us from coming up with a better algorithm.
Now our focus will be lower bounds, i.e., we will be show that no algorithm
can have better regret bound than some threshold. In fact we will prove the
following theorem
Theorem √ 9. No algorithm (deterministic or randomized) can have expected
regret of o( KT ) for all instances.
Note that the possibilities of algorithms are infinite. An algorithm can per-
haps do anything. Formally, a deterministic algorithm is any function that takes
as input K, T and for any 1 ≤ t ≤ T maps a sequence (of pairs consisting of arm
and received reward) {(a1 , r1 ), (a2 , r2 ), . . . , (at−1 , rt−1 )} to some arm at ∈ [K].
Since rewards can take any value in [0, 1], the number of distinct algorithms is
infinite. Thus, it is not possible to prove the above theorem by showing that a
particular algorithm does not have a better regret bound for all such algorithms
(as the number of such algorithms is infinite).
Not surprisingly, we will need new tools to prove such lower bound. We now
will study elementary tools from Information theory which will be sufficient to
prove the above theorem. In fact we will not directly prove the Theorem 9. First
we will consider a toy problem (coin testing problem of exercise) and show a
lower bound for this problem. This itself will require new tools from Information
Theory. Then after getting some familiarity with these tools, we will prove the
Theorem 9.
Testing Coin(ϵ, δ) Problem: It is promised that a given coin is either fair,
i.e, p = 1/2 or ϵ-biased, i.e., p = 1/2 + ϵ (algorithm knows ϵ). Give an algorithm
A that takes samples from that coin (as input) and has the following guarantee:
• If p = 1/2, algorithm must say ‘Fair’ with at least 1 − δ probability.
• If p = 1/2 + ϵ, the algorithm must say ‘Biased’ with at least 1 − δ proba-
bility.
From Hoffding’s inequality, it is easy to show following:
Lemma 1. There is a deterministic algorithm that solves Testing coin problem
and needs only O(1/ϵ2 ) samples for any constant δ < 1/2.
We will now show the following lower bound:
(1−2δ)2
Lemma 2. Let ϵ ≤ 1/3. Any deterministic algorithm will need at least 5ϵ2
samples to solve Testing Coin(ϵ, δ) problem.

17
Now the goal is to prove Lemma 2. As said before, we begin with under-
standing new tools.

Total Variation Distance and KL divergence

Recall that a probability distribution P over a finite domain Ω is a function
P : Ω → [0, 1] that maps each ω ∈ Ω to a real in [0, 1] satisfying:
1. P (ω) ≥ 0 for all ω ∈ Ω
P
2. ω∈Ω P (ω) = 1.

Now we will define the notion of distance between two probability distributions.
Any notion of distance d must necessarily satisfy these two properies - d(P, P ) =
0 for any distribution P and d(P, Q) > 0 for any two distinct distribution P and
Q. There can be many ways to define distance. for instance, we can consider
2
P
Euclidean distance or ℓ2 norm: d(P, Q) = ω (P (ω)−Q(ω)) or we can consider
1
P
ℓ1 norm called as total variation distance d(P, Q) = 2 ω |P (ω) − Q(ω)|. Each
notion of distance has its own merits and limitaions. For us, the notion of total
variation distance also known as statistical distance will be important.
Definition 1. For any two distributions P and Q defined on a finite domain
Ω, the total variation distance between P and Q, denoted as dtv (P, Q) is defined
as
1X
dtv (P, Q) = |P (ω) − Q(ω)|
2
ω∈Ω

It is easy to see that:

1. dtv (P, Q) = 0 if P = Q.
2. dtv (P, Q) > 0 for any P ̸= Q.
3. dtv (P, Q) = dtv (Q, P ).
4. dtv (P, Q) ≤ dtv (P, R) + dtv (R, Q) (triangle inequality)
Now we are going to see one of the most useful and remarkable property of total
variation distance.
Lemma 3.

dtv (P, Q) = max |P (S) − Q(S)| (1)

S⊆Ω
P P
where note that P (S) = ω∈S P (ω) and Q(S) = ω∈S Q(ω).
Now we define KL divergence between two distributions.
Definition 2. The KL divergence KL(P, Q) is defined as
X P (ω)
KL(P, Q) = P (ω) ln
Q(ω)
ω∈Ω

18
The KL divergence satisfy:
Lemma 4. • KL(P, P ) = 0
• KL(P, Q) > 0 if P ̸= Q
However, note that KL(P, Q) ̸= KL(Q, P ) and also it does not satisfy the
triangle inequality. That is why it is called divergence not distance. The useful-
ness of KL divergence comes from the fact that the KL divergence of a product
distribution just adds up. Let us see first what is a product distribution and
then the utility of KL divergence (I have talked about the product distribution
while discussing independent trials).
Definition 3. Given two distributions P and Q on the domains Ω1 and Ω2
respectively, then P × Q is a distribution on domain Ω1 × Ω2 such that
(P × Q)(ω1 , ω2 ) = P (ω1 )Q(ω2 ) ∀ω1 ∈ Ω1 , ω2 ∈ Ω2
In general, if Pi is a distribution on Ωi then the product distribution P1 × P2 ×
· · · × Pm is a distribution on domain Ω1 × · · · × Ωm such that
(P1 × P2 × · · · Pm )(ω1 , ω2 , . . . ωm ) = P1 (ω1 )P2 (ω2 ) · · · Pm (ωm )
We will use, for any distribution P , P ⊗2 for P × P . In general, we have
⊗m
P = P × · · · × P (m-times).
The product distribution corresponds to independent trials. Drawing m in-
dependent samples from a distribution P is exactly same as dwawing one sample
from P ⊗m . Now we will see why KL divergence is extremely useful.
Lemma 5. If for all i ∈ [m], Pi and Qi are distributions on Ωi then
KL(P1 ×P2 ×· · ·×Pm , Q1 ×Q2 ×· · ·×Qm ) = KL(P1 , Q1 )+KL(P2 , Q2 )+· · ·+KL(Pm , Qm )
Corollary 1.
KL(P ⊗m , Q⊗m ) = m · KL(P, Q)
The above property is not true for total variation distance. We do not
have dtv (P ⊗m , Q⊗m ) = m dtv (P, Q). In fact, other than dtv (P ⊗m , Q⊗m ) ≤
m dtv (P, Q), we do not know of any other inequality between these two.
We have now seen both total variation distance and KL divergence, each
having its own merits and limitations. Now we will see that these two are
connected by an inequality, which forms the basis of powerful application.
Lemma 6 (Pinsker inequality). For any two probability distributions P and Q,
we have r
1
dtv (P, Q) ≤ KL(P, Q)
2
Following is also useful.
Lemma 7 (The Bretagnolle–Huber bound).
p 1
dtv (P, Q) ≤ 1 − e−KL(P,Q) ≤ 1 − e−KL(P,Q)
2
Now we have all the necessary tool to prove Lemma 2. First we will prove
the lemma for deterministic algorithms.

19
Proof of Lemma 2 Suppose there is a deterministic algorithm A that solves
the Testing Coin problem and uses m independent coin tosses. Note that algo-
rithm A is just a function A : {H, T }m → {Fair, Biased} that maps a sequence
of head and tails of length m to Fair or Biased. We partition {H, T }m into two
subsets as follows: Af = {y ∈ {0, 1}m : A(y) = Fair} and Ab = {y ∈ {0, 1}m :
A(y) = Biased}, i.e., Af consists of all those sequence which gets mapped to
Fair and Ab consists of all those sequences that gets mapped to Biased.
Let p = Pr(Head). Let ( 21 )⊗m and ( 21 + ϵ)⊗m be the product distributions
corresponding to m independent coin tosses with p = 1/2 and p = 1/2 + ϵ
respectively. For convenience, let Df = ( 21 )⊗m and Db = ( 21 + ϵ)⊗m . We will
bound tv distance between Df = ( 12 )⊗m and Db = ( 12 + ϵ)⊗m in two different
ways. P P
Note that Df (Af ) = ω∈Af Df (ω) ≥ 1 − δ and Db (Af ) = ω∈Af Db (ω) ≤
δ (because the algorithm A solves the Testing Coin(ϵ,δ) problem). Thus by
Lemma 3,
dtv (Df , Db ) ≥ Df (Af ) − Db (Af ) ≥ 1 − 2δ.
Now let us bound tv distance using Pinsker inequality. As said earlier, calcu-
lation of KL divergence between Df and Db is easy as they both are product
distributions.
1 1
KL(( )⊗m , ( + ϵ)⊗m ) = m · KL(1/2, 1/2 + ϵ)
2 2
where KL(1/2, 1/2+ϵ) is the KL divergence between the distributions (1/2, 1/2)
and (1/2 + ϵ, 1/2 − ϵ).
1 1 1 1
KL(1/2, 1/2 + ϵ) = ln + ln
2 2( 21 + ϵ) 2 2( 21 − ϵ)
1 1 1 1
= ln = ln
2 4(1/4 − ϵ2 ) 2 (1 − 4ϵ2 )
1 (4ϵ2 )2 (4ϵ2 )3
= (4ϵ2 + + + . . . ) = 2ϵ2 (1 + 2ϵ2 + 16ϵ4 /3 + · · · ) ≤ 10ϵ2
2 2 3
for ϵ ≤ 1/3 (probably 10 can be replaced by a smaller number but as we are not
optimizing constants I put here bigger number so that the inequality is easy to
derive).
Now we will use Pinsker inequality to bound the tv distance.
r
1 ⊗m 1 ⊗m 1 1 1
dtv (( ) , ( + ϵ) ) ≤ KL(( )⊗m , ( + ϵ)⊗m )
2 2 2 2 2
√
≤ 5mϵ
Now we have
√
r
1 1 1
1 − 2δ ≤ dtv (Df , Db ) ≤ KL(( )⊗m , ( + ϵ)⊗m ) ≤ 5mϵ
2 2 2
(1−2δ)2
which implies that m ≥ 5ϵ2 .

20
Lower Bound for randomized algorithms
Lemma 8. Let ϵ ≤ 1/3. The expected number of coin tosses by any randomized
2
algorithm that solves Testing Coin(ϵ, δ) problem is at least (1−6δ)
15ϵ2 .

In the class, I discussed that any randomized algorithm R is just a distribu-

tion over some deterministic algorithms, i.e., for any randomized algorithm R,
there exits deterministic algorithms A1 , . . . , As (for some finite s) and a distri-
bution (q1 , . . . , qs ) over these deterministic algorithms such that R picks a Ai
(i ∈ [s]) (with probability qi ) and then run the deterministic algorithm Ai on
the given input.

Proof of Lemma 8 Let R be a randomized algorithm that solves the Testing

2
Coin (ϵ, δ) using at most m = (1−6δ) 15ϵ2 expected number of coin tosses. Let
A1 , . . . , As be deterministic algorithms each requiring at most m1 , m2 , . . . , ms
coin tosses and let R picks each Ai with probability qi . Now consider a table
with two rows and s columns — one row corresponding to p = 1/2 (called
row 1) and one corresponding to p = 1/2 + ϵ (called row 2), and s columns
corresponding to each deterministic algorithms A1 , . . . , As .
We put ✓ in the cell corresponding to row 1 and column Aj if
Pr(Aj returns ‘Biased’ when p = 1/2) ≤ 3δ. Otherwise we put ×, i.e,
if Pr(Aj returns ‘Biased’ when p = 1/2) > 3δ. For row 2, we put ✓ in the cell
corresponding to row 2 and column Aj if Pr(Aj returns ‘Fair’ when p = 1/2 +
ϵ) ≤ 3δ. Otherwise we put ×, i.e, if Pr(Aj returns ‘Fair’ when p = 1/2 + ϵ) > 3δ.
By cell(i, Aj ) (i ∈ {1, 2}, j ∈ [s]), we mean the symbol (✓ or ×) in the cell
corresponding to row i and column Aj in the above table.
Now,

δ ≥ Pr(R returns ‘Biased’ when p = 1/2) (because R solves Testing Coin(ϵ, δ))
X
= qj Pr(Aj returns ’biased’ when p = 1/2)
X
≥ qj 3δ
j:cell(1,Aj )=×
X
=⇒ qj ≤ 1/3
j:cell(1,Aj )=×

Similarly, X
qj ≤ 1/3
j:cell(2,Aj )=×
P
This means that j:cell(1,Aj )=✓ and cell(2,Aj )=✓ qj ≥ 1−(1/3+1/3) ≥ 1/3.That
is, R puts at least 1/3 probability on those Aj ’s for which both Pr(Ai returns ‘Biased’ when p =
1/2) ≤ 3δ and Pr(Ai returns ‘Fair’ when p = 1/2 + ϵ) ≤ 3δ are true. Let
J = {j : cell(1, Aj ) = ✓ and cell(2, Aj ) = ✓}. Note that for any j ∈ J, Aj
solves Testing Coin (ϵ, δ) problem and is a deterministic algorithm. By Lemma 2,

21
(1−2(3δ))2 (1−6δ)2
mj ≥ 5ϵ2 = (recall mj is the number of coin tosses by Aj ). There-
5ϵ2
2 2
fore, the expected number of coin tosses by R, is j qj mj ≥ 31 · (1−6δ) = (1−6δ)
P
5ϵ2 15ϵ2 .

Remark. I presented the Lemma 8 and its proof in the class in a slightly dif-
ferent way. The Lemma 8 is more stronger than the one I stated in the class.
There, I assumed that the running time (or the number of samples) of R is a
deterministic so each A1 , . . . , As uses fixed number of samples. As the above
proof shows, we can show lower bound on the expected number of samples (by
same proof ).

Lower Bounds for Bandits

We start with showing lower bounds
√ when there are only two arms. The best
upper bound in this case is O( T ). We show a matching lower bound.
Theorem 10. No algorithm can achieve an expected regret of O(T 0.5−α ) for
any constant α > 0 for all instances.
Proof. Let there exists an algorithm B that has an expected regret of O(T 0.5−α )
for some α > 0 for all instances. We will show that, using B, we can get a
randomized algorithm R that solves Testing Coin(ϵ, 1/50) in o(1/ϵ2 ) coin tosses
contradicting Lemma 8.
Let ϵ < 1/3 be the input of Testing Coin(ϵ, δ) problem. Consider a following
bandit instance:
• K = 2 (there are two arms)
• Arm 1’s reward takes value either 0 or 1. The probability of reward being
1 is the probability that the coin (of Testing Coin(ϵ, 1/50) ) outputs a
Head.
• Arm 2’s reward distribution is also a Bernoulli with mean reward of 21 + 2ϵ ,
i.e., reward takes either 0 or 1 and the probability of reward being one is
1 ϵ
2 + 2.
1
• T = (200/ϵ) 0.5+α
Note in the above that T = o(1/ϵ2 ). Now we describe a randomized algorithm R.
The algorithm R runs B on the above bandit instance (recall that we assumed
there exists an algorithm B that has an expected regret of O(T 0.5−α ) for some
α > 0 for all bandit instances). Whenever B pulls arm 2, the algorithm R
generates a sample from the arm 2’s reward distribution (Bernoulli with mean
reward of 12 + 2ϵ ) and give it to the bandit algorithm B (these are internal
randomness of the randomized algorithm R). Whenever B pulls arm 1, R uses
coin toss of the input coin (of Testing Coin (ϵ, δ)) to provide the reward to B
— if the coin toss outputs Head, it provides a reward of 1 to B and if the result
of coin toss is Tail, R provides a reward of 0 to B. Finally, R returns ‘Fair’
if B (after T rounds) pulls the arm 2 at least T /2 times and otherwise return
‘Biased’ (i.e., if B pulls arm 1 more than T /2 times).

22
Note that R uses at most T = o(1/ϵ)2 coin tosses. Now we prove correctness
of the algorithm R — if the coin is fair, i.e., p = Pr(Head) = 1/2 then B pulls
the arm 2 more than T /2 times (and hence outputs ‘Fair’) with at least 49/50
probability and if the coin is biased, i.e., p = 1/2+ϵ then B pulls the arm 1 more
than T /2 times (and hence outputs ‘Biased’) with at least 49/50 probability. In
other words, B will always pull the best arm > T /2 times with at least 49/50
probability. To see this, note that in both cases E[R(T )] = 2ϵ E[Ns ] where Ns is
the number of times B pulls the suboptimal arm (which is arm 1 when p = 1/2
and arm 2 when p = 1/2 + ϵ). Now we have

ϵ 1 T 1/2+α 1 −α
E[R(T )] = E[Ns ] ≤ T 2 −α =⇒ E[Ns ] ≤ T2 = T /100
2 100
By Markov’s inequality, we have Pr(Ns ≥ T /2) ≤ 1/50. Thus, R solves Testing
Coin (ϵ, 1/50) problem in T = o(1/ϵ)2 coin tosses which contradicts Lemma
8.
Now our goal will be to show the lower bound for K arms.
√
Theorem 11. No algorithm can have an expected regret of ≤ KT /5000 on
all bandit instances.
To prove the above theorem, we first need to define Biased-Coin Identifica-
tion problem.

Biased-Coin Identification: Input is K coins C1 , . . . , CK and ϵ > 0. It

is promised that there is exactly one biased coin for which Pr(Head) = 1/2 + ϵ
and rest of the others are fair coin. The goal of the algorithm is to return the
index of the unfair coin with probability at least 4/5.
Formally, Let P i = ( 12 , . . . , 12 + ϵ, . . . , 21 ) be K-tuple where all coordinate is
1/2 except the ith one which is 1/2 + ϵ. If the tuple corresponding to the input
coins probabilities is P i then the algorithm should return i with at least 4/5
probability, i.e., PrP i (y = i) ≥ 4/5 where y denotes the output of the algorithm
and PrP i (y = i) denotes the probability of the event y = i when the input coins
probability tuple is P i .
Theorem 12. Let ϵ ≤ 1/3. Any deterministic algorithm requires > K/1000ϵ2
number of total coin tosses to solve Biased-Coin Identification problem.
We will prove the above theorem later. First we prove the Theorem 11
assuming the above theorem.

Proof of Theorem 11 (assuming Theorem 12)

Proof. Suppose there is such an algorithm B. We now show that, using B,
we can solve Biased-coin Identification problem in K/2000ϵ2 total coin tosses
contradicting Theorem 12.
We run B on a bandit instance on K arms and T = K/2000ϵ2 . Whenever B
pulls an arm j, we toss the Cj th coin and provide to B a reward of 1 if the coin

23
toss results in Head and a reward of 0 if the coin toss result is Tail. Finally, we
return the index of the arm that has been pulled highest number of times.
It is easy to see that if ith coin is biased then the reward distributions (of
the bandit instance given as input to B)) of all arms except the ith one is a
Bernoulli with a mean of 1/2 and the reward distribution of the ith arm is a
Bernoulli with a mean of 1/2 + ϵ.
Note that for any of the above bandit instances, we have E[R(T )] = ϵ(T −
E[No ]) where No is the number of times√the optimal arm is pulled √
by the
KT KT T
algorithm. By our assumption, E[R(T )] ≤ 5000 =⇒ T − E[No ] ≤ 5000ϵ = 100 .
By Markov inequality, Pr(T − No ≥ T /2) ≤ 1/50. In other words, with at
least 49/50 probability, B will pull the optimal arm strictly more than T /2
times (which also implies that the optimal arm will be pulled highest number of
times). Hence with at least 49/50 > 4/5 probability, the output of the algorithm
will be the index of the optimal arm which is also the index of the biased coin.
Moreover, our algorithm uses only K/2000ϵ2 total coin tosses.

Proof of Theorem 12
Let A be an algorithm that solves the Biased Coin Identification problem in
K
m < 1000ϵ 2 total coin tosses.

We will now see that A can be viewed as a complete binary tree of height
m where recall that m is the number of total coin tosses by the algorithm. All
nodes of this tree are labelled by some index i ∈ [K]. From any internal node,
the edge to left child is labelled by H and the edge to right child is labelled
by T (this ordering is arbitrary). The algorithm starts from root and reach a
leaf. When the algorithm reaches any internal node labelled by i then it means
that the algorithm tosses the coin Ci . If the result of coin toss is H then the
algorithm moves to the left child and if the result is T then it goes to the right
child. In this way, algorithm starts from the root node and reaches some leaf
node. If the algorithm reaches a leaf with label i then it returns i. Below we
define some notions related with this tree.

Notations: The label of a node v will be denoted by L(v) (which will be

in [K]). The height of a node v will be denoted by h(v). By convention we
assume height of the root node is 0 and so the height of the leaf nodes are
m. For any node v, we will use ℓ(v) for left child of v and r(v) for right child
of v. The set of all nodes at height h will be denoted by Vh and moreover
Vh = (vh1 , vh2 , . . . , vh2h ). Note that |Vh | = 2h .
Let y denote the output of the algorithm (which will be in [K]). Recall that
P i = ( 21 , . . . , 12 + ϵ, . . . , 12 ) is the K-tuple where all coordinate is 1/2 except the
ith one which is 1/2 + ϵ. For any node v, we use PrQ (v) for the probability
that the algorithm reaches node v when the input coins probability tuple is Q.
Moreover, pQ (v, ℓ(v)) is the probability that the algorithm reaches ℓ(v) (left
child of v) given that it is currently at v when the input coins probability tuple
is Q. Similarly, pQ (v, r(v)) is the probability that the algorithm reaches r(v)

24
(right child of v) given that it is currently at v when the input coins probability
h
tuple is Q. Let DQ = (PrQ (vh1 ), . . . , PrQ (yh2h )) be the distribution induced
on the nodes at height h, when the input coins probability tuple is Q. Note
P2h
that k=1 PrQ (vhk ) = 1 for all Q (as the algorithm will reach exactly one node
at any height) so this is a probability distribution. Finally, the output of the
algorithm will be denoted by y.
Since A solves the Biased Coin Indentification problem, we must have
PrP i (y = i) ≥ 4/5 for all i ∈ [K].
Claim 1. dtv (DPmi , DPmj ) ≥ 3/5 for all i ̸= j.

Proof. Consider set Si = {v ∈ Vm : L(v) = i} and Sj = {v ∈ Vm : L(v) = j}

(Si is the set of all leaves with label P i. Similarly Sj is the set of all leaves
with label j). We have DPmi (Si ) = v∈SP
i
DPmi (v) = PrP i (y
P = i) m≥ 4/5
(because PrP i (y = i) = ∪v∈Si PrP i (v) = v∈Si PrP i (v) = v∈Si DP i (v)).
Similarly, DPmj (Sj ) ≥ 4/5 and hence DPmj (Si ) ≤ 1/5. By Lemma 3, we have
dtv (DPmi , DPmj ) ≥ |DPmi (Si ) − DPmi (Si )| ≥ 3/5.

Now we would like to bound KL(DPmi , DPmj ). Unfortunately, bounding this

directly is not easy. We will use F = (1/2, . . . , 1/2) (all 1/2) as a reference
distribution. Let us calculate KL(DFm , DPmj ). Our claim is that
X 1
KL(DFm , DPmj ) = KL(DFm−1 , DPm−1
j )+ KL ((1/2, 1/2), (pP j (v, ℓ(v)), pP j (v, r(v)))
2m−1
v∈Vm−1

Before proving the above let us interpret this. The KL divergence of the
distributions induced on nodes at level m by F and P j is equal to the divergence
of the distributions induced on nodes at level m − 1 plus divergence between
them conditioned on the fact that algorithm is currently at height m − 1. For
clarity, we will use pj (v, ℓ(v)) for pP j (v, ℓ(v)) and pj (v, r(v)) for pP j (v, r(v))
from now on.

25
Let us prove the above.
X 1 1
KL(DFm , DPmj ) = ln
2m 2m PrP j (v)
v∈Vm
X 1 1 1 1
= ln + ln
2m 2m PrP j (ℓ(v)) 2m 2m PrP j (r(v))
v∈Vm−1
X 1 1 1 1
= ln + ln
2m 2m PrP j (v) · pj (v, ℓ(v)) 2m 2m PrP j (v) · pj (v, r(v))
v∈Vm−1
X 1 1 X 1 1
= m
ln m−1 + ln
2 2 PrP j (v) 2m 2pj (v, ℓ(v))
v∈Vm−1 v∈Vm−1
X 1 1 X 1 1
+ m
ln m−1 + ln
2 2 PrP j (v) 2m 2pj (v, r(v))
v∈Vm−1 v∈Vm−1
X 1 1 X 1 1 1 1 1
= ln + ( ln + )
2m−1 2m−1 PrP j (v) 2m−1 2 2pj (v, ℓ(v)) 2 2pj (v, r(v))
v∈Vm−1 v∈Vm−1
X 1
= KL(DFm−1 , DPm−1
j )+ KL((1/2, 1/2), (pj (v, ℓ(v)), pj (v, r(v)))
2m−1
v∈Vm−1

Note that KL((1/2, 1/2), (pj (v, ℓ(v)), pj (v, r(v))) = 0 if L(v) ̸= j and oth-
werwise it is at most 10ϵ2 (earlier in the proof of Lemma 2 we show that
KL((1/2, 1/2), (1/2 + ϵ, 1/2 − ϵ)) ≤ 10ϵ2 for ϵ ≤ 1/3). So we have
1
KL(DFm , DPmj ) ≤ KL(DFm−1 , DPm−1
j )+ 10ϵ2 |v ∈ Vm−1 : L(v) = j|
2m−1
This implies that
K
X K
X
KL(DFm , DPmj ) ≤ KL(DFm−1 , DPm−1
j ) + 10ϵ2 ≤ 10mϵ2
j=1 j=1

where the last inequality in the above is by the repeated application of the first
inequality. Now applying Pinsker inequality, we have
K PK
X j=1 d2tv (DFm , DPmj ) 1
2 d2tv (DFm , DPmj ) ≤ 10mϵ 2
=⇒ <
j=1
K 200

K
since we assumed that m < 1000ϵ 2.

Since the average of dtv (DF , DPm1 ), . . . , d2tv (DFm , DPmK ) is 1/200 there exists at
2 m
1
least K/2 values of j ∈ [K] such that d2tv (DFm , DPmj ) < 2 · 200 = 1/100. Let i ̸= k
be two such values. We have dtv (DF , DP i ) < 1/100 and dtv (DFm , DPmk ) < 1/100.
2 m m 2

Hence dtv (DFm , DPmi ) < 1/10 and dtv (DFm , DPmk ) < 1/10. Since dtv satisfies
triangle inequality we have dtv (DPmk , DPmi ) < 1/5 which contradicts Claim 1.

26
Instance dependent lower bound
Recall that the expected regret of an algorithm depends on the instance. We will
use E[R(T, I)] for the expected regret of the algorithm
√ on the instance I. Note
that we have proved maxI∈I E[R(T, I)] = O( KT log T ) for UCB/Successive
Elimination where I is the set of all bandit instances. Moreover, we have also
seen the following instance-dependent upper bound for both UCB/Successive
Elimination algorithms: E[R(T, I)] = O(log T ) a:△a >0 △1a . We will now prove
P
the matching instance dependent lower bound.
Theorem 13. Let I be the set of all bandit instances (where recall that a bandit
instance consists of values of K, T and K reward distributions D1 , . . . , DK ).
1. There can not exist an algorithm for which E[R(T, I)] ≤ f (T )cI for all
bandit instance I ∈ I where f (T ) = o(log T ) and cI ≥ 0 depends on I
only (not on T ).
log T P 1
2. There can not exist an algorithm for which E[R(T )] ≤ 10000 a:△a >0 △a
for all bandit instances.
Proof. Let us first prove the first part. Suppose there exists such an algorithm
B. Consider two bandit instances, both on 2 arms.
1. In the first instance, denoted by w, first arm is Bernoulli with a mean of
1/2 and the second arm is a Bernoulli with a mean of 0.4.
2. In the second instance, denoted by w′ , first arm is Bernoulli with a mean
of 1/2 and the second arm is a Bernoulli with a mean of 0.6.
Since all rewards are either 0 or 1, the run of the algorithm B on these two
instances can be seen as a binary tree of height T in a natural way. All internal
nodes of this tree are labelled by either 1 or 2 (representing the pull of the arm
1 or 2 respectively), while the leaves are unlabelled. From any internal node,
the edge to left child is labelled by 0 and the edge to right child is labelled by
1 (this ordering is arbitrary). The algorithm starts from root and reach a leaf.
When the algorithm reaches any internal node labelled by i ∈ {1, 2} then it
means that the algorithm pulls the arm i. If the reward received is 0 then the
algorithm moves to the left child and if the reward received is 1 then it goes to
the right child. In this way, algorithm starts from the root node and reaches
some leaf node. Recall the notations Vh , L(v), . . . etc. used for the tree from
the previous proof as we will follow the same notations for clarity.
Let E[R(T, w)] be the expected regret of the algorithm B on instance w after
round T and E[R(T, w′ )] be the expected regret of the algorithm B on instance
w′ after round T . For any i ∈ {1, 2}, let Ew [NiT ] is the expected number of
times arm i is pulled by algorithm B after round T when the instance is w.
Similarly we define Ew′ [N1T ] and Ew′ [N2T ]).
Note that E[R(T, w)] = 0.1 · Ew [N2T ] and E[R(T, w′ )] = 0.1 · Ew′ [N1T ]. By
assumption, we also have E[R(T, w)] ≤ f (T )cw for all T and E[R(T, w′ )] ≤
f (T )cw′ for all T where f (T ) = o(log T ). Therefore, Ev [N2T ] ≤ 10cw f (T ) =

27
o(log T ) (as cw does not depend on T ). Similarly, Ew′ [N1T ] = o(log T ). Now we
will use Markov inequality to get the following inequality:

2Ew [N2T ]
Prw (N2T ≥ T /2) ≤ = o(log T )/T
T
2Ew′ [N1T ]
Prw′ (N2T ≥ T /2) = 1 − Prw′ (N1T ≥ T /2) ≥ 1 − ≥ 1 − o(log T )/T
T
We can see that for instance w, the event N2T ≥ T /2 happens with probabil-
ity at most o(log T )/T (which tends to 0 as T → ∞) whereas for the instance w′ ,
the same event happens with probability at least 1 − o(log T )/T (which tends
to 1 as T → ∞). This implies that the total variation distance between the
distributions induced on leaves (recall algorithm B is just a tree) by instance
w and w′ respectively is large. Formally let DvT and Dw T
′ be the distribution
′
induced on leaves for instance w and w respectively. Let S1 ⊆ VT be the set of
leaves such that at least T /2 + 1 nodes in the path from the root to the leaf are
labelled by 1 (in other words, if algorithm reaches any leaf in S1 then we have
N1T > T /2). Therefore S2 = V T \ S1 is the set of leaves such that at most T /2
nodes in the path from the root to the leaf are labelled by 1 (in other words, if
algorithm reaches any leaf in S2 then we have N1T ≤ T /2).
From Lemma 3,

T T T T o(log T ) o(log T ) o(log T )

dtv (Dw , Dw ′ ) ≥ Dw (S1 ) − Dw ′ (S1 ) = 1 − − =1−
T T T
Recall that in all previous lowerq
bounds proofs that we have seen, we have used
Pinsker inequality dtv (P, Q) ≤ 12 KL(P, Q) to relate total variation distance
and KL divergence. Unfortunately, this inequality is not useful when P and Q
are very far from each other (because when P and Q are far, though tv distance
is always at most 1, KL divergence between them can be a large number and
hence above inequality will be vacuously true).
We will use the Bretagnolle–Huber bound: for any distributions P and Q,
p 1
dtv (P, Q) ≤ 1 − e−KL(P,Q) ≤ 1 − e−KL(P,Q)
2
Note that 1 − 12 e−KL(P,Q) is always < 1 and hence the above inequality is not
vacuosly true when dtv (P, Q) → 1 and KL(P, Q) → ∞.
Now

T T 1 T
KL(Dw , Dw ′ ) ≥ ln ≥ log
T T
2(1 − dtv (Dw , Dw′ )) o(log T )

for all T . This implies that

T T
lim KL(Dw , Dw ′ )/ log T ≥ 1 (2)
T →∞

28
By the chain rule, we also have (as we did in previous lower bound proof)
X
T T T −1 T −1
KL(Dw , Dw ′ ) = KL(Dw , Dw ′ )+ Prw (v)KL((0.5, 0.4), (0.5, 0.6))
v∈VT −1 :L(v)=2

Let d = KL((0.5, 0.4), (0.5, 0.6)). By repeated application of the above inequal-
ity, we have
T
X −1 X
T T
KL(Dw , Dw ′) = d Prw (v)
h=0 v∈Vh :L(v)=2
P
Note that for any h, v∈Vh :L(v)=2Prw (v) is the probability that the algorithm B
PT −1 P
pulls arm 2 in the round h+1 (for instance w). Thus h=0 v∈Vh :L(v)=2 Prw (v)
is just Ew [N2T ] (hint: think of indicator random variable). Now
T T
lim KL(Dw , Dw ′ )/ log T ≥ 1 (from inequality 2)
T →∞
Ew [N2T ] 1
=⇒ lim ≥
T →∞ log T d

Recall that earlier we have shown that Ew [N2T ] = o(log T ) which is a contradic-
tion to above.
The proof of second part is almost same. Modify instances w and w′ as
follows and rest of the proofs follows in similar fashion:

1. In w, first arm is Bernoulli with a mean of 1/2 and the second arm is a
Bernoulli with a mean of 0.5 − ϵ.
2. In w′ , first arm is Bernoulli with a mean of 1/2 and the second arm is a
Bernoulli with a mean of 0.5 + ϵ.

29
6 Adversarial bandits
In adversarial bandits, there is no randomness in the received rewards. The
input is an unknown but fixed reward table (which is just a sequence of T
reward vectors r1 , . . . , rT where ri = (ri,1 , ri,2 , . . . , ri,K ) ∈ [0, 1]K ). As there
is no randomness in the received rewards, the randomness can come from the
algorithm only.
Input/Instance: K, T (known), unknown r1 , . . . , rT
In each round t ∈ T :
1. algorithm picks an arm at ∈ [K].
2. algorithm receives the reward rtat .
How do we define regret? One way to define regret is as follows : regret of an
algorithm is the difference of the total rewards received by the algorithm that
knows the reward table minus the (expected) reward received by the algorithm.
The issue with this definition is that it makes the problem very hard in the
sense that any algorithm will have large regret (Ω(T )).
The regret of an algorithm is defined in the following way:
T
X T
X
R(T ) = max rta − rtat
a
t=1 t=1

That is, the regret of an algorithm is the total rewards received by the best
arm in hindsight minus the total reward received by the algorithm. There are
several advantages of defining the regret in the above mentioned way:
1. Now we will be able to show algorithms with sublinear regret.
2. This can be starting point to study more stronger notions of regret (some-
thing in between the above two notions).
As it will be clear soon, there is good reason to consider costs instead of re-
wards with the goal of minimizing cost (than maximizing rewards). Note that
it is just a matter of convenience and to be in line with existing literature we
will consider costs.

Input/Instance: K, T (known), unknown cost vectors c1 , . . . , cT where ci =

(ci1 , . . . , ciK )
In each round t ∈ T :
1. algorithm picks an arm at ∈ [K].
2. algorithm pays the cost ctat .
The regret of an algorithm is defined in the following way:
T
X T
X
R(T ) = ctat − min cta
a
t=1 t=1

30
Bandits with full feedback
An easier problem is bandits with full feedback where after every round t, the
entire cost vector ct is revealed. Note that this is easier than adversarial ban-
dits. As we will see later, this is also a special case of online learning .

Input/Instance: K, T (known), unknown cost vectors c1 , . . . , cT

In each round t ∈ T :
1. algorithm picks an arm at ∈ [K] and pays the cost ctat .
2. algorithm gets to know the cost vector ct .

A general idea is to first study/give algorithms for bandits with full feedback.
Then modify the algorithm to make it work for bandit setting.
Following theorem shows that any deterministic algorithms will have large
regret even for bandits with full feedback.

Theorem 14. Any determistic algorithm will have a regret of T /2 for some
instance, even for 2 arms and when costs are either 0 or 1.

Proof Sketch: Fix any deterministic algorithm. Construct an instance con-

sisting of only 2 arms and binary cost 0 or 1 as follows: in any round t, ex-
actly one arm has cost of 1 and other has cost of 0 (that is, for any t, either
ct1 = 1, ct2 = 0 or ct1 = 0, ct2 = 1). Depending on the algorithm, the cost
table is constructed so that algorithm pays the total cost of T (fill the details
yourself) while there exists an arm that pays only a cost of at most T /2 (this is
the directly implied by the fact that in any round, one arm pays cost of 0 and
other 1).
Note that this is in sharp contrast with stochastic bandits, where all the
algorithm that we had seen were determinstic. Now we give a randomized
algorithm Hedge that has sublinear regret for bandits with full feedback.

Hedge Algorithm
Theorem 15. For any ϵ > 0 and cost vectors c1 , . . . .cT with non negative
entries, the expected regret of Hedge algorithm satisfies:
T
log K X
E[R(T )] ≤ +ϵ pt · c2t (3)
ϵ t=1

where c2t = (c2t1 , . . . , c2tK ) and pt · c2t = pt (a)c2ta .

P
a

Corollary 2. If each cost is in [0, 1] (that is, cta ∈ [0, 1] for each t and a) then
the expected regret of Hedge satisfies:
p
E[R(T )] = O( T log K)

31
Algorithm 6: Hedge Algorithm
Parameter : ϵ > 0 (will be choosen to optimize regret);
w1 (a) = 1 for all arms a;
/* Initialization */
for t = 1, . . . , T doP
pt (a) = wt (a)/ a wt (a) for each arm a;
pt = (pt (1), . . . , pt (K));
at ∼ pt ;
Algorithm pays the cost ctat ;
Algorithm receives the cost vector ct ;
wt+1 (a) = wt (a)e−ϵcta for all a ;
/* update of weights */
end

we have pt · c2t ≤ 1 for each t. Hence

Proof. If each cost is in [0, 1] then q
√
E[R(T )] ≤ logϵ K + ϵT . Choosing ϵ = logT K gives E[R(T )] = T log K.

For the proof of theorem, follow this material (beware of change in nota-
tions):
https://people.eecs.berkeley.edu/~jiantao/2902021spring/scribe/EE290_
Lecture_10.pdf

EXP3 Algorithm
The idea of EXP3 algorithm is to simulate the Hedge algorithm on fake cost
vectors. In any round t, the algorithm pulls arm at and hence gets to know
c t
only ctat . We define fake cost vector ĉt = (0, . . . , ptta
(at ) , 0, . . . , 0) (all entries are
c t
0 except the at th one which is ptta
(at ) ). EXP3 algorithm just invokes Hedge on
the fake cost vector to update weights and proceed further.

Theorem 16. If all costs are in [0, 1] (that is, cta ∈ [0, 1] for all t ∈ [T ] and
a ∈ [K]) then the expected regret of EXP3 satisfies:
p
E[R(T )] = O( KT log K)

Proof of Theorem 16
Note that for any t and a, pta (which is the probability that the algorithm pulls
arm a in round t) is a random variable. Similarly, each fake costs ĉta is also a
random variable.
Let at = (a1 , . . . , at ) (recall at is the arm choosen by the algorithm in round
t). Let bt = (b1 , . . . , bt ) ∈ [K]t . Let pta|at−1 =bt−1 denotes the probability that
the algorithm EXP3 pulls arm a in round t if a1 = b1 , . . . , at−1 = bt−1 . We may
also use pEXP 3
ta|at−1 =bt−1 for clarity if we are taking about both Hedge and EXP3

32
Algorithm 7: EXP3 Algorithm
Parameter : ϵ > 0 (will be choosen to optimize regret);
w1 (a) = 1 for all arms a;
/* Initialization */
for t = 1, . . . , T doP
pt (a) = wt (a)/ a wt (a) for each arm a;
pt = (pt (1), . . . , pt (K));
at ∼ pt ;
Algorithm pays the cost ctat ;
c t
ĉt = (0, . . . , ptta
(at ) , 0, . . . 0);
/* all entries of ĉt are zero except the at th one which
c t
is ptta(at ) */
For each arm a, wt+1 (a) = wt (a)e−ϵĉta ;
/* update of weights */
end

simultaneously. Note that pta|at−1 =bt−1 is a real and will be some function of
b1 , . . . , bt−1 but fortunately we will not need explicit expression for our analysis.
Note that given a1 = b1 , . . . , at−1 = bt−1 , the random variable ĉta takes the
value of 0 if EXP3 does not pulls arm a in the tth round and pta|a cta=b if
t−1 t−1
EXP3 pulls arm a in the tth round. Therefore,
cta
E[ĉta |at−1 = bt−1 ] = pta|at−1 =bt−1 · = cta .
pta|at−1 =bt−1
Also
X
E[ĉta ] = Pr(at−1 = bt−1 )E[ĉta |at−1 = bt−1 ]
(b1 ,...,bt−1 )∈[K]t−1
X
= Pr(at−1 = bt−1 )cta
(b1 ,...,bt−1 )∈[K]t−1

= cta
Thus for any t and a, ĉta is an unbiased estimator of cta (meaning E[ĉta ] = cta ).
Recall that the algorithm EXP3 is just a simulation of Hedge algorithm on fake
cost vectors. So intuitively we should expect that the expected cost of EXP3 on
original cost table should relate to the expected cost of the Hedge algorithm on
the fake cost vectors. To formally see this, we will prove the following lemma.
Lemma 9. Let c□ = (c1 , . . . , cT ) denote the input cost table. Let ĉ□ =
(ĉ1 , . . . , ĉT ) be the fake cost table. Then
E[CostEXP 3 (c□ )] = E[E[CostHedge (ĉ□ )]
where in E[E[CostHedge (ĉ□ )], the outer expectation is over the randomness of
EXP3.

X
E[E[CostHedge (ĉ□ )]] = PrEXP 3 (aT = bT )E[CostHedge (ĉ□|bT )]
bT ∈[K]t
T
X X ctbt
= PrEXP 3 (aT = bT ) pHedge
t (ĉ□|bT ) · EXP 3
ebt
t=1
ptbt |bt−1
bT ∈[K]T

X T
X
= PrEXP 3 (aT = bT ) ctbt (because pHedge
t (ĉ□|bT )) = pEXP 3
tbt |bt−1 ebt )
bT ∈[K]T t=1
EXP 3
= E[Cost (ĉ□ )]

Now for any arm a, applying the inequality 3,

T T X
X log K X
E[E[CostHedge (ĉ□ )] ≤ E[ ĉta + +ϵ pHedge
ta (ĉ□ )ĉ2ta ]
t−1
ϵ t=1 a

Now
XT T
X t
X
E[ ĉta ] = E[ĉta ] = cta
t−1 t−1 t=1

XT X T X
X
E[ pHedge
ta (ĉ )ĉ
□ ta
2
] = E[pHedge
ta (ĉ□ )ĉ2ta ]
t=1 a t=1 a

Now

= c2ta

34
Hence for any arm a, we have
T T X
log K X X
E[CostEXP 3 (c□ )] = E[E[CostHedge (ĉ□ )] ≤ + E[ ĉta ] + ϵE[ pHedge
ta (ĉ□ )ĉ2ta ]
ϵ t−1 t=1 a
T T X
log K X X
= + cta + ϵ c2ta
ϵ t−1 t=1 a
T
log K X
≤ + cta + ϵKT
ϵ t−1

Therefore,
T
EXP 3
X log K
E[Cost (c□ )] ≤ min cta + ϵKT +
a
t−1
ϵ
q
log K √
Choosing ϵ = KT gives regret of KT log K.

Martiangles
We will study random process on decision trees. Consider a decision tree T
(generally this corresponds to sequence of decisions/r.vs Y1 , . . . , Yn ), a complete
rooted tree
P of height n. Each edge uv is associated with probability puv and
we have v∈C(u) puv = 1 for all node u where C(u) is the set of all children
of u. One can visualize as a particle that starts at the root node and in each
step picks a child with the respective probability (and hence in this way particle
reaches a leaf node).
A sequence of random variables X0 , X1 , . . . , Xn is called a martiangale with
respect to a decesion tree T if
1. if Xi : Vi → R for all i where Vi is the set of nodes at height i.
2. E[Xi |Hi−1 ] = Xi−1 for all i where Hi−1 is the history just after time
i − 1. Note that Hi−1 is a random variable and is the node reached by the
particle after just time i − 1.
E[Xi |Hi−1 ] = Xi−1 for all i is same as saying: for all i, for all nodes v at
height i − 1, E[Xi |v] = Xi−1 (v) where E[Xi |v] is the expectation of Xi
given the particle is at node v. Further note that, given that the particle
is at node v (at height i − 1), X0 , . . . , Xi−1 is a fixed, i.e., they depends
on v and so Xi−1 (v) is a real number.
One reason to study martingales is that Hoffding type of concentration in-
equality can be applied to a sequence of dependent random variables if they
form a martiangle.

35
Theorem 17 (Azuma Inequality). If X0 , . . . , Xn is a martiangle with respect
to some decision tree T and if |Xi − Xi−1 | ≤ ci for all i then

ϵ2
Pr(|Xn − X0 | ≥ ϵ) ≤ 2 · exp(− P 2 )
2 i ci

One can derive Hoffding’s inequality for binary independent random vari-
ables from the Azuma inequality: If Y1 , . . . , Yn are independent random vari-
ables such that Yi ∈ {−1, 1} and E[Yi ] = µi then
n
X ϵ2
Pr(| (Yi − µi )| ≥ ϵ) ≤ 2 · exp(− )
i=1
2n

To see the above, consider a decision tree T corresponding to the decisions

Y1 , . . . , Yn (T is a complete binary tree. From any node v at height i, the prob-
ability of reaching the left child of v from v is Pr(Yi = −1) and the probability
of reaching the right child of v from v is Pr(Yi = 1)).
Now (X0 , X1 , . . . , Xn ) is a martiangle where X0 = 0 and Xi = Y1 + · · · +
Yi − µ1 − · · · − µi . To see this, consider any node v at height i − 1. We have

E[Xi |v] = E[Y1 + · · · + Yi − µ1 − · · · − µi |v]

= (Y1 + · · · + Yi−1 − µ1 − · · · − µi−1 )(v) + E[Yi − µi |v]
= Xi−1 (v) + 0 = Xi−1 (v)

Note that applying Azuma inequality to (X0 , . . . , Xn ) gives Hoffding inequality

for Y1 , . . . , Yn .

RLbook Solutions Manual
No ratings yet
RLbook Solutions Manual
35 pages
Building Information Modelling: Detailed Description
100% (1)
Building Information Modelling: Detailed Description
20 pages
Assignment 1: Reinforcement Learning Prof. B. Ravindran
100% (2)
Assignment 1: Reinforcement Learning Prof. B. Ravindran
4 pages
Kisssoft Tutorial 14 Compression Springs
No ratings yet
Kisssoft Tutorial 14 Compression Springs
11 pages
Snu 4a
100% (1)
Snu 4a
192 pages
Multi-Armed Bandits epsilon-greedy algorithm
No ratings yet
Multi-Armed Bandits epsilon-greedy algorithm
14 pages
Lecture 2 EE675
No ratings yet
Lecture 2 EE675
4 pages
Bandit
No ratings yet
Bandit
8 pages
RL UNIT PPT
No ratings yet
RL UNIT PPT
595 pages
Auer - Using Ucb for Exploration-exploitation Tradeoffs
No ratings yet
Auer - Using Ucb for Exploration-exploitation Tradeoffs
26 pages
EE675A Lecture 3
No ratings yet
EE675A Lecture 3
8 pages
Necessary and Sufficient Conditions For Achieving Sub-Linear Regret in Stochastic Multi-Armed Bandits
No ratings yet
Necessary and Sufficient Conditions For Achieving Sub-Linear Regret in Stochastic Multi-Armed Bandits
9 pages
IntroMulti Armed Bandits Slivkin Microsoft PDF
No ratings yet
IntroMulti Armed Bandits Slivkin Microsoft PDF
174 pages
Multi-armed bandits
No ratings yet
Multi-armed bandits
11 pages
Expanded_Multi_Armed_Bandit_and_Probability_Basics
No ratings yet
Expanded_Multi_Armed_Bandit_and_Probability_Basics
5 pages
325 Notes
No ratings yet
325 Notes
23 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
9 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
26 Making Decisions
No ratings yet
26 Making Decisions
31 pages
Unit II
No ratings yet
Unit II
10 pages
Exploration in Contextual Bandits: Reedy Reedy
No ratings yet
Exploration in Contextual Bandits: Reedy Reedy
16 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
34 pages
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
No ratings yet
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
4 pages
Mid-Semester Examination
No ratings yet
Mid-Semester Examination
2 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
MAB Assignment 2
No ratings yet
MAB Assignment 2
2 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
No ratings yet
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
32 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
DLMAIRIL01_Q4-2024_Session3
No ratings yet
DLMAIRIL01_Q4-2024_Session3
47 pages
EE290S Lecture Note 22
No ratings yet
EE290S Lecture Note 22
12 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
Lecture 9: Exploration and Exploitation: David Silver
No ratings yet
Lecture 9: Exploration and Exploitation: David Silver
47 pages
pdf24 Images Merged
No ratings yet
pdf24 Images Merged
12 pages
Evendar 06 A
No ratings yet
Evendar 06 A
27 pages
AR23
No ratings yet
AR23
159 pages
3. Reinforcement Learning (1)
No ratings yet
3. Reinforcement Learning (1)
28 pages
rl-unit5
No ratings yet
rl-unit5
101 pages
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
No ratings yet
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
19 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
RL
No ratings yet
RL
20 pages
Multi-Armed Bandit Algorithms and Empirical Evaluation
No ratings yet
Multi-Armed Bandit Algorithms and Empirical Evaluation
12 pages
NIPS-2008-algorithms-for-infinitely-many-armed-bandits-Paper
No ratings yet
NIPS-2008-algorithms-for-infinitely-many-armed-bandits-Paper
8 pages
Solution2
No ratings yet
Solution2
5 pages
Mavrin 19 A
No ratings yet
Mavrin 19 A
11 pages
cs747 A2020 Quizzes PDF
No ratings yet
cs747 A2020 Quizzes PDF
5 pages
Unit 1-RL
No ratings yet
Unit 1-RL
11 pages
16 - Reinforcement Learning and Bandits.pptx
No ratings yet
16 - Reinforcement Learning and Bandits.pptx
41 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
Mab Notes
No ratings yet
Mab Notes
15 pages
Data Challenge - NC Soft
No ratings yet
Data Challenge - NC Soft
4 pages
RL-Endterm Report - Mridul Agarwal
No ratings yet
RL-Endterm Report - Mridul Agarwal
27 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
0% (1)
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
537 pages
Bellemare17a PDF
No ratings yet
Bellemare17a PDF
10 pages
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
No ratings yet
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
11 pages
A Distrib Persp On RL
No ratings yet
A Distrib Persp On RL
19 pages
RL_Paper_Deepsk
No ratings yet
RL_Paper_Deepsk
4 pages
Fan Glynn
No ratings yet
Fan Glynn
32 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Bapi For All
No ratings yet
Bapi For All
8 pages
Hello, World! I'm Al Sweigart, Author of - Automate The Boring Stuff With Python - and Several Other Programming Books. AMA! - Python
No ratings yet
Hello, World! I'm Al Sweigart, Author of - Automate The Boring Stuff With Python - and Several Other Programming Books. AMA! - Python
3 pages
Freshers Booklet 21 - 22
No ratings yet
Freshers Booklet 21 - 22
109 pages
Salesforce Marketing Cloud & Email Automation
No ratings yet
Salesforce Marketing Cloud & Email Automation
12 pages
10 Usability Crimes You Really Shouldn't Commit
No ratings yet
10 Usability Crimes You Really Shouldn't Commit
12 pages
2404.10772 Gaussian Opacity Fields
No ratings yet
2404.10772 Gaussian Opacity Fields
12 pages
CS625 Sample Paper
No ratings yet
CS625 Sample Paper
13 pages
New Thesis Ref 3
No ratings yet
New Thesis Ref 3
6 pages
ALSTOM Library PDF
100% (1)
ALSTOM Library PDF
450 pages
New PPT
No ratings yet
New PPT
98 pages
Embedded Systems and Internet Protocol
No ratings yet
Embedded Systems and Internet Protocol
82 pages
DustTrak II 6001987 - UK A4 Web
No ratings yet
DustTrak II 6001987 - UK A4 Web
4 pages
Filtro de Silica Gel MTraB - 100115620
No ratings yet
Filtro de Silica Gel MTraB - 100115620
1 page
Hosts
50% (2)
Hosts
1 page
SSRN Id3449848 PDF
No ratings yet
SSRN Id3449848 PDF
40 pages
M3 ppt1 IntroductionToJavaScript
No ratings yet
M3 ppt1 IntroductionToJavaScript
28 pages
Kaspersky Virus Removal Tool 2010
No ratings yet
Kaspersky Virus Removal Tool 2010
9 pages
Bikal Eyesoft Users Manual
No ratings yet
Bikal Eyesoft Users Manual
93 pages
Where Can Buy Bronze Metal Clay Explore A New Material With 35 Projects 1st Edition Yvonne M. Padilla Ebook With Cheap Price
100% (4)
Where Can Buy Bronze Metal Clay Explore A New Material With 35 Projects 1st Edition Yvonne M. Padilla Ebook With Cheap Price
64 pages
Thesis Uj
100% (2)
Thesis Uj
7 pages
AUTOSAR SWS DiagnosticCommunicationManager
No ratings yet
AUTOSAR SWS DiagnosticCommunicationManager
683 pages
PowerFlex 70 - 700 - Parameters To Set Up Datalinks
No ratings yet
PowerFlex 70 - 700 - Parameters To Set Up Datalinks
2 pages
TM 5-3895-207-20P Reeling Machine RL 172/G NSN 3895-00-892-2495 and RL-172 A/g NSN 3895-00-900-8320
No ratings yet
TM 5-3895-207-20P Reeling Machine RL 172/G NSN 3895-00-892-2495 and RL-172 A/g NSN 3895-00-900-8320
28 pages
Trimble TCU5
No ratings yet
Trimble TCU5
30 pages
AP0957-M28-B_DV5950_LINUX_SW4.3.1
No ratings yet
AP0957-M28-B_DV5950_LINUX_SW4.3.1
40 pages
513 Spring 2024 Syllabus
No ratings yet
513 Spring 2024 Syllabus
12 pages
Ansys Optislang
No ratings yet
Ansys Optislang
2 pages

EXP3

Uploaded by

EXP3

Uploaded by

Stochastic Bandits

February 21, 2025

Pr(X ≥ cE[X]) ≤ 1/c.

Theorem 3 (Hoffding’s inequality). Suppose X1 , . . . , Xn be independent bounded

If 0 ≤ Xi ≤ 1 for all i ∈ [n] then we have,

When for all i, 0 ≤ Xi ≤ 1 then

Notations: Throughout the course, we will stick to the following notations.

Regret: How do we measure the performance of an algorithm? Recall that

3.1 Explore Then Commit Algorithm

E[R(T )] = Pr(Bad)E[R(T )|Bad] + Pr(Good)E[R(T )|Good]

R(T )|Good ≤ mK + (T − mK)2ϵ

Algorithm 2: Successive Elimination - High Level Description

Now to describe Successive Elimination fully, we just need to specify what

Algorithm 3: Successive Elimination

E[R(T )] = Pr(Bad)E[R(T )|Bad] + Pr(Good)E[R(T )|Good]

Bounding E[R(T )|Good] Each

Bounding Pr(Good) It remains to show :

where △a = µ∗ − µa is the gap of arm a.

is always active we also have na∗ (t) = 50000 log T

Algorithm 4: UCB Algorithm

UCB acheives the same guarantee on expected regret as that of Successive

Algorithm 5: MOSS Algorithm

Theorem 8. The expected regret E[R(T )] of MOSS satisfies

Proof. We will use a trick that is frequently employed in analysis of randomized

Using 1 − e−y ≥ y − y 2 /2 we have

Total Variation Distance and KL divergence

It is easy to see that:

dtv (P, Q) = max |P (S) − Q(S)| (1)

In the class, I discussed that any randomized algorithm R is just a distribu-

Proof of Lemma 8 Let R be a randomized algorithm that solves the Testing

Lower Bounds for Bandits

Biased-Coin Identification: Input is K coins C1 , . . . , CK and ϵ > 0. It

Proof of Theorem 11 (assuming Theorem 12)

Notations: The label of a node v will be denoted by L(v) (which will be

Proof. Consider set Si = {v ∈ Vm : L(v) = i} and Sj = {v ∈ Vm : L(v) = j}

Now we would like to bound KL(DPmi , DPmj ). Unfortunately, bounding this

T T T T o(log T ) o(log T ) o(log T )

for all T . This implies that

Input/Instance: K, T (known), unknown cost vectors c1 , . . . , cT where ci =

Input/Instance: K, T (known), unknown cost vectors c1 , . . . , cT

Proof Sketch: Fix any deterministic algorithm. Construct an instance con-

where c2t = (c2t1 , . . . , c2tK ) and pt · c2t = pt (a)c2ta .

we have pt · c2t ≤ 1 for each t. Hence

Now for any arm a, applying the inequality 3,

To see the above, consider a decision tree T corresponding to the decisions

E[Xi |v] = E[Y1 + · · · + Yi − µ1 − · · · − µi |v]

Note that applying Azuma inequality to (X0 , . . . , Xn ) gives Hoffding inequality

You might also like