0% found this document useful (0 votes)
3 views

EXP3

The document discusses stochastic bandits, focusing on concentration inequalities, the weak law of large numbers, and the exploration-then-commit algorithm. It presents various theorems, including Chebyshev's and Hoffding's inequalities, which provide bounds on probabilities related to random variables. The exploration-then-commit algorithm aims to minimize expected regret in selecting arms with unknown reward distributions, achieving a regret bound of O(T^(2/3) (log T)^(1/3) K^(1/3)).

Uploaded by

yadavnaman546
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

EXP3

The document discusses stochastic bandits, focusing on concentration inequalities, the weak law of large numbers, and the exploration-then-commit algorithm. It presents various theorems, including Chebyshev's and Hoffding's inequalities, which provide bounds on probabilities related to random variables. The exploration-then-commit algorithm aims to minimize expected regret in selecting arms with unknown reward distributions, achieving a regret bound of O(T^(2/3) (log T)^(1/3) K^(1/3)).

Uploaded by

yadavnaman546
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Stochastic Bandits

February 21, 2025

1 Concentration Inequalities
Theorem 1 (Chebyshev’s Inequality). Let X be any random variable. Then
for all t > 0,
V ar(X)
Pr(|X − E[X]| ≥ t) ≤ .
t2
Theorem 2 (Markov inequality). Let X be any random variable that takes only
non negative values. Then for any c > 0,

Pr(X ≥ cE[X]) ≤ 1/c.

Theorem 3 (Hoffding’s inequality). Suppose X1 , . . . , Xn be independent bounded


random variables such that for all i ∈ [n],Pwe have ai ≤ Xi ≤ bi for all i ∈ [n].
Let Sn = X1 + · · · + Xn and so E[Sn ] = i E[Xi ].
For any t > 0, we have
2t2
−P 2
Pr(|Sn − E[Sn ]| ≥ t) ≤ 2e i (bi −ai )

If 0 ≤ Xi ≤ 1 for all i ∈ [n] then we have,


2t2
Pr(|Sn − E[Sn ]| ≥ t) ≤ 2e− n

P
X1 +···+Xn E[Xi ]
Let X̄n = Sn /n = n . So E[X̄n ] = i
n For any t > 0, we have
2t2 n2
−P 2
Pr(|X̄n − E[X̄n ]| ≥ t) ≤ 2e i (bi −ai )

When for all i, 0 ≤ Xi ≤ 1 then


2
Pr(|X̄n − E[X̄n ]| ≥ t) ≤ 2e−2t n

Remark. Chebyshev’s and Hoffding’s inequalities works even for r.v. that can
take negative values.

1
2 Weak Law of Large Numbers
Let X1 , . . . , Xn be n iid (independent and identically distributed)
P
random vari-
X
ables. Let E[X] = µ and V ar(X) = v. Let X̄n = in i be the empirical
average. Note that E[X̄n ] = µ.
When n → ∞ then the empirical average X̄n converges to the true mean µ.
We can see this from Chebyshev’s inequality. As X1 , . . . , Xn are independent
we have P
Xi nV ar(X) V ar(X)
V ar(X̄n ) = V ar( i ) == 2
= .
n n n
So as n → ∞, the variance of X̄n tends to 0. So we have
V ar(X) v
Pr(|X̄n − µ| ≥ ϵ) ≤ 2
= 2
nϵ nϵ
So for any fixed ϵ > 0, for sufficiently large values of n (this value will depend
on ϵ), the probability Pr(|X̄n − µ| ≥ ϵ) will approach 0.

3 Stochastic Bandits
We start with the basic model of bandits, which is independent rewards. An
algorithm has K possible arms to choose from, and there are T rounds, both
K and T are known in advance (we will see later that one can relax the as-
sumption that T should be known in advance). Each arm a is associated with
a reward distribution Da (pmf/pdf) which is unknown to the algorithm. The
mean reward of the distribution Da will be most relevant for us and is denoted
by µa , i.e., µa = E [r] (r ∼ Da denotes that r is sampled/drawn from the
r∼Da
distribution Da ).
In each round t ∈ T :
1. algorithm picks an arm at ∈ [K].
2. reward rt is sampled from the distribution Dat .
3. algorithm receives the reward rt .
The algorithm can be randomized, i.e., in any round it can fix a probability
distribution over arms and can pick an arm from this distribution.
Again, it is important to note that µ1 , . . . , µK (and distributions D1 , . . . , DK )
are unknown to the algorithm. The algorithm only sees the rewards (of the arms
pulled by the algorithm).
We make the following assumptions:
1. rewards are bounded. For simplicty, we will assume rewards at any round
will be in [0, 1]. So the means µ1 , . . . , µK all are in [0, 1].
2. as we are in stochastic setting, we assume all drawn rewards are indepen-
dent.

2
An important point to note that, regardless of the algorithm is deterministic
or randomized, the arms pulled by the algorithm is a random variable. Because
the algorithm’s decision to pull any arm in some round will depend on past
history of rewards observed by the algorithm. Since the received rewards are
itself a random variable, the arms pulled are also random variable. This should
become clear when we will see some algorithms.

Notations: Throughout the course, we will stick to the following notations.


The set of arms will be denoted by [K] and the total number of rounds will be
denoted by T . The best arm is the arm a that has highest mean reward. We
will use a∗ for the best arm and µ∗ for the mean reward of the best arm. That
is a∗ = arg maxa µa and µ∗ = µa∗ = maxa µa . For any suboptimal arm, we
will use ∆a := µ∗ − µa for the gap of arm a. Finally, the arm pulled by the
algorithm in tth round will be denoted by at .

Regret: How do we measure the performance of an algorithm? Recall that


the goal of the algorithm is to maximize the sum of rewards received in all
rounds.
One standard approach is to compare the algorithm performance with the
best possible algorithm that knows the distributions Da for all a ∈ [K]. Note
that if means µ1 , . . . , µK are known then best strategy to maximize total ex-
pected rewards is to always pick the arm a∗ to get total expected reward of
T µ∗ . It makes sense to define the regret of an algorithm after T rounds as
T
X T
X
R(T ) = µ∗ T − µat = (µ∗ − µat )
t=1 t=1

where at is the arm pulled by the algorithm in the t th round. Thus R(T ) is
the regret incurred by the algorithm after T rounds of not knowing the means
µ1 , µ2 , . . . , µK . Note that the regret R(T ) is a random variable. We are inter-
ested in the expected regret E[R(T )] of the algorithm.

XT
E[R(T )] = T µ∗ − E[ µat ]
t=1

The expectation in the above definition is taken over all the randomness, i.e.,
randomness over the draw of rewards from the distributions D1 , . . . , DK and
internal randomness of the algorithm (if the algorithm is randomized).
Remark. 1. By the definition only, R(T ) and E[R(T )] of an algorithm de-
pends on the problem-instance, i.e., means µ1 , . . . , µK . Of course, we want
an algorithm whose expected regret is small for all instances. In the previ-
ous line, ‘for all problem instances’ is important. For example, consider a
stupid algorithm that always pulls arm 1. This algorithm will have regret
0 if the arm 1 has mean 1 and other arms have mean 0. But of course
this algorithm will suffer for other instance such as if µ1 = 0 and µ2 = 1

3
and will have regret of T . So we want an algorithm that performs well on
all instances.
2. If the regret is small then the algorithm’s performance is close to best
performance when the distributions are known. So we kind of learn the
distributions.
3. If r1 , . . . , rT are rewards received by P
the algorithmP
in rounds 1, 2, . . . , T
T T
respectively then one can see that E[ i=1 rt ] = E[ t=1 µat ]. Hence the
expected regret of an algorithm is the difference of expected total rewards
of best strategy that knows µ1 , . . . , µK and the expected total rewards of
the algorithm.
Note: In some text books, R(T ) is called random-regret or realized regret
and E[R(T )] is called Regret. Its just a matter of convention, what names
should we use. Often, from the context it will be clear whether we are talking
about R[T ] or E[R(T )].
Our goal is to design an algorithm whose expected regret E[R(T )] is as small
as possible for all problem-instances. Like a general discussion on algorithms,
we will ignore constants and only consider Big-O dependence on T . Also, we
assume T ≥ K as any reasonable algorithm will pull each arm atleast once.
Note that since reward in any round is in [0, 1], R(T ) ≤ T always. We want
an algorithms for which R(T ) grows sublinear in T , i.e., R(T
T
)
→ 0 as T → ∞.
In other words, average regret per round should be 0 when T is large (so we
essentially we have learned the unknown distributions). Smaller the regret’s
dependence on T , faster the rate of convergence to 0 for the average regret per
round.

3.1 Explore Then Commit Algorithm


This algorithm is very intuitive and perhaps we will first come up with this algo-
rithm. If we know the means µ1 , . . . , µK then in each round the best algorithm
will pick the arm a∗ (recall notation, a∗ has highest mean µ∗ = maxa∈A {µa })
to maximize total expected rewards. This algorithm first try each arm m times
and finds the estimate µ̂a of the mean µa for every arm a. Thereafter, the algo-
rithm will always pick an arm that maximizes the empirical mean µ̂a (we will
2/3 1/3
call this arm a′ ). The value of m is set to T K
(log T )
2/3 to optimze the expected
regret.
We will now prove our first theorem of this course.
Theorem 4. For any instance (i.e., for any values of unknowns µ1 , . . . , µK ),
the expected regret of the algorithm ETC is O(T 2/3 (log T )1/3 K 1/3 ).
q
Proof. Let ϵ = 5 ln T
m . For any arm a, let Bada be the event that |µ̂a − µa | ≥ ϵ.
Pm
rai
As µ̂a = i=1
m and all rewards are in [0, 1], from Hoffding’s inequality, we have
2 2
Pr(Bada ) = Pr(|µ̂a − µa | ≥ ϵ) ≤ = 10 .
e2ϵ2 m T

4
Algorithm 1: Explore Then Commit
2/3 1/3
1. m = T K (log T )
2/3 ;
2. Try each arm m times. Let for arm a, ra1 , ra2 , . . . , ram be the
rewards received.;
/* Line 2 is exploration/investment phase. In this phase,
the algorithm tries each arm so even the worst arms. But
we hope that we will be able to find near-optimal arm for
remaining rounds. */
3. For each arm a, setP µ̂a as the average received reward for the arm a,
ria
i.e., we have µ̂a = i∈[m]m . Let a′ be the arm that maximizes µ̂a ,

i.e., a = arg maxa {µ̂a }.;
/* We hope a′ is the near-optimal arm, i.e., µ∗ − µa′ is very
small. */
4. From the round m · K + 1, always pick the arm a′ .

Let Bad = ∪a Bada be the event that for some arm a, |µ̂a − µa | ≥ ϵ holds.
So Good = Badc is the event that, for all arms a, we have |µ̂a − µa | < ϵ.
By union bound, we have
X 2 2
Pr(Bad) ≤ Pr(Bada ) ≤ K · 10
≤ 9
a
T T

(as we are assuming T ≥ K as any reasonable algorithm will try each arm at
least once).
Also,
2
Pr(Good) = 1 − Pr(Bad) ≥ 1 − 9
T
q
As the Pr(Bad) is negligible (we have chosen ϵ = 5 ln T
m to ensure that this
happens), it will suffice for us to only bound the expeceted regret conditioned
on event Good. Following calculation formally show this.

E[R(T )] = Pr(Bad)E[R(T )|Bad] + Pr(Good)E[R(T )|Good]


2
≤ 9 · T + 1 · E[R(T )|Good]
T
2
≤ 8 + ·E[R(T )|Good]
T
the second inequality comes from E[R(T )|Bad] ≤ T as all rewards are in [0, 1].

Bounding E[R(T )|Good]. We will now assume event Good, i.e., for all arms
PT
a, we have |µ̂a − µa | < ϵ. Recall that R(T ) = t=1 µ∗ − µat .
The contribution to regret in the investment phase can be at most mK
(assuming worst case contribution of 1 in each round). The contribution to the

5
regret from (mK + 1)st round till end is (T − mK)(µ∗ − µa′ ) (recall that a′ is
the arm that maximizes µ̂a and is always chosen from (mK + 1)st round) . We
now claim that (µ∗ − µa′ ) < 2ϵ. For now let us assume this claim and bound
the regret. We will prove the claim later.

R(T )|Good ≤ mK + (T − mK)2ϵ


≤ mK + 2T ϵ
r
5 ln T
= mK + 2T
m
q
As m inreases mK increases and 2T 5 ln T
m decreases so the above quantity will
q √ 2/3 T 2/3 (log T )1/3
be maximized when mK = 2T 5 ln T
m , i.e., when m = (2 5) K 2/3
.
Substituting this value of m we get R(T ) (conditioned on Good) equal to
O(T 2/3 K 1/3 (log T )1/3 ). Obviously, we also have then
E[R(T )|Good] = O(T 2/3 K 1/3 (log T )1/3 ).
From our earlier calculation, E[R(T )] = T28 +·E[R(T )|Good] = O(T 2/3 K 1/3 (log T )1/3 ).
Now all it remains to prove the claim that (µ∗ − µa′ ) < 2ϵ. Note that µ̂a′ ≥
µ̂a∗ > µ∗ − ϵ and also µa′ > µ̂a′ − ϵ. Thus we have µ∗ − µa′ < 2ϵ.

6
4 Successive Elimination Algorithm
One drawback of ETC algorithm is that it will continue to explore an arm large
number of times (m times) even if an arm’s reward history might suggest to not
pull this arm further. In Successive Elimination algorithm, we discontinue the
arm forever once we have belief that the arm is not good. Below is the high
level description of this algorithm.

Algorithm 2: Successive Elimination - High Level Description


1) Pull every arm once;
2) If there is ‘sufficient evidence’ that some arm a is not a good arm
then remove this arm;
Repeat the above steps over the remaining arms

Now to describe Successive Elimination fully, we just need to specify what


‘sufficient evidence’ is. For the same, we introduce some notations. For any
arm a and round t, let na (t) be thePnumber of times the arm a is pulled till
the round t. Obviously, we have a na (t) = t. Further, let µ̂a (t) be the
empirical mean of received rewards from the arm a till round t. Formally, let
rai be the reward received from q the arm a on the ith pull. Then we have
Pna (t)
r
µ̂a (t) = i=1
na (t)
ai
. Let ϵa (t) = 5nlog T
a (t)
. Finally, let U CBa (t) = µ̂a (t) + ϵa (t)
and LCBa (t) = µ̂a (t) − ϵa (t).
Now we describe the ‘sufficent evidence’ which Succeesive Elimination em-
ploys. Recall the analysis of ETC algorithm. There we define an event Good
(and show that it holds with high probability)q and show that conditioned on
5 log T
Good, µ̂a − ϵ ≤ µ ≤ µ̂a + ϵ where ϵ = m . Here also, we will define an
event Good (and show it will hold with high probability) conditioned on which
for all arm a and round t, we will have LCBa (t) ≤ µ ≤ U CBa (t). Now if at
any time t, we have U CBa (t) < LCBa′ (t) for some arms a and a′ then we know
that µa < µa′ . Hence, it is not a good strategy to pull arm a in any subsequent
rounds because we will be better off pulling arm a′ . In other words, we can
eliminate the arm a for future.

Algorithm 3: Successive Elimination


Activate all the arms;
while #-rounds < T do
Pull each active arm once (and receive rewards);
Deactivate all arms a such that there exits some another arm a′
with U CBa < LCBa′ ;
end

7
Theorem 5. For all instances, i.e, for all values of µ1 , . . . , µK ,
p
E[R(T )] = O( KT log T )

Proof. Let Good be the following event: for all arms a and for all rounds t, we
have |µ̂a (t) − µa | < ϵa (t) (that is, LCBa (t) ≤ µa ≤ U CBa (t)). We will prove
that
1
Pr(Good) ≥ 1 − O( 8 )
T
For now, let us assume the above. Like ETC, it will suffice to bound E[R(T )|Good].

E[R(T )] = Pr(Bad)E[R(T )|Bad] + Pr(Good)E[R(T )|Good]


1
≤ O( 8 ) · T + 1 · E[R(T )|Good]
T
1
≤ O( 7 ) + ·E[R(T )|Good]
T
the second inequality comes from E[R(T )|Bad] ≤ T as all rewards are in [0, 1].

Bounding E[R(T )|Good] Each


P calculation in
P this paragraph is conditioned
(µ∗ − µat ) = a na (T )(µ∗ − µa ). If we show
on Good. Note that R(T ) = tq
for each arm a, (µ∗ − µa ) ≤ (8 5nlog T
a (T )
) then we are done. This is because
q
R(T ) = a na (T )(µ∗ − µa ) ≤ a na (T )(8 5nlog T
P P P p
a (T )
) ≤ a 8 5na (T ) log T . As
P √

qP
n (T ) a na (T )
p
x is a concave function so we have a K a ≤ K = T /K. Thus

R(T ) ≤ 8 5KT q log T . So it remains to show that for any arm a, we have
(µ − µa ) ≤ 8 5nlog
∗ T
a (T )
. For the sake of analysis, we will refer to the ith iteration
of while loop as phase i (for any i). Our first easy observation is that the
arm a∗ with will never be deactivated. Let t be the last round (corresponding
to the end of some phase) where the arm a remained active (in other words,
arm a was played exactly once after t). Note that na (t) = na∗ (t) (because
both arms a and a∗ are active till t so both of them are played equal number
of times, which is the number of phases completed till t). As the arm a is
not deactivated atqt, we mustqhave U CBa (t) ≥ LCBa∗ (t). Further, we have
5 log T 5 log T
ϵa (t) = ϵa∗ (t) = na (t) = na (T )−1 . It is easy to now see that we have
q q
5 log T
µ∗ − µa ≤ 4ϵa (t) = 4 na (T )−1 ≤ 8 5nlog T
a (T )
.

Bounding Pr(Good) It remains to show :


1
Pr(Good) ≥ 1 − O( )
T8
Let ra1 , ra2 , . . . , raT be T samples q from the distribution Da . Let Bada (t) be
the event that | ra1 +···+rat
t − µa | ≥ 5 log
t
T
. By Hoffding’s inequality, we have

8
Pr(Bada (t)) ≤ O(1/e10 log T ) = O( T110 ). Let Bada be the event that for some
q
5 log T
1 ≤ t ≤ T , we have | ra1 +···+r
t
at
− µa | ≥ t . By union bound, Pr(Bada ) ≤
T · O( T110 ) = O( T19 ) Let Bad = ∪a Bada . Again by union bound, Pr(Bad) ≤
K · O( T19 ) = O( T18 ).

The regret in the above theorem is worst-case regret, i.e, for any problem-
instance (that √
is, for any values of K, T and µ1 , . . . , µK , the expected regret
E[R(T )] ≤ O( KT log T ). Now we will show another type of bounds on ex-
pected regret, which will be instance -dependent bound.
Theorem 6. The expected regret of Successive Elimination satisfies
X 1
E[R(T )] ≤ O(log T )
△a
a:△a >0

where △a = µ∗ − µa is the gap of arm a.


Proof. We define the events Bad and Good as in the above theorem. Again it will
suffice to bound E[R(T )|Good]. All calculations now are conditioned on Good.
We claim that for any suboptimal arm a, we have na (T ) ≤ 50000 log T
△2a . This
1
P P
implies that R(T ) = a na (T )△a ≤ O(log T ) a:△a >0 △a . Suppose na (T ) >
50000 log T log T
△2 . Consider the time t when na (t) = 50000 △2 . As the arm a

a a

is always active we also have na∗ (t) = 50000 log T


△2a . So now we have ϵa (t) =
q
ϵa∗ (t) = 5nlog T
a (t)
= △a /100. As we have assumed Good, we have LCBa∗ (t) ≥

µ −2·△a /100 and U CBa (t) ≤ µa +2·△a /100. So we have U CBa (t) < LCBa∗ (t)
which implies that arm a will be eliminated in round t which contradicts that
na (T ) > 50000 log T
△2 .
a

9
UCB Algorithm
Let na (t), ϵa (t), µ̂a , U CBa (t) be as defined before (in Successive Elimination
algorithm).
The idea of UCB is to add bonus to the empirical mean and then pick an arm
that has highest value of qempirical mean plus bonus. The bonus is chosen to be
ϵa (t) which is equal to 5nlog T
a (t)
(since µ̂a (t) + ϵa (t) = U CBa (t), the algorithm,
at any time t, pulls an arm that has highest value of U CBa (t)).
Let us now go into the intuition behind the UCB algorithm in detail. During
the initial rounds, the difference between the empirical mean and the actual
mean of an arm can be significant. Consequently, selecting an arm solely based
on the maximum empirical mean value is not a good strategy. To address this
issue, the algorithm incorporates a bonus term. If an arm a is underexplored,
the bonus is large, which encourages the algorithm to explore that arm (even
if its empirical mean is small at this time). One concern might be whether the
bonus term could lead the algorithm to pull bad arms excessively. However, this
scenario will not happen because the bonus term diminishes as the number of
pulls for the arm increases.

Algorithm 4: UCB Algorithm


U CBa = ∞ for all arms a;
/* Initialization */
while #-rounds < T do
Pull an arm that has the highest value of U CBa ;
/* Recall that U CBa (t) = µ̂a (t) + ϵa (t) */
end

UCB acheives the same guarantee on expected regret as that of Successive


Elimination. The proof is almost same so here we do not give a complete proof.
Theorem 7. For all instances, i.e, for all values of µ1 , . . . , µK ,
p
E[R(T )] = O( KT log T )

and X 1
E[R(T )] = O(log T )
△a
a:△a >0

where △a = µ − µa is the gap of arm a.
Proof. The proof is almost same as that of Successive Elimination. To q prove

E[R(T )] = O( KT log T ), we now show that for any arm a, µ −µa ≤ 2 5nlog
∗ T
a (T )

(assuming Good) (and then the rest of the proof is exactly same). Let ta be
the last round when the arm a was pulled. Thus na (T ) = na (t). Note µa ≥
U CBa (t) − 2ϵa (t) (as we are assuming good), U CBa∗ ≥ µ∗ (as we are assuming

10
Good) and U CB q a∗ (ta ) ≤ UqCBa (ta ) (as arm a was pulled in round ta ). Hence,
µ − µa ≤ 2 na (ta ) = 2 5nlog
∗ 5 log T T
a (T )
. Now the proof goes the same was as in
Successive Elimination algorithm.
The proof of instance dependent bound is also similar. We here prove that
(assuming Good) for any suboptimal arm a, we have na (T ) ≤ 50000 log T
△2a (and
then the proof is exactly same). Suppose not. Consider a time t when na (t) =
50000 log T
△2a . We have ϵa (t) = △a /100. Note that for any time t > t, we

have ϵa (t′ ) ≤ ϵa (t). For any time t′ > t, we have U CBa (t′ ) ≤ µa + 2ϵa (t′ ) ≤
µa + 2ϵa (t) < µ∗ ≤ U CBa∗ (t′ ). This means that for any t′ > t we will have
U CBa∗ (t′ ) > U CBa (t′ ) which means that arm a will never be pulled after t.
This contradicts that na (t) > 50000 log T
△2 .
a

11
MOSS algorithm (UCB2)

MOSS is a variant of UCB and it has the expected regret of O( KT ) and hence
it beats both Successive Elimination and UCB in theory. In the next lecture,

we will see that no algorithm can have smaller expected regret than O( KT )
and hence MOSS is the best possible algorithm.
MOSS
r is same as UCB expect how bonus is calculated. Now the bonus is
max(log KnT (t) ,0)
set to na (t)
a
. Let
s
T
max(log Kna (t) , 0)
Ia (t) = µ̂a (t) + .
na (t)
At any time t, MOSS pulls an arm that has a maximum value of Ia . One
reason MOSS has better expected regret bound than UCB is that the estimation
precision when na (t) is large is more accurate in MOSS than UCB. In other
words, the bonus more quickly goes to 0 in MOSS than UCB as na (t) increases.

Algorithm 5: MOSS Algorithm


Ia = ∞ for all arms a;
/* Initialization */
while #-rounds < T do
Pull an arm that has the highest value of Ia ;
end

Theorem 8. The expected regret E[R(T )] of MOSS satisfies



E[R(T )] = O( KT )

Proof. We will use a trick that is frequently employed in analysis of randomized


algorithms. Instead of sampling from the distribution Da at the time when the
arm a is pulled, we assume that T independent samples from every distribution
Da has already been sampled before the start of the algorithm. For each arm
a, let ra1 , . . . , raT be the T independent samples drawn from the distribution
Da . Now when the algorithm pulls arm a, we provide sample (to the algorithm)
from ra1 , . . . , raT . In particular, the sample provided to the algorithm for ith
pull of arm a is rai . Px
raj
Let us define new notations. Let for any 1 ≤ x ≤ T , µ̂a x = j=1 x be
the average of first x rewards (of T samples drawn beforehand).With respect to
this new notation, note that µ̂a (t) = µ̂a na (t) . Further, for any 1 ≤ x ≤ T , let
q T
max(log Kx ,0)
Ia x = µ̂a x + x . Again note that Ia (t) = Ia na (t) . We will call Ia (t)
as the index of arm a after time t.
Let δ = max{µ∗ − min1≤x≤T Ia∗ x , 0}. Note that δ is a random variable. By
definition only, the index of the best arm will never be less than µ∗ − δ, i.e.,

12
q
Ia∗ (t) ≥ µ∗ − δ for all t. We will prove later that E[δ] ≤ 10 K T . It will be
helpful to keep this fact in mind. q
Let us call an arm a as Good if △a ≤ 5 K T (note that this is different - in
previousqalgorithms, Good and Bad were events). An arm a is called Bad if
△a > 5 K T . Now it will be clear why we have defined Good and Bad arms in
P
this way. Recall that R(T ) = a Ra (T ) where Ra (T ) = na (T )△a .
X X
R(T ) = Ra (T ) + Ra (T )
a:a is Good a:a is Bad
r
X K X
≤ na (T )5 + Ra (T )
T
a:a is Good a:a is Bad
r
K X X
≤5 na (T ) + Ra (T )
T
a:a is Good a:a is Bad
r
K X
≤5 ·T + Ra (T )
T
a:a is Bad
√ X
≤ 5 KT + Ra (T )
a:a is Bad
P √
Thus it suffices to show a:a is Bad Ra (T ) = O( KT ). Let us introduce few
more notations. For any bad arm a, we define a value ka (which is a random
variable) as follows:
△a
ka = |{1 ≤ x ≤ T |Ia x > µa + }|
2
We also define J which is is a random subset of bad arms defined as below:
J = {a ∈ [K]|a is Bad and △a > 2δ}
A very important observation is that for any arm in J, we have na (T ) ≤ ka .
This is the main crux of the analysis. As directly showing bounds for E[na (t)]
is difficult but later we will be able to show bounds for E[ka ] for bad arms.
Now
X X X
Ra (T ) = na (T )△a + na (T )△a
a:a is Bad a∈J a̸∈J:a is Bad
X
≤ ka △a + 2δT
a:a is Bad

Now
X X
E[ Ra (T )] ≤ E[ka ]△a + 2E[δ]T
a:a is Bad a:a is Bad
X √
≤ E[ka ]△a + 20 KT
a:a is Bad

13
q
as we earlier claimed (without proof) that E[δ] ≤ 10 K T . Thus it suffices to
P √
show that a:a is Bad E[ka ]△a = O( KT ). Recall that for any event A, we use
1A for indicator r.v. that takes the value 1 if A happens and 0 otherwise.
For any bad arm a, we have
△a
E[ka ] = E[|{1 ≤ x ≤ T |Ia x > µa + }|]
2
T
X
= E[ 1Ia x >µa + △a ]
2
x=1
T
X
= E[1Ia x >µa + △a ]
2
x=1
T
X △a
= Pr(Ia x > µa + )
x=1
2
s
T T
X max(log Kx , 0) △a
= Pr(µ̂a x + > µa + )
x=1
x 2
s
T T
X △a max(log Kx , 0)
= Pr(µ̂a x − µa > − )
x=1
2 x
T △2a
log
8 K s
△2
a T T
X X △a max(log Kx , 0)
≤ 1+ Pr(µ̂a x − µa > − )
x=1
2 x
T △2
a
log
x=8 △2K
a
s
T △2 T T
log K a X △a max(log Kx , 0)
=8 + Pr(µ̂a x − µa > − )
△2a 2 x
T △2a
8 log
x= K
△2a

q
As the arm a in the above calculations is bad, we have △a > 5 KT . This implies
T △2
log K a
that for x ≥ 8 △2a , we have

T
max(log T △2a
, 0)
log
T K
max(log Kx , 0)
K8 △2
≤ T △2
a

x log K a
8 △2a
T △2
△2 log 8K log(T △2a /K)
a

= a·
8 log(T △2a /K)
2

≤ a
8

14
q
The last inequality holds because log(T △2a /K) > 1 as △a > 5 KT . Therefore
s
T
△a max(log Kx , 0) △a △a
Pr(µ̂a x − µa > − ) ≤ Pr(µ̂a x − µa > − √ )
2 x 2 2 2
2 2
≤ 2exp(−2c △a x)

where c = 21 (1 − √1 ).
2
Now
T △2 T
log K a X
E[ka ] ≤ 8 + 2exp(−2c2 △2a x)
△2a
T △2a
8 log
x= K
△2a

T △2 ∞
log K a X
≤8 + 2exp(−2c2 △2a x)
△2a
T △2a
8 log
x= K
△2
a

T △2
log K a exp(−2c2 △2a x0 )
=8 +
△2a 1 − exp(−2c2 △2a )
T △2
a
log
where x0 = 8 △2K and last equality comes from the geometric series summa-
a
tion.
As exp(−2c2 △2a x0 ) < 1 we have
T △2
log K a 1
E[ka ] ≤ 8 +
△2a 1 − exp(−2c2 △2a )

And hence
T △2
log K a △a
△a E[ka ] ≤ 8 +
△a 1 − exp(−2c2 △2a )

Using 1 − e−y ≥ y − y 2 /2 we have


T △2
log K a △a
△a E[ka ] ≤ 8 + 2 2
△a 2c △a − 2c4 △4a
T △2
log K a 1
≤8 + 2
△a 2c △a (1 − c2 △2a )
T △2
log K a 1
≤8 + 2
△a 2c △a (1 − c2 )
T △2 √
log K a 1 T
≤8 + 2 √
△a 2c (1 − c2 ) 5 K

15
2
q
One can check that the maximum value of f (y) = log(T yy /K) is O( KT
).
Thus √
X X T √
△a E[ka ] ≤ O( √ ) = O( KT )
a:a is bad a:a is bad
K
q
All remains to show: E[δ] ≤ 10 K T .

16
5 Lower Bounds
We started with ETC and show its expected regret of O(T 2/3 K 1/3 (log T )1/3 ).
Then √we show that the expected regret of Successive Elimination and UCB
is O( KT log T ). Finally, we show that √ MOSS beats UCB and Successive
Elimination and has expected regret of O( KT ). What next? √ Can we have
an algorithm that has even better regret than MOSS, i.e., O( KT )? What if
researchers are unable to find a better regret bound then can we conclude that
better bound is not possible? Surely not. It could be lack of understanding or
knowledge, preventing us from coming up with a better algorithm.
Now our focus will be lower bounds, i.e., we will be show that no algorithm
can have better regret bound than some threshold. In fact we will prove the
following theorem
Theorem √ 9. No algorithm (deterministic or randomized) can have expected
regret of o( KT ) for all instances.
Note that the possibilities of algorithms are infinite. An algorithm can per-
haps do anything. Formally, a deterministic algorithm is any function that takes
as input K, T and for any 1 ≤ t ≤ T maps a sequence (of pairs consisting of arm
and received reward) {(a1 , r1 ), (a2 , r2 ), . . . , (at−1 , rt−1 )} to some arm at ∈ [K].
Since rewards can take any value in [0, 1], the number of distinct algorithms is
infinite. Thus, it is not possible to prove the above theorem by showing that a
particular algorithm does not have a better regret bound for all such algorithms
(as the number of such algorithms is infinite).
Not surprisingly, we will need new tools to prove such lower bound. We now
will study elementary tools from Information theory which will be sufficient to
prove the above theorem. In fact we will not directly prove the Theorem 9. First
we will consider a toy problem (coin testing problem of exercise) and show a
lower bound for this problem. This itself will require new tools from Information
Theory. Then after getting some familiarity with these tools, we will prove the
Theorem 9.
Testing Coin(ϵ, δ) Problem: It is promised that a given coin is either fair,
i.e, p = 1/2 or ϵ-biased, i.e., p = 1/2 + ϵ (algorithm knows ϵ). Give an algorithm
A that takes samples from that coin (as input) and has the following guarantee:
• If p = 1/2, algorithm must say ‘Fair’ with at least 1 − δ probability.
• If p = 1/2 + ϵ, the algorithm must say ‘Biased’ with at least 1 − δ proba-
bility.
From Hoffding’s inequality, it is easy to show following:
Lemma 1. There is a deterministic algorithm that solves Testing coin problem
and needs only O(1/ϵ2 ) samples for any constant δ < 1/2.
We will now show the following lower bound:
(1−2δ)2
Lemma 2. Let ϵ ≤ 1/3. Any deterministic algorithm will need at least 5ϵ2
samples to solve Testing Coin(ϵ, δ) problem.

17
Now the goal is to prove Lemma 2. As said before, we begin with under-
standing new tools.

Total Variation Distance and KL divergence


Recall that a probability distribution P over a finite domain Ω is a function
P : Ω → [0, 1] that maps each ω ∈ Ω to a real in [0, 1] satisfying:
1. P (ω) ≥ 0 for all ω ∈ Ω
P
2. ω∈Ω P (ω) = 1.

Now we will define the notion of distance between two probability distributions.
Any notion of distance d must necessarily satisfy these two properies - d(P, P ) =
0 for any distribution P and d(P, Q) > 0 for any two distinct distribution P and
Q. There can be many ways to define distance. for instance, we can consider
2
P
Euclidean distance or ℓ2 norm: d(P, Q) = ω (P (ω)−Q(ω)) or we can consider
1
P
ℓ1 norm called as total variation distance d(P, Q) = 2 ω |P (ω) − Q(ω)|. Each
notion of distance has its own merits and limitaions. For us, the notion of total
variation distance also known as statistical distance will be important.
Definition 1. For any two distributions P and Q defined on a finite domain
Ω, the total variation distance between P and Q, denoted as dtv (P, Q) is defined
as
1X
dtv (P, Q) = |P (ω) − Q(ω)|
2
ω∈Ω

It is easy to see that:


1. dtv (P, Q) = 0 if P = Q.
2. dtv (P, Q) > 0 for any P ̸= Q.
3. dtv (P, Q) = dtv (Q, P ).
4. dtv (P, Q) ≤ dtv (P, R) + dtv (R, Q) (triangle inequality)
Now we are going to see one of the most useful and remarkable property of total
variation distance.
Lemma 3.

dtv (P, Q) = max |P (S) − Q(S)| (1)


S⊆Ω
P P
where note that P (S) = ω∈S P (ω) and Q(S) = ω∈S Q(ω).
Now we define KL divergence between two distributions.
Definition 2. The KL divergence KL(P, Q) is defined as
X P (ω)
KL(P, Q) = P (ω) ln
Q(ω)
ω∈Ω

18
The KL divergence satisfy:
Lemma 4. • KL(P, P ) = 0
• KL(P, Q) > 0 if P ̸= Q
However, note that KL(P, Q) ̸= KL(Q, P ) and also it does not satisfy the
triangle inequality. That is why it is called divergence not distance. The useful-
ness of KL divergence comes from the fact that the KL divergence of a product
distribution just adds up. Let us see first what is a product distribution and
then the utility of KL divergence (I have talked about the product distribution
while discussing independent trials).
Definition 3. Given two distributions P and Q on the domains Ω1 and Ω2
respectively, then P × Q is a distribution on domain Ω1 × Ω2 such that
(P × Q)(ω1 , ω2 ) = P (ω1 )Q(ω2 ) ∀ω1 ∈ Ω1 , ω2 ∈ Ω2
In general, if Pi is a distribution on Ωi then the product distribution P1 × P2 ×
· · · × Pm is a distribution on domain Ω1 × · · · × Ωm such that
(P1 × P2 × · · · Pm )(ω1 , ω2 , . . . ωm ) = P1 (ω1 )P2 (ω2 ) · · · Pm (ωm )
We will use, for any distribution P , P ⊗2 for P × P . In general, we have
⊗m
P = P × · · · × P (m-times).
The product distribution corresponds to independent trials. Drawing m in-
dependent samples from a distribution P is exactly same as dwawing one sample
from P ⊗m . Now we will see why KL divergence is extremely useful.
Lemma 5. If for all i ∈ [m], Pi and Qi are distributions on Ωi then
KL(P1 ×P2 ×· · ·×Pm , Q1 ×Q2 ×· · ·×Qm ) = KL(P1 , Q1 )+KL(P2 , Q2 )+· · ·+KL(Pm , Qm )
Corollary 1.
KL(P ⊗m , Q⊗m ) = m · KL(P, Q)
The above property is not true for total variation distance. We do not
have dtv (P ⊗m , Q⊗m ) = m dtv (P, Q). In fact, other than dtv (P ⊗m , Q⊗m ) ≤
m dtv (P, Q), we do not know of any other inequality between these two.
We have now seen both total variation distance and KL divergence, each
having its own merits and limitations. Now we will see that these two are
connected by an inequality, which forms the basis of powerful application.
Lemma 6 (Pinsker inequality). For any two probability distributions P and Q,
we have r
1
dtv (P, Q) ≤ KL(P, Q)
2
Following is also useful.
Lemma 7 (The Bretagnolle–Huber bound).
p 1
dtv (P, Q) ≤ 1 − e−KL(P,Q) ≤ 1 − e−KL(P,Q)
2
Now we have all the necessary tool to prove Lemma 2. First we will prove
the lemma for deterministic algorithms.

19
Proof of Lemma 2 Suppose there is a deterministic algorithm A that solves
the Testing Coin problem and uses m independent coin tosses. Note that algo-
rithm A is just a function A : {H, T }m → {Fair, Biased} that maps a sequence
of head and tails of length m to Fair or Biased. We partition {H, T }m into two
subsets as follows: Af = {y ∈ {0, 1}m : A(y) = Fair} and Ab = {y ∈ {0, 1}m :
A(y) = Biased}, i.e., Af consists of all those sequence which gets mapped to
Fair and Ab consists of all those sequences that gets mapped to Biased.
Let p = Pr(Head). Let ( 21 )⊗m and ( 21 + ϵ)⊗m be the product distributions
corresponding to m independent coin tosses with p = 1/2 and p = 1/2 + ϵ
respectively. For convenience, let Df = ( 21 )⊗m and Db = ( 21 + ϵ)⊗m . We will
bound tv distance between Df = ( 12 )⊗m and Db = ( 12 + ϵ)⊗m in two different
ways. P P
Note that Df (Af ) = ω∈Af Df (ω) ≥ 1 − δ and Db (Af ) = ω∈Af Db (ω) ≤
δ (because the algorithm A solves the Testing Coin(ϵ,δ) problem). Thus by
Lemma 3,
dtv (Df , Db ) ≥ Df (Af ) − Db (Af ) ≥ 1 − 2δ.
Now let us bound tv distance using Pinsker inequality. As said earlier, calcu-
lation of KL divergence between Df and Db is easy as they both are product
distributions.
1 1
KL(( )⊗m , ( + ϵ)⊗m ) = m · KL(1/2, 1/2 + ϵ)
2 2
where KL(1/2, 1/2+ϵ) is the KL divergence between the distributions (1/2, 1/2)
and (1/2 + ϵ, 1/2 − ϵ).
1 1 1 1
KL(1/2, 1/2 + ϵ) = ln + ln
2 2( 21 + ϵ) 2 2( 21 − ϵ)
1 1 1 1
= ln = ln
2 4(1/4 − ϵ2 ) 2 (1 − 4ϵ2 )
1 (4ϵ2 )2 (4ϵ2 )3
= (4ϵ2 + + + . . . ) = 2ϵ2 (1 + 2ϵ2 + 16ϵ4 /3 + · · · ) ≤ 10ϵ2
2 2 3
for ϵ ≤ 1/3 (probably 10 can be replaced by a smaller number but as we are not
optimizing constants I put here bigger number so that the inequality is easy to
derive).
Now we will use Pinsker inequality to bound the tv distance.
r
1 ⊗m 1 ⊗m 1 1 1
dtv (( ) , ( + ϵ) ) ≤ KL(( )⊗m , ( + ϵ)⊗m )
2 2 2 2 2

≤ 5mϵ
Now we have

r
1 1 1
1 − 2δ ≤ dtv (Df , Db ) ≤ KL(( )⊗m , ( + ϵ)⊗m ) ≤ 5mϵ
2 2 2
(1−2δ)2
which implies that m ≥ 5ϵ2 .

20
Lower Bound for randomized algorithms
Lemma 8. Let ϵ ≤ 1/3. The expected number of coin tosses by any randomized
2
algorithm that solves Testing Coin(ϵ, δ) problem is at least (1−6δ)
15ϵ2 .

In the class, I discussed that any randomized algorithm R is just a distribu-


tion over some deterministic algorithms, i.e., for any randomized algorithm R,
there exits deterministic algorithms A1 , . . . , As (for some finite s) and a distri-
bution (q1 , . . . , qs ) over these deterministic algorithms such that R picks a Ai
(i ∈ [s]) (with probability qi ) and then run the deterministic algorithm Ai on
the given input.

Proof of Lemma 8 Let R be a randomized algorithm that solves the Testing


2
Coin (ϵ, δ) using at most m = (1−6δ) 15ϵ2 expected number of coin tosses. Let
A1 , . . . , As be deterministic algorithms each requiring at most m1 , m2 , . . . , ms
coin tosses and let R picks each Ai with probability qi . Now consider a table
with two rows and s columns — one row corresponding to p = 1/2 (called
row 1) and one corresponding to p = 1/2 + ϵ (called row 2), and s columns
corresponding to each deterministic algorithms A1 , . . . , As .
We put ✓ in the cell corresponding to row 1 and column Aj if
Pr(Aj returns ‘Biased’ when p = 1/2) ≤ 3δ. Otherwise we put ×, i.e,
if Pr(Aj returns ‘Biased’ when p = 1/2) > 3δ. For row 2, we put ✓ in the cell
corresponding to row 2 and column Aj if Pr(Aj returns ‘Fair’ when p = 1/2 +
ϵ) ≤ 3δ. Otherwise we put ×, i.e, if Pr(Aj returns ‘Fair’ when p = 1/2 + ϵ) > 3δ.
By cell(i, Aj ) (i ∈ {1, 2}, j ∈ [s]), we mean the symbol (✓ or ×) in the cell
corresponding to row i and column Aj in the above table.
Now,

δ ≥ Pr(R returns ‘Biased’ when p = 1/2) (because R solves Testing Coin(ϵ, δ))
X
= qj Pr(Aj returns ’biased’ when p = 1/2)
X
≥ qj 3δ
j:cell(1,Aj )=×
X
=⇒ qj ≤ 1/3
j:cell(1,Aj )=×

Similarly, X
qj ≤ 1/3
j:cell(2,Aj )=×
P
This means that j:cell(1,Aj )=✓ and cell(2,Aj )=✓ qj ≥ 1−(1/3+1/3) ≥ 1/3.That
is, R puts at least 1/3 probability on those Aj ’s for which both Pr(Ai returns ‘Biased’ when p =
1/2) ≤ 3δ and Pr(Ai returns ‘Fair’ when p = 1/2 + ϵ) ≤ 3δ are true. Let
J = {j : cell(1, Aj ) = ✓ and cell(2, Aj ) = ✓}. Note that for any j ∈ J, Aj
solves Testing Coin (ϵ, δ) problem and is a deterministic algorithm. By Lemma 2,

21
(1−2(3δ))2 (1−6δ)2
mj ≥ 5ϵ2 = (recall mj is the number of coin tosses by Aj ). There-
5ϵ2
2 2
fore, the expected number of coin tosses by R, is j qj mj ≥ 31 · (1−6δ) = (1−6δ)
P
5ϵ2 15ϵ2 .

Remark. I presented the Lemma 8 and its proof in the class in a slightly dif-
ferent way. The Lemma 8 is more stronger than the one I stated in the class.
There, I assumed that the running time (or the number of samples) of R is a
deterministic so each A1 , . . . , As uses fixed number of samples. As the above
proof shows, we can show lower bound on the expected number of samples (by
same proof ).

Lower Bounds for Bandits


We start with showing lower bounds
√ when there are only two arms. The best
upper bound in this case is O( T ). We show a matching lower bound.
Theorem 10. No algorithm can achieve an expected regret of O(T 0.5−α ) for
any constant α > 0 for all instances.
Proof. Let there exists an algorithm B that has an expected regret of O(T 0.5−α )
for some α > 0 for all instances. We will show that, using B, we can get a
randomized algorithm R that solves Testing Coin(ϵ, 1/50) in o(1/ϵ2 ) coin tosses
contradicting Lemma 8.
Let ϵ < 1/3 be the input of Testing Coin(ϵ, δ) problem. Consider a following
bandit instance:
• K = 2 (there are two arms)
• Arm 1’s reward takes value either 0 or 1. The probability of reward being
1 is the probability that the coin (of Testing Coin(ϵ, 1/50) ) outputs a
Head.
• Arm 2’s reward distribution is also a Bernoulli with mean reward of 21 + 2ϵ ,
i.e., reward takes either 0 or 1 and the probability of reward being one is
1 ϵ
2 + 2.
1
• T = (200/ϵ) 0.5+α
Note in the above that T = o(1/ϵ2 ). Now we describe a randomized algorithm R.
The algorithm R runs B on the above bandit instance (recall that we assumed
there exists an algorithm B that has an expected regret of O(T 0.5−α ) for some
α > 0 for all bandit instances). Whenever B pulls arm 2, the algorithm R
generates a sample from the arm 2’s reward distribution (Bernoulli with mean
reward of 12 + 2ϵ ) and give it to the bandit algorithm B (these are internal
randomness of the randomized algorithm R). Whenever B pulls arm 1, R uses
coin toss of the input coin (of Testing Coin (ϵ, δ)) to provide the reward to B
— if the coin toss outputs Head, it provides a reward of 1 to B and if the result
of coin toss is Tail, R provides a reward of 0 to B. Finally, R returns ‘Fair’
if B (after T rounds) pulls the arm 2 at least T /2 times and otherwise return
‘Biased’ (i.e., if B pulls arm 1 more than T /2 times).

22
Note that R uses at most T = o(1/ϵ)2 coin tosses. Now we prove correctness
of the algorithm R — if the coin is fair, i.e., p = Pr(Head) = 1/2 then B pulls
the arm 2 more than T /2 times (and hence outputs ‘Fair’) with at least 49/50
probability and if the coin is biased, i.e., p = 1/2+ϵ then B pulls the arm 1 more
than T /2 times (and hence outputs ‘Biased’) with at least 49/50 probability. In
other words, B will always pull the best arm > T /2 times with at least 49/50
probability. To see this, note that in both cases E[R(T )] = 2ϵ E[Ns ] where Ns is
the number of times B pulls the suboptimal arm (which is arm 1 when p = 1/2
and arm 2 when p = 1/2 + ϵ). Now we have

ϵ 1 T 1/2+α 1 −α
E[R(T )] = E[Ns ] ≤ T 2 −α =⇒ E[Ns ] ≤ T2 = T /100
2 100
By Markov’s inequality, we have Pr(Ns ≥ T /2) ≤ 1/50. Thus, R solves Testing
Coin (ϵ, 1/50) problem in T = o(1/ϵ)2 coin tosses which contradicts Lemma
8.
Now our goal will be to show the lower bound for K arms.

Theorem 11. No algorithm can have an expected regret of ≤ KT /5000 on
all bandit instances.
To prove the above theorem, we first need to define Biased-Coin Identifica-
tion problem.

Biased-Coin Identification: Input is K coins C1 , . . . , CK and ϵ > 0. It


is promised that there is exactly one biased coin for which Pr(Head) = 1/2 + ϵ
and rest of the others are fair coin. The goal of the algorithm is to return the
index of the unfair coin with probability at least 4/5.
Formally, Let P i = ( 12 , . . . , 12 + ϵ, . . . , 21 ) be K-tuple where all coordinate is
1/2 except the ith one which is 1/2 + ϵ. If the tuple corresponding to the input
coins probabilities is P i then the algorithm should return i with at least 4/5
probability, i.e., PrP i (y = i) ≥ 4/5 where y denotes the output of the algorithm
and PrP i (y = i) denotes the probability of the event y = i when the input coins
probability tuple is P i .
Theorem 12. Let ϵ ≤ 1/3. Any deterministic algorithm requires > K/1000ϵ2
number of total coin tosses to solve Biased-Coin Identification problem.
We will prove the above theorem later. First we prove the Theorem 11
assuming the above theorem.

Proof of Theorem 11 (assuming Theorem 12)


Proof. Suppose there is such an algorithm B. We now show that, using B,
we can solve Biased-coin Identification problem in K/2000ϵ2 total coin tosses
contradicting Theorem 12.
We run B on a bandit instance on K arms and T = K/2000ϵ2 . Whenever B
pulls an arm j, we toss the Cj th coin and provide to B a reward of 1 if the coin

23
toss results in Head and a reward of 0 if the coin toss result is Tail. Finally, we
return the index of the arm that has been pulled highest number of times.
It is easy to see that if ith coin is biased then the reward distributions (of
the bandit instance given as input to B)) of all arms except the ith one is a
Bernoulli with a mean of 1/2 and the reward distribution of the ith arm is a
Bernoulli with a mean of 1/2 + ϵ.
Note that for any of the above bandit instances, we have E[R(T )] = ϵ(T −
E[No ]) where No is the number of times√the optimal arm is pulled √
by the
KT KT T
algorithm. By our assumption, E[R(T )] ≤ 5000 =⇒ T − E[No ] ≤ 5000ϵ = 100 .
By Markov inequality, Pr(T − No ≥ T /2) ≤ 1/50. In other words, with at
least 49/50 probability, B will pull the optimal arm strictly more than T /2
times (which also implies that the optimal arm will be pulled highest number of
times). Hence with at least 49/50 > 4/5 probability, the output of the algorithm
will be the index of the optimal arm which is also the index of the biased coin.
Moreover, our algorithm uses only K/2000ϵ2 total coin tosses.

Proof of Theorem 12
Let A be an algorithm that solves the Biased Coin Identification problem in
K
m < 1000ϵ 2 total coin tosses.

We will now see that A can be viewed as a complete binary tree of height
m where recall that m is the number of total coin tosses by the algorithm. All
nodes of this tree are labelled by some index i ∈ [K]. From any internal node,
the edge to left child is labelled by H and the edge to right child is labelled
by T (this ordering is arbitrary). The algorithm starts from root and reach a
leaf. When the algorithm reaches any internal node labelled by i then it means
that the algorithm tosses the coin Ci . If the result of coin toss is H then the
algorithm moves to the left child and if the result is T then it goes to the right
child. In this way, algorithm starts from the root node and reaches some leaf
node. If the algorithm reaches a leaf with label i then it returns i. Below we
define some notions related with this tree.

Notations: The label of a node v will be denoted by L(v) (which will be


in [K]). The height of a node v will be denoted by h(v). By convention we
assume height of the root node is 0 and so the height of the leaf nodes are
m. For any node v, we will use ℓ(v) for left child of v and r(v) for right child
of v. The set of all nodes at height h will be denoted by Vh and moreover
Vh = (vh1 , vh2 , . . . , vh2h ). Note that |Vh | = 2h .
Let y denote the output of the algorithm (which will be in [K]). Recall that
P i = ( 21 , . . . , 12 + ϵ, . . . , 12 ) is the K-tuple where all coordinate is 1/2 except the
ith one which is 1/2 + ϵ. For any node v, we use PrQ (v) for the probability
that the algorithm reaches node v when the input coins probability tuple is Q.
Moreover, pQ (v, ℓ(v)) is the probability that the algorithm reaches ℓ(v) (left
child of v) given that it is currently at v when the input coins probability tuple
is Q. Similarly, pQ (v, r(v)) is the probability that the algorithm reaches r(v)

24
(right child of v) given that it is currently at v when the input coins probability
h
tuple is Q. Let DQ = (PrQ (vh1 ), . . . , PrQ (yh2h )) be the distribution induced
on the nodes at height h, when the input coins probability tuple is Q. Note
P2h
that k=1 PrQ (vhk ) = 1 for all Q (as the algorithm will reach exactly one node
at any height) so this is a probability distribution. Finally, the output of the
algorithm will be denoted by y.
Since A solves the Biased Coin Indentification problem, we must have
PrP i (y = i) ≥ 4/5 for all i ∈ [K].
Claim 1. dtv (DPmi , DPmj ) ≥ 3/5 for all i ̸= j.

Proof. Consider set Si = {v ∈ Vm : L(v) = i} and Sj = {v ∈ Vm : L(v) = j}


(Si is the set of all leaves with label P i. Similarly Sj is the set of all leaves
with label j). We have DPmi (Si ) = v∈SP
i
DPmi (v) = PrP i (y
P = i) m≥ 4/5
(because PrP i (y = i) = ∪v∈Si PrP i (v) = v∈Si PrP i (v) = v∈Si DP i (v)).
Similarly, DPmj (Sj ) ≥ 4/5 and hence DPmj (Si ) ≤ 1/5. By Lemma 3, we have
dtv (DPmi , DPmj ) ≥ |DPmi (Si ) − DPmi (Si )| ≥ 3/5.

Now we would like to bound KL(DPmi , DPmj ). Unfortunately, bounding this


directly is not easy. We will use F = (1/2, . . . , 1/2) (all 1/2) as a reference
distribution. Let us calculate KL(DFm , DPmj ). Our claim is that
X 1
KL(DFm , DPmj ) = KL(DFm−1 , DPm−1
j )+ KL ((1/2, 1/2), (pP j (v, ℓ(v)), pP j (v, r(v)))
2m−1
v∈Vm−1

Before proving the above let us interpret this. The KL divergence of the
distributions induced on nodes at level m by F and P j is equal to the divergence
of the distributions induced on nodes at level m − 1 plus divergence between
them conditioned on the fact that algorithm is currently at height m − 1. For
clarity, we will use pj (v, ℓ(v)) for pP j (v, ℓ(v)) and pj (v, r(v)) for pP j (v, r(v))
from now on.

25
Let us prove the above.
X 1 1
KL(DFm , DPmj ) = ln
2m 2m PrP j (v)
v∈Vm
X 1 1 1 1
= ln + ln
2m 2m PrP j (ℓ(v)) 2m 2m PrP j (r(v))
v∈Vm−1
X 1 1 1 1
= ln + ln
2m 2m PrP j (v) · pj (v, ℓ(v)) 2m 2m PrP j (v) · pj (v, r(v))
v∈Vm−1
X 1 1 X 1 1
= m
ln m−1 + ln
2 2 PrP j (v) 2m 2pj (v, ℓ(v))
v∈Vm−1 v∈Vm−1
X 1 1 X 1 1
+ m
ln m−1 + ln
2 2 PrP j (v) 2m 2pj (v, r(v))
v∈Vm−1 v∈Vm−1
X 1 1 X 1 1 1 1 1
= ln + ( ln + )
2m−1 2m−1 PrP j (v) 2m−1 2 2pj (v, ℓ(v)) 2 2pj (v, r(v))
v∈Vm−1 v∈Vm−1
X 1
= KL(DFm−1 , DPm−1
j )+ KL((1/2, 1/2), (pj (v, ℓ(v)), pj (v, r(v)))
2m−1
v∈Vm−1

Note that KL((1/2, 1/2), (pj (v, ℓ(v)), pj (v, r(v))) = 0 if L(v) ̸= j and oth-
werwise it is at most 10ϵ2 (earlier in the proof of Lemma 2 we show that
KL((1/2, 1/2), (1/2 + ϵ, 1/2 − ϵ)) ≤ 10ϵ2 for ϵ ≤ 1/3). So we have
1
KL(DFm , DPmj ) ≤ KL(DFm−1 , DPm−1
j )+ 10ϵ2 |v ∈ Vm−1 : L(v) = j|
2m−1
This implies that
K
X K
X
KL(DFm , DPmj ) ≤ KL(DFm−1 , DPm−1
j ) + 10ϵ2 ≤ 10mϵ2
j=1 j=1

where the last inequality in the above is by the repeated application of the first
inequality. Now applying Pinsker inequality, we have
K PK
X j=1 d2tv (DFm , DPmj ) 1
2 d2tv (DFm , DPmj ) ≤ 10mϵ 2
=⇒ <
j=1
K 200

K
since we assumed that m < 1000ϵ 2.

Since the average of dtv (DF , DPm1 ), . . . , d2tv (DFm , DPmK ) is 1/200 there exists at
2 m
1
least K/2 values of j ∈ [K] such that d2tv (DFm , DPmj ) < 2 · 200 = 1/100. Let i ̸= k
be two such values. We have dtv (DF , DP i ) < 1/100 and dtv (DFm , DPmk ) < 1/100.
2 m m 2

Hence dtv (DFm , DPmi ) < 1/10 and dtv (DFm , DPmk ) < 1/10. Since dtv satisfies
triangle inequality we have dtv (DPmk , DPmi ) < 1/5 which contradicts Claim 1.

26
Instance dependent lower bound
Recall that the expected regret of an algorithm depends on the instance. We will
use E[R(T, I)] for the expected regret of the algorithm
√ on the instance I. Note
that we have proved maxI∈I E[R(T, I)] = O( KT log T ) for UCB/Successive
Elimination where I is the set of all bandit instances. Moreover, we have also
seen the following instance-dependent upper bound for both UCB/Successive
Elimination algorithms: E[R(T, I)] = O(log T ) a:△a >0 △1a . We will now prove
P
the matching instance dependent lower bound.
Theorem 13. Let I be the set of all bandit instances (where recall that a bandit
instance consists of values of K, T and K reward distributions D1 , . . . , DK ).
1. There can not exist an algorithm for which E[R(T, I)] ≤ f (T )cI for all
bandit instance I ∈ I where f (T ) = o(log T ) and cI ≥ 0 depends on I
only (not on T ).
log T P 1
2. There can not exist an algorithm for which E[R(T )] ≤ 10000 a:△a >0 △a
for all bandit instances.
Proof. Let us first prove the first part. Suppose there exists such an algorithm
B. Consider two bandit instances, both on 2 arms.
1. In the first instance, denoted by w, first arm is Bernoulli with a mean of
1/2 and the second arm is a Bernoulli with a mean of 0.4.
2. In the second instance, denoted by w′ , first arm is Bernoulli with a mean
of 1/2 and the second arm is a Bernoulli with a mean of 0.6.
Since all rewards are either 0 or 1, the run of the algorithm B on these two
instances can be seen as a binary tree of height T in a natural way. All internal
nodes of this tree are labelled by either 1 or 2 (representing the pull of the arm
1 or 2 respectively), while the leaves are unlabelled. From any internal node,
the edge to left child is labelled by 0 and the edge to right child is labelled by
1 (this ordering is arbitrary). The algorithm starts from root and reach a leaf.
When the algorithm reaches any internal node labelled by i ∈ {1, 2} then it
means that the algorithm pulls the arm i. If the reward received is 0 then the
algorithm moves to the left child and if the reward received is 1 then it goes to
the right child. In this way, algorithm starts from the root node and reaches
some leaf node. Recall the notations Vh , L(v), . . . etc. used for the tree from
the previous proof as we will follow the same notations for clarity.
Let E[R(T, w)] be the expected regret of the algorithm B on instance w after
round T and E[R(T, w′ )] be the expected regret of the algorithm B on instance
w′ after round T . For any i ∈ {1, 2}, let Ew [NiT ] is the expected number of
times arm i is pulled by algorithm B after round T when the instance is w.
Similarly we define Ew′ [N1T ] and Ew′ [N2T ]).
Note that E[R(T, w)] = 0.1 · Ew [N2T ] and E[R(T, w′ )] = 0.1 · Ew′ [N1T ]. By
assumption, we also have E[R(T, w)] ≤ f (T )cw for all T and E[R(T, w′ )] ≤
f (T )cw′ for all T where f (T ) = o(log T ). Therefore, Ev [N2T ] ≤ 10cw f (T ) =

27
o(log T ) (as cw does not depend on T ). Similarly, Ew′ [N1T ] = o(log T ). Now we
will use Markov inequality to get the following inequality:

2Ew [N2T ]
Prw (N2T ≥ T /2) ≤ = o(log T )/T
T
2Ew′ [N1T ]
Prw′ (N2T ≥ T /2) = 1 − Prw′ (N1T ≥ T /2) ≥ 1 − ≥ 1 − o(log T )/T
T
We can see that for instance w, the event N2T ≥ T /2 happens with probabil-
ity at most o(log T )/T (which tends to 0 as T → ∞) whereas for the instance w′ ,
the same event happens with probability at least 1 − o(log T )/T (which tends
to 1 as T → ∞). This implies that the total variation distance between the
distributions induced on leaves (recall algorithm B is just a tree) by instance
w and w′ respectively is large. Formally let DvT and Dw T
′ be the distribution

induced on leaves for instance w and w respectively. Let S1 ⊆ VT be the set of
leaves such that at least T /2 + 1 nodes in the path from the root to the leaf are
labelled by 1 (in other words, if algorithm reaches any leaf in S1 then we have
N1T > T /2). Therefore S2 = V T \ S1 is the set of leaves such that at most T /2
nodes in the path from the root to the leaf are labelled by 1 (in other words, if
algorithm reaches any leaf in S2 then we have N1T ≤ T /2).
From Lemma 3,

T T T T o(log T ) o(log T ) o(log T )


dtv (Dw , Dw ′ ) ≥ Dw (S1 ) − Dw ′ (S1 ) = 1 − − =1−
T T T
Recall that in all previous lowerq
bounds proofs that we have seen, we have used
Pinsker inequality dtv (P, Q) ≤ 12 KL(P, Q) to relate total variation distance
and KL divergence. Unfortunately, this inequality is not useful when P and Q
are very far from each other (because when P and Q are far, though tv distance
is always at most 1, KL divergence between them can be a large number and
hence above inequality will be vacuously true).
We will use the Bretagnolle–Huber bound: for any distributions P and Q,
p 1
dtv (P, Q) ≤ 1 − e−KL(P,Q) ≤ 1 − e−KL(P,Q)
2
Note that 1 − 12 e−KL(P,Q) is always < 1 and hence the above inequality is not
vacuosly true when dtv (P, Q) → 1 and KL(P, Q) → ∞.
Now

T T 1 T
KL(Dw , Dw ′ ) ≥ ln ≥ log
T T
2(1 − dtv (Dw , Dw′ )) o(log T )

for all T . This implies that


T T
lim KL(Dw , Dw ′ )/ log T ≥ 1 (2)
T →∞

28
By the chain rule, we also have (as we did in previous lower bound proof)
X
T T T −1 T −1
KL(Dw , Dw ′ ) = KL(Dw , Dw ′ )+ Prw (v)KL((0.5, 0.4), (0.5, 0.6))
v∈VT −1 :L(v)=2

Let d = KL((0.5, 0.4), (0.5, 0.6)). By repeated application of the above inequal-
ity, we have
T
X −1 X
T T
KL(Dw , Dw ′) = d Prw (v)
h=0 v∈Vh :L(v)=2
P
Note that for any h, v∈Vh :L(v)=2Prw (v) is the probability that the algorithm B
PT −1 P
pulls arm 2 in the round h+1 (for instance w). Thus h=0 v∈Vh :L(v)=2 Prw (v)
is just Ew [N2T ] (hint: think of indicator random variable). Now
T T
lim KL(Dw , Dw ′ )/ log T ≥ 1 (from inequality 2)
T →∞
Ew [N2T ] 1
=⇒ lim ≥
T →∞ log T d

Recall that earlier we have shown that Ew [N2T ] = o(log T ) which is a contradic-
tion to above.
The proof of second part is almost same. Modify instances w and w′ as
follows and rest of the proofs follows in similar fashion:

1. In w, first arm is Bernoulli with a mean of 1/2 and the second arm is a
Bernoulli with a mean of 0.5 − ϵ.
2. In w′ , first arm is Bernoulli with a mean of 1/2 and the second arm is a
Bernoulli with a mean of 0.5 + ϵ.

29
6 Adversarial bandits
In adversarial bandits, there is no randomness in the received rewards. The
input is an unknown but fixed reward table (which is just a sequence of T
reward vectors r1 , . . . , rT where ri = (ri,1 , ri,2 , . . . , ri,K ) ∈ [0, 1]K ). As there
is no randomness in the received rewards, the randomness can come from the
algorithm only.
Input/Instance: K, T (known), unknown r1 , . . . , rT
In each round t ∈ T :
1. algorithm picks an arm at ∈ [K].
2. algorithm receives the reward rtat .
How do we define regret? One way to define regret is as follows : regret of an
algorithm is the difference of the total rewards received by the algorithm that
knows the reward table minus the (expected) reward received by the algorithm.
The issue with this definition is that it makes the problem very hard in the
sense that any algorithm will have large regret (Ω(T )).
The regret of an algorithm is defined in the following way:
T
X T
X
R(T ) = max rta − rtat
a
t=1 t=1

That is, the regret of an algorithm is the total rewards received by the best
arm in hindsight minus the total reward received by the algorithm. There are
several advantages of defining the regret in the above mentioned way:
1. Now we will be able to show algorithms with sublinear regret.
2. This can be starting point to study more stronger notions of regret (some-
thing in between the above two notions).
As it will be clear soon, there is good reason to consider costs instead of re-
wards with the goal of minimizing cost (than maximizing rewards). Note that
it is just a matter of convenience and to be in line with existing literature we
will consider costs.

Input/Instance: K, T (known), unknown cost vectors c1 , . . . , cT where ci =


(ci1 , . . . , ciK )
In each round t ∈ T :
1. algorithm picks an arm at ∈ [K].
2. algorithm pays the cost ctat .
The regret of an algorithm is defined in the following way:
T
X T
X
R(T ) = ctat − min cta
a
t=1 t=1

30
Bandits with full feedback
An easier problem is bandits with full feedback where after every round t, the
entire cost vector ct is revealed. Note that this is easier than adversarial ban-
dits. As we will see later, this is also a special case of online learning .

Input/Instance: K, T (known), unknown cost vectors c1 , . . . , cT


In each round t ∈ T :
1. algorithm picks an arm at ∈ [K] and pays the cost ctat .
2. algorithm gets to know the cost vector ct .

A general idea is to first study/give algorithms for bandits with full feedback.
Then modify the algorithm to make it work for bandit setting.
Following theorem shows that any deterministic algorithms will have large
regret even for bandits with full feedback.

Theorem 14. Any determistic algorithm will have a regret of T /2 for some
instance, even for 2 arms and when costs are either 0 or 1.

Proof Sketch: Fix any deterministic algorithm. Construct an instance con-


sisting of only 2 arms and binary cost 0 or 1 as follows: in any round t, ex-
actly one arm has cost of 1 and other has cost of 0 (that is, for any t, either
ct1 = 1, ct2 = 0 or ct1 = 0, ct2 = 1). Depending on the algorithm, the cost
table is constructed so that algorithm pays the total cost of T (fill the details
yourself) while there exists an arm that pays only a cost of at most T /2 (this is
the directly implied by the fact that in any round, one arm pays cost of 0 and
other 1).
Note that this is in sharp contrast with stochastic bandits, where all the
algorithm that we had seen were determinstic. Now we give a randomized
algorithm Hedge that has sublinear regret for bandits with full feedback.

Hedge Algorithm
Theorem 15. For any ϵ > 0 and cost vectors c1 , . . . .cT with non negative
entries, the expected regret of Hedge algorithm satisfies:
T
log K X
E[R(T )] ≤ +ϵ pt · c2t (3)
ϵ t=1

where c2t = (c2t1 , . . . , c2tK ) and pt · c2t = pt (a)c2ta .


P
a

Corollary 2. If each cost is in [0, 1] (that is, cta ∈ [0, 1] for each t and a) then
the expected regret of Hedge satisfies:
p
E[R(T )] = O( T log K)

31
Algorithm 6: Hedge Algorithm
Parameter : ϵ > 0 (will be choosen to optimize regret);
w1 (a) = 1 for all arms a;
/* Initialization */
for t = 1, . . . , T doP
pt (a) = wt (a)/ a wt (a) for each arm a;
pt = (pt (1), . . . , pt (K));
at ∼ pt ;
Algorithm pays the cost ctat ;
Algorithm receives the cost vector ct ;
wt+1 (a) = wt (a)e−ϵcta for all a ;
/* update of weights */
end

we have pt · c2t ≤ 1 for each t. Hence


Proof. If each cost is in [0, 1] then q

E[R(T )] ≤ logϵ K + ϵT . Choosing ϵ = logT K gives E[R(T )] = T log K.

For the proof of theorem, follow this material (beware of change in nota-
tions):
https://people.eecs.berkeley.edu/~jiantao/2902021spring/scribe/EE290_
Lecture_10.pdf

EXP3 Algorithm
The idea of EXP3 algorithm is to simulate the Hedge algorithm on fake cost
vectors. In any round t, the algorithm pulls arm at and hence gets to know
c t
only ctat . We define fake cost vector ĉt = (0, . . . , ptta
(at ) , 0, . . . , 0) (all entries are
c t
0 except the at th one which is ptta
(at ) ). EXP3 algorithm just invokes Hedge on
the fake cost vector to update weights and proceed further.

Theorem 16. If all costs are in [0, 1] (that is, cta ∈ [0, 1] for all t ∈ [T ] and
a ∈ [K]) then the expected regret of EXP3 satisfies:
p
E[R(T )] = O( KT log K)

Proof of Theorem 16
Note that for any t and a, pta (which is the probability that the algorithm pulls
arm a in round t) is a random variable. Similarly, each fake costs ĉta is also a
random variable.
Let at = (a1 , . . . , at ) (recall at is the arm choosen by the algorithm in round
t). Let bt = (b1 , . . . , bt ) ∈ [K]t . Let pta|at−1 =bt−1 denotes the probability that
the algorithm EXP3 pulls arm a in round t if a1 = b1 , . . . , at−1 = bt−1 . We may
also use pEXP 3
ta|at−1 =bt−1 for clarity if we are taking about both Hedge and EXP3

32
Algorithm 7: EXP3 Algorithm
Parameter : ϵ > 0 (will be choosen to optimize regret);
w1 (a) = 1 for all arms a;
/* Initialization */
for t = 1, . . . , T doP
pt (a) = wt (a)/ a wt (a) for each arm a;
pt = (pt (1), . . . , pt (K));
at ∼ pt ;
Algorithm pays the cost ctat ;
c t
ĉt = (0, . . . , ptta
(at ) , 0, . . . 0);
/* all entries of ĉt are zero except the at th one which
c t
is ptta(at ) */
For each arm a, wt+1 (a) = wt (a)e−ϵĉta ;
/* update of weights */
end

simultaneously. Note that pta|at−1 =bt−1 is a real and will be some function of
b1 , . . . , bt−1 but fortunately we will not need explicit expression for our analysis.
Note that given a1 = b1 , . . . , at−1 = bt−1 , the random variable ĉta takes the
value of 0 if EXP3 does not pulls arm a in the tth round and pta|a cta=b if
t−1 t−1
EXP3 pulls arm a in the tth round. Therefore,
cta
E[ĉta |at−1 = bt−1 ] = pta|at−1 =bt−1 · = cta .
pta|at−1 =bt−1
Also
X
E[ĉta ] = Pr(at−1 = bt−1 )E[ĉta |at−1 = bt−1 ]
(b1 ,...,bt−1 )∈[K]t−1
X
= Pr(at−1 = bt−1 )cta
(b1 ,...,bt−1 )∈[K]t−1

= cta
Thus for any t and a, ĉta is an unbiased estimator of cta (meaning E[ĉta ] = cta ).
Recall that the algorithm EXP3 is just a simulation of Hedge algorithm on fake
cost vectors. So intuitively we should expect that the expected cost of EXP3 on
original cost table should relate to the expected cost of the Hedge algorithm on
the fake cost vectors. To formally see this, we will prove the following lemma.
Lemma 9. Let c□ = (c1 , . . . , cT ) denote the input cost table. Let ĉ□ =
(ĉ1 , . . . , ĉT ) be the fake cost table. Then
E[CostEXP 3 (c□ )] = E[E[CostHedge (ĉ□ )]
where in E[E[CostHedge (ĉ□ )], the outer expectation is over the randomness of
EXP3.

33
Proof. Let ej = (0, . . . , 0, 1, 0, . . . , 0) be a unit vector with 1 in the jth entry
ctbt
and 0 otherwise. For any bT ∈ [K]T , if aT = bT then we have ĉt = ebt pEXP 3
tbt |bt−1
tbtc
for any t. Let ĉt|bT = ebt pEXP 3 and ĉ□|bT = (ĉ1|bT , . . . , ĉT |bT )
tbt |bt−1

X
E[E[CostHedge (ĉ□ )]] = PrEXP 3 (aT = bT )E[CostHedge (ĉ□|bT )]
bT ∈[K]t
T
X X ctbt
= PrEXP 3 (aT = bT ) pHedge
t (ĉ□|bT ) · EXP 3
ebt
t=1
ptbt |bt−1
bT ∈[K]T

X T
X
= PrEXP 3 (aT = bT ) ctbt (because pHedge
t (ĉ□|bT )) = pEXP 3
tbt |bt−1 ebt )
bT ∈[K]T t=1
EXP 3
= E[Cost (ĉ□ )]

Now for any arm a, applying the inequality 3,


T T X
X log K X
E[E[CostHedge (ĉ□ )] ≤ E[ ĉta + +ϵ pHedge
ta (ĉ□ )ĉ2ta ]
t−1
ϵ t=1 a

Now
XT T
X t
X
E[ ĉta ] = E[ĉta ] = cta
t−1 t−1 t=1

XT X T X
X
E[ pHedge
ta (ĉ )ĉ
□ ta
2
] = E[pHedge
ta (ĉ□ )ĉ2ta ]
t=1 a t=1 a

Now

E[pHedge
ta (ĉ□ )ĉ2ta ] =
X
Pr(at−1 = bt−1 )E[pHedge
ta (ĉ□ )ĉ2ta |at−1 = bt−1 ]
bt−1 ∈[K]t−1
X
= Pr(at−1 = bt−1 )pEXP 3 2
ta|bt−1 E[ĉta |at−1 = bt−1 ]
bt−1 ∈[K]t−1
X c2ta
= Pr(at−1 = bt−1 )pEXP 3 EXP 3
ta|bt−1 pta|bt−1
pEXP 3
ta|bt−1
bt−1 ∈[K]t−1

= c2ta

34
Hence for any arm a, we have
T T X
log K X X
E[CostEXP 3 (c□ )] = E[E[CostHedge (ĉ□ )] ≤ + E[ ĉta ] + ϵE[ pHedge
ta (ĉ□ )ĉ2ta ]
ϵ t−1 t=1 a
T T X
log K X X
= + cta + ϵ c2ta
ϵ t−1 t=1 a
T
log K X
≤ + cta + ϵKT
ϵ t−1

Therefore,
T
EXP 3
X log K
E[Cost (c□ )] ≤ min cta + ϵKT +
a
t−1
ϵ
q
log K √
Choosing ϵ = KT gives regret of KT log K.

Martiangles
We will study random process on decision trees. Consider a decision tree T
(generally this corresponds to sequence of decisions/r.vs Y1 , . . . , Yn ), a complete
rooted tree
P of height n. Each edge uv is associated with probability puv and
we have v∈C(u) puv = 1 for all node u where C(u) is the set of all children
of u. One can visualize as a particle that starts at the root node and in each
step picks a child with the respective probability (and hence in this way particle
reaches a leaf node).
A sequence of random variables X0 , X1 , . . . , Xn is called a martiangale with
respect to a decesion tree T if
1. if Xi : Vi → R for all i where Vi is the set of nodes at height i.
2. E[Xi |Hi−1 ] = Xi−1 for all i where Hi−1 is the history just after time
i − 1. Note that Hi−1 is a random variable and is the node reached by the
particle after just time i − 1.
E[Xi |Hi−1 ] = Xi−1 for all i is same as saying: for all i, for all nodes v at
height i − 1, E[Xi |v] = Xi−1 (v) where E[Xi |v] is the expectation of Xi
given the particle is at node v. Further note that, given that the particle
is at node v (at height i − 1), X0 , . . . , Xi−1 is a fixed, i.e., they depends
on v and so Xi−1 (v) is a real number.
One reason to study martingales is that Hoffding type of concentration in-
equality can be applied to a sequence of dependent random variables if they
form a martiangle.

35
Theorem 17 (Azuma Inequality). If X0 , . . . , Xn is a martiangle with respect
to some decision tree T and if |Xi − Xi−1 | ≤ ci for all i then

ϵ2
Pr(|Xn − X0 | ≥ ϵ) ≤ 2 · exp(− P 2 )
2 i ci

One can derive Hoffding’s inequality for binary independent random vari-
ables from the Azuma inequality: If Y1 , . . . , Yn are independent random vari-
ables such that Yi ∈ {−1, 1} and E[Yi ] = µi then
n
X ϵ2
Pr(| (Yi − µi )| ≥ ϵ) ≤ 2 · exp(− )
i=1
2n

To see the above, consider a decision tree T corresponding to the decisions


Y1 , . . . , Yn (T is a complete binary tree. From any node v at height i, the prob-
ability of reaching the left child of v from v is Pr(Yi = −1) and the probability
of reaching the right child of v from v is Pr(Yi = 1)).
Now (X0 , X1 , . . . , Xn ) is a martiangle where X0 = 0 and Xi = Y1 + · · · +
Yi − µ1 − · · · − µi . To see this, consider any node v at height i − 1. We have

E[Xi |v] = E[Y1 + · · · + Yi − µ1 − · · · − µi |v]


= (Y1 + · · · + Yi−1 − µ1 − · · · − µi−1 )(v) + E[Yi − µi |v]
= Xi−1 (v) + 0 = Xi−1 (v)

Note that applying Azuma inequality to (X0 , . . . , Xn ) gives Hoffding inequality


for Y1 , . . . , Yn .

36

You might also like