0% found this document useful (0 votes)

14 views

Lecture 3

Uploaded by

Mr. RAVI KUMAR I

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Lecture 3

Uploaded by

Mr. RAVI KUMAR I

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

CSE599i: Online and Adaptive Machine Learning Winter 2018

Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization

Lecturer: Kevin Jamieson Scribes: Walter Cai, Emisa Nategh, Jennifer Rogers

Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.

Probability and Statistics Review

We begin with a review of the basics of probability and statistics, including independence, the law of large
numbers, and the central limit theorem. This will lay the foundation for an introduction of tail bounds and
their use in analyzing stochastic bandit problems.
Let X and Y be random variables. We say that X and Y are independent if, ∀A, B,

P (Y ∈ A|X ∈ B) = P (Y ∈ A)

We can use this definition of independence to show that the expectation of a product of functions of X and
Y is the product of the expectations, as long as X and Y are independent:

E[f (X)g(Y )] = E[E[f (X)g(Y )|Y = y]] = E[E[f (X)]g(Y )] = E[f (X)]E[g(Y )]
iid
We say Xi ∼ P for i = 1, . . . , n if each Xi is independent and identically distributed.
Lemma 1. Suppose the true distribution P has mean E[Xi ] = µ, and variance E[(Xi − µ)2 ] = σ 2 . If we
Pn 2
define an estimator of the mean µ̂n = n1 i=1 Xi , then E[(µ̂n − µ)2 ] = σn .
Proof. We begin by substituting the definition of µ̂n and completing the square

" #
2 1X 2
E[(µ̂n − µ) ] = E (Xi − µ)
n i
 
1 X 1 X
= E 2 (Xi − µ)2 + 2 (Xi − µ)(Xj − µ)
n i n
i6=j=1

Since the Xi and Xj are independent when i 6= j, the expectation of the second term is 0. This allows
us to simplify the expression,

1 X
E[(µ̂n − µ)2 ] = E (Xi − µ)2

n2 i
1 X 2
= 2 σ
n i
σ2
=
n

This bound on the squared variation of our estimator from the true mean will allow us to prove the Weak
Law of Large Numbers. Before we begin that proof, we will need another result: Markov’s Inequality.

1
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 2

Figure 1: Since P (Y > x) is a decreasing function supported only on the nonnegative numbers, the integral
of P (Y > x) is bounded below by tP (Y > t)

E[Y ]
Lemma 2. (Markov’s Inequality): If Y is a nonnegative random variable, then P (Y > t) ≤ t

We present two different proofs of Markov’s Inequality.

Proof. We can write the expectation of Y as the integral
Z ∞
E[Y ] = P (Y > x)dx
x=0

Note that we can take this integral from 0 since Y is nonnegative, and that P (Y > x) is a nonincreasing
function. As Figure 1 illustrates, the nonincreasing nature of P (Y > x) implies the following lower bound:

E[Y ] ≥ tP (Y > t)
E[Y ]
We have recovered Markov’s Inequality, that P (Y > t) ≤ t .

Proof. (Alternate proof of Markov’s Inequality) Let Y be a positive random variable. Then
Y ≥ t1{Y ≥ t}
To see why this is true, first consider that, when Y < t, the indicator function is zero, and by definition
we know Y ≥ 0. In the second case, when Y ≥ t, we see that the indicator function is 1, simplifying this
equation to our assumption, Y ≥ t.
Next, we take the expectation of both sides to get
E[Y ] ≥ tE[1{Y ≥ t}]
= tP (Y ≥ t)
E[Y ]
We have recovered Markov’s Inequality, that P (Y ≥ t) ≤ t .

Theorem 1. (Weak Law of Large Numbers): For all ε > 0, limn→∞ P (|µ̂n − µ| > ε) = 0
Proof. Fix ε > 0. Then, P (|µ̂n − µ| > ε) = P (|µ̂n − µ|2 > ε2 ). Now, the random variable in question,
|µ̂n − µ|2 , is nonnegative. This means we can apply Markov’s Inequality, yielding
P (|µ̂n − µ| > ) = P (|µ̂n − µ|2 > 2 )
E[|µ̂n − µ|2 ]
≤
ε2
2
σ
= 2
nε
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 3

where, in the last step, we applied lemma 1. Next, we take the limit,

σ2
lim P (|µ̂n − µ| > ε) ≤ lim
n→∞ n→∞ nε2
=0

Example: Estimating the bias of a coin with Markov’s Inequality

2
σ
We have already shown in the proof of theorem 1 that, given fixed ε > 0, P (|µ̂n − µ| > ε) ≤ nε 2 . Now,

suppose we are trying to estimate the bias of a coin. We know the bias is bounded between [0, 1], and that
the variance of a distribution with this support is bounded by σ 2 ≤ 41 . This gives us
1
P (|µ̂n − µ| > ε) ≤
4nε2
If we want the probability of such an event to be bounded by δ, then we set the right hand side equal to δ,
and solve for ε. This yields r
1
|µ̂n − µ| ≤
4nδ
with probability at least 1 − δ. Thus, if we desire that |µ̂n − µ| ≤ with probability at least 1 − δ then
−2
we must have n ≥ 4δ . Later, we will see that the Central Limit Theorem suggests this to be very loose.
Indeed, the CLT implies it suffices to take just n = −2 log(2/δ)/2 which is substantially smaller.
√
Theorem 2. (Central Limit Theorem (CLT)) limn→∞ n(µ̂n − µ) ∼ N (0, σ 2 ).
Proof. Consider the random variable
X1 + · · · + Xn − nµ
Zn = √
nσ 2
We will prove the central limit theorem by calculating the characteristic function of this random variable,
and showing that, in the limit as n → ∞, it is the same as the characteristic function for N (0, 1). We begin
by rewriting the random variable Zn ,

X1 + · · · + Xn − nµ
Zn = √
nσ 2
n
X Xj − µ
= √
j=1 nσ 2
n
X 1
= √ Yj
j=1
n

X −µ
where Yj = jσ . Note that these Yj are i.i.d. with mean 0 and variance 1. We want to find a closed form
for the characteristic function of Zn , which is given by

φZn (t) = E [exp (itZn )]

√
where i = −1. We substitute in our definition of Zn , yielding
  
X 1
φZn (t) = E exp it √ Yj 
j
n
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 4

By the properties of exponentials, we can change the sum in the exponent into a product of exponentials:
 

Y 1
φZn (t) = E  exp it √ Yj 
j
n

Since the Yj are independent, the expectation commutes with the product, and we can write
Y
1

φZn (t) = E exp it √ Yj
j
n
We know the Yj are identically distributed, so each expectation in the product must have the same value.
This enables us to simplify the equation using Y1 , which is representative of all Yj
n
1
φZn (t) = E exp it √ Y1
n
Using the definition of the characteristic function, we can write this as a power of the characteristic function
of Y1
n
t
φZn (t) = φY1 √
n
Now, we can use Taylor’s Theorem to approximate the characteristic function. For some (possibly complex)
constant c, we have, as √tn → 0,

t2 t3
3
t t
φ Yi √ =1− +c 3 +o 3
n 2n 6n 2 n2
n
t2
Next, we recognize that, as n → ∞, the characteristic function approaches φZn (t) → 1 − 2n . Using
x n
x

the identity e = limn→∞ 1 + n , we conclude that
1 2
lim φZn (t) = e− 2 t
n→∞
which is exactly the characteristic function of the standard normal distribution, N (0, 1). Recalling that
our definition of Zn was a transformation of the random variables Xi , we see that the sum of the Xi ’s will
converge to a normal distribution N (nµ, nσ 2 ).

Example: Estimating the bias of a coin using the Central Limit Theorem
Revisiting our coin flip example, we can use the central limit theorem√to improve our asymptotic bound on
|µ̂n − µ|. The central limit theorem tells us that our random variable n(µ̂n − µ) ∼ N (0, σ 2 ) as n → ∞, so
we begin with the definition of a normal distribution:

Z ∞
1 −nx2
P (µ̂n − µ > ε) ≤ p e 2σ 2 dx
ε 2πσ 2 /n
−nε2
≤e 2σ 2

We can set this bound equal to δ and q

solve for ε, as in our previous example. Doing this, we find that,
2
with probability at least 1 − δ, |µ̂n − µ| ≤ 2σ log(2/δ)
n as n → ∞. This suggests that for “sufficiently large”
n0 , we have with probability at least 1 − δ that |µ̂n − µ| ≤ whenever n ≥ max{n0 , 2σ 2 −2 log(2/δ)}. Here,
n0 is encoding the fact that the CLT is an asymptotic statement. To make such a statement rigorous for
all finite n, without knowledge of some sufficiently large n0 , we must appeal to different techniques. Note,
however, that Markov’s inequality holds for all n but is substantially looser than we would expect.
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 5

Chernoff Bounds
Central Limit Theorem guarantees are useful for large sample sizes, but if n is small, we would still like to
bound the deviation of µ̂n from the true mean. Chernoff Bounds are a technique for bounding a random
variable using its moment generating function. We wish to bound the quantity P (µ̂n − µ > ε). For λ > 0,
we can use the fact that ex is monotonically increasing to transform our variable:
P (µ̂n − µ > ε) = P (λ(µ̂n − µ) > λε)
= P (eλ(µ̂n −µ) > eλε )
Now, our random variable is nonnegative, and we can apply Markov’s Inequality.
h i
P (µ̂n − µ > ε) ≤ e−λε E eλ(µ̂n −µ)
h 1P i
= e−λε E eλ( n i (Xi −µ))
" #
Y λ
−λε (X −µ)
=e E en i

i
Y h λ i
−λε
=e E e n (Xi −µ) (1)
i
h λ in
= e−λε E e n (Xi −µ) (2)

In equation 1, we have used the independence of Xi ’s, which means that the product commutes with the
expectation. In equation 2, we have leveraged their identical distribution (so the expectations are identical).
This sequence of steps, exponentiating and applying Markov’s inequality with independence, is the technique
known as the Chernoff bound.

Hoeffding’s Inequality
Moving on from the Chernoff bound technique, we describe a more general bound: Hoeffding’s Inequality.
We prove the inequality by leveraging The Chernoff bound as well as Hoeffding’s Lemma which we define
and prove first.
Lemma 3. (Hoeffding’s Lemma) Let X be a random variable from domain [a, b] almost surely and E[X] = 0.
Then for any real s
s2 (bi −ai )2
E esX ≤ e

8

Proof. We adapt the proof found from Duchi [7]. First note that esx is convex w.r.t. x. Thus we have:
b − X sa X − a sb
esX ≤ e + e
b−a b−a
By linearity:
b − E[X] sa E[X] − a sb b sa a sb
E esX ≤

e + e = e − e
b−a b−a b−a b−a
−a
For ease of reading, let p = b−a also noting that a = −p(b − a). We isolate a factor esa :
b sa a sb
e − e = (1 − p)esa + pesb
b−a b−a

= (1 − p) + pes(b−a) esa

= 1 − p + pes(b−a) e−sp(b−a)
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 6

Substitute u = s(b − a):

1 − p + pes(b−a) e−sp(b−a) = (1 − p + peu ) epu

Define function φ of u as the logarithm of the above expression:

φ(u) = log ((1 − p + peu ) epu ) = pu + log (1 − p + peu )

We write E eeX ≤ eφ(u) and we proceed to bound φ(u). Our route is to apply Taylor’s theorem; there must
exist some z ∈ [0, u] where
1 1
φ(u) = φ(0) + uφ0 (0) + u2 φ00 (z) ≤ φ(0) + uφ0 (0) + sup u2 φ00 (z) (3)
2 z 2

We derive the first and second derivatives of φ(u):

peu
φ0 (u) = p +
1 − p + peu
p(1 − p)eu
φ00 (u) =
(1 − p + peu )2

We have φ(0) = φ0 (0) = 0 so we may rewrite Equation (3):

1 1
φ(u) ≤ φ(0) + uφ0 (0) + sup u2 φ00 (z) = sup u2 φ00 (z)
|{z} | {z } z 2 z 2
=0 =0

We therefore need only maximize φ00 (z). We substitute y for eu :

p(1 − p)y
(1 − p + py)2

We note that the expression is a linear expression over a quadratic expression and therefore concave for
y > 0. It therefore suffices to find the critical point for y:

d p(1 − p)y p(1 − p)(1 − p − py)

=
dy (1 − p + py)2 (1 − p + py)3

We have two critical points to consider; y = 1−p p−1

p , and y = p . We note E[X] ≥ 0 =⇒ a ≤ 0 =⇒ p =
−a 1−p p−1
b−a ∈ [0, 1]. Hence p ≥ 0, and y = p ≤ 0. We therefore select the candidate that falls inside the
nonegative window: y = 1−p p−1
p . Note that if p = 0 =⇒ p = 1 =⇒
p−1
p = 0. That is, there is in fact only
a single critical point in this situation. Substituting back in, we have:

p(1 − p) 1−p
p 1
φ00 (u) ≤ =
(1 − p + p 1−p
p )
2 4

We may conclude:
u2 s2 (b−a)2
E esX ≤ eφ(u) ≤ e 8 = e 8

We now prove the primary implication of Lemma 3 and result of this subsection.
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 7

Theorem 3. (Hoeffding’s Inequality) Given independent random variables {X1 , . . . , Xm } where ai ≤ Xi ≤ bi

almost surely (with probability 1) we have:
m m
!
−22 m2

1 X 1 X
P Xi − E[Xi ] ≥ ≤ exp Pm 2
m i=1 m i=1 i=1 (bi − ai )

Proof. We adapt the proof found from Duchi [7]. As mentioned earlier, the result is a straightforward
application of Lemma 3.
For all 1 ≤ i ≤ m define a new variable Zi as the difference between Xi and its expectation.
Zi = Xi − E[Xi ]
This implies that E[Zi ] = 0. Moreover, we may bound the domain of Zi inside [ai − E[Xi ], bi − E[Xi ]]. In
particular, we note that the interval must still have length bi − ai independent of the expectation of Xi .
Let s be some positive value. We have:
m
! m
! ! Qm sZ
i=1 e
X X Chernoff E i
st
P Zi ≥ t = P exp s Zi ≥ e ≤ st
i=1 i=1
e

By independence of the Zi we may shift the expectation inside the product and continue. Recall that Zi
must still live in an interval of length bi − ai .
Qm sZ Qm sZ m m
!
E i=1 e
i
i=1 E e
i Hoeffding Lemma
−st
Y s2 (bi −ai )2 s2 X 2
= ≤ e e 8 = exp −st + (bi − ai )
est est i=1
8 i=1

We now substitute a conveniently engineered value of s to conclude our result. Note that s > 0 so the earlier
restriction on s is satisfied.
4t
s = Pm 2
i=1 i − ai )
(b
substituting in s as well as t = m, we have:
m
! m m
!
X 1 X 1 X
P Zi ≥ m = P Xi − E[Xi ] ≥
i=1
m i=1 m i=1
2 Xm
!
4m 1 4m 2
≤ exp − Pm 2
m + Pm 2
(bi − ai )
i=1 (bi − ai ) 8 i=1 (bi − ai ) i=1
−22 m2

= exp Pm 2
i=1 (bi − ai )

Stochastic Multi-Armed Bandits

In the stochastic multi-armed bandit problem, the player is presented with a collection of actions, or arms, to
choose from in each round of play. Each arm distributes rewards according to some (unknown) subgaussian
distribution over [0, 1]. Rewards are i.i.d., with E[Xi,t ] = µi for all arms i and times t. The goal of the player
is to minimize the cumulative regret, which is defined as the difference between the player’s rewards after T
time steps, and the best reward possible given the strategy of playing a single arm. If the player chooses
arm It at time t, the regret can be written as
" T #
X
R(T ) = max E (Xj,t − XIt ,t )
j
t=1
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 8

For an alternative formulation of the regret, define each arm’s gap from the best arm as ∆i = maxj µj − µi .
We can rewrite the regret by taking the expectation inside the summation,
X
R(T ) = max E[Xj,t ] − E[XIt ,t ]
j
t
X
= max µj − E[µIt ]
j
t
n
X
= ∆i E[Ti ]
i=1

where Ti is the total number of times we have played arm i. We see that the regret is the product of the
number of times each suboptimal arm is played and that arm’s gap with the optimal arm.

UCB (Upper Confidence Bounds)

Auer et al. (2002) [3] introduced simple and efficient allocation strategies based on upper confidence bounds
for a bandit problem with any reward distribution with known bounded support. Their algorithms demon-
strate logarithmic regret performance uniformly over time, not just asymptotically.
To implement the UCB1 algorithm, we need both µ̂i,Ti , the empirical reward estimate of arm i after it
has been pulled Ti many times, and an upper confidence bound on that estimate. To calculate our empirical
reward estimate, we simply average the observed rewards over all rounds where we pull arm i. In the below
expression, we assume |{t : It = i}| = Ti . That is, we have progressed through an appropriate number of
rounds where arm i has been pulled Ti many times:
P
Xt
µ̂i,Ti = t:It =i
Ti
In addition to our empirical reward estimates, we need an upper confidence bound to describe the largest
plausible mean of each arm. Using Hoeffding’s Inequality and Chernoff Bounds, we can construct such a
confidence interval. With probability at least 1 − t−α , the empirical mean µ̂i,Ti,t will differ from the true
q
mean by at most = α2T log t
i
. The UCB1 algorithm chooses the largest such upper bound:

s
α log(t)
UCBi,t = It := arg max µ̂i,Ti,t +
i∈[n] 2 Ti
q
α log(t)
We see that our confidence bound, 2Ti , grows slowly as we play for more rounds (as t increases),
ensuring that we never stop playing any given arm. The confidence bound for arm i shrinks quickly as we
pull the arm (as Ti increases).
The pseudocode may be found in Algorithm 1.
Theorem 4. Regret bound for the UCB algorithm
For T ≥ 1
X 2α
R(T ) ≤ 4α∆−1
i log(T ) + ∆i
α−1
i:∆i >0

Proof. Suppose, without loss of generality, that arm 1 is optimal. Then, arm i 6= 1 will only be played in two
cases: either arms 1 and i have been sampled insufficiently to distinguish between their means, or the upper
confidence bound given by Hoeffding’s inequality fails for either arm 1, or arm i. We begin by bounding the
chance that we pull a suboptimal arm due to insufficient sampling.
Suppose that we have the following two events At , Bt .
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 9

Algorithm 1 UCB
1: procedure UCB({1, 2, . . . , n}, T ) . Arms 1 through n, max steps T
2: for 1 ≤ t ≤ n do
3: It ← t . Play each arm once
4: end for
5: for n + 1 ≤ t ≤ T do
6: It = arg max UCBi,t−1
i∈{1,··· ,n}
7: Observe reward XTt ,t
8: end for
9: end procedure

q
α log t
At ) µ̂i,Ti ≤ µi + 2Ti
q
α log t
Bt ) µ̂1,T1 ≥ µ1 − 2 T1

We wish to bound the probabilities of the complements of events At and Bt occuring. We will apply
Hoeffding’s inequality (Theorem 3). At fails when
r
α log t
µ̂i,Ti − µi >
2 Ti
By Theorem 3 we have:
!
−22 t2
P(Act ) = P (µ̂i,Ti − µi > ) ≤ exp Pt 2
i=1 (bi − ai )
!
−22 t2
= exp Pt 2
i=1 (1 − 0)
= exp(−22 t)
q
α log t
We plug in our bounding value = 2Ti .

r !
α log t −2tα log t
P µ̂i,Ti − µi > ≤ exp
2Ti 2Ti

−tα log t
= exp
Ti

−tα log t
≤ exp
t
= eα log t
= t−α (4)

The statement and justification is identical for the complement of event Bt and we return to the task of
bounding the number of suboptimal arm pulls.
A suboptimal arm i is only played if its upper confidence bound exceeds that of arm 1, meaning that
r r
α log t α log t
µ̂i,Ti + > µ̂1,T1 +
2 Ti 2 T1
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 10

Suppose that both At and Bt both hold. In this case, suboptimal arm i is pulled due to insufficient sampling
up to this point.
Since At has been assumed to be true, we generate the following bound:
s s
α log(t) α log(t)
µi + 2 ≥ µ̂i,Ti + (5)
Ti 2 Ti

Next, we use our assumption of Bt being true to upper-bound the right hand side of Line (5):
r
α log t
µ̂1,T1 + ≥ µ1 (6)
2 T1
Chaining equations (5) and (6) we have
s
α log(t)
µi + 2 ≥ µ1
Ti

Rearranging we have:
s
α log(t) µ1 − µi
≥
Ti 2

Now, recall our definition of the optimality gap of an arm, ∆i = maxj µj − µi . Since we know arm 1 is
optimal, this becomes ∆i = µ1 − µi . Our inequality becomes
s
α log(t) ∆i
≥
Ti 2
Solving for the number of times Ti that an arm has been played, we arrive at

Ti ≤ 4∆−2
i α log(t)
≤ 4∆−2
i α log(T ) (7)

Thus when At and Bt hold, we only play suboptimal arm i at most 4∆−2 i α log(T ) times.
Recall that It can only be equal to i if either it has been sampled insufficiently (fewer than 4∆−2
i α log(T )
times) or either event At or Bt fails. For any arm i, the expected number of times it is played up to round
T under UCB is:
T
X
E[Ti ] = E[1(It = i)]
t=1
T
X
≤ 4 α∆−2
i log(T ) + E[1{Act ∪ Btc }]
t=1
T
X
≤ 4 α∆−2
i log(T ) + E[1{Act }] + E[1{Btc }] (8)
t=1
T
X
≤ 4 α∆−2
i log(T ) + t−α + t−α (9)
t=1
T
X
= 4 α∆−2
i log(T ) + 2 t−α
t=1
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 11

The inequality in line (8) comes from the union bound on events Act and Btc . The inequality in line (9) comes
from Equation (4). In order to bound the second term on the right hand side, we note:
T Z ∞
X
−α −1 −α
t ≤1+ x−α dx = 1 + =
t=1 1 1−α 1−α
Therefore we have:
2α
E[Ti ] ≤ 4α∆−2
i log(T ) + (10)
α−1
The desired result follows from summing over all suboptimal arms:
X
R(T ) = ∆i E[Ti ]
i6=1
X 2α
= 4α∆−1
i log(T ) + ∆i
α−1
i6=1

Theorem 5. (Lai-Robin’s (1985) [13]) Lai and Robbins, provided an optimal asymptotic lower bound on the
expected regret of any bandit algorithm.

If ∀β > 0, R(T ) ≤ o(T β ), then

log(T )
E[Ti ] ≥
∆2i
We refer the reader to Kaufmann et al. (2012) [12] for a proof of the above theorem. For an overview of
the UCB family of algorithms refer to Bubeck and Cesa-Bianchi (2012, chap. 2) [5].
If the gap between the best and second-best arm is very small, then the ∆−1
i penalty in the regret bound
becomes very large. However, as the gap becomes small, we would imagine that playing the second-best arm
becomes a decent strategy. To this end, we seek a “worst case” regret bound for the UCB algorithm. This
bound, which is independent of ∆i , is shown in the following theorem.
Theorem 6. For all T ≥ n, a gap-agnostic bound achieved by the UCB algorithm in round T is
p 2α
E[R(T )] ≤ (1 + 4α) nT log(T ) + n
α−1
Proof. Divide the arms into two groups:
• Group G1 contains “almost optimal” arms with ∆i < Tn log(T ).
p

• Group G2 contains arms with ∆i ≥ Tn log(T ).

The total regret is the sum of the regret of each group. The maximum total regret incurred due to pulling
arms in G1 is given by
X
Ti ∆i
i∈G1

By definition, the regret on any arm i ∈ G1 is bounded by ∆i < Tn log(T ). We may therefore bound the
p

total regret on arms in G1 as follows:

r
X n X
Ti ∆i ≤ log(T ) Ti
T
i∈G1 i∈G1
r
n
≤T· log(T )
T
p
= nT log(T )
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 12

pn
We may now shift our focus to group G2 . Recall by definition for all arms i ∈ G2 we have ∆i ≥ T log(T ).
Rearranging we have:
s
−1 T
∆i ≤ (11)
n log(T )

We begin by building on Equation (10) and summing over all arms in G2 :

X X 2α

E[Ti ]∆i ≤ 4α∆−2i log(T ) + ∆i (12)
α−1
i∈G2 i∈G2
X 2α

= 4α∆−1i log(T ) + ∆ i
α−1
i∈G2
s !
X T 2α
≤ 4α log(T ) + 1 (13)
n log(T ) α−1
i∈G2
r !
X T log(T ) 2α
= 4α +1
n α−1
i∈G2
r !
T log(T ) 2α
≤ n · 4α +1 (14)
n α−1
p 2α
≤ 4α nT log(T ) + n
α−1
Line (12) follows from multiplying ∆i (a necessarily nonnegative value) onto either side of equation (10)
and summing over all arms in G2 . Line (13) follows from Equation (11), and the fact that µi lives in [0, 1]
implying ∆i may not exceed 1. For line (14), note that the interior of the summation is independent of
which arm i we are iterating over and |G2 | ≤ n.
We sum the expected regret over all arms in groups G1 and G2 to arrive at the total expected regret:
X
E[R(T )] = E[Ti ]∆i
1≤i≤n
X X
= E[Ti ]∆i + E[Ti ]∆i
i∈G1 i∈G2
p p 2α
≤ nT log(T ) + 4α nT log(T ) + n
α−1
p 2α
= (1 + 4α) nT log(T ) + n
α−1

Putting Theorems 4 and 6 together, we see that the UCB algorithm

√ operates under two distinct regimes.
During the initial “burn-in” period, the algorithm experiences O( nT log T ) regret to learn the arm payouts.
As the game continues, and the gap between the arms P becomes easier to distinguish, the algorithm moves
into the second regime, where its performance is O( i ∆−1 i log T ).
UCB algorithms are an active research field in machine learning, especially for the contextual bandit
problem [5, 8, 15]. For an overview of the UCB family of algorithms refer to Bubeck and Cesa-Bianchi (2012,
chap. 2) [5] and [4].

Thompson Sampling(Posterior Sampling or Probability Matching)

Thompson sampling (TS) is one of the oldest heuristics for multi-armed bandit problems [20]. Thompson
sampling takes a Bayesian approach to find the optimal arm while balancing the trade off between exploration
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 13

and exploitation of non-optimal arms. In TS, the reward of each arm is distributed Bernoulli and the expected
reward is unknown. The objective is to find the optimal arm that gives maximum expected cumulative
reward. The TS algorithm initially assumes arm i to have prior Beta (1, 1) on µi , which is natural because
Beta(1, 1) is the uniform distribution on (0, 1). At time t, having observed Si (t) successes (reward=1) and
Fi (t) failures (reward=0) in Ti (t) = Si (t) + Fi (t) plays of arm i, the algorithm updates the distribution on µi
as Beta(Si (t) + 1, Fi (t) + 1). The algorithm then samples from these posterior distributions of the µi ’s, and
plays an arm according to the probability of its mean being the largest. The Thompson sampling algorithm
is given in Algorithm 2.

Algorithm 2 Thompson Sampling

1: procedure Thompson({1, 2, . . . , n}, T ) . Arms 1 through n, max steps T
2: SIt , FIt , Ti ← 0
3: for 1 ≤ t ≤ T do
4: for 1 ≤ i ≤ n do
5: µ̂i ∼ Beta(SIt + 1, FIt + 1) . Draw each µ̂i according to the posterior distribution
6: end for
7: It ← argmax µ̂i
i∈[n]
8: TIt ← TIt + 1 . Increment the total counter for arm It .
9: XIt ,t ∼ Bernouli(µi ) . Observe reward XIt ,t
10: SIt ← SIt + XIt ,t . Update success counter appropriately
11: FIt ← Ti It − SIt . Update failure counter appropriately
12: end for
13: end procedure

Thompson Sampling has received considerable attention in industry as well (e.g. Scott (2010) [18],
Graepel et al. (2010) [11], and Tang et al. (2013) [19]). For more details please see [1, 2, 6, 17].

KL-UCB
The Kullback-Leibler UCB algorithm (KL-UCB) presents a modern approach to UCB for the standard
stochastic bandits problem. KL-UCB improves the regret bounds from earlier UCB algorithms by considering
the distance between the estimated distributions of each arm. The algorithms differ only at the arm selection
step. Recall that UCB uses the following rule for arm selection:
s
α log(t)
arg max µ̂i,Ti,t +
i∈[n] 2 Ti
In contrast, KL-UCB uses:
!

It := arg max max q ∈ [0, 1] : Ti · d µ̂i,Ti,t , q ≤ log(t) + c log(log(t)) (15)
i∈[n]

where d(·, ·) is the Bernoulli Kullback-Leibler divergence:

p 1−p
d(p, q) = p log + (1 − p) log
q 1−q
and c is a tuning parameter. The inner expression on the right side of Equation (15) is the stronger upper
confidence bound. For each arm i ∈ [n], the maximal q in the inner statement may be efficiently approxi-
mated using Newton’s method. The psuedo-code may be found in Algorithm 3.

KL-UCB is optimal for Bernoulli distributions and strictly dominates UCB for any bounded reward
distributions. For more details please see [9, 14].
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 14

Algorithm 3 KL-UCB
1: procedure KL-UCB({1, 2, . . . , n}, T ) . Arms 1 through n, max steps T
2: for 1 ≤ t ≤ n do
3: It ← t . Play each arm once
4: end for
5: for n + 1 ≤ t ≤ T do !

6: It = arg max max q ∈ [0, 1] : Ti · d µ̂i,Ti,t , q ≤ log(t) + c log(log(t))
i∈[n]
7: Observe reward XTt ,t
8: end for
9: end procedure

Examples/Applications for TS and UCB

One of the early motivations for studying the Multi-Armed Bandit problem was clinical trials. Suppose that
we have N different treatments of unknown efficacy for a certain disease. Patients arrive sequentially, and we
must decide on a treatment to administer for each arriving patient. To make this decision, we could learn from
how the previous choices of treatments fared for the previous patients. After a sufficient number of trials, we
may have a reasonable idea of which treatment is most effective, and from then on, we could administer that
treatment for all the patients. In applications like display advertising, product assortment, recommendation
system (e.g. news article recommendation, cascading recommendation, recommending courses to learners),
reinforcement learning in Markov decision processes, and active learning with neural networks, Thompson
sampling is competitive to or better than popular methods such as UCB. Web advertising, job scheduling (or
exercise scheduling), and routing (shortest path problem) examples could be another motivations in MAB
problems. For more details please see [10].
Thompson Sampling and offers significant advantages over the UCB approach, and can be applied to
problems with finite or infinite action spaces and complicated relationships among action rewards, refer to
Russo and Van Roy (2014) [16].
UCB algorithms have been proposed for a variety of problems, including bandit problems with indepen-
dent arms, bandit problems with linearly parameterized arms, bandits with continuous action spaces and
smooth reward functions, and exploration in reinforcement learning. UCB1 is the building block for tree
search algorithms (e.g. Upper Confidence bound applied to Trees (UCT)) used to, e.g., play games. There
are some limitations for using Thompson sampling. For example, it is certainly a poor fit for sequential
learning problems that do not require much active exploration. It may also perform poorly in time-sensitive
learning problems where it is better to exploit a high performing suboptimal action than to invest resources
exploring arms that might offer slightly improved performance.

References
[1] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem.
CoRR, abs/1111.1797, 2011.

[2] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. CoRR,
abs/1209.3353, 2012.
[3] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit
problem. Mach. Learn., 47(2-3):235–256, May 2002.
[4] Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed
bandit problem.
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 15

[5] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed
bandit problems. CoRR, abs/1204.5721, 2012.

[6] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Proceedings of the
24th International Conference on Neural Information Processing Systems, NIPS’11, pages 2249–2257,
USA, 2011. Curran Associates Inc.
[7] John C. Duchi. Probability bounds. 2009.
[8] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The gen-
eralized linear case. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta,
editors, Advances in Neural Information Processing Systems 23, pages 586–594. Curran Associates, Inc.,
2010.
[9] Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic bandits and beyond.
In COLT, 2011.

[10] Aditya Gopalan, Shie Mannor, and Yishay Mansour. Thompson sampling for complex online problems.
In Proceedings of the 31st International Conference on International Conference on Machine Learning
- Volume 32, ICML’14, pages I–100–I–108. JMLR.org, 2014.
[11] Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich. Web-scale bayesian
click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In
Proceedings of the 27th International Conference on International Conference on Machine Learning,
ICML’10, pages 13–20, USA, 2010. Omnipress.
[12] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification
in multi-armed bandit models. J. Mach. Learn. Res., 17(1):1–42, January 2016.

[13] T.L Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math.,
6(1):4–22, March 1985.
[14] Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. A Finite-Time Analysis of Multi-armed
Bandits Problems with Kullback-Leibler Divergences. In Sham Kakade & Ulrike von Luxburg, editor,
24th Annual Conference on Learning Theory : COLT’11, page 18, Budapest, Hungary, July 2011.

[15] Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. CoRR, abs/0812.3465,
2008.
[16] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. CoRR,
abs/1301.2609, 2013.

[17] Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, and Ian Osband. A tutorial on thompson sampling.
CoRR, abs/1707.02038, 2017.
[18] Steven L. Scott. A modern bayesian look at the multi-armed bandit. Appl. Stoch. Model. Bus. Ind.,
26(6):639–658, November 2010.

[19] Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. Automatic ad format selection via contex-
tual bandits. In Proceedings of the 22Nd ACM International Conference on Information & Knowledge
Management, CIKM ’13, pages 1587–1594, New York, NY, USA, 2013. ACM.
[20] WILLIAM R THOMPSON. On the likelihood that one unknown probability exceeds another in view
of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.

Econometrics - Fumio Hayashi (Solutions)
No ratings yet
Econometrics - Fumio Hayashi (Solutions)
19 pages
Econometrics Cheat Sheet Stock and Watson
No ratings yet
Econometrics Cheat Sheet Stock and Watson
2 pages
CCSS SaxonCourse1 2 3
0% (1)
CCSS SaxonCourse1 2 3
72 pages
Full Download Calculus For Business Economics Life Sciences and Social Sciences 13th Edition Barnett Solutions Manual All Chapter 2024 PDF
100% (22)
Full Download Calculus For Business Economics Life Sciences and Social Sciences 13th Edition Barnett Solutions Manual All Chapter 2024 PDF
44 pages
Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
Ejercicio de Matemáticas Inglés 4A
No ratings yet
Ejercicio de Matemáticas Inglés 4A
285 pages
CS229 Supplemental Lecture Notes Hoeffding's Inequality: 1 Basic Probability Bounds
No ratings yet
CS229 Supplemental Lecture Notes Hoeffding's Inequality: 1 Basic Probability Bounds
8 pages
Probability Bounds
No ratings yet
Probability Bounds
14 pages
Lecture Notes 2 1 Probability Inequalities
No ratings yet
Lecture Notes 2 1 Probability Inequalities
9 pages
Ashish Mcdiarmid
No ratings yet
Ashish Mcdiarmid
22 pages
Notes
No ratings yet
Notes
32 pages
ADASD
No ratings yet
ADASD
4 pages
15-359: Probability and Computing Inequalities: N J N J
No ratings yet
15-359: Probability and Computing Inequalities: N J N J
11 pages
15-359: Probability and Computing Inequalities: N J N J
No ratings yet
15-359: Probability and Computing Inequalities: N J N J
11 pages
03 Hoeffding
No ratings yet
03 Hoeffding
5 pages
Lecture Notes 2 1 Probability Inequalities
No ratings yet
Lecture Notes 2 1 Probability Inequalities
9 pages
Unit 3 - Bounds and Inequalities
No ratings yet
Unit 3 - Bounds and Inequalities
25 pages
Lec 2
No ratings yet
Lec 2
7 pages
Essentials On The Analysis of Randomized Algorithms: 1 Basics
No ratings yet
Essentials On The Analysis of Randomized Algorithms: 1 Basics
8 pages
Hoeffding Bounds
No ratings yet
Hoeffding Bounds
9 pages
확통1 LectureNote06 on Limit Theorems
No ratings yet
확통1 LectureNote06 on Limit Theorems
36 pages
Lecture 4 Inequalities and Asymptotic Estimates
No ratings yet
Lecture 4 Inequalities and Asymptotic Estimates
9 pages
1 Upper Bounds On The Tail Probability
No ratings yet
1 Upper Bounds On The Tail Probability
7 pages
final_practice_2023
No ratings yet
final_practice_2023
10 pages
Roch (Wisconsion) Ch2 Moments and Tails
No ratings yet
Roch (Wisconsion) Ch2 Moments and Tails
52 pages
9 CLT
No ratings yet
9 CLT
19 pages
1.8. Large Deviation and Some Exponential Inequalities.: B R e DX Essinf G (X), T e DX Esssup G (X)
No ratings yet
1.8. Large Deviation and Some Exponential Inequalities.: B R e DX Essinf G (X), T e DX Esssup G (X)
4 pages
Lec 3
No ratings yet
Lec 3
8 pages
Prob Notes
No ratings yet
Prob Notes
70 pages
Unit3-Probability and Stochastic Processes (18MAB203T)
No ratings yet
Unit3-Probability and Stochastic Processes (18MAB203T)
25 pages
Chapter 7 8fhjg
No ratings yet
Chapter 7 8fhjg
9 pages
SC633 Lecture Notes
No ratings yet
SC633 Lecture Notes
4 pages
RIP Routing Protocol
No ratings yet
RIP Routing Protocol
27 pages
11 Tail Inequalities: 11.1 Markov's Inequality
No ratings yet
11 Tail Inequalities: 11.1 Markov's Inequality
5 pages
Unit 2
No ratings yet
Unit 2
28 pages
Martingale Limit Theory and Stochastic Regression Theory: Ching-Zong Wei
No ratings yet
Martingale Limit Theory and Stochastic Regression Theory: Ching-Zong Wei
155 pages
Selective Review - Probability
No ratings yet
Selective Review - Probability
30 pages
CS 4850-Lecture 1
No ratings yet
CS 4850-Lecture 1
3 pages
Probability and Statistics Soln 20
No ratings yet
Probability and Statistics Soln 20
5 pages
Lecture1
No ratings yet
Lecture1
8 pages
Sol 3
No ratings yet
Sol 3
7 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Best Question
No ratings yet
Best Question
6 pages
Unif Gauss Tail
No ratings yet
Unif Gauss Tail
13 pages
G Chebyshev's
No ratings yet
G Chebyshev's
9 pages
02-first-model-of-learning
No ratings yet
02-first-model-of-learning
37 pages
note2
No ratings yet
note2
4 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
728HA1
No ratings yet
728HA1
6 pages
Section 6 - The law of large numbers and the central limit theorem(1)
No ratings yet
Section 6 - The law of large numbers and the central limit theorem(1)
10 pages
Precautionary Principle Papers
No ratings yet
Precautionary Principle Papers
36 pages
Probability Theory Notes Chapter 3 Varadhan
No ratings yet
Probability Theory Notes Chapter 3 Varadhan
50 pages
report-endterm
No ratings yet
report-endterm
30 pages
534 Covers
No ratings yet
534 Covers
2 pages
18MAB203T U3 Book PDF
No ratings yet
18MAB203T U3 Book PDF
38 pages
Math489/889 Stochastic Processes and Advanced Mathematical Finance Homework 7
No ratings yet
Math489/889 Stochastic Processes and Advanced Mathematical Finance Homework 7
3 pages
UCSD ECE153 Handout #26 Prof. Young-Han Kim Thursday, May 1, Solutions To Homework Set #4 (Prepared by Fatemeh Arbabjolfaei) - PDF Free Download PDF
No ratings yet
UCSD ECE153 Handout #26 Prof. Young-Han Kim Thursday, May 1, Solutions To Homework Set #4 (Prepared by Fatemeh Arbabjolfaei) - PDF Free Download PDF
11 pages
Horowitz Sinander Notes
No ratings yet
Horowitz Sinander Notes
136 pages
msqe_metrics_1_ps2
No ratings yet
msqe_metrics_1_ps2
11 pages
Assn10 Sol
No ratings yet
Assn10 Sol
8 pages
AOPSUMS97
No ratings yet
AOPSUMS97
12 pages
College Statistics
No ratings yet
College Statistics
244 pages
Large Deviations: S. R. S. Varadhan
No ratings yet
Large Deviations: S. R. S. Varadhan
12 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Bhanu LOR
No ratings yet
Bhanu LOR
1 page
REC5C002
No ratings yet
REC5C002
2 pages
SECX1029-UNIT-5
No ratings yet
SECX1029-UNIT-5
15 pages
Lor Naveen Kumar Ece
No ratings yet
Lor Naveen Kumar Ece
1 page
CO HARDWARE LAB
No ratings yet
CO HARDWARE LAB
4 pages
III-II CSE_ML MID 2_Set-2
No ratings yet
III-II CSE_ML MID 2_Set-2
1 page
R22 ECE EMI Syllabus
No ratings yet
R22 ECE EMI Syllabus
2 pages
ADE-1
No ratings yet
ADE-1
3 pages
MWOC_JulyAugust-2022
No ratings yet
MWOC_JulyAugust-2022
1 page
ICA Unit 4
No ratings yet
ICA Unit 4
34 pages
Adobe Scan 12-Sep-2023 (1)(1)
No ratings yet
Adobe Scan 12-Sep-2023 (1)(1)
1 page
file_5e6f419ab8dd6
No ratings yet
file_5e6f419ab8dd6
3 pages
AprilMay-2023
No ratings yet
AprilMay-2023
2 pages
MWOC_JanuaryFebruary-2023
No ratings yet
MWOC_JanuaryFebruary-2023
2 pages
Data Structure Final PART I
No ratings yet
Data Structure Final PART I
43 pages
Data Structure Final PART II
No ratings yet
Data Structure Final PART II
50 pages
PTSP QP March 2022
No ratings yet
PTSP QP March 2022
1 page
III-II Ece - Esd Mid 2 - Set-2
No ratings yet
III-II Ece - Esd Mid 2 - Set-2
2 pages
DE DLD Syllabus
No ratings yet
DE DLD Syllabus
4 pages
C21 - CS - Iii Sem
No ratings yet
C21 - CS - Iii Sem
85 pages
III-II Ece - Esd Mid 2 - Set-1
No ratings yet
III-II Ece - Esd Mid 2 - Set-1
1 page
Machine Learning
No ratings yet
Machine Learning
5 pages
Python Programs
No ratings yet
Python Programs
15 pages
Number System
No ratings yet
Number System
29 pages
MECH5605 - 2 - 3 - Direct Approach - 1D Heat
No ratings yet
MECH5605 - 2 - 3 - Direct Approach - 1D Heat
16 pages
CS 2336 Discrete Mathematics: Counting: Permutations and Combinations
No ratings yet
CS 2336 Discrete Mathematics: Counting: Permutations and Combinations
29 pages
Mathematical Modelling of The Mass-Spring-Damper System - A Fractional Calculus Approach
No ratings yet
Mathematical Modelling of The Mass-Spring-Damper System - A Fractional Calculus Approach
7 pages
Complex Number With Solution
No ratings yet
Complex Number With Solution
13 pages
Python Examples
100% (1)
Python Examples
16 pages
5 Normal Distribution
No ratings yet
5 Normal Distribution
34 pages
1 s2.0 S0951832022006299 Main
No ratings yet
1 s2.0 S0951832022006299 Main
13 pages
Merton Truck Excel
No ratings yet
Merton Truck Excel
35 pages
Ntse 2010 Stage 2 Mat Sol
No ratings yet
Ntse 2010 Stage 2 Mat Sol
26 pages
Divide and Conquer PDF
No ratings yet
Divide and Conquer PDF
622 pages
Assignment I - Math 111 - 2019: Deadline: November 27, 2019
No ratings yet
Assignment I - Math 111 - 2019: Deadline: November 27, 2019
5 pages
TJC H2 Maths P1
No ratings yet
TJC H2 Maths P1
6 pages
Phys 432 HW 9
No ratings yet
Phys 432 HW 9
8 pages
Ce123-Trip Generation and Attraction (Final)
No ratings yet
Ce123-Trip Generation and Attraction (Final)
48 pages
6.2.6 Day 2 Lesson Plan
No ratings yet
6.2.6 Day 2 Lesson Plan
18 pages
Introduction To Inferential Statistics
100% (2)
Introduction To Inferential Statistics
9 pages
Halliday/Resnick/Walker Fundamentals of Physics: Classroom Response System Questions
No ratings yet
Halliday/Resnick/Walker Fundamentals of Physics: Classroom Response System Questions
55 pages
CA2 Types of Arthematic Mean With Solved Numericals: Assignment in
100% (1)
CA2 Types of Arthematic Mean With Solved Numericals: Assignment in
16 pages
Pre-Test: LE Level: Direction
No ratings yet
Pre-Test: LE Level: Direction
9 pages
Chap1 Boolean Algebra - Unit 3
No ratings yet
Chap1 Boolean Algebra - Unit 3
25 pages
Mark Scheme (Results) Summer 2019: Pearson Edexcel GCE Further Mathematics Further Pure 1 Paper 6667 - 01
No ratings yet
Mark Scheme (Results) Summer 2019: Pearson Edexcel GCE Further Mathematics Further Pure 1 Paper 6667 - 01
19 pages
Absolute Value Equations #1 PDF
No ratings yet
Absolute Value Equations #1 PDF
4 pages
Review of Number Concepts-Lesson 3
No ratings yet
Review of Number Concepts-Lesson 3
37 pages
Prepared By: Engr. Jeffrey P. Landicho
No ratings yet
Prepared By: Engr. Jeffrey P. Landicho
24 pages

Lecture 3

Uploaded by

Lecture 3

Uploaded by

CSE599i: Online and Adaptive Machine Learning Winter 2018

Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization

Probability and Statistics Review

We present two different proofs of Markov’s Inequality.

Example: Estimating the bias of a coin with Markov’s Inequality

φZn (t) = E [exp (itZn )]

We can set this bound equal to δ and q

Substitute u = s(b − a):

Define function φ of u as the logarithm of the above expression:

φ(u) = log ((1 − p + peu ) epu ) = pu + log (1 − p + peu )

We derive the first and second derivatives of φ(u):

We have φ(0) = φ0 (0) = 0 so we may rewrite Equation (3):

We therefore need only maximize φ00 (z). We substitute y for eu :

d p(1 − p)y p(1 − p)(1 − p − py)

We have two critical points to consider; y = 1−p p−1

Theorem 3. (Hoeffding’s Inequality) Given independent random variables {X1 , . . . , Xm } where ai ≤ Xi ≤ bi

Stochastic Multi-Armed Bandits

UCB (Upper Confidence Bounds)

If ∀β > 0, R(T ) ≤ o(T β ), then

• Group G2 contains arms with ∆i ≥ Tn log(T ).

total regret on arms in G1 as follows:

We begin by building on Equation (10) and summing over all arms in G2 :

Putting Theorems 4 and 6 together, we see that the UCB algorithm

Thompson Sampling(Posterior Sampling or Probability Matching)

Algorithm 2 Thompson Sampling

where d(·, ·) is the Bernoulli Kullback-Leibler divergence:

Examples/Applications for TS and UCB

You might also like