0% found this document useful (0 votes)
14 views

Lecture 3

Lecture 3

Uploaded by

Mr. RAVI KUMAR I
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lecture 3

Lecture 3

Uploaded by

Mr. RAVI KUMAR I
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CSE599i: Online and Adaptive Machine Learning Winter 2018

Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization


Lecturer: Kevin Jamieson Scribes: Walter Cai, Emisa Nategh, Jennifer Rogers

Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.

Probability and Statistics Review


We begin with a review of the basics of probability and statistics, including independence, the law of large
numbers, and the central limit theorem. This will lay the foundation for an introduction of tail bounds and
their use in analyzing stochastic bandit problems.
Let X and Y be random variables. We say that X and Y are independent if, ∀A, B,

P (Y ∈ A|X ∈ B) = P (Y ∈ A)

We can use this definition of independence to show that the expectation of a product of functions of X and
Y is the product of the expectations, as long as X and Y are independent:

E[f (X)g(Y )] = E[E[f (X)g(Y )|Y = y]] = E[E[f (X)]g(Y )] = E[f (X)]E[g(Y )]
iid
We say Xi ∼ P for i = 1, . . . , n if each Xi is independent and identically distributed.
Lemma 1. Suppose the true distribution P has mean E[Xi ] = µ, and variance E[(Xi − µ)2 ] = σ 2 . If we
Pn 2
define an estimator of the mean µ̂n = n1 i=1 Xi , then E[(µ̂n − µ)2 ] = σn .
Proof. We begin by substituting the definition of µ̂n and completing the square

" #
2 1X 2
E[(µ̂n − µ) ] = E (Xi − µ)
n i
 
1 X 1 X
= E 2 (Xi − µ)2 + 2 (Xi − µ)(Xj − µ)
n i n
i6=j=1

Since the Xi and Xj are independent when i 6= j, the expectation of the second term is 0. This allows
us to simplify the expression,

1 X 
E[(µ̂n − µ)2 ] = E (Xi − µ)2

n2 i
1 X 2
= 2 σ
n i
σ2
=
n

This bound on the squared variation of our estimator from the true mean will allow us to prove the Weak
Law of Large Numbers. Before we begin that proof, we will need another result: Markov’s Inequality.

1
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 2

Figure 1: Since P (Y > x) is a decreasing function supported only on the nonnegative numbers, the integral
of P (Y > x) is bounded below by tP (Y > t)

E[Y ]
Lemma 2. (Markov’s Inequality): If Y is a nonnegative random variable, then P (Y > t) ≤ t

We present two different proofs of Markov’s Inequality.


Proof. We can write the expectation of Y as the integral
Z ∞
E[Y ] = P (Y > x)dx
x=0

Note that we can take this integral from 0 since Y is nonnegative, and that P (Y > x) is a nonincreasing
function. As Figure 1 illustrates, the nonincreasing nature of P (Y > x) implies the following lower bound:

E[Y ] ≥ tP (Y > t)
E[Y ]
We have recovered Markov’s Inequality, that P (Y > t) ≤ t .

Proof. (Alternate proof of Markov’s Inequality) Let Y be a positive random variable. Then
Y ≥ t1{Y ≥ t}
To see why this is true, first consider that, when Y < t, the indicator function is zero, and by definition
we know Y ≥ 0. In the second case, when Y ≥ t, we see that the indicator function is 1, simplifying this
equation to our assumption, Y ≥ t.
Next, we take the expectation of both sides to get
E[Y ] ≥ tE[1{Y ≥ t}]
= tP (Y ≥ t)
E[Y ]
We have recovered Markov’s Inequality, that P (Y ≥ t) ≤ t .

Theorem 1. (Weak Law of Large Numbers): For all ε > 0, limn→∞ P (|µ̂n − µ| > ε) = 0
Proof. Fix ε > 0. Then, P (|µ̂n − µ| > ε) = P (|µ̂n − µ|2 > ε2 ). Now, the random variable in question,
|µ̂n − µ|2 , is nonnegative. This means we can apply Markov’s Inequality, yielding
P (|µ̂n − µ| > ) = P (|µ̂n − µ|2 > 2 )
E[|µ̂n − µ|2 ]

ε2
2
σ
= 2

Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 3

where, in the last step, we applied lemma 1. Next, we take the limit,

σ2
lim P (|µ̂n − µ| > ε) ≤ lim
n→∞ n→∞ nε2
=0

Example: Estimating the bias of a coin with Markov’s Inequality


2
σ
We have already shown in the proof of theorem 1 that, given fixed ε > 0, P (|µ̂n − µ| > ε) ≤ nε 2 . Now,

suppose we are trying to estimate the bias of a coin. We know the bias is bounded between [0, 1], and that
the variance of a distribution with this support is bounded by σ 2 ≤ 41 . This gives us
1
P (|µ̂n − µ| > ε) ≤
4nε2
If we want the probability of such an event to be bounded by δ, then we set the right hand side equal to δ,
and solve for ε. This yields r
1
|µ̂n − µ| ≤
4nδ
with probability at least 1 − δ. Thus, if we desire that |µ̂n − µ| ≤  with probability at least 1 − δ then
−2
we must have n ≥ 4δ . Later, we will see that the Central Limit Theorem suggests this to be very loose.
Indeed, the CLT implies it suffices to take just n = −2 log(2/δ)/2 which is substantially smaller.

Theorem 2. (Central Limit Theorem (CLT)) limn→∞ n(µ̂n − µ) ∼ N (0, σ 2 ).
Proof. Consider the random variable
X1 + · · · + Xn − nµ
Zn = √
nσ 2
We will prove the central limit theorem by calculating the characteristic function of this random variable,
and showing that, in the limit as n → ∞, it is the same as the characteristic function for N (0, 1). We begin
by rewriting the random variable Zn ,

X1 + · · · + Xn − nµ
Zn = √
nσ 2
n
X Xj − µ
= √
j=1 nσ 2
n
X 1
= √ Yj
j=1
n

X −µ
where Yj = jσ . Note that these Yj are i.i.d. with mean 0 and variance 1. We want to find a closed form
for the characteristic function of Zn , which is given by

φZn (t) = E [exp (itZn )]



where i = −1. We substitute in our definition of Zn , yielding
  
X 1
φZn (t) = E exp it √ Yj 
j
n
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 4

By the properties of exponentials, we can change the sum in the exponent into a product of exponentials:
 
 
Y 1
φZn (t) = E  exp it √ Yj 
j
n

Since the Yj are independent, the expectation commutes with the product, and we can write
Y  
1

φZn (t) = E exp it √ Yj
j
n
We know the Yj are identically distributed, so each expectation in the product must have the same value.
This enables us to simplify the equation using Y1 , which is representative of all Yj
  n
1
φZn (t) = E exp it √ Y1
n
Using the definition of the characteristic function, we can write this as a power of the characteristic function
of Y1
  n
t
φZn (t) = φY1 √
n
Now, we can use Taylor’s Theorem to approximate the characteristic function. For some (possibly complex)
constant c, we have, as √tn → 0,

t2 t3
   3
t t
φ Yi √ =1− +c 3 +o 3
n 2n 6n 2 n2
 n
t2
Next, we recognize that, as n → ∞, the characteristic function approaches φZn (t) → 1 − 2n . Using
x n
x

the identity e = limn→∞ 1 + n , we conclude that
1 2
lim φZn (t) = e− 2 t
n→∞
which is exactly the characteristic function of the standard normal distribution, N (0, 1). Recalling that
our definition of Zn was a transformation of the random variables Xi , we see that the sum of the Xi ’s will
converge to a normal distribution N (nµ, nσ 2 ).

Example: Estimating the bias of a coin using the Central Limit Theorem
Revisiting our coin flip example, we can use the central limit theorem√to improve our asymptotic bound on
|µ̂n − µ|. The central limit theorem tells us that our random variable n(µ̂n − µ) ∼ N (0, σ 2 ) as n → ∞, so
we begin with the definition of a normal distribution:

Z ∞
1 −nx2
P (µ̂n − µ > ε) ≤ p e 2σ 2 dx
ε 2πσ 2 /n
−nε2
≤e 2σ 2

We can set this bound equal to δ and q


solve for ε, as in our previous example. Doing this, we find that,
2
with probability at least 1 − δ, |µ̂n − µ| ≤ 2σ log(2/δ)
n as n → ∞. This suggests that for “sufficiently large”
n0 , we have with probability at least 1 − δ that |µ̂n − µ| ≤  whenever n ≥ max{n0 , 2σ 2 −2 log(2/δ)}. Here,
n0 is encoding the fact that the CLT is an asymptotic statement. To make such a statement rigorous for
all finite n, without knowledge of some sufficiently large n0 , we must appeal to different techniques. Note,
however, that Markov’s inequality holds for all n but is substantially looser than we would expect.
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 5

Chernoff Bounds
Central Limit Theorem guarantees are useful for large sample sizes, but if n is small, we would still like to
bound the deviation of µ̂n from the true mean. Chernoff Bounds are a technique for bounding a random
variable using its moment generating function. We wish to bound the quantity P (µ̂n − µ > ε). For λ > 0,
we can use the fact that ex is monotonically increasing to transform our variable:
P (µ̂n − µ > ε) = P (λ(µ̂n − µ) > λε)
= P (eλ(µ̂n −µ) > eλε )
Now, our random variable is nonnegative, and we can apply Markov’s Inequality.
h i
P (µ̂n − µ > ε) ≤ e−λε E eλ(µ̂n −µ)
h 1P i
= e−λε E eλ( n i (Xi −µ))
" #
Y λ
−λε (X −µ)
=e E en i

i
Y h λ i
−λε
=e E e n (Xi −µ) (1)
i
h λ in
= e−λε E e n (Xi −µ) (2)

In equation 1, we have used the independence of Xi ’s, which means that the product commutes with the
expectation. In equation 2, we have leveraged their identical distribution (so the expectations are identical).
This sequence of steps, exponentiating and applying Markov’s inequality with independence, is the technique
known as the Chernoff bound.

Hoeffding’s Inequality
Moving on from the Chernoff bound technique, we describe a more general bound: Hoeffding’s Inequality.
We prove the inequality by leveraging The Chernoff bound as well as Hoeffding’s Lemma which we define
and prove first.
Lemma 3. (Hoeffding’s Lemma) Let X be a random variable from domain [a, b] almost surely and E[X] = 0.
Then for any real s
s2 (bi −ai )2
E esX ≤ e
 
8

Proof. We adapt the proof found from Duchi [7]. First note that esx is convex w.r.t. x. Thus we have:
b − X sa X − a sb
esX ≤ e + e
b−a b−a
By linearity:
 b − E[X] sa E[X] − a sb b sa a sb
E esX ≤

e + e = e − e
b−a b−a b−a b−a
−a
For ease of reading, let p = b−a also noting that a = −p(b − a). We isolate a factor esa :
b sa a sb
e − e = (1 − p)esa + pesb
b−a b−a
 
= (1 − p) + pes(b−a) esa
 
= 1 − p + pes(b−a) e−sp(b−a)
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 6

Substitute u = s(b − a):


 
1 − p + pes(b−a) e−sp(b−a) = (1 − p + peu ) epu

Define function φ of u as the logarithm of the above expression:

φ(u) = log ((1 − p + peu ) epu ) = pu + log (1 − p + peu )


 
We write E eeX ≤ eφ(u) and we proceed to bound φ(u). Our route is to apply Taylor’s theorem; there must
exist some z ∈ [0, u] where
1 1
φ(u) = φ(0) + uφ0 (0) + u2 φ00 (z) ≤ φ(0) + uφ0 (0) + sup u2 φ00 (z) (3)
2 z 2

We derive the first and second derivatives of φ(u):

peu
φ0 (u) = p +
1 − p + peu
p(1 − p)eu
φ00 (u) =
(1 − p + peu )2

We have φ(0) = φ0 (0) = 0 so we may rewrite Equation (3):


1 1
φ(u) ≤ φ(0) + uφ0 (0) + sup u2 φ00 (z) = sup u2 φ00 (z)
|{z} | {z } z 2 z 2
=0 =0

We therefore need only maximize φ00 (z). We substitute y for eu :

p(1 − p)y
(1 − p + py)2

We note that the expression is a linear expression over a quadratic expression and therefore concave for
y > 0. It therefore suffices to find the critical point for y:

d p(1 − p)y p(1 − p)(1 − p − py)


=
dy (1 − p + py)2 (1 − p + py)3

We have two critical points to consider; y = 1−p p−1


p , and y = p . We note E[X] ≥ 0 =⇒ a ≤ 0 =⇒ p =
−a 1−p p−1
b−a ∈ [0, 1]. Hence p ≥ 0, and y = p ≤ 0. We therefore select the candidate that falls inside the
nonegative window: y = 1−p p−1
p . Note that if p = 0 =⇒ p = 1 =⇒
p−1
p = 0. That is, there is in fact only
a single critical point in this situation. Substituting back in, we have:

p(1 − p) 1−p
p 1
φ00 (u) ≤ =
(1 − p + p 1−p
p )
2 4

We may conclude:
u2 s2 (b−a)2
E esX ≤ eφ(u) ≤ e 8 = e 8
 

We now prove the primary implication of Lemma 3 and result of this subsection.
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 7

Theorem 3. (Hoeffding’s Inequality) Given independent random variables {X1 , . . . , Xm } where ai ≤ Xi ≤ bi


almost surely (with probability 1) we have:
m m
!
−22 m2
 
1 X 1 X
P Xi − E[Xi ] ≥  ≤ exp Pm 2
m i=1 m i=1 i=1 (bi − ai )

Proof. We adapt the proof found from Duchi [7]. As mentioned earlier, the result is a straightforward
application of Lemma 3.
For all 1 ≤ i ≤ m define a new variable Zi as the difference between Xi and its expectation.
Zi = Xi − E[Xi ]
This implies that E[Zi ] = 0. Moreover, we may bound the domain of Zi inside [ai − E[Xi ], bi − E[Xi ]]. In
particular, we note that the interval must still have length bi − ai independent of the expectation of Xi .
Let s be some positive value. We have:
m
! m
! ! Qm sZ 
i=1 e
X X Chernoff E i
st
P Zi ≥ t = P exp s Zi ≥ e ≤ st
i=1 i=1
e

By independence of the Zi we may shift the expectation inside the product and continue. Recall that Zi
must still live in an interval of length bi − ai .
Qm sZ  Qm  sZ  m m
!
E i=1 e
i
i=1 E e
i Hoeffding Lemma
−st
Y s2 (bi −ai )2 s2 X 2
= ≤ e e 8 = exp −st + (bi − ai )
est est i=1
8 i=1

We now substitute a conveniently engineered value of s to conclude our result. Note that s > 0 so the earlier
restriction on s is satisfied.
4t
s = Pm 2
i=1 i − ai )
(b
substituting in s as well as t = m, we have:
m
! m m
!
X 1 X 1 X
P Zi ≥ m = P Xi − E[Xi ] ≥ 
i=1
m i=1 m i=1
 2 Xm
!
4m 1 4m 2
≤ exp − Pm 2
m + Pm 2
(bi − ai )
i=1 (bi − ai ) 8 i=1 (bi − ai ) i=1
−22 m2
 
= exp Pm 2
i=1 (bi − ai )

Stochastic Multi-Armed Bandits


In the stochastic multi-armed bandit problem, the player is presented with a collection of actions, or arms, to
choose from in each round of play. Each arm distributes rewards according to some (unknown) subgaussian
distribution over [0, 1]. Rewards are i.i.d., with E[Xi,t ] = µi for all arms i and times t. The goal of the player
is to minimize the cumulative regret, which is defined as the difference between the player’s rewards after T
time steps, and the best reward possible given the strategy of playing a single arm. If the player chooses
arm It at time t, the regret can be written as
" T #
X
R(T ) = max E (Xj,t − XIt ,t )
j
t=1
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 8

For an alternative formulation of the regret, define each arm’s gap from the best arm as ∆i = maxj µj − µi .
We can rewrite the regret by taking the expectation inside the summation,
X
R(T ) = max E[Xj,t ] − E[XIt ,t ]
j
t
X
= max µj − E[µIt ]
j
t
n
X
= ∆i E[Ti ]
i=1

where Ti is the total number of times we have played arm i. We see that the regret is the product of the
number of times each suboptimal arm is played and that arm’s gap with the optimal arm.

UCB (Upper Confidence Bounds)


Auer et al. (2002) [3] introduced simple and efficient allocation strategies based on upper confidence bounds
for a bandit problem with any reward distribution with known bounded support. Their algorithms demon-
strate logarithmic regret performance uniformly over time, not just asymptotically.
To implement the UCB1 algorithm, we need both µ̂i,Ti , the empirical reward estimate of arm i after it
has been pulled Ti many times, and an upper confidence bound on that estimate. To calculate our empirical
reward estimate, we simply average the observed rewards over all rounds where we pull arm i. In the below
expression, we assume |{t : It = i}| = Ti . That is, we have progressed through an appropriate number of
rounds where arm i has been pulled Ti many times:
P
Xt
µ̂i,Ti = t:It =i
Ti
In addition to our empirical reward estimates, we need an upper confidence bound to describe the largest
plausible mean of each arm. Using Hoeffding’s Inequality and Chernoff Bounds, we can construct such a
confidence interval. With probability at least 1 − t−α , the empirical mean µ̂i,Ti,t will differ from the true
q
mean by at most  = α2T log t
i
. The UCB1 algorithm chooses the largest such upper bound:

s
α log(t)
UCBi,t = It := arg max µ̂i,Ti,t +
i∈[n] 2 Ti
q
α log(t)
We see that our confidence bound, 2Ti , grows slowly as we play for more rounds (as t increases),
ensuring that we never stop playing any given arm. The confidence bound for arm i shrinks quickly as we
pull the arm (as Ti increases).
The pseudocode may be found in Algorithm 1.
Theorem 4. Regret bound for the UCB algorithm
For T ≥ 1
X 2α
R(T ) ≤ 4α∆−1
i log(T ) + ∆i
α−1
i:∆i >0

Proof. Suppose, without loss of generality, that arm 1 is optimal. Then, arm i 6= 1 will only be played in two
cases: either arms 1 and i have been sampled insufficiently to distinguish between their means, or the upper
confidence bound given by Hoeffding’s inequality fails for either arm 1, or arm i. We begin by bounding the
chance that we pull a suboptimal arm due to insufficient sampling.
Suppose that we have the following two events At , Bt .
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 9

Algorithm 1 UCB
1: procedure UCB({1, 2, . . . , n}, T ) . Arms 1 through n, max steps T
2: for 1 ≤ t ≤ n do
3: It ← t . Play each arm once
4: end for
5: for n + 1 ≤ t ≤ T do
6: It = arg max UCBi,t−1
i∈{1,··· ,n}
7: Observe reward XTt ,t
8: end for
9: end procedure

q
α log t
At ) µ̂i,Ti ≤ µi + 2Ti
q
α log t
Bt ) µ̂1,T1 ≥ µ1 − 2 T1

We wish to bound the probabilities of the complements of events At and Bt occuring. We will apply
Hoeffding’s inequality (Theorem 3). At fails when
r
α log t
µ̂i,Ti − µi >
2 Ti
By Theorem 3 we have:
!
−22 t2
P(Act ) = P (µ̂i,Ti − µi > ) ≤ exp Pt 2
i=1 (bi − ai )
!
−22 t2
= exp Pt 2
i=1 (1 − 0)
= exp(−22 t)
q
α log t
We plug in our bounding value  = 2Ti .

r !  
α log t −2tα log t
P µ̂i,Ti − µi > ≤ exp
2Ti 2Ti
 
−tα log t
= exp
Ti
 
−tα log t
≤ exp
t
= eα log t
= t−α (4)

The statement and justification is identical for the complement of event Bt and we return to the task of
bounding the number of suboptimal arm pulls.
A suboptimal arm i is only played if its upper confidence bound exceeds that of arm 1, meaning that
r r
α log t α log t
µ̂i,Ti + > µ̂1,T1 +
2 Ti 2 T1
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 10

Suppose that both At and Bt both hold. In this case, suboptimal arm i is pulled due to insufficient sampling
up to this point.
Since At has been assumed to be true, we generate the following bound:
s s
α log(t) α log(t)
µi + 2 ≥ µ̂i,Ti + (5)
Ti 2 Ti

Next, we use our assumption of Bt being true to upper-bound the right hand side of Line (5):
r
α log t
µ̂1,T1 + ≥ µ1 (6)
2 T1
Chaining equations (5) and (6) we have
s
α log(t)
µi + 2 ≥ µ1
Ti

Rearranging we have:
s
α log(t) µ1 − µi

Ti 2

Now, recall our definition of the optimality gap of an arm, ∆i = maxj µj − µi . Since we know arm 1 is
optimal, this becomes ∆i = µ1 − µi . Our inequality becomes
s
α log(t) ∆i

Ti 2
Solving for the number of times Ti that an arm has been played, we arrive at

Ti ≤ 4∆−2
i α log(t)
≤ 4∆−2
i α log(T ) (7)

Thus when At and Bt hold, we only play suboptimal arm i at most 4∆−2 i α log(T ) times.
Recall that It can only be equal to i if either it has been sampled insufficiently (fewer than 4∆−2
i α log(T )
times) or either event At or Bt fails. For any arm i, the expected number of times it is played up to round
T under UCB is:
T
X
E[Ti ] = E[1(It = i)]
t=1
T
X
≤ 4 α∆−2
i log(T ) + E[1{Act ∪ Btc }]
t=1
T 
X 
≤ 4 α∆−2
i log(T ) + E[1{Act }] + E[1{Btc }] (8)
t=1
T 
X 
≤ 4 α∆−2
i log(T ) + t−α + t−α (9)
t=1
T
X
= 4 α∆−2
i log(T ) + 2 t−α
t=1
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 11

The inequality in line (8) comes from the union bound on events Act and Btc . The inequality in line (9) comes
from Equation (4). In order to bound the second term on the right hand side, we note:
T Z ∞
X
−α −1 −α
t ≤1+ x−α dx = 1 + =
t=1 1 1−α 1−α
Therefore we have:

E[Ti ] ≤ 4α∆−2
i log(T ) + (10)
α−1
The desired result follows from summing over all suboptimal arms:
X
R(T ) = ∆i E[Ti ]
i6=1
X 2α
= 4α∆−1
i log(T ) + ∆i
α−1
i6=1

Theorem 5. (Lai-Robin’s (1985) [13]) Lai and Robbins, provided an optimal asymptotic lower bound on the
expected regret of any bandit algorithm.

If ∀β > 0, R(T ) ≤ o(T β ), then


log(T )
E[Ti ] ≥
∆2i
We refer the reader to Kaufmann et al. (2012) [12] for a proof of the above theorem. For an overview of
the UCB family of algorithms refer to Bubeck and Cesa-Bianchi (2012, chap. 2) [5].
If the gap between the best and second-best arm is very small, then the ∆−1
i penalty in the regret bound
becomes very large. However, as the gap becomes small, we would imagine that playing the second-best arm
becomes a decent strategy. To this end, we seek a “worst case” regret bound for the UCB algorithm. This
bound, which is independent of ∆i , is shown in the following theorem.
Theorem 6. For all T ≥ n, a gap-agnostic bound achieved by the UCB algorithm in round T is
p 2α
E[R(T )] ≤ (1 + 4α) nT log(T ) + n
α−1
Proof. Divide the arms into two groups:
• Group G1 contains “almost optimal” arms with ∆i < Tn log(T ).
p

• Group G2 contains arms with ∆i ≥ Tn log(T ).


p

The total regret is the sum of the regret of each group. The maximum total regret incurred due to pulling
arms in G1 is given by
X
Ti ∆i
i∈G1

By definition, the regret on any arm i ∈ G1 is bounded by ∆i < Tn log(T ). We may therefore bound the
p

total regret on arms in G1 as follows:


r
X n X
Ti ∆i ≤ log(T ) Ti
T
i∈G1 i∈G1
r
n
≤T· log(T )
T
p
= nT log(T )
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 12

pn
We may now shift our focus to group G2 . Recall by definition for all arms i ∈ G2 we have ∆i ≥ T log(T ).
Rearranging we have:
s
−1 T
∆i ≤ (11)
n log(T )

We begin by building on Equation (10) and summing over all arms in G2 :


X X 2α

E[Ti ]∆i ≤ 4α∆−2i log(T ) + ∆i (12)
α−1
i∈G2 i∈G2
X 2α

= 4α∆−1i log(T ) + ∆ i
α−1
i∈G2
s !
X T 2α
≤ 4α log(T ) + 1 (13)
n log(T ) α−1
i∈G2
r !
X T log(T ) 2α
= 4α +1
n α−1
i∈G2
r !
T log(T ) 2α
≤ n · 4α +1 (14)
n α−1
p 2α
≤ 4α nT log(T ) + n
α−1
Line (12) follows from multiplying ∆i (a necessarily nonnegative value) onto either side of equation (10)
and summing over all arms in G2 . Line (13) follows from Equation (11), and the fact that µi lives in [0, 1]
implying ∆i may not exceed 1. For line (14), note that the interior of the summation is independent of
which arm i we are iterating over and |G2 | ≤ n.
We sum the expected regret over all arms in groups G1 and G2 to arrive at the total expected regret:
X
E[R(T )] = E[Ti ]∆i
1≤i≤n
X X
= E[Ti ]∆i + E[Ti ]∆i
i∈G1 i∈G2
p p 2α
≤ nT log(T ) + 4α nT log(T ) + n
α−1
p 2α
= (1 + 4α) nT log(T ) + n
α−1

Putting Theorems 4 and 6 together, we see that the UCB algorithm


√ operates under two distinct regimes.
During the initial “burn-in” period, the algorithm experiences O( nT log T ) regret to learn the arm payouts.
As the game continues, and the gap between the arms P becomes easier to distinguish, the algorithm moves
into the second regime, where its performance is O( i ∆−1 i log T ).
UCB algorithms are an active research field in machine learning, especially for the contextual bandit
problem [5, 8, 15]. For an overview of the UCB family of algorithms refer to Bubeck and Cesa-Bianchi (2012,
chap. 2) [5] and [4].

Thompson Sampling(Posterior Sampling or Probability Matching)


Thompson sampling (TS) is one of the oldest heuristics for multi-armed bandit problems [20]. Thompson
sampling takes a Bayesian approach to find the optimal arm while balancing the trade off between exploration
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 13

and exploitation of non-optimal arms. In TS, the reward of each arm is distributed Bernoulli and the expected
reward is unknown. The objective is to find the optimal arm that gives maximum expected cumulative
reward. The TS algorithm initially assumes arm i to have prior Beta (1, 1) on µi , which is natural because
Beta(1, 1) is the uniform distribution on (0, 1). At time t, having observed Si (t) successes (reward=1) and
Fi (t) failures (reward=0) in Ti (t) = Si (t) + Fi (t) plays of arm i, the algorithm updates the distribution on µi
as Beta(Si (t) + 1, Fi (t) + 1). The algorithm then samples from these posterior distributions of the µi ’s, and
plays an arm according to the probability of its mean being the largest. The Thompson sampling algorithm
is given in Algorithm 2.

Algorithm 2 Thompson Sampling


1: procedure Thompson({1, 2, . . . , n}, T ) . Arms 1 through n, max steps T
2: SIt , FIt , Ti ← 0
3: for 1 ≤ t ≤ T do
4: for 1 ≤ i ≤ n do
5: µ̂i ∼ Beta(SIt + 1, FIt + 1) . Draw each µ̂i according to the posterior distribution
6: end for
7: It ← argmax µ̂i
i∈[n]
8: TIt ← TIt + 1 . Increment the total counter for arm It .
9: XIt ,t ∼ Bernouli(µi ) . Observe reward XIt ,t
10: SIt ← SIt + XIt ,t . Update success counter appropriately
11: FIt ← Ti It − SIt . Update failure counter appropriately
12: end for
13: end procedure

Thompson Sampling has received considerable attention in industry as well (e.g. Scott (2010) [18],
Graepel et al. (2010) [11], and Tang et al. (2013) [19]). For more details please see [1, 2, 6, 17].

KL-UCB
The Kullback-Leibler UCB algorithm (KL-UCB) presents a modern approach to UCB for the standard
stochastic bandits problem. KL-UCB improves the regret bounds from earlier UCB algorithms by considering
the distance between the estimated distributions of each arm. The algorithms differ only at the arm selection
step. Recall that UCB uses the following rule for arm selection:
s
α log(t)
arg max µ̂i,Ti,t +
i∈[n] 2 Ti
In contrast, KL-UCB uses:
!
 
It := arg max max q ∈ [0, 1] : Ti · d µ̂i,Ti,t , q ≤ log(t) + c log(log(t)) (15)
i∈[n]

where d(·, ·) is the Bernoulli Kullback-Leibler divergence:


p 1−p
d(p, q) = p log + (1 − p) log
q 1−q
and c is a tuning parameter. The inner expression on the right side of Equation (15) is the stronger upper
confidence bound. For each arm i ∈ [n], the maximal q in the inner statement may be efficiently approxi-
mated using Newton’s method. The psuedo-code may be found in Algorithm 3.

KL-UCB is optimal for Bernoulli distributions and strictly dominates UCB for any bounded reward
distributions. For more details please see [9, 14].
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 14

Algorithm 3 KL-UCB
1: procedure KL-UCB({1, 2, . . . , n}, T ) . Arms 1 through n, max steps T
2: for 1 ≤ t ≤ n do
3: It ← t . Play each arm once
4: end for
5: for n + 1 ≤ t ≤ T do !
 
6: It = arg max max q ∈ [0, 1] : Ti · d µ̂i,Ti,t , q ≤ log(t) + c log(log(t))
i∈[n]
7: Observe reward XTt ,t
8: end for
9: end procedure

Examples/Applications for TS and UCB


One of the early motivations for studying the Multi-Armed Bandit problem was clinical trials. Suppose that
we have N different treatments of unknown efficacy for a certain disease. Patients arrive sequentially, and we
must decide on a treatment to administer for each arriving patient. To make this decision, we could learn from
how the previous choices of treatments fared for the previous patients. After a sufficient number of trials, we
may have a reasonable idea of which treatment is most effective, and from then on, we could administer that
treatment for all the patients. In applications like display advertising, product assortment, recommendation
system (e.g. news article recommendation, cascading recommendation, recommending courses to learners),
reinforcement learning in Markov decision processes, and active learning with neural networks, Thompson
sampling is competitive to or better than popular methods such as UCB. Web advertising, job scheduling (or
exercise scheduling), and routing (shortest path problem) examples could be another motivations in MAB
problems. For more details please see [10].
Thompson Sampling and offers significant advantages over the UCB approach, and can be applied to
problems with finite or infinite action spaces and complicated relationships among action rewards, refer to
Russo and Van Roy (2014) [16].
UCB algorithms have been proposed for a variety of problems, including bandit problems with indepen-
dent arms, bandit problems with linearly parameterized arms, bandits with continuous action spaces and
smooth reward functions, and exploration in reinforcement learning. UCB1 is the building block for tree
search algorithms (e.g. Upper Confidence bound applied to Trees (UCT)) used to, e.g., play games. There
are some limitations for using Thompson sampling. For example, it is certainly a poor fit for sequential
learning problems that do not require much active exploration. It may also perform poorly in time-sensitive
learning problems where it is better to exploit a high performing suboptimal action than to invest resources
exploring arms that might offer slightly improved performance.

References
[1] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem.
CoRR, abs/1111.1797, 2011.

[2] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. CoRR,
abs/1209.3353, 2012.
[3] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit
problem. Mach. Learn., 47(2-3):235–256, May 2002.
[4] Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed
bandit problem.
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 15

[5] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed
bandit problems. CoRR, abs/1204.5721, 2012.

[6] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Proceedings of the
24th International Conference on Neural Information Processing Systems, NIPS’11, pages 2249–2257,
USA, 2011. Curran Associates Inc.
[7] John C. Duchi. Probability bounds. 2009.
[8] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The gen-
eralized linear case. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta,
editors, Advances in Neural Information Processing Systems 23, pages 586–594. Curran Associates, Inc.,
2010.
[9] Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic bandits and beyond.
In COLT, 2011.

[10] Aditya Gopalan, Shie Mannor, and Yishay Mansour. Thompson sampling for complex online problems.
In Proceedings of the 31st International Conference on International Conference on Machine Learning
- Volume 32, ICML’14, pages I–100–I–108. JMLR.org, 2014.
[11] Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich. Web-scale bayesian
click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In
Proceedings of the 27th International Conference on International Conference on Machine Learning,
ICML’10, pages 13–20, USA, 2010. Omnipress.
[12] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification
in multi-armed bandit models. J. Mach. Learn. Res., 17(1):1–42, January 2016.

[13] T.L Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math.,
6(1):4–22, March 1985.
[14] Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. A Finite-Time Analysis of Multi-armed
Bandits Problems with Kullback-Leibler Divergences. In Sham Kakade & Ulrike von Luxburg, editor,
24th Annual Conference on Learning Theory : COLT’11, page 18, Budapest, Hungary, July 2011.

[15] Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. CoRR, abs/0812.3465,
2008.
[16] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. CoRR,
abs/1301.2609, 2013.

[17] Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, and Ian Osband. A tutorial on thompson sampling.
CoRR, abs/1707.02038, 2017.
[18] Steven L. Scott. A modern bayesian look at the multi-armed bandit. Appl. Stoch. Model. Bus. Ind.,
26(6):639–658, November 2010.

[19] Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. Automatic ad format selection via contex-
tual bandits. In Proceedings of the 22Nd ACM International Conference on Information & Knowledge
Management, CIKM ’13, pages 1587–1594, New York, NY, USA, 2013. ACM.
[20] WILLIAM R THOMPSON. On the likelihood that one unknown probability exceeds another in view
of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.

You might also like