Lecture 3
Lecture 3
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.
P (Y ∈ A|X ∈ B) = P (Y ∈ A)
We can use this definition of independence to show that the expectation of a product of functions of X and
Y is the product of the expectations, as long as X and Y are independent:
E[f (X)g(Y )] = E[E[f (X)g(Y )|Y = y]] = E[E[f (X)]g(Y )] = E[f (X)]E[g(Y )]
iid
We say Xi ∼ P for i = 1, . . . , n if each Xi is independent and identically distributed.
Lemma 1. Suppose the true distribution P has mean E[Xi ] = µ, and variance E[(Xi − µ)2 ] = σ 2 . If we
Pn 2
define an estimator of the mean µ̂n = n1 i=1 Xi , then E[(µ̂n − µ)2 ] = σn .
Proof. We begin by substituting the definition of µ̂n and completing the square
" #
2 1X 2
E[(µ̂n − µ) ] = E (Xi − µ)
n i
1 X 1 X
= E 2 (Xi − µ)2 + 2 (Xi − µ)(Xj − µ)
n i n
i6=j=1
Since the Xi and Xj are independent when i 6= j, the expectation of the second term is 0. This allows
us to simplify the expression,
1 X
E[(µ̂n − µ)2 ] = E (Xi − µ)2
n2 i
1 X 2
= 2 σ
n i
σ2
=
n
This bound on the squared variation of our estimator from the true mean will allow us to prove the Weak
Law of Large Numbers. Before we begin that proof, we will need another result: Markov’s Inequality.
1
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 2
Figure 1: Since P (Y > x) is a decreasing function supported only on the nonnegative numbers, the integral
of P (Y > x) is bounded below by tP (Y > t)
E[Y ]
Lemma 2. (Markov’s Inequality): If Y is a nonnegative random variable, then P (Y > t) ≤ t
Note that we can take this integral from 0 since Y is nonnegative, and that P (Y > x) is a nonincreasing
function. As Figure 1 illustrates, the nonincreasing nature of P (Y > x) implies the following lower bound:
E[Y ] ≥ tP (Y > t)
E[Y ]
We have recovered Markov’s Inequality, that P (Y > t) ≤ t .
Proof. (Alternate proof of Markov’s Inequality) Let Y be a positive random variable. Then
Y ≥ t1{Y ≥ t}
To see why this is true, first consider that, when Y < t, the indicator function is zero, and by definition
we know Y ≥ 0. In the second case, when Y ≥ t, we see that the indicator function is 1, simplifying this
equation to our assumption, Y ≥ t.
Next, we take the expectation of both sides to get
E[Y ] ≥ tE[1{Y ≥ t}]
= tP (Y ≥ t)
E[Y ]
We have recovered Markov’s Inequality, that P (Y ≥ t) ≤ t .
Theorem 1. (Weak Law of Large Numbers): For all ε > 0, limn→∞ P (|µ̂n − µ| > ε) = 0
Proof. Fix ε > 0. Then, P (|µ̂n − µ| > ε) = P (|µ̂n − µ|2 > ε2 ). Now, the random variable in question,
|µ̂n − µ|2 , is nonnegative. This means we can apply Markov’s Inequality, yielding
P (|µ̂n − µ| > ) = P (|µ̂n − µ|2 > 2 )
E[|µ̂n − µ|2 ]
≤
ε2
2
σ
= 2
nε
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 3
where, in the last step, we applied lemma 1. Next, we take the limit,
σ2
lim P (|µ̂n − µ| > ε) ≤ lim
n→∞ n→∞ nε2
=0
suppose we are trying to estimate the bias of a coin. We know the bias is bounded between [0, 1], and that
the variance of a distribution with this support is bounded by σ 2 ≤ 41 . This gives us
1
P (|µ̂n − µ| > ε) ≤
4nε2
If we want the probability of such an event to be bounded by δ, then we set the right hand side equal to δ,
and solve for ε. This yields r
1
|µ̂n − µ| ≤
4nδ
with probability at least 1 − δ. Thus, if we desire that |µ̂n − µ| ≤ with probability at least 1 − δ then
−2
we must have n ≥ 4δ . Later, we will see that the Central Limit Theorem suggests this to be very loose.
Indeed, the CLT implies it suffices to take just n = −2 log(2/δ)/2 which is substantially smaller.
√
Theorem 2. (Central Limit Theorem (CLT)) limn→∞ n(µ̂n − µ) ∼ N (0, σ 2 ).
Proof. Consider the random variable
X1 + · · · + Xn − nµ
Zn = √
nσ 2
We will prove the central limit theorem by calculating the characteristic function of this random variable,
and showing that, in the limit as n → ∞, it is the same as the characteristic function for N (0, 1). We begin
by rewriting the random variable Zn ,
X1 + · · · + Xn − nµ
Zn = √
nσ 2
n
X Xj − µ
= √
j=1 nσ 2
n
X 1
= √ Yj
j=1
n
X −µ
where Yj = jσ . Note that these Yj are i.i.d. with mean 0 and variance 1. We want to find a closed form
for the characteristic function of Zn , which is given by
By the properties of exponentials, we can change the sum in the exponent into a product of exponentials:
Y 1
φZn (t) = E exp it √ Yj
j
n
Since the Yj are independent, the expectation commutes with the product, and we can write
Y
1
φZn (t) = E exp it √ Yj
j
n
We know the Yj are identically distributed, so each expectation in the product must have the same value.
This enables us to simplify the equation using Y1 , which is representative of all Yj
n
1
φZn (t) = E exp it √ Y1
n
Using the definition of the characteristic function, we can write this as a power of the characteristic function
of Y1
n
t
φZn (t) = φY1 √
n
Now, we can use Taylor’s Theorem to approximate the characteristic function. For some (possibly complex)
constant c, we have, as √tn → 0,
t2 t3
3
t t
φ Yi √ =1− +c 3 +o 3
n 2n 6n 2 n2
n
t2
Next, we recognize that, as n → ∞, the characteristic function approaches φZn (t) → 1 − 2n . Using
x n
x
the identity e = limn→∞ 1 + n , we conclude that
1 2
lim φZn (t) = e− 2 t
n→∞
which is exactly the characteristic function of the standard normal distribution, N (0, 1). Recalling that
our definition of Zn was a transformation of the random variables Xi , we see that the sum of the Xi ’s will
converge to a normal distribution N (nµ, nσ 2 ).
Example: Estimating the bias of a coin using the Central Limit Theorem
Revisiting our coin flip example, we can use the central limit theorem√to improve our asymptotic bound on
|µ̂n − µ|. The central limit theorem tells us that our random variable n(µ̂n − µ) ∼ N (0, σ 2 ) as n → ∞, so
we begin with the definition of a normal distribution:
Z ∞
1 −nx2
P (µ̂n − µ > ε) ≤ p e 2σ 2 dx
ε 2πσ 2 /n
−nε2
≤e 2σ 2
Chernoff Bounds
Central Limit Theorem guarantees are useful for large sample sizes, but if n is small, we would still like to
bound the deviation of µ̂n from the true mean. Chernoff Bounds are a technique for bounding a random
variable using its moment generating function. We wish to bound the quantity P (µ̂n − µ > ε). For λ > 0,
we can use the fact that ex is monotonically increasing to transform our variable:
P (µ̂n − µ > ε) = P (λ(µ̂n − µ) > λε)
= P (eλ(µ̂n −µ) > eλε )
Now, our random variable is nonnegative, and we can apply Markov’s Inequality.
h i
P (µ̂n − µ > ε) ≤ e−λε E eλ(µ̂n −µ)
h 1P i
= e−λε E eλ( n i (Xi −µ))
" #
Y λ
−λε (X −µ)
=e E en i
i
Y h λ i
−λε
=e E e n (Xi −µ) (1)
i
h λ in
= e−λε E e n (Xi −µ) (2)
In equation 1, we have used the independence of Xi ’s, which means that the product commutes with the
expectation. In equation 2, we have leveraged their identical distribution (so the expectations are identical).
This sequence of steps, exponentiating and applying Markov’s inequality with independence, is the technique
known as the Chernoff bound.
Hoeffding’s Inequality
Moving on from the Chernoff bound technique, we describe a more general bound: Hoeffding’s Inequality.
We prove the inequality by leveraging The Chernoff bound as well as Hoeffding’s Lemma which we define
and prove first.
Lemma 3. (Hoeffding’s Lemma) Let X be a random variable from domain [a, b] almost surely and E[X] = 0.
Then for any real s
s2 (bi −ai )2
E esX ≤ e
8
Proof. We adapt the proof found from Duchi [7]. First note that esx is convex w.r.t. x. Thus we have:
b − X sa X − a sb
esX ≤ e + e
b−a b−a
By linearity:
b − E[X] sa E[X] − a sb b sa a sb
E esX ≤
e + e = e − e
b−a b−a b−a b−a
−a
For ease of reading, let p = b−a also noting that a = −p(b − a). We isolate a factor esa :
b sa a sb
e − e = (1 − p)esa + pesb
b−a b−a
= (1 − p) + pes(b−a) esa
= 1 − p + pes(b−a) e−sp(b−a)
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 6
peu
φ0 (u) = p +
1 − p + peu
p(1 − p)eu
φ00 (u) =
(1 − p + peu )2
p(1 − p)y
(1 − p + py)2
We note that the expression is a linear expression over a quadratic expression and therefore concave for
y > 0. It therefore suffices to find the critical point for y:
p(1 − p) 1−p
p 1
φ00 (u) ≤ =
(1 − p + p 1−p
p )
2 4
We may conclude:
u2 s2 (b−a)2
E esX ≤ eφ(u) ≤ e 8 = e 8
We now prove the primary implication of Lemma 3 and result of this subsection.
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 7
Proof. We adapt the proof found from Duchi [7]. As mentioned earlier, the result is a straightforward
application of Lemma 3.
For all 1 ≤ i ≤ m define a new variable Zi as the difference between Xi and its expectation.
Zi = Xi − E[Xi ]
This implies that E[Zi ] = 0. Moreover, we may bound the domain of Zi inside [ai − E[Xi ], bi − E[Xi ]]. In
particular, we note that the interval must still have length bi − ai independent of the expectation of Xi .
Let s be some positive value. We have:
m
! m
! ! Qm sZ
i=1 e
X X Chernoff E i
st
P Zi ≥ t = P exp s Zi ≥ e ≤ st
i=1 i=1
e
By independence of the Zi we may shift the expectation inside the product and continue. Recall that Zi
must still live in an interval of length bi − ai .
Qm sZ Qm sZ m m
!
E i=1 e
i
i=1 E e
i Hoeffding Lemma
−st
Y s2 (bi −ai )2 s2 X 2
= ≤ e e 8 = exp −st + (bi − ai )
est est i=1
8 i=1
We now substitute a conveniently engineered value of s to conclude our result. Note that s > 0 so the earlier
restriction on s is satisfied.
4t
s = Pm 2
i=1 i − ai )
(b
substituting in s as well as t = m, we have:
m
! m m
!
X 1 X 1 X
P Zi ≥ m = P Xi − E[Xi ] ≥
i=1
m i=1 m i=1
2 Xm
!
4m 1 4m 2
≤ exp − Pm 2
m + Pm 2
(bi − ai )
i=1 (bi − ai ) 8 i=1 (bi − ai ) i=1
−22 m2
= exp Pm 2
i=1 (bi − ai )
For an alternative formulation of the regret, define each arm’s gap from the best arm as ∆i = maxj µj − µi .
We can rewrite the regret by taking the expectation inside the summation,
X
R(T ) = max E[Xj,t ] − E[XIt ,t ]
j
t
X
= max µj − E[µIt ]
j
t
n
X
= ∆i E[Ti ]
i=1
where Ti is the total number of times we have played arm i. We see that the regret is the product of the
number of times each suboptimal arm is played and that arm’s gap with the optimal arm.
s
α log(t)
UCBi,t = It := arg max µ̂i,Ti,t +
i∈[n] 2 Ti
q
α log(t)
We see that our confidence bound, 2Ti , grows slowly as we play for more rounds (as t increases),
ensuring that we never stop playing any given arm. The confidence bound for arm i shrinks quickly as we
pull the arm (as Ti increases).
The pseudocode may be found in Algorithm 1.
Theorem 4. Regret bound for the UCB algorithm
For T ≥ 1
X 2α
R(T ) ≤ 4α∆−1
i log(T ) + ∆i
α−1
i:∆i >0
Proof. Suppose, without loss of generality, that arm 1 is optimal. Then, arm i 6= 1 will only be played in two
cases: either arms 1 and i have been sampled insufficiently to distinguish between their means, or the upper
confidence bound given by Hoeffding’s inequality fails for either arm 1, or arm i. We begin by bounding the
chance that we pull a suboptimal arm due to insufficient sampling.
Suppose that we have the following two events At , Bt .
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 9
Algorithm 1 UCB
1: procedure UCB({1, 2, . . . , n}, T ) . Arms 1 through n, max steps T
2: for 1 ≤ t ≤ n do
3: It ← t . Play each arm once
4: end for
5: for n + 1 ≤ t ≤ T do
6: It = arg max UCBi,t−1
i∈{1,··· ,n}
7: Observe reward XTt ,t
8: end for
9: end procedure
q
α log t
At ) µ̂i,Ti ≤ µi + 2Ti
q
α log t
Bt ) µ̂1,T1 ≥ µ1 − 2 T1
We wish to bound the probabilities of the complements of events At and Bt occuring. We will apply
Hoeffding’s inequality (Theorem 3). At fails when
r
α log t
µ̂i,Ti − µi >
2 Ti
By Theorem 3 we have:
!
−22 t2
P(Act ) = P (µ̂i,Ti − µi > ) ≤ exp Pt 2
i=1 (bi − ai )
!
−22 t2
= exp Pt 2
i=1 (1 − 0)
= exp(−22 t)
q
α log t
We plug in our bounding value = 2Ti .
r !
α log t −2tα log t
P µ̂i,Ti − µi > ≤ exp
2Ti 2Ti
−tα log t
= exp
Ti
−tα log t
≤ exp
t
= eα log t
= t−α (4)
The statement and justification is identical for the complement of event Bt and we return to the task of
bounding the number of suboptimal arm pulls.
A suboptimal arm i is only played if its upper confidence bound exceeds that of arm 1, meaning that
r r
α log t α log t
µ̂i,Ti + > µ̂1,T1 +
2 Ti 2 T1
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 10
Suppose that both At and Bt both hold. In this case, suboptimal arm i is pulled due to insufficient sampling
up to this point.
Since At has been assumed to be true, we generate the following bound:
s s
α log(t) α log(t)
µi + 2 ≥ µ̂i,Ti + (5)
Ti 2 Ti
Next, we use our assumption of Bt being true to upper-bound the right hand side of Line (5):
r
α log t
µ̂1,T1 + ≥ µ1 (6)
2 T1
Chaining equations (5) and (6) we have
s
α log(t)
µi + 2 ≥ µ1
Ti
Rearranging we have:
s
α log(t) µ1 − µi
≥
Ti 2
Now, recall our definition of the optimality gap of an arm, ∆i = maxj µj − µi . Since we know arm 1 is
optimal, this becomes ∆i = µ1 − µi . Our inequality becomes
s
α log(t) ∆i
≥
Ti 2
Solving for the number of times Ti that an arm has been played, we arrive at
Ti ≤ 4∆−2
i α log(t)
≤ 4∆−2
i α log(T ) (7)
Thus when At and Bt hold, we only play suboptimal arm i at most 4∆−2 i α log(T ) times.
Recall that It can only be equal to i if either it has been sampled insufficiently (fewer than 4∆−2
i α log(T )
times) or either event At or Bt fails. For any arm i, the expected number of times it is played up to round
T under UCB is:
T
X
E[Ti ] = E[1(It = i)]
t=1
T
X
≤ 4 α∆−2
i log(T ) + E[1{Act ∪ Btc }]
t=1
T
X
≤ 4 α∆−2
i log(T ) + E[1{Act }] + E[1{Btc }] (8)
t=1
T
X
≤ 4 α∆−2
i log(T ) + t−α + t−α (9)
t=1
T
X
= 4 α∆−2
i log(T ) + 2 t−α
t=1
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 11
The inequality in line (8) comes from the union bound on events Act and Btc . The inequality in line (9) comes
from Equation (4). In order to bound the second term on the right hand side, we note:
T Z ∞
X
−α −1 −α
t ≤1+ x−α dx = 1 + =
t=1 1 1−α 1−α
Therefore we have:
2α
E[Ti ] ≤ 4α∆−2
i log(T ) + (10)
α−1
The desired result follows from summing over all suboptimal arms:
X
R(T ) = ∆i E[Ti ]
i6=1
X 2α
= 4α∆−1
i log(T ) + ∆i
α−1
i6=1
Theorem 5. (Lai-Robin’s (1985) [13]) Lai and Robbins, provided an optimal asymptotic lower bound on the
expected regret of any bandit algorithm.
The total regret is the sum of the regret of each group. The maximum total regret incurred due to pulling
arms in G1 is given by
X
Ti ∆i
i∈G1
By definition, the regret on any arm i ∈ G1 is bounded by ∆i < Tn log(T ). We may therefore bound the
p
pn
We may now shift our focus to group G2 . Recall by definition for all arms i ∈ G2 we have ∆i ≥ T log(T ).
Rearranging we have:
s
−1 T
∆i ≤ (11)
n log(T )
and exploitation of non-optimal arms. In TS, the reward of each arm is distributed Bernoulli and the expected
reward is unknown. The objective is to find the optimal arm that gives maximum expected cumulative
reward. The TS algorithm initially assumes arm i to have prior Beta (1, 1) on µi , which is natural because
Beta(1, 1) is the uniform distribution on (0, 1). At time t, having observed Si (t) successes (reward=1) and
Fi (t) failures (reward=0) in Ti (t) = Si (t) + Fi (t) plays of arm i, the algorithm updates the distribution on µi
as Beta(Si (t) + 1, Fi (t) + 1). The algorithm then samples from these posterior distributions of the µi ’s, and
plays an arm according to the probability of its mean being the largest. The Thompson sampling algorithm
is given in Algorithm 2.
Thompson Sampling has received considerable attention in industry as well (e.g. Scott (2010) [18],
Graepel et al. (2010) [11], and Tang et al. (2013) [19]). For more details please see [1, 2, 6, 17].
KL-UCB
The Kullback-Leibler UCB algorithm (KL-UCB) presents a modern approach to UCB for the standard
stochastic bandits problem. KL-UCB improves the regret bounds from earlier UCB algorithms by considering
the distance between the estimated distributions of each arm. The algorithms differ only at the arm selection
step. Recall that UCB uses the following rule for arm selection:
s
α log(t)
arg max µ̂i,Ti,t +
i∈[n] 2 Ti
In contrast, KL-UCB uses:
!
It := arg max max q ∈ [0, 1] : Ti · d µ̂i,Ti,t , q ≤ log(t) + c log(log(t)) (15)
i∈[n]
KL-UCB is optimal for Bernoulli distributions and strictly dominates UCB for any bounded reward
distributions. For more details please see [9, 14].
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 14
Algorithm 3 KL-UCB
1: procedure KL-UCB({1, 2, . . . , n}, T ) . Arms 1 through n, max steps T
2: for 1 ≤ t ≤ n do
3: It ← t . Play each arm once
4: end for
5: for n + 1 ≤ t ≤ T do !
6: It = arg max max q ∈ [0, 1] : Ti · d µ̂i,Ti,t , q ≤ log(t) + c log(log(t))
i∈[n]
7: Observe reward XTt ,t
8: end for
9: end procedure
References
[1] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem.
CoRR, abs/1111.1797, 2011.
[2] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. CoRR,
abs/1209.3353, 2012.
[3] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit
problem. Mach. Learn., 47(2-3):235–256, May 2002.
[4] Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed
bandit problem.
Lecture 3: Stochastic Multi-Armed Bandits, Regret Minimization 15
[5] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed
bandit problems. CoRR, abs/1204.5721, 2012.
[6] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Proceedings of the
24th International Conference on Neural Information Processing Systems, NIPS’11, pages 2249–2257,
USA, 2011. Curran Associates Inc.
[7] John C. Duchi. Probability bounds. 2009.
[8] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The gen-
eralized linear case. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta,
editors, Advances in Neural Information Processing Systems 23, pages 586–594. Curran Associates, Inc.,
2010.
[9] Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic bandits and beyond.
In COLT, 2011.
[10] Aditya Gopalan, Shie Mannor, and Yishay Mansour. Thompson sampling for complex online problems.
In Proceedings of the 31st International Conference on International Conference on Machine Learning
- Volume 32, ICML’14, pages I–100–I–108. JMLR.org, 2014.
[11] Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich. Web-scale bayesian
click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In
Proceedings of the 27th International Conference on International Conference on Machine Learning,
ICML’10, pages 13–20, USA, 2010. Omnipress.
[12] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification
in multi-armed bandit models. J. Mach. Learn. Res., 17(1):1–42, January 2016.
[13] T.L Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math.,
6(1):4–22, March 1985.
[14] Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. A Finite-Time Analysis of Multi-armed
Bandits Problems with Kullback-Leibler Divergences. In Sham Kakade & Ulrike von Luxburg, editor,
24th Annual Conference on Learning Theory : COLT’11, page 18, Budapest, Hungary, July 2011.
[15] Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. CoRR, abs/0812.3465,
2008.
[16] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. CoRR,
abs/1301.2609, 2013.
[17] Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, and Ian Osband. A tutorial on thompson sampling.
CoRR, abs/1707.02038, 2017.
[18] Steven L. Scott. A modern bayesian look at the multi-armed bandit. Appl. Stoch. Model. Bus. Ind.,
26(6):639–658, November 2010.
[19] Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. Automatic ad format selection via contex-
tual bandits. In Proceedings of the 22Nd ACM International Conference on Information & Knowledge
Management, CIKM ’13, pages 1587–1594, New York, NY, USA, 2013. ACM.
[20] WILLIAM R THOMPSON. On the likelihood that one unknown probability exceeds another in view
of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.