0% found this document useful (0 votes)
26 views

EE675A Lecture 4

The document discusses the UCB algorithm for multi-armed bandit problems. It provides the formal description of the UCB algorithm and derives instance-dependent and instance-independent regret bounds for UCB, showing they match the bounds for successive elimination. It also briefly introduces Thompson sampling, which maintains probability distributions over the arm means rather than point estimates.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

EE675A Lecture 4

The document discusses the UCB algorithm for multi-armed bandit problems. It provides the formal description of the UCB algorithm and derives instance-dependent and instance-independent regret bounds for UCB, showing they match the bounds for successive elimination. It also briefly introduces Thompson sampling, which maintains probability distributions over the arm means rather than point estimates.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

EE675A - IITK, 2022-23-II

Lecture 4: UCB and Thompson Sampling Algorithm


18th January 2023
Lecturer: Subrahmanya Swamy Peruru Scribe: Akansh Agrawal, Aditya Ranjan

We have so far discussed about the Successive Elimination Algorithm and performed the regret
analysis for two cases : Instance Dependent and Instance Independent. In this lecture we will
discuss yet another algorithm - UCB algorithm2 and derive the regret for this case as well. Next,
we will cover the Thompson Algorithm2 which is a probability distribution based algorithm.

1 UCB Algorithm
Let’s understand the following conversation to first get the gist of the algorithm.
T eacher : Consider a bandit in an imaginary world, who attacks the diamond trader having a
treasure box with k-arms. The trader being from IIT is smart and designed the mechanism in his
box that each arm when played yields the reward with some unknown distribution. Moreover, the
maximum times the box can be played is bounded by some constant number. So can you guess
what is the strategy bandit would apply apart from using successive elimination algorithm ?
Student1 : Yeah..he will definitely ask the trader to tell the optimal arm to maximize the return.
Student2 : But does the trader himself know which is the optimal arm ?
T eacher : The trader indeed knows but he may lie. IITians can be liars, can’t they?
Everyone : Laughs...
Student1 : So maybe some exploration is required at initial stages since there is uncertainty about
the action-value estimate here .
T eacher : Right, but can we do something rigorous ?
Student2 : May be ϵ-greedy would work.
T eacher : The drawback ϵ-greedy possesses is the random selection of an arm in between. Can
we do better than this approach by giving weightage to the arms which has the potential to be the
optimal arm, using what we learnt from Successive Elimination Algorithm?
Students : Hmmm...
T eacher : Alright, you guys can only think when it’s about marks . Consider yourself as a dacoit
and treasure to be the marks earned.
Student1 : May be we can use Upper Confidence Bound to play the arm after doing the initial
exploration.
T eacher : Bingo ! See it works. The marks drive you all crazy... So keeping the jokes aside, let’s
try to understand the algorithm our friend just proposed.

Upper-Confidence-Bound (UCB) Algorithm utilises the fact to adaptively select the arm by
exploiting their potential of being optimal arm by considering their estimate and upper confidence

1
bound. The algorithm is formally stated as follow :

Algorithm 1 UCB
1. Play each arm ∈ A once.
2. For each round t, Play with arm a(t) ∈ A such
q that a(t) = argmaxa U CBt (a),
2logT
where U CBt (a) = µ̄t (a) + ϵt (a) and ϵt (a) = nt (a)

1.1 UCB Regret Analysis


Essentially an arm ’a’ will be played in round t after exploration if it’s UCB is equal or greater than
of the UCB of the optimal arm estimated so far.

U CBt (a) ≥ U CBt (a∗ ) (1)


∗ ∗
µ̄t (a) + ϵt (a) ≥ µ̄t (a ) + ϵt (a ) (2)
∗ ∗
µ̄t (a ) − µ̄t (a) ≤ ϵt (a) − ϵt (a ) (3)

Also we have ,

∆(a) = µ(a∗ ) − µ(a) (4)


∗ ∗
∆(a) ≤ µ̄t (a ) + ϵt (a ) − (µ̄t (a) − ϵt (a)) (5)
∆(a) ≤ 2ϵt (a)) (6)
s
2 log T
∆(a) ≤ 2 (7)
nt (a)

Note that this bound we achieved in equation (7) is similar to the one we have in Successive
Elimination Algorithm as well and therefore we will get similar bound on Regret for this case as
well. Let’s quickly derive it as well.

1.2 Derivation for the Instance-dependent bound for UCB


∆(a) = µ(a∗ ) − µ(a) ≤ 2ϵt . (8)
p
∆(a) ≤ O(ϵT ) = O( log T /nT (a)) (9)

nT (a) ≤ O(log T /(∆(a))2 ) (10)

2
Therefore the expected total regret contributed by an arm a is :

R(T ; a) = ∆(a).nT (a)


(11)

R(T ; a) ≤ ∆(a).O(log T /(∆(a))2 )


(12)
 
log T
R(T ; a) ≤ O (13)
∆(a)
The total regret, R(T ), is given as :
X
R(T ) = R(T ; a) (14)
a∈A

X
R(T ) = O(log T /∆(a)) (15)
a∈A

1 X
R(T ) ≤ O(log T ) (16)
a∈A
∆(a)
 
K log T
R(T ) ≤ O
∆ (17)
where: ∆ = mina ∆(a).
As we saw in the Successive Elimination as well that this bound gives us a tight bound only
when ∆ is large enough otherwise this bound will become loose and therefore instance independent
bound is required .

1.3 Derivation for the Instance-Independent bound for UCB


We saw above,
 
log T
R(T ; a) ≤ O (18)
∆(a)
X
R(T ) = R(T ; a). (19)
a∈A

We’ll can bound ∆(a), by some confidence interval ϵ and analyze the regret for the 2 cases:

If ∆(a) < ϵ ,

R(T ; a) ≤ ϵ nT (a)

3
If ∆(a) > ϵ ,
   
log T log T
R(T ; a) ≤ O ≤O
∆(a) ϵ

Therefore we have,
X
R(T ) = R(T ; a) (20)
a∈A

K
R(T ) ≤ O(ϵ.T + log T ) (21)
ϵ
q
∗ K
Since this is same as we did in Successive Elimination, so we get ϵ = T
log T on minimiz-
ing the RHS, which on substituting gives :
p
R(T ) ≤ O KT log T . (22)
Therefore we get the same results for UCB as we did in the Successive Elimination Algorithm.
For more info please follow the lecture notes from previous lecture as well.

2 Maintain Probability Distributions


In all our previous discussions regarding the Bandit problem, we tried to get some estimate about
the true means µ(a)’s of arms. In the Explore-Then-Commit and ϵ-Greedy algorithms, we es-
timated the means using a point estimate: the sample mean µ(a). To capture more informa-
tion, in the Adaptive and the UCB algorithms, we maintained confidence intervals for µ(a):
(LCB(a), U CB(a)).
We can go even further: keep track of the entire probability distributions of µ(a)’s! We will
have some prior distributions initially, and every time we play an arm a, using Bayes’ rule, we
update the distribution of µ(a) to its posterior distribution, which is just our belief of the new
distribution of µ(a) after we get a new observation (the reward obtained).
Let us illustrate how this works using a coin-toss game.

2.1 Coin Toss game using Bayesian Inference1


Suppose we are given a biased coin such that P(Head) = p, where the probability p is unknown
to us, and we wish to estimate it. A frequentist person will approach this problem as follows: toss
the coin n times, if Head appears Hn times, then p ≈ Hnn . In fact, in the limit n → ∞, the ratio Hnn
will converge to p.
Here is the Bayesian approach to the problem. First of all, have some prior idea about the
parameter p. What we mean by this is: have a prior distribution about the value of p. For example,
if initially we have no idea about p’s value then we can choose it to be Uniform(0, 1), i.e. all values

4
in [0, 1] equally probable. The prior distribution could, in general, be some distribution in the range
[0, 1] (as we know p cannot lie outside that), for example, B(α, β) = c(α, β)θα−1 (1 − θ)β−1 , 0 ≤
θ ≤ 1 (Beta distribution). Here c(α, β) is a constant depending on α, β which is needed to make
the total probability sum up to 1.
Given our prior distribution, we now toss the coin once. Let the result be some r ∈ {0, 1} (1
for Heads, 0 for Tails). Bayes’ rule states that for events A, B:

P(B|A)P(A) P(B|A)P(A)
P(A|B) = =P
P(B) A′ P(B|A )P(A )
′ ′

If we take event A as the distribution of p (i.e., event p = θ) and event B as our observation (result
of the coin toss), we can write this as:

P(Observation r|p = θ) P(Observation r|p = θ)


P(p = θ| Observation r) = = R1
P(Observation r) P(Observation r|p = θ′ )P(p = θ′ )dθ′
0

(Here we abuse the notation a bit, P(p = θ) can mean the continuous probability distribution
function of p at θ. Also the summation is replaced by an integral due to the continuous nature of
p’s distribution.)
We can see that P(Observation r|p = θ) is easy to find, it is θ if r = 1 (Heads), and (1 − θ)
if r = 0 (Tails), i.e. P(Observation r|p = θ) = θr (1 − θ)1−r . Thus, if we have our prior
distribution P(p = θ), we can find out the posterior distribution given the observation: P(p =
θ| Observation r).
If we have a Beta prior (even Uniform(0, 1) = B(1, 1)) B(α, β), then our posterior is:

θr (1 − θ)1−r · c(α, β)θα−1 (1 − θ)β−1


P(p = θ| Observation r) = R 1
0
θ′r (1 − θ′ )1−r · c(α, β)θ′α−1 (1 − θ′ )β−1 dθ′

Note that the denominator is just a constant (this is true in Bayesian inference in general, and not
just this coin toss example), so we can write our posterior in a proportional manner (neglecting the
constants).
P(p = θ| Observation r) ∝ θα+r−1 (1 − θ)β+(1−r)−1
i.e., our posterior distribution is B(α + r, β + (1 − r)) (B(α + 1, β) in case of Heads, B(α, β + 1)
in case of Tails). In the case of Beta priors and Bernoulli coin tosses, each time we toss, our
distribution belief for p changes from the previous one in the above manner.
If our initial prior is B(α, β), we toss n times, out of which there are Hn heads and Tn tails, then
our final posterior distribution of p is B(α +Hn , β +Tn ). Given this final posterior distribution of p
based on our observations, we can now get an estimate of p as per our needs, for e.g. the posterior
mean, posterior median etc.

5
2.2 Thompson Sampling Algorithm
Now, let us consider the Bandit Problem with Bernoulli Rewards, each arm gives a reward of 1
with some probability (fixed for that arm), and a reward of 0 with the rest probability.
)
P(R(a) = 1) = µ(a)
for each arm a ∈ A (0 ≤ µ(a) ≤ 1)
P(R(a) = 0) = 1 − µ(a)

Note: It can be seen that E(R(a)) = µ(a) · 1 + (1 − µ(a)) · 0 = µ(a), which is why we retain the
notation of the Bernoulli probabilities as µ(a).
To begin with, we have may some initial idea about the behaviour of each arm, i.e. some prior
distribution for each parameter µ(a), which we take to be a Beta distribution B(α0 (a), β0 (a)).
Even if we have no prior information on µ(a), we can take it to be the Uniform(0, 1) distribution,
i.e. B(1, 1) (α0 (a) = 1, β0 (a) = 1).
Every time (say at round t) we choose an arm a (we will see the selection policy soon) and play
it, we get a reward Rt (a) = r. Based on the reward, we update µ(a)’s distribution to the posterior
distribution, which can be found using Bayes’ rule.

P(µ(a) = θ|R(a) = r) ∝ P(Rt (a) = r|µ(a) = θ) · P(µ(a) = θ)

Note that P(Rt (a) = r|µ(a) = θ) = θr (1 − θ)1−r , for r ∈ {0, 1} (Bernoulli distribution).
If our prior was B(α, β), like we saw for our coin toss example, our posterior distribution is
P(µ(a) = θ) ∝ θr (1 − θ)1−r · θα−1 (1 − θ)β−1 = θα+r−1 (1 − θ)β+1−r−1 , i.e., B(α + r, β + (1 − r)).
So based on the reward r, the posterior of µ(a) for the chosen arm a is updated as above.

How to choose the arm a? The heuristic is: for any a ∈ A, play arm a with the probability that
it has the highest µ(a), i.e. P(µ(a) = maxa′ ∈A µ(a′ )). This probability is known to us based on
our own beliefs about the distributions of each µ(a).
For example, if there were only two arms a, b, then we choose arm a with probability P(µ(a) >
µ(b)), and arm b with probability P(µ(b) > µ(a)). Here the probability distributions for µ(a) &
µ(b) are our most up-to-date distribution beliefs for both.
Now, these probabilities themselves are hard to calculate analytically. Hence, we use a simple
sampling algorithm which will do our job. The idea is as follows: for each arm a, sample a point
e from the distribution of µ(a), B(α(a), β(a)). Then, select the arm with the highest θ(a).
θ(a) e As
e
θ(a)’s and µ(a)’s come from the same distribution, this will indeed select a particular arm a with
the required probability. This is known as the Thompson Sampling algorithm.
Thus, the final algorithm is presented as follows:

6
Algorithm 2 Thompson Sampling
1: At t = 0, have prior Beta distributions for µ(a) for each arm a ∈ A: B(α0 (a), β0 (a))
2: for (t = 1 to T ) do
3: For each arm a ∈ A, sample θet (a) ∼ B(αt−1 (a), βt−1 (a))
4: Play arm a(t) = arg max{θet (a)}
a
5: Update the posterior of the arm a(t) based on the reward rt ∈ {0, 1} received in round t:
αt (a(t)) = αt−1 (a(t)) + rt , βt (a(t)) = βt−1 (a(t)) + (1 − rt ).
6: end for

References
1
S. Agrawal. Lecture 4: Introduction to thompson sampling, multi-armed bandits and reinforce-
ment learning.
2
A. Slivkins. Introduction to Multi-Armed Bandits, volume 7. 2022.

You might also like