0% found this document useful (0 votes)

26 views

EE675A Lecture 4

The document discusses the UCB algorithm for multi-armed bandit problems. It provides the formal description of the UCB algorithm and derives instance-dependent and instance-independent regret bounds for UCB, showing they match the bounds for successive elimination. It also briefly introduces Thompson sampling, which maintains probability distributions over the arm means rather than point estimates.

Uploaded by

SAURABH VISHWAKARMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

EE675A Lecture 4

Uploaded by

SAURABH VISHWAKARMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

EE675A - IITK, 2022-23-II

Lecture 4: UCB and Thompson Sampling Algorithm

18th January 2023
Lecturer: Subrahmanya Swamy Peruru Scribe: Akansh Agrawal, Aditya Ranjan

We have so far discussed about the Successive Elimination Algorithm and performed the regret
analysis for two cases : Instance Dependent and Instance Independent. In this lecture we will
discuss yet another algorithm - UCB algorithm2 and derive the regret for this case as well. Next,
we will cover the Thompson Algorithm2 which is a probability distribution based algorithm.

1 UCB Algorithm
Let’s understand the following conversation to first get the gist of the algorithm.
T eacher : Consider a bandit in an imaginary world, who attacks the diamond trader having a
treasure box with k-arms. The trader being from IIT is smart and designed the mechanism in his
box that each arm when played yields the reward with some unknown distribution. Moreover, the
maximum times the box can be played is bounded by some constant number. So can you guess
what is the strategy bandit would apply apart from using successive elimination algorithm ?
Student1 : Yeah..he will definitely ask the trader to tell the optimal arm to maximize the return.
Student2 : But does the trader himself know which is the optimal arm ?
T eacher : The trader indeed knows but he may lie. IITians can be liars, can’t they?
Everyone : Laughs...
Student1 : So maybe some exploration is required at initial stages since there is uncertainty about
the action-value estimate here .
T eacher : Right, but can we do something rigorous ?
Student2 : May be ϵ-greedy would work.
T eacher : The drawback ϵ-greedy possesses is the random selection of an arm in between. Can
we do better than this approach by giving weightage to the arms which has the potential to be the
optimal arm, using what we learnt from Successive Elimination Algorithm?
Students : Hmmm...
T eacher : Alright, you guys can only think when it’s about marks . Consider yourself as a dacoit
and treasure to be the marks earned.
Student1 : May be we can use Upper Confidence Bound to play the arm after doing the initial
exploration.
T eacher : Bingo ! See it works. The marks drive you all crazy... So keeping the jokes aside, let’s
try to understand the algorithm our friend just proposed.

Upper-Conﬁdence-Bound (UCB) Algorithm utilises the fact to adaptively select the arm by
exploiting their potential of being optimal arm by considering their estimate and upper conﬁdence

1
bound. The algorithm is formally stated as follow :

Algorithm 1 UCB
1. Play each arm ∈ A once.
2. For each round t, Play with arm a(t) ∈ A such
q that a(t) = argmaxa U CBt (a),
2logT
where U CBt (a) = µ̄t (a) + ϵt (a) and ϵt (a) = nt (a)

1.1 UCB Regret Analysis

Essentially an arm ’a’ will be played in round t after exploration if it’s UCB is equal or greater than
of the UCB of the optimal arm estimated so far.

U CBt (a) ≥ U CBt (a∗ ) (1)

∗ ∗
µ̄t (a) + ϵt (a) ≥ µ̄t (a ) + ϵt (a ) (2)
∗ ∗
µ̄t (a ) − µ̄t (a) ≤ ϵt (a) − ϵt (a ) (3)

Also we have ,

∆(a) = µ(a∗ ) − µ(a) (4)

∗ ∗
∆(a) ≤ µ̄t (a ) + ϵt (a ) − (µ̄t (a) − ϵt (a)) (5)
∆(a) ≤ 2ϵt (a)) (6)
s
2 log T
∆(a) ≤ 2 (7)
nt (a)

Note that this bound we achieved in equation (7) is similar to the one we have in Successive
Elimination Algorithm as well and therefore we will get similar bound on Regret for this case as
well. Let’s quickly derive it as well.

1.2 Derivation for the Instance-dependent bound for UCB

∆(a) = µ(a∗ ) − µ(a) ≤ 2ϵt . (8)
p
∆(a) ≤ O(ϵT ) = O( log T /nT (a)) (9)

nT (a) ≤ O(log T /(∆(a))2 ) (10)

2
Therefore the expected total regret contributed by an arm a is :

R(T ; a) = ∆(a).nT (a)

(11)

R(T ; a) ≤ ∆(a).O(log T /(∆(a))2 )

(12)

log T
R(T ; a) ≤ O (13)
∆(a)
The total regret, R(T ), is given as :
X
R(T ) = R(T ; a) (14)
a∈A

X
R(T ) = O(log T /∆(a)) (15)
a∈A

1 X
R(T ) ≤ O(log T ) (16)
a∈A
∆(a)

K log T
R(T ) ≤ O
∆ (17)
where: ∆ = mina ∆(a).
As we saw in the Successive Elimination as well that this bound gives us a tight bound only
when ∆ is large enough otherwise this bound will become loose and therefore instance independent
bound is required .

1.3 Derivation for the Instance-Independent bound for UCB

We saw above,

log T
R(T ; a) ≤ O (18)
∆(a)
X
R(T ) = R(T ; a). (19)
a∈A

We’ll can bound ∆(a), by some conﬁdence interval ϵ and analyze the regret for the 2 cases:

If ∆(a) < ϵ ,

R(T ; a) ≤ ϵ nT (a)

3
If ∆(a) > ϵ ,

log T log T
R(T ; a) ≤ O ≤O
∆(a) ϵ

Therefore we have,
X
R(T ) = R(T ; a) (20)
a∈A

K
R(T ) ≤ O(ϵ.T + log T ) (21)
ϵ
q
∗ K
Since this is same as we did in Successive Elimination, so we get ϵ = T
log T on minimiz-
ing the RHS, which on substituting gives :
p
R(T ) ≤ O KT log T . (22)
Therefore we get the same results for UCB as we did in the Successive Elimination Algorithm.
For more info please follow the lecture notes from previous lecture as well.

2 Maintain Probability Distributions

In all our previous discussions regarding the Bandit problem, we tried to get some estimate about
the true means µ(a)’s of arms. In the Explore-Then-Commit and ϵ-Greedy algorithms, we es-
timated the means using a point estimate: the sample mean µ(a). To capture more informa-
tion, in the Adaptive and the UCB algorithms, we maintained conﬁdence intervals for µ(a):
(LCB(a), U CB(a)).
We can go even further: keep track of the entire probability distributions of µ(a)’s! We will
have some prior distributions initially, and every time we play an arm a, using Bayes’ rule, we
update the distribution of µ(a) to its posterior distribution, which is just our belief of the new
distribution of µ(a) after we get a new observation (the reward obtained).
Let us illustrate how this works using a coin-toss game.

2.1 Coin Toss game using Bayesian Inference1

Suppose we are given a biased coin such that P(Head) = p, where the probability p is unknown
to us, and we wish to estimate it. A frequentist person will approach this problem as follows: toss
the coin n times, if Head appears Hn times, then p ≈ Hnn . In fact, in the limit n → ∞, the ratio Hnn
will converge to p.
Here is the Bayesian approach to the problem. First of all, have some prior idea about the
parameter p. What we mean by this is: have a prior distribution about the value of p. For example,
if initially we have no idea about p’s value then we can choose it to be Uniform(0, 1), i.e. all values

4
in [0, 1] equally probable. The prior distribution could, in general, be some distribution in the range
[0, 1] (as we know p cannot lie outside that), for example, B(α, β) = c(α, β)θα−1 (1 − θ)β−1 , 0 ≤
θ ≤ 1 (Beta distribution). Here c(α, β) is a constant depending on α, β which is needed to make
the total probability sum up to 1.
Given our prior distribution, we now toss the coin once. Let the result be some r ∈ {0, 1} (1
for Heads, 0 for Tails). Bayes’ rule states that for events A, B:

P(B|A)P(A) P(B|A)P(A)
P(A|B) = =P
P(B) A′ P(B|A )P(A )
′ ′

If we take event A as the distribution of p (i.e., event p = θ) and event B as our observation (result
of the coin toss), we can write this as:

P(Observation r|p = θ) P(Observation r|p = θ)

P(p = θ| Observation r) = = R1
P(Observation r) P(Observation r|p = θ′ )P(p = θ′ )dθ′
0

(Here we abuse the notation a bit, P(p = θ) can mean the continuous probability distribution
function of p at θ. Also the summation is replaced by an integral due to the continuous nature of
p’s distribution.)
We can see that P(Observation r|p = θ) is easy to ﬁnd, it is θ if r = 1 (Heads), and (1 − θ)
if r = 0 (Tails), i.e. P(Observation r|p = θ) = θr (1 − θ)1−r . Thus, if we have our prior
distribution P(p = θ), we can ﬁnd out the posterior distribution given the observation: P(p =
θ| Observation r).
If we have a Beta prior (even Uniform(0, 1) = B(1, 1)) B(α, β), then our posterior is:

θr (1 − θ)1−r · c(α, β)θα−1 (1 − θ)β−1

P(p = θ| Observation r) = R 1
0
θ′r (1 − θ′ )1−r · c(α, β)θ′α−1 (1 − θ′ )β−1 dθ′

Note that the denominator is just a constant (this is true in Bayesian inference in general, and not
just this coin toss example), so we can write our posterior in a proportional manner (neglecting the
constants).
P(p = θ| Observation r) ∝ θα+r−1 (1 − θ)β+(1−r)−1
i.e., our posterior distribution is B(α + r, β + (1 − r)) (B(α + 1, β) in case of Heads, B(α, β + 1)
in case of Tails). In the case of Beta priors and Bernoulli coin tosses, each time we toss, our
distribution belief for p changes from the previous one in the above manner.
If our initial prior is B(α, β), we toss n times, out of which there are Hn heads and Tn tails, then
our ﬁnal posterior distribution of p is B(α +Hn , β +Tn ). Given this ﬁnal posterior distribution of p
based on our observations, we can now get an estimate of p as per our needs, for e.g. the posterior
mean, posterior median etc.

5
2.2 Thompson Sampling Algorithm
Now, let us consider the Bandit Problem with Bernoulli Rewards, each arm gives a reward of 1
with some probability (ﬁxed for that arm), and a reward of 0 with the rest probability.
)
P(R(a) = 1) = µ(a)
for each arm a ∈ A (0 ≤ µ(a) ≤ 1)
P(R(a) = 0) = 1 − µ(a)

Note: It can be seen that E(R(a)) = µ(a) · 1 + (1 − µ(a)) · 0 = µ(a), which is why we retain the
notation of the Bernoulli probabilities as µ(a).
To begin with, we have may some initial idea about the behaviour of each arm, i.e. some prior
distribution for each parameter µ(a), which we take to be a Beta distribution B(α0 (a), β0 (a)).
Even if we have no prior information on µ(a), we can take it to be the Uniform(0, 1) distribution,
i.e. B(1, 1) (α0 (a) = 1, β0 (a) = 1).
Every time (say at round t) we choose an arm a (we will see the selection policy soon) and play
it, we get a reward Rt (a) = r. Based on the reward, we update µ(a)’s distribution to the posterior
distribution, which can be found using Bayes’ rule.

P(µ(a) = θ|R(a) = r) ∝ P(Rt (a) = r|µ(a) = θ) · P(µ(a) = θ)

Note that P(Rt (a) = r|µ(a) = θ) = θr (1 − θ)1−r , for r ∈ {0, 1} (Bernoulli distribution).
If our prior was B(α, β), like we saw for our coin toss example, our posterior distribution is
P(µ(a) = θ) ∝ θr (1 − θ)1−r · θα−1 (1 − θ)β−1 = θα+r−1 (1 − θ)β+1−r−1 , i.e., B(α + r, β + (1 − r)).
So based on the reward r, the posterior of µ(a) for the chosen arm a is updated as above.

How to choose the arm a? The heuristic is: for any a ∈ A, play arm a with the probability that
it has the highest µ(a), i.e. P(µ(a) = maxa′ ∈A µ(a′ )). This probability is known to us based on
our own beliefs about the distributions of each µ(a).
For example, if there were only two arms a, b, then we choose arm a with probability P(µ(a) >
µ(b)), and arm b with probability P(µ(b) > µ(a)). Here the probability distributions for µ(a) &
µ(b) are our most up-to-date distribution beliefs for both.
Now, these probabilities themselves are hard to calculate analytically. Hence, we use a simple
sampling algorithm which will do our job. The idea is as follows: for each arm a, sample a point
e from the distribution of µ(a), B(α(a), β(a)). Then, select the arm with the highest θ(a).
θ(a) e As
e
θ(a)’s and µ(a)’s come from the same distribution, this will indeed select a particular arm a with
the required probability. This is known as the Thompson Sampling algorithm.
Thus, the ﬁnal algorithm is presented as follows:

6
Algorithm 2 Thompson Sampling
1: At t = 0, have prior Beta distributions for µ(a) for each arm a ∈ A: B(α0 (a), β0 (a))
2: for (t = 1 to T ) do
3: For each arm a ∈ A, sample θet (a) ∼ B(αt−1 (a), βt−1 (a))
4: Play arm a(t) = arg max{θet (a)}
a
5: Update the posterior of the arm a(t) based on the reward rt ∈ {0, 1} received in round t:
αt (a(t)) = αt−1 (a(t)) + rt , βt (a(t)) = βt−1 (a(t)) + (1 − rt ).
6: end for

References
1
S. Agrawal. Lecture 4: Introduction to thompson sampling, multi-armed bandits and reinforce-
ment learning.
2
A. Slivkins. Introduction to Multi-Armed Bandits, volume 7. 2022.

CS 188 Introduction To AI Midterm Study Guide
No ratings yet
CS 188 Introduction To AI Midterm Study Guide
2 pages
multi-arm-bandit problem
No ratings yet
multi-arm-bandit problem
11 pages
Bayes' Estimators: The Method
No ratings yet
Bayes' Estimators: The Method
7 pages
Understanding Python
No ratings yet
Understanding Python
9 pages
SML - Week 2
No ratings yet
SML - Week 2
4 pages
CS464_Ch3_Estimation
No ratings yet
CS464_Ch3_Estimation
56 pages
Bayesian Learning: Thanks To Nir Friedman, HU
No ratings yet
Bayesian Learning: Thanks To Nir Friedman, HU
41 pages
EE675 Lecture 5
No ratings yet
EE675 Lecture 5
6 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
03 Lectureslides ParameterInference
No ratings yet
03 Lectureslides ParameterInference
24 pages
Bayesian Uncertainty Quantification
No ratings yet
Bayesian Uncertainty Quantification
23 pages
CS 747, Autumn 2023 - Lecture 3
No ratings yet
CS 747, Autumn 2023 - Lecture 3
27 pages
EE675A Lecture 3
No ratings yet
EE675A Lecture 3
8 pages
Cs Ai Lecture Notes 02
No ratings yet
Cs Ai Lecture Notes 02
103 pages
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
No ratings yet
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
18 pages
Assign 1
No ratings yet
Assign 1
5 pages
Baes Theory
No ratings yet
Baes Theory
76 pages
Eci 2023
No ratings yet
Eci 2023
507 pages
15-381 Artificial Intelligence: Representation and Problem Solving Homework 2 - Solutions
No ratings yet
15-381 Artificial Intelligence: Representation and Problem Solving Homework 2 - Solutions
7 pages
Notes4_BayesianLearning
No ratings yet
Notes4_BayesianLearning
8 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
1 What Is A Randomized Algorithm?: Lecture Notes CS:5360 Randomized Algorithms
No ratings yet
1 What Is A Randomized Algorithm?: Lecture Notes CS:5360 Randomized Algorithms
8 pages
Expanded_Multi_Armed_Bandit_and_Probability_Basics
No ratings yet
Expanded_Multi_Armed_Bandit_and_Probability_Basics
5 pages
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
No ratings yet
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
49 pages
The Random Variable For Probabilities: Chris Piech CS109, Stanford University
No ratings yet
The Random Variable For Probabilities: Chris Piech CS109, Stanford University
58 pages
Machine Learning Models and Theories
No ratings yet
Machine Learning Models and Theories
38 pages
ADASD
No ratings yet
ADASD
4 pages
Probability and Randomized Algorithms
No ratings yet
Probability and Randomized Algorithms
14 pages
Randomized Algorithms: - Tools For P ( (+) )
No ratings yet
Randomized Algorithms: - Tools For P ( (+) )
20 pages
Randomized Algorithms: - Tools For P ( (+) )
No ratings yet
Randomized Algorithms: - Tools For P ( (+) )
20 pages
Bandits
No ratings yet
Bandits
2 pages
SoICT-Eng - ProbComp - Lec 2
No ratings yet
SoICT-Eng - ProbComp - Lec 2
28 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
213 pages
X400004_20220215_solutions
No ratings yet
X400004_20220215_solutions
8 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
report-endterm
No ratings yet
report-endterm
30 pages
Exam Sta 426
No ratings yet
Exam Sta 426
6 pages
RL UNIT PPT
No ratings yet
RL UNIT PPT
595 pages
Probability Theory - Towards Data Science
No ratings yet
Probability Theory - Towards Data Science
19 pages
Assignment 1: CS747: F I L A
No ratings yet
Assignment 1: CS747: F I L A
10 pages
Arena Stanfordlecturenotes11
No ratings yet
Arena Stanfordlecturenotes11
9 pages
26 Making Decisions
No ratings yet
26 Making Decisions
31 pages
CS 747, Autumn 2023: Lecture 4: Shivaram Kalyanakrishnan
No ratings yet
CS 747, Autumn 2023: Lecture 4: Shivaram Kalyanakrishnan
42 pages
PyCon 2015 - Bayesian Statistics Made Simple
100% (4)
PyCon 2015 - Bayesian Statistics Made Simple
145 pages
PSet6 Solutions
No ratings yet
PSet6 Solutions
6 pages
Assignment 2- solution
No ratings yet
Assignment 2- solution
5 pages
Lecture 10
No ratings yet
Lecture 10
59 pages
Biased Coin
No ratings yet
Biased Coin
7 pages
ML Lecture 03 - Probabilistic Inference (Spring 2024)
No ratings yet
ML Lecture 03 - Probabilistic Inference (Spring 2024)
46 pages
The Probabilistic Method - ProbabilisticMethod
No ratings yet
The Probabilistic Method - ProbabilisticMethod
9 pages
Introstat
No ratings yet
Introstat
16 pages
Probabilistic Theory of Deep Learning
No ratings yet
Probabilistic Theory of Deep Learning
11 pages
An Introduction To Bayesian Statistics
100% (9)
An Introduction To Bayesian Statistics
20 pages
Bayesian Inference
No ratings yet
Bayesian Inference
5 pages
ML and MAP - HTML
No ratings yet
ML and MAP - HTML
9 pages
HW 2 Sol
No ratings yet
HW 2 Sol
7 pages
Notes Randomization
No ratings yet
Notes Randomization
7 pages
Particle Filter Tutorial
No ratings yet
Particle Filter Tutorial
8 pages

EE675A Lecture 4

Uploaded by

EE675A Lecture 4

Uploaded by

EE675A - IITK, 2022-23-II

Lecture 4: UCB and Thompson Sampling Algorithm

1.1 UCB Regret Analysis

U CBt (a) ≥ U CBt (a∗ ) (1)

∆(a) = µ(a∗ ) − µ(a) (4)

1.2 Derivation for the Instance-dependent bound for UCB

nT (a) ≤ O(log T /(∆(a))2 ) (10)

R(T ; a) = ∆(a).nT (a)

R(T ; a) ≤ ∆(a).O(log T /(∆(a))2 )

1.3 Derivation for the Instance-Independent bound for UCB

2 Maintain Probability Distributions

2.1 Coin Toss game using Bayesian Inference1

P(Observation r|p = θ) P(Observation r|p = θ)

θr (1 − θ)1−r · c(α, β)θα−1 (1 − θ)β−1

P(µ(a) = θ|R(a) = r) ∝ P(Rt (a) = r|µ(a) = θ) · P(µ(a) = θ)

You might also like