Lecture 2 EE675
Lecture 2 EE675
1 Recap
1.1 Paradigms of Machine Learning and Basic Terminology
There are three paradigms of Machine Learning:
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
We primarily focus on Reinforcement Learning (RL) in this course. Some common terms used in
Reinforcement Learning are described below:
• Environment : The combination of the new State and Reward to the previous action
• Reward : Feedback to the User in correspondence with the previous action and the new
state. Commonly denoted as Rt
∑
• Return : Cumulative Reward. It is equivalent to Tt=1 Rt
• Policy : Methodology which evaluates history and predicts best possible action. A policy
can be segregated into the following two categories:
– Deterministic : The output is a single action which is predicted to be the best possible
choice
– Stochastic : The output is a set of probabilities of the actions. As per the policy, the
higher the probability of a particular action, the better it is.
The main objective of an RL algorithm is to maximize the cumulative reward over several rounds.
1
2 Basics of n-Armed Bandits
n-Armed Bandits is a special scenario of Reinforcement Learning where there is no time sequence
dependence. In simpler terms, n-Armed Bandits are Single State Scenarios. A single action leads
to a reward with no intermediate states, after which the environment resets, and the process repeats.
Some Common Notations :
• Arms : Refer to actions taken by the user. The ith arm is denoted by ai , and the total number
of arms is taken to be K
E[Rt |At = ai ] = µi
µ∗ = max µi
i
2.1 Algorithms
There are two main algorithms for a Multi-Armed Bandit problem.
2
2.1.1 Explore Then Commit Algorithm
In this methodology, we explore all arms uniformly and select the arm with the highest sample
average reward.
Algorithm:
If we take a large number of samples, the sample mean will be equal to the expected mean. In
our setting, the average reward µ̃(a) for any action a should be close to the expected reward µ(a).
We make use of the following inequality to set a bound.
Let us take an example of K = 2 arms. This indicates one of the arms will be the optimal arm,
and the other one will be a sub-optimal arm.
Let a∗ denote the best arm. If we choose the sub-optimal arm, then it must have been because
the sub-optimal arm had a higher sample mean, µ̄(a) > µ̄(a∗ ). We also know by using Hoeffding’s
Inequality:
µ(a) + ϵ ≥ µ̄(a) > µ̄(a∗ ) ≥ µ(a∗ ) − ϵ
Thus,
µ(a∗ ) − µ(a) ≤ 2ϵ
In order to analyse regret: R(T ), we need to divide it into two parts -
• Exploration Regret :If we sample every arm N times, we will have a term of order N for
sampling the sub-optimal arm in every round.
3
Hence,
R(T ) ≤ N + 2ϵ(T − 2N ) < N + 2ϵT
√
R(T ) < N + 2T 2 log T /N
For N = T 2/3 (log T )1/3 , the upper bound in Hoeffding’s Inequality is minimum. Thus, we get
Let A be the event that Hoeffding’s Inequality holds for every arm and A′ be its complementary
event. Then, the Expected Regret can be expressed as
If Hoeffding’s inequality doesn’t hold, then a regret term of order 1/T 4 is considered for each
round. As P (A) ≤ 1
E[R(T )] ≤ R(T ) + T.O(1/T 4 )
E[R(T )] ≤ O(T 2/3 (log T )1/3
2. If coin lands in head pick any arm at random, else pick the arm with best sample average.
References
[1] A. Slivkins. Introduction to Multi-Armed Bandits. Foundations and Trends in Machine Learn-
ing, 2022.
[2] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 2020.
[1] [2]