0% found this document useful (0 votes)

40 views

Lecture 2 EE675

The document summarizes key concepts in n-armed bandit problems and algorithms for solving them. It defines n-armed bandits as single-state reinforcement learning problems where choosing an arm provides a reward without intermediate states. It describes two main algorithms: 1) explore-then-commit which uniformly explores arms and commits to the best, and 2) epsilon-greedy which chooses the best arm most of the time but explores randomly with probability epsilon. Both algorithms aim to minimize expected regret from not choosing the optimal arm.

Uploaded by

SAURABH VISHWAKARMA

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

Lecture 2 EE675

Uploaded by

SAURABH VISHWAKARMA

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

EE675 - IIT Kanpur, 2022-23-II

Lecture 2: n-Armed Bandits

11.01.23
Lecturer: Prof. Subrahmanyu Swamy Peruru Scribe: Ananya Mehrotra and Shamayeeta Dass

1 Recap
1.1 Paradigms of Machine Learning and Basic Terminology
There are three paradigms of Machine Learning:

• Supervised Learning

• Unsupervised Learning

• Reinforcement Learning

We primarily focus on Reinforcement Learning (RL) in this course. Some common terms used in
Reinforcement Learning are described below:

• State : Representation of the environment of the task. Commonly denoted as St

• Action : Refers to the reaction to State. Commonly denoted as At

• Environment : The combination of the new State and Reward to the previous action

• Reward : Feedback to the User in correspondence with the previous action and the new
state. Commonly denoted as Rt
∑
• Return : Cumulative Reward. It is equivalent to Tt=1 Rt

• Policy : Methodology which evaluates history and predicts best possible action. A policy
can be segregated into the following two categories:

– Deterministic : The output is a single action which is predicted to be the best possible
choice
– Stochastic : The output is a set of probabilities of the actions. As per the policy, the
higher the probability of a particular action, the better it is.

The main objective of an RL algorithm is to maximize the cumulative reward over several rounds.

1
2 Basics of n-Armed Bandits
n-Armed Bandits is a special scenario of Reinforcement Learning where there is no time sequence
dependence. In simpler terms, n-Armed Bandits are Single State Scenarios. A single action leads
to a reward with no intermediate states, after which the environment resets, and the process repeats.
Some Common Notations :
• Arms : Refer to actions taken by the user. The ith arm is denoted by ai , and the total number
of arms is taken to be K

• Expectation of a Random Variable ’X’ : Denoted by E[X].

∫
E[X] = xfX (x)dx

where fX (x) refers to the Probability Density Function of a continuous RV ’X’

• True means of arms : Denoted by µi for ith arm. We know,

E[Rt |At = ai ] = µi

The best true mean is given by µ∗ such that,

µ∗ = max µi
i

• Estimated Sample Averages: Denoted by µ̄i for ith arm.

There can be two main objectives in a Multi-Armed Bandit problem:
• Best Arm Identiﬁcation : This methodology focuses more on exploration. Over T rounds,
P (Identiﬁed Arm is the Optimal Arm) is maximised to help identify the optimal arm

• Regret Minimzation : This methodology minimzes the Expected Regret.

The Expected Regret is
∑
T
µ∗ T − µi (at )
i=1

and, the Actual Regret is

∑
T
µ∗ T − Rt
t=1

where µ(at ) = E[Rt |at ] and µi = E[Rt |at = ai ]

2.1 Algorithms
There are two main algorithms for a Multi-Armed Bandit problem.

2
2.1.1 Explore Then Commit Algorithm
In this methodology, we explore all arms uniformly and select the arm with the highest sample
average reward.

Algorithm:

1. Explore each arm N times.

2. Pick arm â with the highest sample average mean.

3. Play arm â in all remaining rounds.

If we take a large number of samples, the sample mean will be equal to the expected mean. In
our setting, the average reward µ̃(a) for any action a should be close to the expected reward µ(a).
We make use of the following inequality to set a bound.

Hoeffding’s Inequality: It states that for N sample means and ∀ ϵ,

P [|µ̄(a) − µ(a)| ≥ ϵ] < 2e−2ϵ

2.1.2 Regret Minimisation in ETC Algorithm:

In Hoeffding’s
√ Inequality, we assume that Rt ∈ [0, 1]. As we want the probability to be small, we
take ε ∼ 2 log(T )
N
. This leads to the upper bound in the above expression to be of the order: O( T14 ).

Let us take an example of K = 2 arms. This indicates one of the arms will be the optimal arm,
and the other one will be a sub-optimal arm.

Let a∗ denote the best arm. If we choose the sub-optimal arm, then it must have been because
the sub-optimal arm had a higher sample mean, µ̄(a) > µ̄(a∗ ). We also know by using Hoeffding’s
Inequality:
µ(a) + ϵ ≥ µ̄(a) > µ̄(a∗ ) ≥ µ(a∗ ) − ϵ
Thus,
µ(a∗ ) − µ(a) ≤ 2ϵ
In order to analyse regret: R(T ), we need to divide it into two parts -

• Exploration Regret :If we sample every arm N times, we will have a term of order N for
sampling the sub-optimal arm in every round.

• Exploitation Regret : In the remaining T − 2N rounds, we will have a regret term of 2ϵ in

each round for picking the sub-optimal arm.

3
Hence,
R(T ) ≤ N + 2ϵ(T − 2N ) < N + 2ϵT
√
R(T ) < N + 2T 2 log T /N
For N = T 2/3 (log T )1/3 , the upper bound in Hoeffding’s Inequality is minimum. Thus, we get

R(T ) ≤ O(T 2/3 (log T )1/3 )

Let A be the event that Hoeffding’s Inequality holds for every arm and A′ be its complementary
event. Then, the Expected Regret can be expressed as

E[R(T )] = E[R(T )|A]P (A) + E[R(T )|A′ ]P (A′ )

If Hoeffding’s inequality doesn’t hold, then a regret term of order 1/T 4 is considered for each
round. As P (A) ≤ 1
E[R(T )] ≤ R(T ) + T.O(1/T 4 )
E[R(T )] ≤ O(T 2/3 (log T )1/3

2.1.3 Epsilon-greedy Algorithm

Algorithm

1. Toss a coin that lands in the head with a probability of ϵ.

2. If coin lands in head pick any arm at random, else pick the arm with best sample average.

If the Exploration Probability ϵt ≈ t−1/3 , then

E[R(t)] ≤ t2/3 O((K log t)1/3 )

References
[1] A. Slivkins. Introduction to Multi-Armed Bandits. Foundations and Trends in Machine Learn-
ing, 2022.

[2] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 2020.

[1] [2]

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
100% (8)
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
148 pages
How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (20)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
100% (10)
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
708 pages
RLbook Solutions Manual
No ratings yet
RLbook Solutions Manual
35 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
Machine Learning For Humans
100% (4)
Machine Learning For Humans
97 pages
Bandit
No ratings yet
Bandit
8 pages
Multi-Armed Bandits epsilon-greedy algorithm
No ratings yet
Multi-Armed Bandits epsilon-greedy algorithm
14 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
Dissecting Reinforcement Learning-Part6
No ratings yet
Dissecting Reinforcement Learning-Part6
25 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
Multi-Armed Bandit Algorithms and Empirical Evaluation
No ratings yet
Multi-Armed Bandit Algorithms and Empirical Evaluation
12 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
EE675A Lecture 3
No ratings yet
EE675A Lecture 3
8 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
34 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
Evendar 06 A
No ratings yet
Evendar 06 A
27 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
Multi-armed bandits
No ratings yet
Multi-armed bandits
11 pages
Lecture 9: Exploration and Exploitation: David Silver
No ratings yet
Lecture 9: Exploration and Exploitation: David Silver
47 pages
26202-Article Text-30265-1-2-20230626
No ratings yet
26202-Article Text-30265-1-2-20230626
8 pages
RL-Unit-1_QA
No ratings yet
RL-Unit-1_QA
10 pages
Necessary and Sufficient Conditions For Achieving Sub-Linear Regret in Stochastic Multi-Armed Bandits
No ratings yet
Necessary and Sufficient Conditions For Achieving Sub-Linear Regret in Stochastic Multi-Armed Bandits
9 pages
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
No ratings yet
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
19 pages
Unit II
No ratings yet
Unit II
10 pages
Auer - Using Ucb for Exploration-exploitation Tradeoffs
No ratings yet
Auer - Using Ucb for Exploration-exploitation Tradeoffs
26 pages
Bandits
No ratings yet
Bandits
2 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
9 pages
Exploration Exploitation
No ratings yet
Exploration Exploitation
40 pages
29117-Article Text-33171-1-2-20240324
No ratings yet
29117-Article Text-33171-1-2-20240324
8 pages
Finite-Time Regret of Thompson Sampling Algorithms For Exponential Family Multi-Armed Bandits
No ratings yet
Finite-Time Regret of Thompson Sampling Algorithms For Exponential Family Multi-Armed Bandits
49 pages
RL UNIT PPT
No ratings yet
RL UNIT PPT
595 pages
Class 3
No ratings yet
Class 3
32 pages
26 Making Decisions
No ratings yet
26 Making Decisions
31 pages
IntroMulti Armed Bandits Slivkin Microsoft PDF
No ratings yet
IntroMulti Armed Bandits Slivkin Microsoft PDF
174 pages
Fan Glynn
No ratings yet
Fan Glynn
32 pages
Bandit Algorithms
No ratings yet
Bandit Algorithms
596 pages
rl-unit5
No ratings yet
rl-unit5
101 pages
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
No ratings yet
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
11 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
pdf24 Images Merged
No ratings yet
pdf24 Images Merged
12 pages
Reading 3-Russo & Van Roy 2014
No ratings yet
Reading 3-Russo & Van Roy 2014
24 pages
Improved Regret Bounds for Bandits with Expert Advice
No ratings yet
Improved Regret Bounds for Bandits with Expert Advice
18 pages
16 - Reinforcement Learning and Bandits.pptx
No ratings yet
16 - Reinforcement Learning and Bandits.pptx
41 pages
Written Assignment 1
No ratings yet
Written Assignment 1
2 pages
RL-Endterm Report - Mridul Agarwal
No ratings yet
RL-Endterm Report - Mridul Agarwal
27 pages
Open Problem: Regret Bounds For Thompson Sampling: 1. Background
No ratings yet
Open Problem: Regret Bounds For Thompson Sampling: 1. Background
3 pages
Data Challenge - NC Soft
No ratings yet
Data Challenge - NC Soft
4 pages
EE290S Lecture Note 22
No ratings yet
EE290S Lecture Note 22
12 pages
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
No ratings yet
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
4 pages
Multi-Armed Bandit
No ratings yet
Multi-Armed Bandit
17 pages
Bandits and Graphs
No ratings yet
Bandits and Graphs
13 pages
Mab Notes
No ratings yet
Mab Notes
15 pages
Mid-Semester Examination
No ratings yet
Mid-Semester Examination
2 pages
1.RL Unit 1
No ratings yet
1.RL Unit 1
47 pages
Bandit Algorithms
No ratings yet
Bandit Algorithms
2 pages
Assignment 1: CS747: F I L A
No ratings yet
Assignment 1: CS747: F I L A
10 pages
Unit 1-RL
No ratings yet
Unit 1-RL
11 pages
RL
No ratings yet
RL
20 pages
EAS 240 MAB Project Description Spring 2025
No ratings yet
EAS 240 MAB Project Description Spring 2025
10 pages
note2
No ratings yet
note2
4 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
Topics on Tournaments in Graph Theory
From Everand
Topics on Tournaments in Graph Theory
John W. Moon
No ratings yet
Useful Formulae: Mathematical & Physical
From Everand
Useful Formulae: Mathematical & Physical
Matthew Watkins
No ratings yet
Exercise Sheet 10
No ratings yet
Exercise Sheet 10
3 pages
EE675A Lecture 4
No ratings yet
EE675A Lecture 4
7 pages
EE675 Lecture 5
No ratings yet
EE675 Lecture 5
6 pages
Exercise Sheet 12
No ratings yet
Exercise Sheet 12
6 pages
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
From Music To Mathematic
100% (1)
From Music To Mathematic
4 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
Wisc V Interpretation
100% (1)
Wisc V Interpretation
8 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
Mind Control Patents
100% (1)
Mind Control Patents
41 pages
Tech Trend 2024 Report-2
No ratings yet
Tech Trend 2024 Report-2
11 pages
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
100% (7)
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
145 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
Croma Campus - Salesforce Certification Training Curriculum
No ratings yet
Croma Campus - Salesforce Certification Training Curriculum
12 pages
Artificial Neural Networks
100% (1)
Artificial Neural Networks
18 pages
Cryptography and Network Security: Chapter 20 - Firewalls
No ratings yet
Cryptography and Network Security: Chapter 20 - Firewalls
21 pages
Data Structures Learning Platform
No ratings yet
Data Structures Learning Platform
8 pages
5
No ratings yet
5
10 pages
College Code: College Name Department Name
No ratings yet
College Code: College Name Department Name
170 pages
2-Inductive Learning
No ratings yet
2-Inductive Learning
37 pages
Kaduna Polytechnic Department of Computer Science College of Science and Technology School of Applied Science
No ratings yet
Kaduna Polytechnic Department of Computer Science College of Science and Technology School of Applied Science
9 pages
Error Handler: Presented By: Junaid Khan Department of Computer Science University of Peshawar
No ratings yet
Error Handler: Presented By: Junaid Khan Department of Computer Science University of Peshawar
22 pages
Computer Organization and Operating Systems PDF
0% (2)
Computer Organization and Operating Systems PDF
90 pages
Lab7 IAA202
No ratings yet
Lab7 IAA202
6 pages
Google Cloud Platform
No ratings yet
Google Cloud Platform
17 pages
ST-PMC1 Single-Axis Motion Controller Operating Manual
No ratings yet
ST-PMC1 Single-Axis Motion Controller Operating Manual
17 pages
Final 100
No ratings yet
Final 100
15 pages
AWS PYSPARK
No ratings yet
AWS PYSPARK
1 page
CCIT 105 Final Project Rubrics
No ratings yet
CCIT 105 Final Project Rubrics
1 page
Cs2303 Theory of Computation 2marks
100% (1)
Cs2303 Theory of Computation 2marks
20 pages
Chapter 8 - Operating System Support
No ratings yet
Chapter 8 - Operating System Support
43 pages
941_SS2 COMPUTER STUDIES NOTE 3RD TERM
No ratings yet
941_SS2 COMPUTER STUDIES NOTE 3RD TERM
14 pages
Nokia Guide To Digitalization in Mining Discover The Top Use Cases and Options For Mission Critical Connectivity Ebook EN
No ratings yet
Nokia Guide To Digitalization in Mining Discover The Top Use Cases and Options For Mission Critical Connectivity Ebook EN
11 pages
Dell Inspiron M5040/15-N5040/ 15-N5050 Owner's Manual
No ratings yet
Dell Inspiron M5040/15-N5040/ 15-N5050 Owner's Manual
72 pages
A7-R5.1 English
No ratings yet
A7-R5.1 English
8 pages
2017-CE-008 Lab#08 - 2
No ratings yet
2017-CE-008 Lab#08 - 2
16 pages
AWS - Capstone Project
No ratings yet
AWS - Capstone Project
12 pages
Hands On Tablet Curriculum
No ratings yet
Hands On Tablet Curriculum
109 pages
Modul Ajar 1 Mapil Pilihan Python
No ratings yet
Modul Ajar 1 Mapil Pilihan Python
18 pages

Lecture 2 EE675

Uploaded by

Lecture 2 EE675

Uploaded by

EE675 - IIT Kanpur, 2022-23-II

Lecture 2: n-Armed Bandits

• State : Representation of the environment of the task. Commonly denoted as St

• Action : Refers to the reaction to State. Commonly denoted as At

• Expectation of a Random Variable ’X’ : Denoted by E[X].

where fX (x) refers to the Probability Density Function of a continuous RV ’X’

• True means of arms : Denoted by µi for ith arm. We know,

The best true mean is given by µ∗ such that,

• Estimated Sample Averages: Denoted by µ̄i for ith arm.

• Regret Minimzation : This methodology minimzes the Expected Regret.

and, the Actual Regret is

where µ(at ) = E[Rt |at ] and µi = E[Rt |at = ai ]

1. Explore each arm N times.

2. Pick arm â with the highest sample average mean.

3. Play arm â in all remaining rounds.

Hoeffding’s Inequality: It states that for N sample means and ∀ ϵ,

P [|µ̄(a) − µ(a)| ≥ ϵ] < 2e−2ϵ

2.1.2 Regret Minimisation in ETC Algorithm:

• Exploitation Regret : In the remaining T − 2N rounds, we will have a regret term of 2ϵ in

R(T ) ≤ O(T 2/3 (log T )1/3 )

E[R(T )] = E[R(T )|A]P (A) + E[R(T )|A′ ]P (A′ )

2.1.3 Epsilon-greedy Algorithm

1. Toss a coin that lands in the head with a probability of ϵ.

If the Exploration Probability ϵt ≈ t−1/3 , then

E[R(t)] ≤ t2/3 O((K log t)1/3 )

You might also like