0% found this document useful (0 votes)

48 views

A Brief Introduction To Reinforcement Learning

This document provides a brief introduction to reinforcement learning. It defines reinforcement learning as getting an agent to maximize rewards by learning from interactions with an environment without being explicitly told what to do. It describes reinforcement learning problems using Markov decision processes (MDPs) and partially observable MDPs (POMDPs). The key challenges of reinforcement learning are exploration-exploitation, delayed reward, and generalization across states.

Uploaded by

Narendra Patel

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

A Brief Introduction To Reinforcement Learning

Uploaded by

Narendra Patel

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

A brief introduction to reinforcement learning

See also Rich Sutton's FAQ on RL Reinforcement learning is the problem of getting an agent to act in the world so as to
maximize its rewards. For example, consider teaching a dog a new trick: you cannot tell it what to do, but you can
reward/punish it if it does the right/wrong thing. It has to figure out what it did that made it get the reward/punishment,
which is known as the credit assignment problem. We can use a similar method to train computers to do many tasks, such as
playing backgammon or chess, scheduling jobs, and controlling robot limbs.

We can formalise the RL problem as follows. The environment is a modelled as a stochastic finite state machine with inputs
(actions sent from the agent) and outputs (observations and rewards sent to the agent)

State transition function P(X(t)|X(t-1),A(t))

Observation (output) function P(Y(t) | X(t), A(t))
Reward function E(R(t) | X(t), A(t))

(Notice that what the agent sees depends on what it does, which reflects the fact that perception is an active process.) The
agent is also modelled as stochastic FSM with inputs (observations/rewards sent from the environment) and outputs (actions
sent to the environment).

State transition function: S(t) = f (S(t-1), Y(t), R(t), A(t))

Policy/output function: A(t) = pi(S(t)))

The agent's goal is to find a policy and state-update function so as to maximize the the expected sum of discounted rewards
E [ R_0 + g R_1 + g^2 R_2 + ...] = E sum_{t=0}^infty gamma^t R_t

where 0 <=>

In the special case that Y(t)=X(t), we say the world is fully observable, and the model becomes a Markov Decision Process
(MDP). In this case, the agent does not need any internal state (memory) to act optimally. In the more realistic case, where
the agent only gets to see part of the world state, the model is called a Partially Observable MDP (POMDP), pronounced
"pom-dp". We give a bried introduction to these topics below.

MDPs
Reinforcement Learning
POMDPs
First-order models
Recommended reading

MDPs
A Markov Decision Process (MDP) is just like a Markov Chain, except the transition matrix depends on the action taken by
the decision maker (agent) at each time step. The agent receives a reward, which depends on the action and the state. The
goal is to find a function, called a policy, which specifies which action to take in each state, so as to maximize some function
(e.g., the mean or expected discounted sum) of the sequence of rewards. One can formalize this in terms of Bellman's
equation, which can be solved iteratively using policy iteration. The unique fixed point of this equation is the optimal policy.

More precisely, let us define the transition matrix and reward functions as follows.
T(s,a,s') = Pr[S(t+1)=s' | S(t)=s, A(t)=a]
R(s,a,s') = E[R(t+1)| S(t)=a, A(t)=a, S(t+1)=s']

(We are assuming states, actions and time are discrete. Continuous MDPs can also be defined, but are usually solved by
discretization.)

We define the value of performing action a in state s as follows:

Q(s,a) = sum_s' T(s,a,s') [ R(s,a,s') + g V(s') ]
where 0 < g <=> V(s) = max_a Q(s,a) = max_a sum_s' T(s,a,s') [ R(s,a,s') + g V(s) ] In words, the value of a state is the
maximum expected reward we will get in that state, plus the expected discounted value of all possible successor states, s'. If
we define
R(s,a) = E[ R(s,a,s') ] = sum_{s'} T(s,a,s') R(s,a,s')

the above equation simplifies to the more common form

V(s) = max_a R(s,a) + sum_s' T(s,a,s') g V(s')

which, for a fixed policy and a tabular (non-parametric) representation of the V/Q/T/R functions, can be rewritten in matrix-
vector form as V = R + g T V Solving these n simultaneous equations is called value determination (n is the number of
states).

If V/Q satisfies the Bellman equation, then the greedy policy

p(s) = argmax_a Q(s,a)

is optimal. If not, we can set p(s) to argmax_a Q(s,a) and re-evaluate V (and hence Q) and repeat. This is called policy
iteration, and is guaranteed to converge to the unique optimal policy. (Here is some Matlab software for solving MDPs using
policy iteration.) The best theoretical upper bound on the number of iterations needed by policy iteration is exponential in n
(Mansour and Singh, UAI 99), but in practice, the number of steps is O(n). By formulating the problem as a linear program,
it can be proved that one can find the optimal policy in polynomial time.

For AI applications, the state is usually defined in terms of state variables. If there are k binary variables, there are n = 2^k
states. Typically, there are some independencies between these variables, so that the T/R functions (and hopefully the V/Q
functions, too!) are structured; this can be represented using a Dynamic Bayesian Network (DBN), which is like a
probabilistic version of a STRIPS rule used in classical AI planning. For details, see

"Decision Theoretic Planning: Structural Assumptions and Computational Leverage". Craig Boutilier, Thomas Dean
and Steve Hanks

JAIR (Journal of AI Research) 1999. Postscript (87 pages)

Reinforcement Learning
If we know the model (i.e., the transition and reward functions), we can solve for the optimal policy in about n^2 time using
policy iteration. Unfortunately, if the state is composed of k binary state variables, then n = 2^k, so this is way too slow. In
addition, what do we do if we don't know the model?

Reinforcement Learning (RL) solves both problems: we can approximately solve an MDP by replacing the sum over all
states with a Monte Carlo approximation. In other words, we only update the V/Q functions (using temporal difference (TD)
methods) for states that are actually visited while acting in the world. If we keep track of the transitions made and the
rewards received, we can also estimate the model as we go, and then "simulate" the effects of actions without having to
actually perform them.

There are three fundamental problems that RL must tackle: the exploration-exploitation tradeoff, the problem of delayed
reward (credit assignment), and the need to generalize. We will discuss each in turn.

We mentioned that in RL, the agent must make trajectories through the state space to gather statistics. The exploration-
exploitation tradeoff is the following: should we explore new (and potentially more rewarding) states, or stick with what we
know to be good (exploit existing knowledge)? This problem has been extensively studied in the case of k-armed bandits,
which are MDPs with a single state and k actions. The goal is to choose the optimal action to perform in that state, which is
analogous to deciding which of the k levers to pull in a k-armed bandit (slot machine). There are some theoretical results
(e.g., Gittins' indices), but they do not generalise to the multi-state case.

The problem of delayed reward is well-illustrated by games such as chess or backgammon. The player (agent) makes many
moves, and only gets rewarded or punished at the end of the game. Which move in that long sequence was responsible for
the win or loss? This is called the credit assignment problem. We can solve it by essentially doing stochastic gradient
descent on Bellman's equation, backpropagating the reward signal through the trajectory, and averaging over many trials.
This is called temporal difference learning.

It is fundamentally impossible to learn the value of a state before a reward signal has been received. In large state spaces,
random exploration might take a long time to reach a rewarding state. The only solution is to define higher-level actions,
which can reach the goal more quickly. A canonical example is travel: to get from Berkeley to San Francisco, I first plan at a
high level (I decide to drive, say), then at a lower level (I walk to my car), then at a still lower level (how to move my feet),
etc. Automatically learning action hierarchies (temporal abstraction) is currently a very active research area.

The last problem we will discuss is generalization: given that we can only visit a subset of the (exponential number) of
states, how can know the value of all the states? The most common approach is to approximate the Q/V functions using, say,
a neural net. A more promising approach (in my opinion) uses the factored structure of the model to allow safe state
abstraction (Dietterich, NIPS'99).

RL is a huge and active subject, and you are recommended to read the references below for more information. There have
been a few successful applications of RL. The most famous is probably Tesauro's TD-gammon, which learned to play
backgammon extremely well, using a neural network function approximator and TD(lambda). Other applications have
included controlling robot arms and various scheduling problems. However, these are still very simple problems by the
standards of AI, and required a lot of human engineering; we are a far cry from the dream of fully autonomous learning
agents.

POMDPs
MDPs assume that the complete state of the world is visible to the agent. This is clearly highly unrealistic (think of a robot
in a room with enclosing walls: it cannot see the state of the world outside of the room). POMDPs model the information
available to the agent by specifying a function from the hidden state to the observables, just as in an HMM. The goal now is
to find a mapping from observations (not states) to actions. Unfortunately, the observations are not Markov (because two
different states might look the same), which invalidates all of the MDP solution techniques. The optimal solution to this
problem is to construct a belief state MDP, where a belief state is a probability distribution over states. For details on this
approach, see

"Planning and Acting in Partially Observable Stochastic Domains". Leslie Pack Kaelbling, Michael L. Littman and
Anthony R. Cassandra

Artificial Intelligence, Vol. 101, 1998. Postscript (45 pages)

Control theory is concerned with solving POMDPs, but in practice, control theorists make strong assumptions about the
nature of the model (typically linear-Gaussian) and reward function (typically negative quadratic loss) in order to be able to
make theoretical guarantees of optimality, etc. By contrast, optimally solving a generic discrete POMDP is wildly
intractable. Finding tractable special cases (e.g., structured models) is a hot research topic.

For more details on POMDPs, see Tony Cassandra's POMDP page.

First-order models
A major limitation of (PO)MDPs is that they model the world in terms of a fixed-size set of state variables, each of which
can take on specific values, say true and false, or -3.5. These are called propositional models. It would seem more natural to
use a first-order model, which allows for (a variable number of) objects and relations. However, this is a completely open
research problem. Leslie Kaelbling is doing interesting work in this area (see her "reifying robots" page).

"Reinforcement Learning: A Survey". Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore

JAIR (Journal of AI Research), Volume 4, 1996. Postscript (40 pages) or HTML version

"Planning and Acting in Partially Observable Stochastic Domains". Leslie Pack Kaelbling, Michael L. Littman and
Anthony R. Cassandra

Artificial Intelligence, Vol. 101, 1998. Postscript (45 pages)

"Decision Theoretic Planning: Structural Assumptions and Computational Leverage". Craig Boutilier, Thomas Dean
and Steve Hanks

JAIR (Journal of AI Research) 1999. Postscript (87 pages)

ICT Mentorship 2022 - Episode 1 To 13 - Notes
100% (1)
ICT Mentorship 2022 - Episode 1 To 13 - Notes
44 pages
Case Study of Acacia and Its Natural Phenomenon
No ratings yet
Case Study of Acacia and Its Natural Phenomenon
4 pages
Study of Led TV Manufacturing
No ratings yet
Study of Led TV Manufacturing
28 pages
ML QB 5
No ratings yet
ML QB 5
44 pages
Unit 03 RL Problem
No ratings yet
Unit 03 RL Problem
9 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
MDP
No ratings yet
MDP
10 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
Unit 04 Finite Markov Decision Processes
No ratings yet
Unit 04 Finite Markov Decision Processes
8 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
AI (IT) UNIT-4
No ratings yet
AI (IT) UNIT-4
37 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
11 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
InTech-Multi Automata Learning
No ratings yet
InTech-Multi Automata Learning
21 pages
AS02
No ratings yet
AS02
16 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
Katharina Tluk v. Toschanowitz, Barbara Hammer and Helge Ritter - Mapping The Design Space of Reinforcement Learning Problems - A Case Study
No ratings yet
Katharina Tluk v. Toschanowitz, Barbara Hammer and Helge Ritter - Mapping The Design Space of Reinforcement Learning Problems - A Case Study
11 pages
325 Notes
No ratings yet
325 Notes
23 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Unit 2 - 1
No ratings yet
Unit 2 - 1
13 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
p1 Piotr
No ratings yet
p1 Piotr
7 pages
Sections
No ratings yet
Sections
76 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Report p1
No ratings yet
Report p1
7 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Solving Stochastic Planning Problems With Large State and Action Spaces
No ratings yet
Solving Stochastic Planning Problems With Large State and Action Spaces
9 pages
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
No ratings yet
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
6 pages
UNIT 4 (2)
No ratings yet
UNIT 4 (2)
6 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
BT Rarbarllwcm 09
No ratings yet
BT Rarbarllwcm 09
8 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Dissecting Reinforcement Learning-Part9
No ratings yet
Dissecting Reinforcement Learning-Part9
15 pages
Inverse Reinforcement Learning
No ratings yet
Inverse Reinforcement Learning
8 pages
Genetic Reinforcement Learning Algorithms For On-Line Fuzzy Inference System Tuning "Application To Mobile Robotic"
No ratings yet
Genetic Reinforcement Learning Algorithms For On-Line Fuzzy Inference System Tuning "Application To Mobile Robotic"
31 pages
Ai - Chapter 3
No ratings yet
Ai - Chapter 3
10 pages
RL Unit 2
No ratings yet
RL Unit 2
11 pages
Disertatie
No ratings yet
Disertatie
5 pages
AI Module 2 Notes
No ratings yet
AI Module 2 Notes
16 pages
Unit 1, 2 RL
No ratings yet
Unit 1, 2 RL
29 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
RL
No ratings yet
RL
6 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Dissecting Reinforcement Learning-Part10
No ratings yet
Dissecting Reinforcement Learning-Part10
19 pages
AI Module2
No ratings yet
AI Module2
46 pages
A Beginners Guide To Deep Reinforcement Learning PDF
No ratings yet
A Beginners Guide To Deep Reinforcement Learning PDF
9 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
BDS602 Module 2 PDF
No ratings yet
BDS602 Module 2 PDF
16 pages
Module 2-Unit 1 Introduction To State Space Search
No ratings yet
Module 2-Unit 1 Introduction To State Space Search
14 pages
ml_group_5
No ratings yet
ml_group_5
21 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Bai402 Aimod2notes
No ratings yet
Bai402 Aimod2notes
17 pages
Function Approximation
No ratings yet
Function Approximation
35 pages
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
No ratings yet
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
9 pages
Multi Grid
No ratings yet
Multi Grid
10 pages
Module 2
No ratings yet
Module 2
115 pages
Thesis Fabrizio Galli
No ratings yet
Thesis Fabrizio Galli
22 pages
MLT Unit-5 notes
No ratings yet
MLT Unit-5 notes
17 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Solar Rooftop Solutions Solar Rooftop EPC Services in India
No ratings yet
Solar Rooftop Solutions Solar Rooftop EPC Services in India
3 pages
Sterling and Wilson Solar Limited Ranked Worlds Largest Solar EPC Service Provider
No ratings yet
Sterling and Wilson Solar Limited Ranked Worlds Largest Solar EPC Service Provider
2 pages
Solar Epc Solar Epc Projects Solar Epc Contractor Best Solar Epc Company in India
No ratings yet
Solar Epc Solar Epc Projects Solar Epc Contractor Best Solar Epc Company in India
2 pages
First Solar Leads Worlds Top Utilityscale Solar EPC Contractors Ranking For
No ratings yet
First Solar Leads Worlds Top Utilityscale Solar EPC Contractors Ranking For
2 pages
Top IoT Security Issues Ransom Botnet Attacks Spying
No ratings yet
Top IoT Security Issues Ransom Botnet Attacks Spying
4 pages
EPC Saur Energy International
No ratings yet
EPC Saur Energy International
3 pages
What Does EPC Mean in Solar
No ratings yet
What Does EPC Mean in Solar
1 page
What Risks Do IoT Security Issues Pose To Businesses
No ratings yet
What Risks Do IoT Security Issues Pose To Businesses
3 pages
The Security and Privacy Issues That Come With The Internet of Things
No ratings yet
The Security and Privacy Issues That Come With The Internet of Things
2 pages
Top Solar EPCs
No ratings yet
Top Solar EPCs
7 pages
How To Start Solar EPC Business Steps To Start Solar EPC Company
No ratings yet
How To Start Solar EPC Business Steps To Start Solar EPC Company
2 pages
Vikram Solar's EPC Division Offers
No ratings yet
Vikram Solar's EPC Division Offers
2 pages
CS1 Acacia and Kudus PDF
No ratings yet
CS1 Acacia and Kudus PDF
1 page
PBL(math)
No ratings yet
PBL(math)
10 pages
3BUA000279R0002 en Bridge Controller and Process Bus Adapter (BRC-400 and BRC-300 PBA-200) PDF
No ratings yet
3BUA000279R0002 en Bridge Controller and Process Bus Adapter (BRC-400 and BRC-300 PBA-200) PDF
123 pages
Alpha, Beta, Gamma Decay PDF
100% (1)
Alpha, Beta, Gamma Decay PDF
9 pages
Dev Phil 2
No ratings yet
Dev Phil 2
80 pages
Valves Brochure2019 PDF
No ratings yet
Valves Brochure2019 PDF
50 pages
PRST Clp500 en
100% (1)
PRST Clp500 en
64 pages
Task 4 - Infinity Mirror
No ratings yet
Task 4 - Infinity Mirror
17 pages
How To Make Bregedel Tempe
No ratings yet
How To Make Bregedel Tempe
3 pages
3-FUNCTIONALMEDICINE-PART3
No ratings yet
3-FUNCTIONALMEDICINE-PART3
95 pages
GN012_Gate-Driver-Design-with-GaN-E-HEMTs_200505
No ratings yet
GN012_Gate-Driver-Design-with-GaN-E-HEMTs_200505
21 pages
Lecture 19 Notes - Mucosal Immunity
No ratings yet
Lecture 19 Notes - Mucosal Immunity
5 pages
Uee Types of Tariff (Yash Jadhao)
No ratings yet
Uee Types of Tariff (Yash Jadhao)
6 pages
Chen Et Al., 2010
No ratings yet
Chen Et Al., 2010
8 pages
AQ Bond IFU Jan 2025 (1)
No ratings yet
AQ Bond IFU Jan 2025 (1)
1 page
Form A: (Please Be Seated) After Homily ... (Please All Stand)
No ratings yet
Form A: (Please Be Seated) After Homily ... (Please All Stand)
2 pages
Diagnostic Test (5th Grade)
No ratings yet
Diagnostic Test (5th Grade)
8 pages
Red Lion
0% (1)
Red Lion
9 pages
T450 - T650E-T650EF T650-T650F: Use and Maintenance Manual
No ratings yet
T450 - T650E-T650EF T650-T650F: Use and Maintenance Manual
56 pages
Egamaster - Nonsparkingcertifications
No ratings yet
Egamaster - Nonsparkingcertifications
3 pages
Earthquake S: Knowledge About Earthquakes
No ratings yet
Earthquake S: Knowledge About Earthquakes
13 pages
D08.02 Notes For The User PV Updated 29-09-15
No ratings yet
D08.02 Notes For The User PV Updated 29-09-15
2 pages
NATURE - The hidden functional diversity of macrophages (Li et al., 2024)
No ratings yet
NATURE - The hidden functional diversity of macrophages (Li et al., 2024)
2 pages
The Good Doctor S01E02
No ratings yet
The Good Doctor S01E02
59 pages
MPM 40 - TDS en V3.0
No ratings yet
MPM 40 - TDS en V3.0
1 page
Edward Hammond, Andrew McIndoe, Mark Blunt, John
No ratings yet
Edward Hammond, Andrew McIndoe, Mark Blunt, John
249 pages
Telecomanda
No ratings yet
Telecomanda
31 pages
Plant Report
No ratings yet
Plant Report
19 pages
MND Accessories Full Price List-2023 3 1 Implementos de Gimnacio
No ratings yet
MND Accessories Full Price List-2023 3 1 Implementos de Gimnacio
34 pages

A Brief Introduction To Reinforcement Learning

Uploaded by

A Brief Introduction To Reinforcement Learning

Uploaded by

A brief introduction to reinforcement learning

State transition function P(X(t)|X(t-1),A(t))

State transition function: S(t) = f (S(t-1), Y(t), R(t), A(t))

We define the value of performing action a in state s as follows:

the above equation simplifies to the more common form

If V/Q satisfies the Bellman equation, then the greedy policy

JAIR (Journal of AI Research) 1999. Postscript (87 pages)

Artificial Intelligence, Vol. 101, 1998. Postscript (45 pages)

For more details on POMDPs, see Tony Cassandra's POMDP page.

Artificial Intelligence, Vol. 101, 1998. Postscript (45 pages)

JAIR (Journal of AI Research) 1999. Postscript (87 pages)

You might also like