0% found this document useful (0 votes)

0 views

L12 Reinforcement Learning 2

Uploaded by

black hello

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

L12 Reinforcement Learning 2

Uploaded by

black hello

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Reinforcement

Learning – Part 2
Markov Decision Process (MDP)
• a way to formalize sequential decision making - basis
for structuring problems that are solved with
reinforcement learning
Reward
• a Reinforcement Learning policy used to map a current
state to an action where the agent continuously State
interacts with the environment to produce new
solutions and receive rewards
• used to formalize RL problems, and if the environment
Environment
is fully observable Agent
• Markov’s Process states that the future is independent
of the past, given the present Action
• given the present state, the next state can be predicted
easily, without the need for the previous state
Markov Decision Process (MDP)
• agent aims to maximize the reward at each state not just the
immediate reward, but the cumulative rewards it receives over time
• agent interacts with the environment and takes action while it is at
one state to reach the next future state - the maximum reward
returned is based on the action
• process of selecting an action from a given state, transitioning to a
new state, and receiving a reward happens sequentially over and Reward
over again, which creates something called a trajectory that shows
the sequence of states, actions, and rewards. State
• Parameters
• Set of models
• Set of all possible actions - A Environment
Agent
• Set of states - S
• Reward - R
Action
• Policy - n
• Value - V
Markov Decision Process (MDP)
• aims to find the shortest path
between node A and D
A 15
• the number at each edge denote the B
reward associated with the path
5
• A, B, C and D denote the nodes -20 10
• travel from node A to B is an action, 0

the reward is the cost at each path, C

and policy is each path take
25 D
Markov Decision Process (MDP)
• process will maximize the output
based on the reward at each step
A
and will traverse the path with the
B
highest reward
5

25 D

Path taken by MDP

MDP Notation

• a set of states , a set of actions , and a set of rewards Reward,

• assume that each of these sets has a finite number of elements
State,
• at each time step , the agent receives some representation of the
environment's state
𝑆 𝑡 +1 𝑅𝑡 + 1
• based on this state, the agent selects an action which the state-action
pair .
Environment
• time is then incremented to the next time step , and the environment is Agent
transitioned to a new state , the agent receives a numerical reward for
the action taken from state Action,
• the process of receiving a reward as an arbitrary function that maps
state-action pairs to rewards which at each time ,

• the trajectory representing the sequential process of selecting an action

from a state, transitioning to a new state, and receiving a reward can
be represented as

Goal: learn to choose actions that maximize

MDP Notation
• From the diagram
• At time , the environment is in state
• The agent observes the current state Reward,
and selects action
State,
• The environment transitions to state
and grants the agent reward 𝑆 𝑡 +1 𝑅𝑡 + 1
• This process then starts over for the Environment
next time step, Agent

Action,
Learning in RL
• agents take random decisions in their environment and learns on
selecting the right one out of many to achieve their goal and play
at a super-human level
• Policy network
• network which learns to give a definite output by giving a particular Input
to the game
• Value network
• assigns value/score to the state of the game by calculating an expected
cumulative score for the current state s
• every state goes through the value network
• the states which gets more reward obviously get more value in the network
Algorithms for control learning
• Criterion of optimality:
• agent's action selection is modeled as a map called policy

• Brute force:
• choose the policy with the largest expected return

• Value function:
• attempt to find a policy that maximizes the return by maintaining a set of estimates of
expected returns for some policy
• Monte Carlo methods:
• mimics policy iteration

• Temporal difference methods:

• corrected by allowing the procedure to change the policy (at some or all states) before the
values settle
• Direct policy search:
• search directly in (some subset of) the policy space
Q-Learning

• value-based method of supplying information to inform which action an agent

should take
• find the optimal policy by learning the optimal Q-values for each state-action
• Steps:
1. starts with the initialization of the Q-table
2. agent selects an action and performs it
3. reward for the action is measured
4. Q-table is updated
• Q-table is a table or matrix created during Q-learning
• agent’s goal is to maximize the value of Q by finding the best action to take at a
particular state
• The Q stands for quality, which indicates the quality of action taken by the
agent
Now we'll add a similar matrix, "Q. The rows of matrix
Q represent the current state of the agent, and the
columns represent the possible actions leading to the
next state (the links between the nodes).

The agent starts out knowing nothing, the matrix Q is

initialized to zero.

Q-Learning The transition rule of Q learning is a very simple

formula:
Q(state, action) = R(state, action) + Gamma *
Max[Q(next state, all actions)]

A value assigned to a specific element of matrix Q, is

equal to the sum of the corresponding value in matrix
R and the learning parameter Gamma, multiplied by
the maximum value of Q for all possible actions in the
next state.
The Q-Learning algorithm
goes as follows:
1. Set the gamma parameter, and environment rewards in matrix R.
2. Initialize matrix Q to zero.
3. Next step:
For each episode:
Select a random initial state.
Do While the goal state hasn't been reached.
Select one among all possible actions for the current state.
Using this possible action, consider going to the next state.
Get maximum Q value for this next state based on all
possible actions.
Compute: Q(state, action) = R(state, action) + Gamma *
Max[Q(next state, all actions)]
Set the next state as the current state.
End Do
End For
Temporal difference learning
• refers to a class of model-free reinforcement learning methods which
learn by bootstrapping from the current estimate of the value function
• sample from the environment, like Monte Carlo methods, and perform
updates based on current estimates, like dynamic programming methods
• adjust predictions to match later, more accurate, predictions about the
future before the final outcome is known
• Example:
• Suppose you wish to predict the weather for Saturday, and you have some model
that predicts Saturday's weather, given the weather of each day in the week. In the
standard case, you would wait until Saturday and then adjust all your models.
However, when it is, for example, Friday, you should have a pretty good idea of what
the weather would be on Saturday – and thus be able to change, say, Saturday's
model before Saturday arrives
Partial Observable States
• complete state of the world which is visible to the agent is highly
unrealistic
• othen, the dynamics of an agent’s environment and observations are
unknown
• Partial Observable MDP (POMDP) models information available to the
agent by specifying a function from the hidden state to the observables
• goal now is to find a mapping from observations (not states) to actions
• agent must maintain a sensor model (the probability distribution of
different observations given the underlying state) and the underlying MDP
• POMDP's policy is a mapping from the observations (or belief states) to the
actions
Policy Search

• Reinforcement Learning using a

neural network policy
• The policy can be any algorithm you
can think of, and it does not even
have to be deterministic.
Four points in policy space and
the agent’s corresponding
behavior

• Genetic algorithms
• randomly create a first
generation of 100 policies and
try them out, then “kill”
the 80 worst policies
• Optimization Techniques
• by evaluating the gradients of
the rewards with regards to the
policy parameters
CartPole – OpenAI Gym
• A pole is attached by an un-
actuated joint to a cart, which
moves along a frictionless track
• The system is controlled by applying
a force of +1 or -1 to the cart
• The pendulum starts upright, and
the goal is to prevent it from falling
over
• A reward of +1 is provided for every
timestep that the pole remains
upright
• The episode ends when the pole is
more than 15 degrees from vertical,
or the cart moves more than 2.4
units from the center
Neural Network
policy

• If we knew what the best action was at each step, we could train the neural network as usual, by
minimizing the cross entropy between the estimated probability and the tar‐
get probability
• However, in Reinforcement Learning the only guidance the agent gets is through rewards.
• For example, if the agent manages to balance the pole
for 100 steps, how can it know which of the 100 actions it took were good, and which of them were bad?
• All it knows is that the pole fell after the last action, but surely this last action is not entirely responsible.
• This is called the credit assignment problem:
• when the agent gets a reward, it is hard for it to know which actions should get credited (or blamed)
for it. Think of a dog that gets rewarded hours after it behaved well; will it understand what it is
rewarded for?
Discounte
d Rewards

20
Supervised vs
Unsupervised vs
Reinforcement Learning

• Supervised Learning
• Set of labeled examples provided by ‘external supervisor’
• Not applicable to learning from interaction
• Generally complicated to obtain examples of all situations
• Unsupervised Learning
• Usually tries to learn structure / data representation
• Does not exactly match RL: RL wants to maximize a reward
• Reinforcement Learning
• Hidden structures! unlabeled data! no reliance on structures!
maximize a reward!
Supervised Learning vs Reinforcement
Learning
• Supervised Learning • Reinforcement Learning
• Steps 1: • Step 1:
• Teacher: Does picture 1 show a car • World: You are in state 9. Choose
or a flower? action A or C
• Learner: A flower • Learner: Action A
• Teacher: No, it’s a car • World: Your reward is 100
• Step 2: • Step 2:
• Teacher: Does picture 2 show a car • World: You are in state 32. Choose
or a flower? action B or E
• Learner: A car • Learner: Action B
• Teacher: Yes, it’s a car • World: Your reward is 50
• Step 3: • Step 3:
• … • …
Types of Reinforcement Learning
• Search-based: evolution directly on a policy
• E.g., Genetic algorithm

• Model-based: build a model of the environment

• Then you can use dynamic programming
• Memory-intensive learning method

• Model-free: learn a policy without any model

• Temporal difference methods (TD)
• Requires limited episodic memory (through more helps)
Actor-critic learning
• The TD version of Policy
Types of Iteration
Model-free
Reinforceme Q-learning
nt Learning • The TD version of Value
Iteration
• This is the most widely
used RL algorithm

24
Feature / reward design can be very involved

• Online learning (no time for tuning)

• Continuous features (handled by tiling)
• Delayed rewards (handled by shaping)

Challenges Parameters can have large effects on learning

speed
in • Tuning has just one effect: slowing it down
reinforceme Realistic environments can have partial
nt learning observability

Realistic environments can be non-stationary

There may be multiple agents

Reference
https://www.guru99.com/reinforcement-learning-tutorial.html
https://www.g2.com/articles/reinforcement-learning
https://www.simplilearn.com/tutorials/machine-learning-
tutorial/reinforcement-learning
https://deeplizard.com/learn/video/nyjbcRQ-uQ8
https://towardsdatascience.com/policy-networks-vs-value-
networks-in-reinforcement-learning-da2776056ad2
https://en.wikipedia.org/wiki/Temporal_difference_learning

Need For Speed Carbono
No ratings yet
Need For Speed Carbono
13 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Lecture Notes on Reinforcement Learning Basics
No ratings yet
Lecture Notes on Reinforcement Learning Basics
6 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Chapter17 1
No ratings yet
Chapter17 1
40 pages
Markov Decision Process Tutorial
No ratings yet
Markov Decision Process Tutorial
22 pages
RL Ese
No ratings yet
RL Ese
7 pages
Module 2
No ratings yet
Module 2
87 pages
unit-5
No ratings yet
unit-5
65 pages
ML Unit 5 (ChatGPT)
No ratings yet
ML Unit 5 (ChatGPT)
17 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
Unit 3
No ratings yet
Unit 3
12 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Sections
No ratings yet
Sections
76 pages
Unit 4
100% (1)
Unit 4
7 pages
RL Frra
No ratings yet
RL Frra
10 pages
Practice Problems of Machine Learning Answer
No ratings yet
Practice Problems of Machine Learning Answer
29 pages
Lec 06 - Search
No ratings yet
Lec 06 - Search
56 pages
L11 Reinforcement Learning 1
No ratings yet
L11 Reinforcement Learning 1
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
RL RS-Unit_3 (1)
No ratings yet
RL RS-Unit_3 (1)
6 pages
RL UNIT - II
No ratings yet
RL UNIT - II
20 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Lecture3__InsideAnAgent
No ratings yet
Lecture3__InsideAnAgent
35 pages
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
No ratings yet
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
23 pages
Module_1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module_1 - Reinforcement Learning and Markov Decision Process
19 pages
Unit-5
No ratings yet
Unit-5
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Module 04
No ratings yet
Module 04
63 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Unit2d Local Search
No ratings yet
Unit2d Local Search
20 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
MDP
No ratings yet
MDP
10 pages
RL_Assgn1
No ratings yet
RL_Assgn1
14 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
UNIT 4 (2)
No ratings yet
UNIT 4 (2)
6 pages
MLT Unit-5 notes
No ratings yet
MLT Unit-5 notes
17 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
8 Informed Search -Hill climbing
No ratings yet
8 Informed Search -Hill climbing
16 pages
Unit-2 Ai
No ratings yet
Unit-2 Ai
119 pages
19Z701-AI-Unit-2-1 Search Strategies
No ratings yet
19Z701-AI-Unit-2-1 Search Strategies
84 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
DL questions
No ratings yet
DL questions
30 pages
AI Beyond Classical Search &CSPs
No ratings yet
AI Beyond Classical Search &CSPs
116 pages
PowerPoint Presentation
No ratings yet
PowerPoint Presentation
63 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Simulated Annealing: Fundamentals and Applications
From Everand
Simulated Annealing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Exercises of Multi-Variable Functions
From Everand
Exercises of Multi-Variable Functions
Simone Malacrida
No ratings yet
Guide to Install Visual Studio 2019
No ratings yet
Guide to Install Visual Studio 2019
3 pages
Chap01 - Intro to Programming
No ratings yet
Chap01 - Intro to Programming
37 pages
Chapter 2 Network Protocols _ Communication_July 2023
No ratings yet
Chapter 2 Network Protocols _ Communication_July 2023
56 pages
Chapter 6 - Multimedia Element Video
No ratings yet
Chapter 6 - Multimedia Element Video
44 pages
Chapter 4 Data Link Layer (OSI Model)_July 2023
No ratings yet
Chapter 4 Data Link Layer (OSI Model)_July 2023
39 pages
Chapter 6 Network Layer_July 2023
No ratings yet
Chapter 6 Network Layer_July 2023
58 pages
Chapter 10 Application Layer_July 2023
No ratings yet
Chapter 10 Application Layer_July 2023
36 pages
Setup - Firebase
No ratings yet
Setup - Firebase
9 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
Practical 1 Slide
No ratings yet
Practical 1 Slide
20 pages
L03 Generalization, Train Test Splits and Validation
No ratings yet
L03 Generalization, Train Test Splits and Validation
49 pages
L08 Hierachical agglomerative clustering
No ratings yet
L08 Hierachical agglomerative clustering
41 pages
L01 Introduction to ML
No ratings yet
L01 Introduction to ML
16 pages
L04 Decision Trees
No ratings yet
L04 Decision Trees
34 pages
Practical 2 Hadoop Distributed File System (HDFS)
No ratings yet
Practical 2 Hadoop Distributed File System (HDFS)
4 pages
L05 Unsupervised learning - Overview
No ratings yet
L05 Unsupervised learning - Overview
16 pages
L02 Classification and Regression
No ratings yet
L02 Classification and Regression
26 pages
Conducting Cyclic Potentiodynamic Polarization Measurements To Determine The Corrosion Susceptibility of Small Implant Devices
No ratings yet
Conducting Cyclic Potentiodynamic Polarization Measurements To Determine The Corrosion Susceptibility of Small Implant Devices
9 pages
Gastropharm PDF
No ratings yet
Gastropharm PDF
2 pages
Unit 1 Architectural Interiors
No ratings yet
Unit 1 Architectural Interiors
27 pages
Badri Prasad Vs State of Madhya Pradesh & Anr On 11 October, 1968
No ratings yet
Badri Prasad Vs State of Madhya Pradesh & Anr On 11 October, 1968
10 pages
Garments closing summary
No ratings yet
Garments closing summary
2 pages
Insights Into Nipah Virus: A Review of Epidemiology, Pathogenesis, and Therapeutic Advances
No ratings yet
Insights Into Nipah Virus: A Review of Epidemiology, Pathogenesis, and Therapeutic Advances
8 pages
RAP8
No ratings yet
RAP8
25 pages
MET AWS Brochure B211184EN-A 210x280 Lores
No ratings yet
MET AWS Brochure B211184EN-A 210x280 Lores
8 pages
TEMS Discovery 21.2.1 Release Note
No ratings yet
TEMS Discovery 21.2.1 Release Note
32 pages
W 49 F 002 Up 12 B
No ratings yet
W 49 F 002 Up 12 B
23 pages
Notes Estimation Theory
100% (3)
Notes Estimation Theory
39 pages
SHW Series 1500 Highway Communications - February 2017
No ratings yet
SHW Series 1500 Highway Communications - February 2017
31 pages
SICU Nursing Process
No ratings yet
SICU Nursing Process
49 pages
Answering Questions With Statistics 1st Edition Szafran Test Bank Full Chapter PDF
100% (19)
Answering Questions With Statistics 1st Edition Szafran Test Bank Full Chapter PDF
44 pages
CEN+Online+Review Module+8 Final
100% (2)
CEN+Online+Review Module+8 Final
36 pages
s11148 008 9034 2
No ratings yet
s11148 008 9034 2
2 pages
Fa15 Pavexpress Training Klinefelter
No ratings yet
Fa15 Pavexpress Training Klinefelter
79 pages
Needs Assessment
No ratings yet
Needs Assessment
16 pages
Aluminium Bluebook-L Earle Jorgensen
No ratings yet
Aluminium Bluebook-L Earle Jorgensen
26 pages
A Novel Kind of Concrete Superplasticizer Based On Lignite
No ratings yet
A Novel Kind of Concrete Superplasticizer Based On Lignite
8 pages
Genbio Exam
No ratings yet
Genbio Exam
13 pages
Kaolinite
No ratings yet
Kaolinite
2 pages
Sessional Test For Electrical. (EDDE-II)
No ratings yet
Sessional Test For Electrical. (EDDE-II)
13 pages
Forecasting Energy Consumption Using Hybrid CNN and LSTM Auto-Encoder Network With Hyperband Optimization
No ratings yet
Forecasting Energy Consumption Using Hybrid CNN and LSTM Auto-Encoder Network With Hyperband Optimization
17 pages
Lord of The Flies
No ratings yet
Lord of The Flies
2 pages
0625 s11 Ms 21
No ratings yet
0625 s11 Ms 21
6 pages
Environmental Studies (U1)
No ratings yet
Environmental Studies (U1)
63 pages
@boardexamss Arihant All in One Class 10th WWW - Ultraedu.in
67% (6)
@boardexamss Arihant All in One Class 10th WWW - Ultraedu.in
542 pages
Electrical Interview Questions On Protection
No ratings yet
Electrical Interview Questions On Protection
3 pages

L12 Reinforcement Learning 2

Uploaded by

L12 Reinforcement Learning 2

Uploaded by

Reinforcement

the reward is the cost at each path, C

Path taken by MDP

• a set of states , a set of actions , and a set of rewards Reward,

• the trajectory representing the sequential process of selecting an action

Goal: learn to choose actions that maximize

• Temporal difference methods:

• value-based method of supplying information to inform which action an agent

The agent starts out knowing nothing, the matrix Q is

Q-Learning The transition rule of Q learning is a very simple

A value assigned to a specific element of matrix Q, is

• Reinforcement Learning using a

• This neural network will take an

• Model-based: build a model of the environment

• Model-free: learn a policy without any model

• Online learning (no time for tuning)

Challenges Parameters can have large effects on learning

Realistic environments can be non-stationary

There may be multiple agents

You might also like