0% found this document useful (0 votes)
0 views

L12 Reinforcement Learning 2

Uploaded by

black hello
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

L12 Reinforcement Learning 2

Uploaded by

black hello
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Reinforcement

Learning – Part 2
Markov Decision Process (MDP)
• a way to formalize sequential decision making - basis
for structuring problems that are solved with
reinforcement learning
Reward
• a Reinforcement Learning policy used to map a current
state to an action where the agent continuously State
interacts with the environment to produce new
solutions and receive rewards
• used to formalize RL problems, and if the environment
Environment
is fully observable Agent
• Markov’s Process states that the future is independent
of the past, given the present Action
• given the present state, the next state can be predicted
easily, without the need for the previous state
Markov Decision Process (MDP)
• agent aims to maximize the reward at each state not just the
immediate reward, but the cumulative rewards it receives over time
• agent interacts with the environment and takes action while it is at
one state to reach the next future state - the maximum reward
returned is based on the action
• process of selecting an action from a given state, transitioning to a
new state, and receiving a reward happens sequentially over and Reward
over again, which creates something called a trajectory that shows
the sequence of states, actions, and rewards. State
• Parameters
• Set of models
• Set of all possible actions - A Environment
Agent
• Set of states - S
• Reward - R
Action
• Policy - n
• Value - V
Markov Decision Process (MDP)
• aims to find the shortest path
between node A and D
A 15
• the number at each edge denote the B
reward associated with the path
5
• A, B, C and D denote the nodes -20 10
• travel from node A to B is an action, 0

the reward is the cost at each path, C


and policy is each path take
25 D
Markov Decision Process (MDP)
• process will maximize the output
based on the reward at each step
A
and will traverse the path with the
B
highest reward
5

25 D

Path taken by MDP


MDP Notation

• a set of states , a set of actions , and a set of rewards Reward,


• assume that each of these sets has a finite number of elements
State,
• at each time step , the agent receives some representation of the
environment's state
𝑆 𝑡 +1 𝑅𝑡 + 1
• based on this state, the agent selects an action which the state-action
pair .
Environment
• time is then incremented to the next time step , and the environment is Agent
transitioned to a new state , the agent receives a numerical reward for
the action taken from state Action,
• the process of receiving a reward as an arbitrary function that maps
state-action pairs to rewards which at each time ,

• the trajectory representing the sequential process of selecting an action


from a state, transitioning to a new state, and receiving a reward can
be represented as

Goal: learn to choose actions that maximize


MDP Notation
• From the diagram
• At time , the environment is in state
• The agent observes the current state Reward,
and selects action
State,
• The environment transitions to state
and grants the agent reward 𝑆 𝑡 +1 𝑅𝑡 + 1
• This process then starts over for the Environment
next time step, Agent

Action,
Learning in RL
• agents take random decisions in their environment and learns on
selecting the right one out of many to achieve their goal and play
at a super-human level
• Policy network
• network which learns to give a definite output by giving a particular Input
to the game
• Value network
• assigns value/score to the state of the game by calculating an expected
cumulative score for the current state s
• every state goes through the value network
• the states which gets more reward obviously get more value in the network
Algorithms for control learning
• Criterion of optimality:
• agent's action selection is modeled as a map called policy

• Brute force:
• choose the policy with the largest expected return

• Value function:
• attempt to find a policy that maximizes the return by maintaining a set of estimates of
expected returns for some policy
• Monte Carlo methods:
• mimics policy iteration

• Temporal difference methods:


• corrected by allowing the procedure to change the policy (at some or all states) before the
values settle
• Direct policy search:
• search directly in (some subset of) the policy space
Q-Learning

• value-based method of supplying information to inform which action an agent


should take
• find the optimal policy by learning the optimal Q-values for each state-action
• Steps:
1. starts with the initialization of the Q-table
2. agent selects an action and performs it
3. reward for the action is measured
4. Q-table is updated
• Q-table is a table or matrix created during Q-learning
• agent’s goal is to maximize the value of Q by finding the best action to take at a
particular state
• The Q stands for quality, which indicates the quality of action taken by the
agent
Now we'll add a similar matrix, "Q. The rows of matrix
Q represent the current state of the agent, and the
columns represent the possible actions leading to the
next state (the links between the nodes).

The agent starts out knowing nothing, the matrix Q is


initialized to zero.

Q-Learning The transition rule of Q learning is a very simple


formula:
Q(state, action) = R(state, action) + Gamma *
Max[Q(next state, all actions)]

A value assigned to a specific element of matrix Q, is


equal to the sum of the corresponding value in matrix
R and the learning parameter Gamma, multiplied by
the maximum value of Q for all possible actions in the
next state.
The Q-Learning algorithm
goes as follows:
1. Set the gamma parameter, and environment rewards in matrix R.
2. Initialize matrix Q to zero.
3. Next step:
For each episode:
Select a random initial state.
Do While the goal state hasn't been reached.
Select one among all possible actions for the current state.
Using this possible action, consider going to the next state.
Get maximum Q value for this next state based on all
possible actions.
Compute: Q(state, action) = R(state, action) + Gamma *
Max[Q(next state, all actions)]
Set the next state as the current state.
End Do
End For
Temporal difference learning
• refers to a class of model-free reinforcement learning methods which
learn by bootstrapping from the current estimate of the value function
• sample from the environment, like Monte Carlo methods, and perform
updates based on current estimates, like dynamic programming methods
• adjust predictions to match later, more accurate, predictions about the
future before the final outcome is known
• Example:
• Suppose you wish to predict the weather for Saturday, and you have some model
that predicts Saturday's weather, given the weather of each day in the week. In the
standard case, you would wait until Saturday and then adjust all your models.
However, when it is, for example, Friday, you should have a pretty good idea of what
the weather would be on Saturday – and thus be able to change, say, Saturday's
model before Saturday arrives
Partial Observable States
• complete state of the world which is visible to the agent is highly
unrealistic
• othen, the dynamics of an agent’s environment and observations are
unknown
• Partial Observable MDP (POMDP) models information available to the
agent by specifying a function from the hidden state to the observables
• goal now is to find a mapping from observations (not states) to actions
• agent must maintain a sensor model (the probability distribution of
different observations given the underlying state) and the underlying MDP
• POMDP's policy is a mapping from the observations (or belief states) to the
actions
Policy Search

• Reinforcement Learning using a


neural network policy
• The policy can be any algorithm you
can think of, and it does not even
have to be deterministic.
Four points in policy space and
the agent’s corresponding
behavior

• Genetic algorithms
• randomly create a first
generation of 100 policies and
try them out, then “kill”
the 80 worst policies
• Optimization Techniques
• by evaluating the gradients of
the rewards with regards to the
policy parameters
CartPole – OpenAI Gym
• A pole is attached by an un-
actuated joint to a cart, which
moves along a frictionless track
• The system is controlled by applying
a force of +1 or -1 to the cart
• The pendulum starts upright, and
the goal is to prevent it from falling
over
• A reward of +1 is provided for every
timestep that the pole remains
upright
• The episode ends when the pole is
more than 15 degrees from vertical,
or the cart moves more than 2.4
units from the center
Neural Network
policy

• This neural network will take an


observation as input, and it will
output the action to be
executed.
• More precisely, it will estimate a
probability for each action, and then
we will select an action randomly
according to the estimated
probabilities
Evaluating Actions: The Credit Assignment
Problem

• If we knew what the best action was at each step, we could train the neural network as usual, by
minimizing the cross entropy between the estimated probability and the tar‐
get probability
• However, in Reinforcement Learning the only guidance the agent gets is through rewards.
• For example, if the agent manages to balance the pole
for 100 steps, how can it know which of the 100 actions it took were good, and which of them were bad?
• All it knows is that the pole fell after the last action, but surely this last action is not entirely responsible.
• This is called the credit assignment problem:
• when the agent gets a reward, it is hard for it to know which actions should get credited (or blamed)
for it. Think of a dog that gets rewarded hours after it behaved well; will it understand what it is
rewarded for?
Discounte
d Rewards

20
Supervised vs
Unsupervised vs
Reinforcement Learning

• Supervised Learning
• Set of labeled examples provided by ‘external supervisor’
• Not applicable to learning from interaction
• Generally complicated to obtain examples of all situations
• Unsupervised Learning
• Usually tries to learn structure / data representation
• Does not exactly match RL: RL wants to maximize a reward
• Reinforcement Learning
• Hidden structures! unlabeled data! no reliance on structures!
maximize a reward!
Supervised Learning vs Reinforcement
Learning
• Supervised Learning • Reinforcement Learning
• Steps 1: • Step 1:
• Teacher: Does picture 1 show a car • World: You are in state 9. Choose
or a flower? action A or C
• Learner: A flower • Learner: Action A
• Teacher: No, it’s a car • World: Your reward is 100
• Step 2: • Step 2:
• Teacher: Does picture 2 show a car • World: You are in state 32. Choose
or a flower? action B or E
• Learner: A car • Learner: Action B
• Teacher: Yes, it’s a car • World: Your reward is 50
• Step 3: • Step 3:
• … • …
Types of Reinforcement Learning
• Search-based: evolution directly on a policy
• E.g., Genetic algorithm

• Model-based: build a model of the environment


• Then you can use dynamic programming
• Memory-intensive learning method

• Model-free: learn a policy without any model


• Temporal difference methods (TD)
• Requires limited episodic memory (through more helps)
Actor-critic learning
• The TD version of Policy
Types of Iteration
Model-free
Reinforceme Q-learning
nt Learning • The TD version of Value
Iteration
• This is the most widely
used RL algorithm

24
Feature / reward design can be very involved

• Online learning (no time for tuning)


• Continuous features (handled by tiling)
• Delayed rewards (handled by shaping)

Challenges Parameters can have large effects on learning


speed
in • Tuning has just one effect: slowing it down
reinforceme Realistic environments can have partial
nt learning observability

Realistic environments can be non-stationary

There may be multiple agents


Reference
https://www.guru99.com/reinforcement-learning-tutorial.html
https://www.g2.com/articles/reinforcement-learning
https://www.simplilearn.com/tutorials/machine-learning-
tutorial/reinforcement-learning
https://deeplizard.com/learn/video/nyjbcRQ-uQ8
https://towardsdatascience.com/policy-networks-vs-value-
networks-in-reinforcement-learning-da2776056ad2
https://en.wikipedia.org/wiki/Temporal_difference_learning

You might also like