L12 Reinforcement Learning 2
L12 Reinforcement Learning 2
Learning – Part 2
Markov Decision Process (MDP)
• a way to formalize sequential decision making - basis
for structuring problems that are solved with
reinforcement learning
Reward
• a Reinforcement Learning policy used to map a current
state to an action where the agent continuously State
interacts with the environment to produce new
solutions and receive rewards
• used to formalize RL problems, and if the environment
Environment
is fully observable Agent
• Markov’s Process states that the future is independent
of the past, given the present Action
• given the present state, the next state can be predicted
easily, without the need for the previous state
Markov Decision Process (MDP)
• agent aims to maximize the reward at each state not just the
immediate reward, but the cumulative rewards it receives over time
• agent interacts with the environment and takes action while it is at
one state to reach the next future state - the maximum reward
returned is based on the action
• process of selecting an action from a given state, transitioning to a
new state, and receiving a reward happens sequentially over and Reward
over again, which creates something called a trajectory that shows
the sequence of states, actions, and rewards. State
• Parameters
• Set of models
• Set of all possible actions - A Environment
Agent
• Set of states - S
• Reward - R
Action
• Policy - n
• Value - V
Markov Decision Process (MDP)
• aims to find the shortest path
between node A and D
A 15
• the number at each edge denote the B
reward associated with the path
5
• A, B, C and D denote the nodes -20 10
• travel from node A to B is an action, 0
25 D
Action,
Learning in RL
• agents take random decisions in their environment and learns on
selecting the right one out of many to achieve their goal and play
at a super-human level
• Policy network
• network which learns to give a definite output by giving a particular Input
to the game
• Value network
• assigns value/score to the state of the game by calculating an expected
cumulative score for the current state s
• every state goes through the value network
• the states which gets more reward obviously get more value in the network
Algorithms for control learning
• Criterion of optimality:
• agent's action selection is modeled as a map called policy
• Brute force:
• choose the policy with the largest expected return
• Value function:
• attempt to find a policy that maximizes the return by maintaining a set of estimates of
expected returns for some policy
• Monte Carlo methods:
• mimics policy iteration
• Genetic algorithms
• randomly create a first
generation of 100 policies and
try them out, then “kill”
the 80 worst policies
• Optimization Techniques
• by evaluating the gradients of
the rewards with regards to the
policy parameters
CartPole – OpenAI Gym
• A pole is attached by an un-
actuated joint to a cart, which
moves along a frictionless track
• The system is controlled by applying
a force of +1 or -1 to the cart
• The pendulum starts upright, and
the goal is to prevent it from falling
over
• A reward of +1 is provided for every
timestep that the pole remains
upright
• The episode ends when the pole is
more than 15 degrees from vertical,
or the cart moves more than 2.4
units from the center
Neural Network
policy
• If we knew what the best action was at each step, we could train the neural network as usual, by
minimizing the cross entropy between the estimated probability and the tar‐
get probability
• However, in Reinforcement Learning the only guidance the agent gets is through rewards.
• For example, if the agent manages to balance the pole
for 100 steps, how can it know which of the 100 actions it took were good, and which of them were bad?
• All it knows is that the pole fell after the last action, but surely this last action is not entirely responsible.
• This is called the credit assignment problem:
• when the agent gets a reward, it is hard for it to know which actions should get credited (or blamed)
for it. Think of a dog that gets rewarded hours after it behaved well; will it understand what it is
rewarded for?
Discounte
d Rewards
20
Supervised vs
Unsupervised vs
Reinforcement Learning
• Supervised Learning
• Set of labeled examples provided by ‘external supervisor’
• Not applicable to learning from interaction
• Generally complicated to obtain examples of all situations
• Unsupervised Learning
• Usually tries to learn structure / data representation
• Does not exactly match RL: RL wants to maximize a reward
• Reinforcement Learning
• Hidden structures! unlabeled data! no reliance on structures!
maximize a reward!
Supervised Learning vs Reinforcement
Learning
• Supervised Learning • Reinforcement Learning
• Steps 1: • Step 1:
• Teacher: Does picture 1 show a car • World: You are in state 9. Choose
or a flower? action A or C
• Learner: A flower • Learner: Action A
• Teacher: No, it’s a car • World: Your reward is 100
• Step 2: • Step 2:
• Teacher: Does picture 2 show a car • World: You are in state 32. Choose
or a flower? action B or E
• Learner: A car • Learner: Action B
• Teacher: Yes, it’s a car • World: Your reward is 50
• Step 3: • Step 3:
• … • …
Types of Reinforcement Learning
• Search-based: evolution directly on a policy
• E.g., Genetic algorithm
24
Feature / reward design can be very involved