Intro To Reinforcement Learning
Intro To Reinforcement Learning
Reinforcement
Learning
prof. Carlo Lucibello
deterministic prediction
Model:
probabilistic prediction
Reinforcement Learning
Games
Trading
Robotics
Reinforcement Learning in nature
Reinforcement Learning basics
● “Trial-and-error” method of
learning.
Delayed feedback
● The effect of an action may not be entirely visible instantaneously, but it may
affect the reward signal many steps later
Sequential decisions
● The sequence in which you make your moves will decide the path you take and
hence the final outcome
Game over
Get a coin
What we want to learn?
Optimal POLICY!
LET'S FORMALIZE
MATHEMATICALLY
ALL THIS
State -> Action -> Reward -> State (SARS)
State (a vector of numbers)
Action (discrete choices)
Rewards
Finish the level:+100
Still in game: + 0.1 per sec
“Future”
reward…?
Discount factor
HOW WE CAN LEARN?
What we want to learn?
Deterministic policy:
Stochastic policy:
Second Approach: Q-Learning
Given a policy, one can compute what is the average return (total reward)
obtained by playing a certain action in a certain state and then keep playing
according to the policy. This is called Q-function:
Idea: what if we consider the expected value of reward for each action in the different states?
A1 A2 A3
S1 12 0 -10
S2 4 10 1000
● It learns from experience which is the best action for each state
● You build your policy, the play the game many times and update the Q
function
Q-Table and Q-value
Learning the Q-table (pseudo-code)
Q-value update
Reward
after the the maximum value of
step the Q function for all
possible actions in the
new state
DOES IT REALLY WORK?
Real world case
Chess game:
● State: 10^50
● Action: 30-40 (legal ones)
or 2^18 (possible)
● Transitions to a new state are
deterministic, but depend on
the adversary
● Rewards: 0 for each
intermediate step, {-1,0,1} at
the end
Real world case
GO game:
● State: 3^361 (possible) or
10^170 (legal)
● Action: 200 (average)
or 361 (beginning)
● Transitions to a new state are
deterministic, but depend on
the adversary
● Rewards: 0 for each
intermediate step, {-1,0,1} at
the end
Real world case
Tram in Milano:
● State: 18×30×2
(lines × stops × (at or going to))
● Action: 100^3 (#tram^#state)
HOW DO DEAL
WITH COMPLEX
SETTINGS?
Q-Deep-learning
Constructing this function, even using Q-learning, could be an impossible task due to large state space.
Global (or Delayed) Rewards: Do not look at the reward at every instant – look
at all rewards at the same time.
https://tinyurl.com/BocconiRL
Other Links
https://tinyurl.com/BocconiFrozenLake
https://tinyurl.com/FrozenLakeTEO
https://huggingface.co/learn/deep-rl-course/unit0/introduction