0% found this document useful (0 votes)
5 views

Intro To Reinforcement Learning

prof. Carlo Lucibello Department of Computing Sciences Bocconi University

Uploaded by

s25237
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Intro To Reinforcement Learning

prof. Carlo Lucibello Department of Computing Sciences Bocconi University

Uploaded by

s25237
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Intro to

Reinforcement
Learning
prof. Carlo Lucibello

Department of Computing Sciences


Bocconi University
Machine learning paradigms

● Supervised learning: “learn to predict”


Machine learning paradigms

● Supervised learning: “learn to predict”

● Unsupervised learning: “learn the representation”


Machine learning paradigms

● Supervised learning: “learn to predict”

● Unsupervised learning: “learn the representation”

● Reinforcement learning: “learn to control”


Supervised Learning Example:
diagnosing heart abnormalities

● ML approach: teach the computer program


through examples

● Ask cardiologists to label a large number of


recorded ECG signals

● The learning algorithms adjusts the


program/model so that its predictions agree
with cardiologists’ labels
Supervised Learning

Training Set of input-output pairs:

deterministic prediction
Model:

probabilistic prediction
Reinforcement Learning
Games
Trading
Robotics
Reinforcement Learning in nature
Reinforcement Learning basics

Two key elements:


• Learning via trial and error
• No supervision
MAIN INGREDIENTS

● Agent: It’s me, Mario!

● State: Observations about the


world

● Actions: Decisions on what to


do next

● Rewards: Positive or negative


consequences of the actions

● Learning: Learn to choose the


action that gets the largest
reward!
Basic set-up

● An agent repeatedly interacts


and learns from a stochastic
environment with the goal of
maximizing cumulative rewards.

● “Trial-and-error” method of
learning.

● Algorithmic principles motivated


by psychology and neuroscience:
rewards provide a positive
reinforcement for an action.
Key features
Lack of a “supervisor”
● No labels telling us the best action to take
● We only have a reward signal

Delayed feedback
● The effect of an action may not be entirely visible instantaneously, but it may
affect the reward signal many steps later

Sequential decisions
● The sequence in which you make your moves will decide the path you take and
hence the final outcome

Actions affects observations


● Observations are a function of the agents’ own actions, which the agent may
decide based on its past observations
Agent & Environment
State (Level Info)

Dapeng Hong , Billy Wan, Yaqi Zhang


Action (Up, Down, Left, Right….)
Rewards
Finish the level
Still in game

Game over
Get a coin
What we want to learn?

Given the current


state and the
possible reward,
what is the best
action we can
choose?

Optimal POLICY!
LET'S FORMALIZE
MATHEMATICALLY
ALL THIS
State -> Action -> Reward -> State (SARS)
State (a vector of numbers)
Action (discrete choices)
Rewards
Finish the level:+100
Still in game: + 0.1 per sec

Game over: -10


Get a coin: +1
Return: cumulative reward

“Future”
reward…?

Discount factor
HOW WE CAN LEARN?
What we want to learn?

Given the current


state and the
possible reward,
what is the best
action we can
choose?
First approach: Policy Learning

How we can find the best action given a specific state?


We can “build” a function, the so called POLICY, and try to optimize it

Deterministic policy:

Stochastic policy:
Second Approach: Q-Learning

Given a policy, one can compute what is the average return (total reward)
obtained by playing a certain action in a certain state and then keep playing
according to the policy. This is called Q-function:

One can then use the Q-function to improve the


policy, then compute a new Q-function and so on.

E.g. greedy policy derived from Q-function:


HOW TO LEARN THE
Q-FUNCTION?
Q-Learning (Classic Algorithm)

Idea: what if we consider the expected value of reward for each action in the different states?

A1 A2 A3

S1 12 0 -10

S2 4 10 1000

● Q-Learning helps the agent make decisions by estimating the value of


different actions in different states

● It learns from experience which is the best action for each state

● You build your policy, the play the game many times and update the Q
function
Q-Table and Q-value
Learning the Q-table (pseudo-code)
Q-value update

Learning Rate Discount factor

Reward
after the the maximum value of
step the Q function for all
possible actions in the
new state
DOES IT REALLY WORK?
Real world case

Chess game:
● State: 10^50
● Action: 30-40 (legal ones)
or 2^18 (possible)
● Transitions to a new state are
deterministic, but depend on
the adversary
● Rewards: 0 for each
intermediate step, {-1,0,1} at
the end
Real world case

GO game:
● State: 3^361 (possible) or
10^170 (legal)
● Action: 200 (average)
or 361 (beginning)
● Transitions to a new state are
deterministic, but depend on
the adversary
● Rewards: 0 for each
intermediate step, {-1,0,1} at
the end
Real world case

Texas Hold ‘em:


● State: 10^31×4^8 (9 players)
● Action: 4 (fold,rise, call, check)
● Transitions to a new state are
stochastic and depend on the
adversary
● Rewards: 0 for each step, {−1, 0,
1} at the and
Real world case

Tram in Milano:
● State: 18×30×2
(lines × stops × (at or going to))
● Action: 100^3 (#tram^#state)
HOW DO DEAL
WITH COMPLEX
SETTINGS?
Q-Deep-learning

Constructing this function, even using Q-learning, could be an impossible task due to large state space.

However, we can approximate it.

And to do this, the best way is to use a neural network…


Q-Deep-learning
Q-Deep-learning (pseudocode)
A trained Q-network playing Super Mario!
AI Olympics
Tag Game - Evolving Strategies
It works!!! (Funny examples)
It works!!! (Funny examples)
Tricks and Problems
Exploration vs Exploitation

● Exploration: try new actions to gather information about the environment

● Exploitation: choose actions based on the current knowledge to maximize rewards

A balance between exploration and exploitation is essential!


Exploration vs Exploitation
epsilon-greedy policy
Stuck into the walls?

Global (or Delayed) Rewards: Do not look at the reward at every instant – look
at all rewards at the same time.

Delayed Rewards Delayed Punishment


Let's train our Neural Networks
using Pytorch and Google Colab!

https://tinyurl.com/BocconiRL
Other Links

https://tinyurl.com/BocconiFrozenLake

https://tinyurl.com/FrozenLakeTEO

https://huggingface.co/learn/deep-rl-course/unit0/introduction

You might also like