Q Learning
Q Learning
1. The robot loses 1 point at each step. This is done so that the
robot takes the shortest path and reaches the goal as fast as
possible.
2. If the robot steps on a mine, the point loss is 100 and the game
ends.
3. If the robot gets power ⚡️, it gains 1 point.
4. If the robot reaches the end goal, the robot gets 100 points.
Now, the obvious question is: How do we train a robot to
reach the end goal with the shortest path without stepping on a
mine?
In the Q-Table, the columns are the actions and the rows are
the states.
Each Q-table score will be the maximum expected future
reward that the robot will get if it takes that action at that
state. This is an iterative process, as we need to improve the
Q-Table at each iteration.
Using the above function, we get the values of Q for the cells
in the table.
When we start, all the values in the Q-table are zeros.
For the robot example, there are four actions to choose from:
up, down, left, and right. We are starting the training now —
our robot knows nothing about the environment. So the robot
chooses a random action, say right.
We can now update the Q-values for being at the start and
moving right using the Bellman equation.
power = +1
mine = -100
end = +100
SARSA vs Q-learning
The difference between these two algorithms is
that SARSA chooses an action following the same current
policy and updates its Q-values whereas Q-learning chooses
the greedy action, that is, the action that gives the maximum Q-
value for the state, that is, it follows an optimal policy.