REINFORCE_Algorithm
REINFORCE_Algorithm
Objective of REINFORCE
The objective function for the policy is the expected return:
where
∞
Gt = ∑ γ k rt+k
k=0
Let's define:
Episode Generation
The agent follows the policy, which samples actions based on the probabilities
output by the policy network. For simplicity, assume it receives rewards as
follows:
1 0 Right -1
2 1 Right -1
3 5 Goal (Right) 10
Calculate Returns Gt
For each step in reverse order, we calculate the return Gt = rt + γGt+1.
Step 3: G3 = r3 = 10
Step 2: G2 = r2 + γG3 = −1 + 0.99 × 10 = 8.9
Step 1: G1 = r1 + γG2 = −1 + 0.99 × 8.9 = 7.811
So, the returns for each timestep are G1 = 7.811, G2 = 8.9, and G3 = 10.
Gradient Update
We backpropagate this total loss and update the policy network parameters.
Code
In [4]: import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import matplotlib.pyplot as plt
def reset(self):
self.current_position = self.start_position
return self.current_position
# Generate an episode
def generate_episode(env, policy):
states, actions, rewards = [], [], []
state = env.reset()
for _ in range(env.max_steps):
state_tensor = torch.tensor([state], dtype=torch.float32) # Convert
dist = policy(state_tensor) # Get action distribution from policy
action = dist.sample() # Sample an action based on the distribution
# Update policy
def update_policy(states, actions, returns):
loss = 0
for state, action, G in zip(states, actions, returns):
dist = policy(state) # Get action distribution for the state
log_prob = dist.log_prob(action) # Log probability of the chosen ac
loss += -log_prob * G # Multiply by return and accumulate the loss
optimizer.zero_grad()
loss.backward()
optimizer.step() # Update policy parameters
def reset(self):
self.current_position = self.start_position
return self.current_position
if self.current_position == self.goal_position:
return self.current_position, 10, True # reward of 10 and episo
else:
return self.current_position, -1, False # reward of -1 for each
# Generate an episode
def generate_episode(env, policy):
states, actions, rewards = [], [], []
state = env.reset()
for _ in range(env.max_steps):
state_tensor = torch.tensor([state], dtype=torch.float32)
dist = policy(state_tensor)
action = dist.sample()
states.append(state_tensor)
actions.append(action)
state = next_state
if done:
break
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
Step 1:
State = 0.0, Action = 1
Reward-to-go (G) = 5.666
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|0.0)) = -0.407
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.407) * 5.666 = 2.308
Step 2:
State = 1.0, Action = 1
Reward-to-go (G) = 6.733
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|1.0)) = -0.229
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.229) * 6.733 = 1.539
Step 3:
State = 2.0, Action = 1
Reward-to-go (G) = 7.811
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|2.0)) = -0.123
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.123) * 7.811 = 0.962
Step 4:
State = 3.0, Action = 1
Reward-to-go (G) = 8.900
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|3.0)) = -0.065
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.065) * 8.900 = 0.576
Step 5:
State = 4.0, Action = 1
Reward-to-go (G) = 10.000
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|4.0)) = -0.034
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.034) * 10.000 = 0.336
Step 1:
State = 0.0, Action = 0
Reward-to-go (G) = 1.502
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(0|0.0)) = -1.108
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-1.108) * 1.502 = 1.664
Step 2:
State = -1.0, Action = 1
Reward-to-go (G) = 2.527
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|-1.0)) = -0.686
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.686) * 2.527 = 1.733
Step 3:
State = 0.0, Action = 0
Reward-to-go (G) = 3.563
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(0|0.0)) = -1.108
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-1.108) * 3.563 = 3.948
Step 4:
State = -1.0, Action = 1
Reward-to-go (G) = 4.609
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|-1.0)) = -0.686
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.686) * 4.609 = 3.161
Step 5:
State = 0.0, Action = 1
Reward-to-go (G) = 5.666
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|0.0)) = -0.401
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.401) * 5.666 = 2.271
Step 6:
State = 1.0, Action = 1
Reward-to-go (G) = 6.733
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|1.0)) = -0.220
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.220) * 6.733 = 1.485
Step 7:
State = 2.0, Action = 1
Reward-to-go (G) = 7.811
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|2.0)) = -0.116
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.116) * 7.811 = 0.909
Step 8:
State = 3.0, Action = 1
Reward-to-go (G) = 8.900
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|3.0)) = -0.060
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.060) * 8.900 = 0.533
Step 9:
State = 4.0, Action = 1
Reward-to-go (G) = 10.000
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|4.0)) = -0.030
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.030) * 10.000 = 0.304
Step 2:
State = 1.0, Action = 1
Reward-to-go (G) = 6.733
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|1.0)) = -0.213
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.213) * 6.733 = 1.435
Step 3:
State = 2.0, Action = 1
Reward-to-go (G) = 7.811
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|2.0)) = -0.110
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.110) * 7.811 = 0.863
Step 4:
State = 3.0, Action = 1
Reward-to-go (G) = 8.900
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|3.0)) = -0.056
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.056) * 8.900 = 0.497
Step 5:
State = 4.0, Action = 1
Reward-to-go (G) = 10.000
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|4.0)) = -0.028
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.028) * 10.000 = 0.278
Step 2:
State = 1.0, Action = 1
Reward-to-go (G) = 6.733
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|1.0)) = -0.206
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.206) * 6.733 = 1.386
Step 3:
State = 2.0, Action = 1
Reward-to-go (G) = 7.811
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|2.0)) = -0.105
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.105) * 7.811 = 0.817
Step 4:
State = 3.0, Action = 1
Reward-to-go (G) = 8.900
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|3.0)) = -0.052
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.052) * 8.900 = 0.461
Step 5:
State = 4.0, Action = 1
Reward-to-go (G) = 10.000
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|4.0)) = -0.025
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.025) * 10.000 = 0.253
Step 1:
State = 0.0, Action = 1
Reward-to-go (G) = 3.563
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|0.0)) = -0.381
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.381) * 3.563 = 1.359
Step 2:
State = 1.0, Action = 1
Reward-to-go (G) = 4.609
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|1.0)) = -0.199
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.199) * 4.609 = 0.916
Step 3:
State = 2.0, Action = 1
Reward-to-go (G) = 5.666
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|2.0)) = -0.099
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.099) * 5.666 = 0.561
Step 4:
State = 3.0, Action = 0
Reward-to-go (G) = 6.733
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(0|3.0)) = -3.059
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-3.059) * 6.733 = 20.595
Step 5:
State = 2.0, Action = 1
Reward-to-go (G) = 7.811
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|2.0)) = -0.099
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.099) * 7.811 = 0.773
Step 6:
State = 3.0, Action = 1
Reward-to-go (G) = 8.900
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|3.0)) = -0.048
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.048) * 8.900 = 0.428
Step 7:
State = 4.0, Action = 1
Reward-to-go (G) = 10.000
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|4.0)) = -0.023
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.023) * 10.000 = 0.230