0% found this document useful (0 votes)

2 views

REINFORCE_Algorithm

The document provides a detailed explanation of the REINFORCE algorithm, including its mathematical background, objective function, and step-by-step calculations for a mock environment where an agent learns to reach a goal. It outlines how to compute returns, log probabilities, and policy gradients, along with a Python implementation that demonstrates the algorithm in action. The document also includes a training loop and visualizations of the learning curve over multiple episodes.

Uploaded by

Rohit Kamble

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

REINFORCE_Algorithm

Uploaded by

Rohit Kamble

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

REINFORCE Algorithm: Mathematical

Background and Step-by-Step

Calculation
Here's a Python implementation of the REINFORCE algorithm with explanations
for each step. For simplicity, I'll create a mock environment where the agent
learns to reach a goal in a 1-dimensional space. The agent receives a positive
reward for reaching the goal and a negative reward otherwise.

Mathematical Background of REINFORCE

The REINFORCE algorithm uses policy gradient methods to optimize a policy
πθ (a|s) with parameters θ by maximizing the expected reward J(θ) directly
through gradient ascent.

Objective of REINFORCE
The objective function for the policy is the expected return:

J(θ) = Eπθ [Gt ]

where

∞
Gt = ∑ γ k rt+k
k=0

is the discounted return at timestep t , and γ is the discount factor.

Gradient of the Objective

The gradient of J(θ) with respect to the policy parameters θ can be derived as:
∇θ J(θ) = Eπθ [∇θ log πθ (at |st ) ⋅ Gt ]
This equation tells us:

Sample actions according to the current policy.

Compute the return Gt following each action.
Update θ using the product of the log-probability of the action and the return
Gt .

Step-by-Step Calculations for One Episode

To show how the code translates this math into steps, we'll walk through a single
episode using a small discount factor and sample reward values. We'll assume
an episode length of 3, with a state-action sequence of 3 steps.

Let's define:

Discount factor (γ): 0.99

Rewards in the episode: [−1, −1, 10] (agent gets a final large reward on
reaching the goal)

Episode Generation
The agent follows the policy, which samples actions based on the probabilities
output by the policy network. For simplicity, assume it receives rewards as
follows:

Step State (s) Action (a) Reward (r)

1 0 Right -1

2 1 Right -1

3 5 Goal (Right) 10

Calculate Returns Gt
For each step in reverse order, we calculate the return Gt = rt + γGt+1.
Step 3: G3 = r3 = 10
Step 2: G2 = r2 + γG3 = −1 + 0.99 × 10 = 8.9
Step 1: G1 = r1 + γG2 = −1 + 0.99 × 8.9 = 7.811

So, the returns for each timestep are G1 = 7.811, G2 = 8.9, and G3 = 10.

Compute Log Probabilities of Actions

Suppose the policy network outputs probabilities for actions Left and Right at
each state. For instance, the network might give a 30% probability for Left and
70% for Right. The log probabilities for chosen actions could look like:

Step 1: log πθ (Right|s = 0) = log(0.7) ≈ −0.357

Step 2: log πθ (Right|s = 1) = log(0.7) ≈ −0.357
Step 3: log πθ (Right|s = 5) = log(0.7) ≈ −0.357

Calculate the Policy Gradient and Loss

Now, we combine these values to compute the loss for gradient ascent. The loss
for each timestep is − log πθ (at |st ) ⋅ Gt , and the total loss is their sum:

Step 1: − log πθ (Right|s = 0) ⋅ G1 ≈ −(−0.357) ⋅ 7.811 = 2.788

Step 2: − log πθ (Right|s = 1) ⋅ G2 ≈ −(−0.357) ⋅ 8.9 = 3.177
Step 3: − log πθ (Right|s = 5) ⋅ G3 ≈ −(−0.357) ⋅ 10 = 3.57

Total Loss for the episode: 2.788 + 3.177 + 3.57 = 9.535

Gradient Update
We backpropagate this total loss and update the policy network parameters.

Code
In [4]: import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import matplotlib.pyplot as plt

# Define the environment

class SimpleEnvironment:
def __init__(self):
self.goal_position = 5
self.start_position = 0
self.current_position = self.start_position
self.max_steps = 10 # Limit the number of steps per episode

def reset(self):
self.current_position = self.start_position
return self.current_position

def step(self, action):

# Update the position based on the action (0 = left, 1 = right)
if action == 1: # move right
self.current_position += 1
elif action == 0: # move left
self.current_position -= 1
# Check if the goal is reached
if self.current_position == self.goal_position:
return self.current_position, 10, True # reward of 10 and episo
else:
return self.current_position, -1, False # reward of -1 for each

# Define the policy network

class PolicyNetwork(nn.Module):
def __init__(self):
super(PolicyNetwork, self).__init__()
self.fc = nn.Linear(1, 2) # Input: state (position), Output: probab

def forward(self, x):

x = self.fc(x)
return Categorical(logits=x) # Categorical distribution over action

# Initialize the policy and optimizer

policy = PolicyNetwork()
optimizer = optim.Adam(policy.parameters(), lr=0.01)

# Generate an episode
def generate_episode(env, policy):
states, actions, rewards = [], [], []
state = env.reset()

for _ in range(env.max_steps):
state_tensor = torch.tensor([state], dtype=torch.float32) # Convert
dist = policy(state_tensor) # Get action distribution from policy
action = dist.sample() # Sample an action based on the distribution

# Store state, action, and reward for this step

states.append(state_tensor)
actions.append(action)

# Take action in environment

next_state, reward, done = env.step(action.item())
rewards.append(reward)

state = next_state # Move to the next state

if done:
break

return states, actions, rewards

# Compute cumulative returns

def compute_returns(rewards, gamma=0.99):
returns = []
G = 0
for reward in reversed(rewards):
G = reward + gamma * G
returns.insert(0, G) # Insert at the beginning to reverse the order
return returns

# Update policy
def update_policy(states, actions, returns):
loss = 0
for state, action, G in zip(states, actions, returns):
dist = policy(state) # Get action distribution for the state
log_prob = dist.log_prob(action) # Log probability of the chosen ac
loss += -log_prob * G # Multiply by return and accumulate the loss

optimizer.zero_grad()
loss.backward()
optimizer.step() # Update policy parameters

# Training the model and tracking rewards for each episode

env = SimpleEnvironment()
num_episodes = 500 # Number of episodes for training
rewards_per_episode = []

# Training the model and tracking rewards for each episode

env = SimpleEnvironment()
num_episodes = 500 # Number of episodes for training
rewards_per_episode = []

for episode in range(num_episodes):

states, actions, rewards = generate_episode(env, policy)
returns = compute_returns(rewards)
update_policy(states, actions, returns)
total_reward = sum(rewards)
rewards_per_episode.append(total_reward)

# Print progress every 100 episodes

if episode % 100 == 0:
print(f"Episode {episode}, Total Reward: {total_reward}")

# Plotting the learning curve

plt.figure(figsize=(20, 6))
plt.plot(rewards_per_episode)
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.title("Learning Curve of REINFORCE Algorithm")
plt.show()

Episode 0, Total Reward: -10

Episode 100, Total Reward: -10
Episode 200, Total Reward: 4
Episode 300, Total Reward: 6
Episode 400, Total Reward: 6
Python Code to Track Step-by-Step
Values
In [5]: import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import matplotlib.pyplot as plt

# Define the environment

class SimpleEnvironment:
def __init__(self):
self.goal_position = 5
self.start_position = 0
self.current_position = self.start_position
self.max_steps = 10 # Limit the number of steps per episode

def reset(self):
self.current_position = self.start_position
return self.current_position

def step(self, action):

if action == 1: # move right
self.current_position += 1
elif action == 0: # move left
self.current_position -= 1

if self.current_position == self.goal_position:
return self.current_position, 10, True # reward of 10 and episo
else:
return self.current_position, -1, False # reward of -1 for each

# Define the policy network

class PolicyNetwork(nn.Module):
def __init__(self):
super(PolicyNetwork, self).__init__()
self.fc = nn.Linear(1, 2)

def forward(self, x):

x = self.fc(x)
return Categorical(logits=x)

# Initialize the policy and optimizer

policy = PolicyNetwork()
optimizer = optim.Adam(policy.parameters(), lr=0.01)

# Generate an episode
def generate_episode(env, policy):
states, actions, rewards = [], [], []
state = env.reset()

for _ in range(env.max_steps):
state_tensor = torch.tensor([state], dtype=torch.float32)
dist = policy(state_tensor)
action = dist.sample()

states.append(state_tensor)
actions.append(action)

next_state, reward, done = env.step(action.item())

rewards.append(reward)

state = next_state
if done:
break

return states, actions, rewards

# Compute cumulative returns with formula printing

def compute_returns(rewards, gamma=0.99):
returns = []
G = 0
print("\nCalculating Returns (Reward-to-go) for each step in reverse ord
for t in reversed(range(len(rewards))):
G = rewards[t] + gamma * G
returns.insert(0, G)
print(f" Step {t + 1}: Reward r_t = {rewards[t]}, G_t = r_t + γ * G
print(f" G_{t} = {rewards[t]:.3f} + {gamma} * {G - rewards[t]:.3f
return returns

# Update policy with logging for each step, including formulas

def update_policy_with_logging(states, actions, returns):
total_loss = 0
for step, (state, action, G) in enumerate(zip(states, actions, returns))
dist = policy(state)
log_prob = dist.log_prob(action)
loss = -log_prob * G
total_loss += loss

# Print details for each step, including formulas

print(f"\nStep {step + 1}:")
print(f" State = {state.item()}, Action = {action.item()}")
print(f" Reward-to-go (G) = {G:.3f}")
print(f" Formula for Log Probability: log(π_θ(a|s))")
print(f" Log Probability = log(π_θ({action.item()}|{state.item()}
print(f" Formula for Loss at this step: -log(π_θ(a|s)) * G")
print(f" Loss = -({log_prob.item():.3f}) * {G:.3f} = {loss.item()

optimizer.zero_grad()
total_loss.backward()
optimizer.step()

# Print total loss for the episode, including formula

print(f"\nTotal Loss for the episode (sum of step losses): {total_loss.i
print("Formula for Total Loss: Σ[-log(π_θ(a|s)) * G] for each step in th

# Main training loop

env = SimpleEnvironment()
num_episodes = 5 # Set to a small number to see detailed output for each ep

for episode in range(num_episodes):

print(f"--- Episode {episode + 1} ---")
states, actions, rewards = generate_episode(env, policy)
returns = compute_returns(rewards)
update_policy_with_logging(states, actions, returns)
--- Episode 1 ---

Calculating Returns (Reward-to-go) for each step in reverse order:

Step 1:
State = 0.0, Action = 1
Reward-to-go (G) = 5.666
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|0.0)) = -0.407
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.407) * 5.666 = 2.308

Step 2:
State = 1.0, Action = 1
Reward-to-go (G) = 6.733
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|1.0)) = -0.229
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.229) * 6.733 = 1.539

Step 3:
State = 2.0, Action = 1
Reward-to-go (G) = 7.811
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|2.0)) = -0.123
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.123) * 7.811 = 0.962

Step 4:
State = 3.0, Action = 1
Reward-to-go (G) = 8.900
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|3.0)) = -0.065
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.065) * 8.900 = 0.576

Step 5:
State = 4.0, Action = 1
Reward-to-go (G) = 10.000
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|4.0)) = -0.034
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.034) * 10.000 = 0.336

Total Loss for the episode (sum of step losses): 5.721

Formula for Total Loss: Σ[-log(π_θ(a|s)) * G] for each step in the episode
--- Episode 2 ---

Calculating Returns (Reward-to-go) for each step in reverse order:

Step 9: Reward r_t = 10, G_t = r_t + γ * G_(t+1)
G_8 = 10.000 + 0.99 * 0.000 = 10.000
Step 8: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_7 = -1.000 + 0.99 * 9.900 = 8.900
Step 7: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_6 = -1.000 + 0.99 * 8.811 = 7.811
Step 6: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_5 = -1.000 + 0.99 * 7.733 = 6.733
Step 5: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_4 = -1.000 + 0.99 * 6.666 = 5.666
Step 4: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_3 = -1.000 + 0.99 * 5.609 = 4.609
Step 3: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_2 = -1.000 + 0.99 * 4.563 = 3.563
Step 2: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_1 = -1.000 + 0.99 * 3.527 = 2.527
Step 1: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_0 = -1.000 + 0.99 * 2.502 = 1.502

Step 1:
State = 0.0, Action = 0
Reward-to-go (G) = 1.502
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(0|0.0)) = -1.108
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-1.108) * 1.502 = 1.664

Step 2:
State = -1.0, Action = 1
Reward-to-go (G) = 2.527
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|-1.0)) = -0.686
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.686) * 2.527 = 1.733

Step 3:
State = 0.0, Action = 0
Reward-to-go (G) = 3.563
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(0|0.0)) = -1.108
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-1.108) * 3.563 = 3.948

Step 4:
State = -1.0, Action = 1
Reward-to-go (G) = 4.609
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|-1.0)) = -0.686
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.686) * 4.609 = 3.161

Step 5:
State = 0.0, Action = 1
Reward-to-go (G) = 5.666
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|0.0)) = -0.401
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.401) * 5.666 = 2.271

Step 6:
State = 1.0, Action = 1
Reward-to-go (G) = 6.733
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|1.0)) = -0.220
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.220) * 6.733 = 1.485

Step 7:
State = 2.0, Action = 1
Reward-to-go (G) = 7.811
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|2.0)) = -0.116
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.116) * 7.811 = 0.909

Step 8:
State = 3.0, Action = 1
Reward-to-go (G) = 8.900
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|3.0)) = -0.060
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.060) * 8.900 = 0.533

Step 9:
State = 4.0, Action = 1
Reward-to-go (G) = 10.000
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|4.0)) = -0.030
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.030) * 10.000 = 0.304

Total Loss for the episode (sum of step losses): 16.008

Formula for Total Loss: Σ[-log(π_θ(a|s)) * G] for each step in the episode

--- Episode 3 ---

Calculating Returns (Reward-to-go) for each step in reverse order:

Step 5: Reward r_t = 10, G_t = r_t + γ * G_(t+1)
G_4 = 10.000 + 0.99 * 0.000 = 10.000
Step 4: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_3 = -1.000 + 0.99 * 9.900 = 8.900
Step 3: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_2 = -1.000 + 0.99 * 8.811 = 7.811
Step 2: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_1 = -1.000 + 0.99 * 7.733 = 6.733
Step 1: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_0 = -1.000 + 0.99 * 6.666 = 5.666
Step 1:
State = 0.0, Action = 1
Reward-to-go (G) = 5.666
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|0.0)) = -0.394
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.394) * 5.666 = 2.234

Step 2:
State = 1.0, Action = 1
Reward-to-go (G) = 6.733
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|1.0)) = -0.213
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.213) * 6.733 = 1.435

Step 3:
State = 2.0, Action = 1
Reward-to-go (G) = 7.811
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|2.0)) = -0.110
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.110) * 7.811 = 0.863

Step 4:
State = 3.0, Action = 1
Reward-to-go (G) = 8.900
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|3.0)) = -0.056
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.056) * 8.900 = 0.497

Step 5:
State = 4.0, Action = 1
Reward-to-go (G) = 10.000
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|4.0)) = -0.028
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.028) * 10.000 = 0.278

Total Loss for the episode (sum of step losses): 5.307

Formula for Total Loss: Σ[-log(π_θ(a|s)) * G] for each step in the episode

--- Episode 4 ---

Calculating Returns (Reward-to-go) for each step in reverse order:

Step 5: Reward r_t = 10, G_t = r_t + γ * G_(t+1)
G_4 = 10.000 + 0.99 * 0.000 = 10.000
Step 4: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_3 = -1.000 + 0.99 * 9.900 = 8.900
Step 3: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_2 = -1.000 + 0.99 * 8.811 = 7.811
Step 2: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_1 = -1.000 + 0.99 * 7.733 = 6.733
Step 1: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_0 = -1.000 + 0.99 * 6.666 = 5.666
Step 1:
State = 0.0, Action = 1
Reward-to-go (G) = 5.666
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|0.0)) = -0.388
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.388) * 5.666 = 2.197

Step 2:
State = 1.0, Action = 1
Reward-to-go (G) = 6.733
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|1.0)) = -0.206
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.206) * 6.733 = 1.386

Step 3:
State = 2.0, Action = 1
Reward-to-go (G) = 7.811
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|2.0)) = -0.105
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.105) * 7.811 = 0.817

Step 4:
State = 3.0, Action = 1
Reward-to-go (G) = 8.900
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|3.0)) = -0.052
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.052) * 8.900 = 0.461

Step 5:
State = 4.0, Action = 1
Reward-to-go (G) = 10.000
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|4.0)) = -0.025
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.025) * 10.000 = 0.253

Total Loss for the episode (sum of step losses): 5.115

Formula for Total Loss: Σ[-log(π_θ(a|s)) * G] for each step in the episode

--- Episode 5 ---

Calculating Returns (Reward-to-go) for each step in reverse order:

Step 7: Reward r_t = 10, G_t = r_t + γ * G_(t+1)
G_6 = 10.000 + 0.99 * 0.000 = 10.000
Step 6: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_5 = -1.000 + 0.99 * 9.900 = 8.900
Step 5: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_4 = -1.000 + 0.99 * 8.811 = 7.811
Step 4: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_3 = -1.000 + 0.99 * 7.733 = 6.733
Step 3: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_2 = -1.000 + 0.99 * 6.666 = 5.666
Step 2: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_1 = -1.000 + 0.99 * 5.609 = 4.609
Step 1: Reward r_t = -1, G_t = r_t + γ * G_(t+1)
G_0 = -1.000 + 0.99 * 4.563 = 3.563

Step 1:
State = 0.0, Action = 1
Reward-to-go (G) = 3.563
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|0.0)) = -0.381
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.381) * 3.563 = 1.359

Step 2:
State = 1.0, Action = 1
Reward-to-go (G) = 4.609
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|1.0)) = -0.199
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.199) * 4.609 = 0.916

Step 3:
State = 2.0, Action = 1
Reward-to-go (G) = 5.666
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|2.0)) = -0.099
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.099) * 5.666 = 0.561

Step 4:
State = 3.0, Action = 0
Reward-to-go (G) = 6.733
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(0|3.0)) = -3.059
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-3.059) * 6.733 = 20.595

Step 5:
State = 2.0, Action = 1
Reward-to-go (G) = 7.811
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|2.0)) = -0.099
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.099) * 7.811 = 0.773

Step 6:
State = 3.0, Action = 1
Reward-to-go (G) = 8.900
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|3.0)) = -0.048
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.048) * 8.900 = 0.428

Step 7:
State = 4.0, Action = 1
Reward-to-go (G) = 10.000
Formula for Log Probability: log(π_θ(a|s))
Log Probability = log(π_θ(1|4.0)) = -0.023
Formula for Loss at this step: -log(π_θ(a|s)) * G
Loss = -(-0.023) * 10.000 = 0.230

Total Loss for the episode (sum of step losses): 24.862

Formula for Total Loss: Σ[-log(π_θ(a|s)) * G] for each step in the episode

This notebook was converted with convert.ploomber.io

Py Code Example 4 1 Gradient MC Evaluation
No ratings yet
Py Code Example 4 1 Gradient MC Evaluation
4 pages
Class ActorCritic
No ratings yet
Class ActorCritic
1 page
201CS240-ML OBTR 2 (1)
No ratings yet
201CS240-ML OBTR 2 (1)
16 pages
Assignment Week6 AI4ICPS
No ratings yet
Assignment Week6 AI4ICPS
11 pages
import gym
No ratings yet
import gym
4 pages
Reinforcement Ch.2
No ratings yet
Reinforcement Ch.2
77 pages
Practical
No ratings yet
Practical
6 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
Chapter 1
No ratings yet
Chapter 1
33 pages
Code Principale de Ait Omar
No ratings yet
Code Principale de Ait Omar
13 pages
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
No ratings yet
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
110 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Ass1 Merged Merged
No ratings yet
Ass1 Merged Merged
15 pages
Ass1 Merged Merged
No ratings yet
Ass1 Merged Merged
19 pages
Assignment 2 - Policy Gradients
No ratings yet
Assignment 2 - Policy Gradients
7 pages
21BAI10063 MonteCarloLab
No ratings yet
21BAI10063 MonteCarloLab
18 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Math Lab 1
No ratings yet
Math Lab 1
7 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Deep Learning Cheatsheet
No ratings yet
Deep Learning Cheatsheet
5 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Part 2 - Building A Deep Q-Network To Play Gridworld - Catastrophic Forgetting and Experience Replay - by NandaKishore Joshi - Towards Data Science
No ratings yet
Part 2 - Building A Deep Q-Network To Play Gridworld - Catastrophic Forgetting and Experience Replay - by NandaKishore Joshi - Towards Data Science
8 pages
Reinforcement Learning - Project 3
No ratings yet
Reinforcement Learning - Project 3
9 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
drl_v5
No ratings yet
drl_v5
64 pages
Notes Summary
No ratings yet
Notes Summary
65 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
Py Code Example 11 0 Baird Semi Gradient DP Like
No ratings yet
Py Code Example 11 0 Baird Semi Gradient DP Like
3 pages
Exp 3
No ratings yet
Exp 3
9 pages
Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization
No ratings yet
Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization
36 pages
Part 3 - Building A Deep Q-Network To Play Gridworld - Learning Instability and Target Networks - by NandaKishore Joshi - Towards Data Science
No ratings yet
Part 3 - Building A Deep Q-Network To Play Gridworld - Learning Instability and Target Networks - by NandaKishore Joshi - Towards Data Science
7 pages
RL UNIT V QA (1)
No ratings yet
RL UNIT V QA (1)
13 pages
RNN + RL: Shusen Wang
No ratings yet
RNN + RL: Shusen Wang
51 pages
Frozen Lake
No ratings yet
Frozen Lake
6 pages
Ass 2
No ratings yet
Ass 2
4 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Lecture2 Drl A
No ratings yet
Lecture2 Drl A
39 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
RBF Based Reinforcement Learning
No ratings yet
RBF Based Reinforcement Learning
14 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
SOS Midterm
No ratings yet
SOS Midterm
8 pages
RL Intro-2
No ratings yet
RL Intro-2
24 pages
Ee126 Project 1
No ratings yet
Ee126 Project 1
5 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
100% (9)
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
65 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
100% (5)
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
62 pages
Experiment Number 5
No ratings yet
Experiment Number 5
2 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
No ratings yet
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
10 pages
train_NN
No ratings yet
train_NN
3 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
Py Code Example 11 0 Baird Semi Gradient TD0
No ratings yet
Py Code Example 11 0 Baird Semi Gradient TD0
3 pages
Midterm_Report_Example3
No ratings yet
Midterm_Report_Example3
4 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Ch.10 Producer's Equilibrium
No ratings yet
Ch.10 Producer's Equilibrium
3 pages
Nikon D5600 Specifications PDF
No ratings yet
Nikon D5600 Specifications PDF
3 pages
19 IPv6 Basics
No ratings yet
19 IPv6 Basics
42 pages
2 Theory OfLife Cycle Costing
No ratings yet
2 Theory OfLife Cycle Costing
28 pages
Bux Presentation
No ratings yet
Bux Presentation
17 pages
Mini Proposal of Increase 2020
No ratings yet
Mini Proposal of Increase 2020
25 pages
Charizma Polly Chiffon Collection Vol-01
No ratings yet
Charizma Polly Chiffon Collection Vol-01
19 pages
Kapferer Lbrand-Identity-Prism
100% (1)
Kapferer Lbrand-Identity-Prism
26 pages
E12 Review U678 23 24
No ratings yet
E12 Review U678 23 24
10 pages
Grade 6 Test Result Consolidation-4th Quarter: Matapat Matulungin
No ratings yet
Grade 6 Test Result Consolidation-4th Quarter: Matapat Matulungin
5 pages
Diversity Added Adaptive-Fuzzy Logic Controlled Receivers For Direct Sequence Code Division Multiple Access System
No ratings yet
Diversity Added Adaptive-Fuzzy Logic Controlled Receivers For Direct Sequence Code Division Multiple Access System
10 pages
Formula Sheet Business Math 2
No ratings yet
Formula Sheet Business Math 2
2 pages
English For It Students - Feb 2023
No ratings yet
English For It Students - Feb 2023
6 pages
Morphology
67% (3)
Morphology
9 pages
Hawaiian Healing Center
No ratings yet
Hawaiian Healing Center
132 pages
Assignment # 02
No ratings yet
Assignment # 02
3 pages
Structures of Nail
83% (6)
Structures of Nail
5 pages
Toy Wars
No ratings yet
Toy Wars
2 pages
The Lost Chords 1
No ratings yet
The Lost Chords 1
82 pages
Vishal Resume
No ratings yet
Vishal Resume
6 pages
High Speed Pan Mixer: For Concrete Production
No ratings yet
High Speed Pan Mixer: For Concrete Production
2 pages
David D - Mentor Report - Prairiewood - pp2
No ratings yet
David D - Mentor Report - Prairiewood - pp2
5 pages
The Solution For Big Data Hadoop
No ratings yet
The Solution For Big Data Hadoop
27 pages
Job Description Components
No ratings yet
Job Description Components
4 pages
Evaluation of Combine Harvester Operation Costs in
No ratings yet
Evaluation of Combine Harvester Operation Costs in
7 pages
10 1 1 497 9667
No ratings yet
10 1 1 497 9667
12 pages
System Testing 8.1. Implementation and Testing
No ratings yet
System Testing 8.1. Implementation and Testing
3 pages
خامسة جديد كونكت
No ratings yet
خامسة جديد كونكت
2 pages
applied-thermodynamics-2022-scheme-vtu-notes-bme401
No ratings yet
applied-thermodynamics-2022-scheme-vtu-notes-bme401
147 pages
Naskah Soal Pas Big Vii
No ratings yet
Naskah Soal Pas Big Vii
5 pages

REINFORCE_Algorithm

Uploaded by

REINFORCE_Algorithm

Uploaded by

REINFORCE Algorithm: Mathematical

Background and Step-by-Step

Mathematical Background of REINFORCE

J(θ) = Eπθ [Gt ]

is the discounted return at timestep t , and γ is the discount factor.

Gradient of the Objective

Sample actions according to the current policy.

Step-by-Step Calculations for One Episode

Discount factor (γ): 0.99

Step State (s) Action (a) Reward (r)

Compute Log Probabilities of Actions

Step 1: log πθ (Right|s = 0) = log(0.7) ≈ −0.357

Calculate the Policy Gradient and Loss

Step 1: − log πθ (Right|s = 0) ⋅ G1 ≈ −(−0.357) ⋅ 7.811 = 2.788

Total Loss for the episode: 2.788 + 3.177 + 3.57 = 9.535

# Define the environment

def step(self, action):

# Define the policy network

def forward(self, x):

# Initialize the policy and optimizer

# Store state, action, and reward for this step

# Take action in environment

state = next_state # Move to the next state

return states, actions, rewards

# Compute cumulative returns

# Training the model and tracking rewards for each episode

# Training the model and tracking rewards for each episode

for episode in range(num_episodes):

# Print progress every 100 episodes

# Plotting the learning curve

Episode 0, Total Reward: -10

# Define the environment

def step(self, action):

# Define the policy network

def forward(self, x):

# Initialize the policy and optimizer

next_state, reward, done = env.step(action.item())

return states, actions, rewards

# Compute cumulative returns with formula printing

# Update policy with logging for each step, including formulas

# Print details for each step, including formulas

# Print total loss for the episode, including formula

# Main training loop

for episode in range(num_episodes):

Calculating Returns (Reward-to-go) for each step in reverse order:

Total Loss for the episode (sum of step losses): 5.721

Calculating Returns (Reward-to-go) for each step in reverse order:

Total Loss for the episode (sum of step losses): 16.008

--- Episode 3 ---

Calculating Returns (Reward-to-go) for each step in reverse order:

Total Loss for the episode (sum of step losses): 5.307

--- Episode 4 ---

Calculating Returns (Reward-to-go) for each step in reverse order:

Total Loss for the episode (sum of step losses): 5.115

--- Episode 5 ---

Calculating Returns (Reward-to-go) for each step in reverse order:

Total Loss for the episode (sum of step losses): 24.862

This notebook was converted with convert.ploomber.io

You might also like