0% found this document useful (0 votes)
11 views

Reinforcement Learning Toolbox™ Release Notes

The Reinforcement Learning Toolbox™ Release Notes provide updates and changes for various versions, including new features like support for discrete and hybrid action spaces, improved training algorithms, and enhancements to the Reinforcement Learning Designer App. The document also outlines functionality that is being removed or changed, along with guidelines for reproducibility and data logging. Additionally, it includes contact information for MathWorks and legal disclaimers regarding software usage and trademarks.

Uploaded by

VIVEK AHLAWAT
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Reinforcement Learning Toolbox™ Release Notes

The Reinforcement Learning Toolbox™ Release Notes provide updates and changes for various versions, including new features like support for discrete and hybrid action spaces, improved training algorithms, and enhancements to the Reinforcement Learning Designer App. The document also outlines functionality that is being removed or changed, along with guidelines for reproducibility and data logging. Additionally, it includes contact information for MathWorks and legal disclaimers regarding software usage and trademarks.

Uploaded by

VIVEK AHLAWAT
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Reinforcement Learning Toolbox™ Release Notes

How to Contact MathWorks

Latest news: www.mathworks.com

Sales and services: www.mathworks.com/sales_and_services

User community: www.mathworks.com/matlabcentral

Technical support: www.mathworks.com/support/contact_us

Phone: 508-647-7000

The MathWorks, Inc.


1 Apple Hill Drive
Natick, MA 01760-2098
Reinforcement Learning Toolbox™ Release Notes
© COPYRIGHT 2019–2024 by The MathWorks, Inc.
The software described in this document is furnished under a license agreement. The software may be used or copied
only under the terms of the license agreement. No part of this manual may be photocopied or reproduced in any form
without prior written consent from The MathWorks, Inc.
FEDERAL ACQUISITION: This provision applies to all acquisitions of the Program and Documentation by, for, or through
the federal government of the United States. By accepting delivery of the Program or Documentation, the government
hereby agrees that this software or documentation qualifies as commercial computer software or commercial computer
software documentation as such terms are used or defined in FAR 12.212, DFARS Part 227.72, and DFARS 252.227-7014.
Accordingly, the terms and conditions of this Agreement and only those rights specified in this Agreement, shall pertain
to and govern the use, modification, reproduction, release, performance, display, and disclosure of the Program and
Documentation by the federal government (or other entity acquiring for or through the federal government) and shall
supersede any conflicting contractual terms or conditions. If this License fails to meet the government's needs or is
inconsistent in any respect with federal procurement law, the government agrees to return the Program and
Documentation, unused, to The MathWorks, Inc.
Trademarks
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See
www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand names may be
trademarks or registered trademarks of their respective holders.
Patents
MathWorks products are protected by one or more U.S. patents. Please see www.mathworks.com/patents for
more information.
Contents

R2024b

Soft Actor-Critic Agent: Support for discrete action space environments


.......................................................... 1-2

Soft Actor-Critic Agent: Support for hybrid action space environments


.......................................................... 1-2

Reinforcement Learning Designer App: Support for evaluating agents


during training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3

Reproducibility: General guidelines to reproduce example results . . . . . 1-3

Functionality being removed or changed . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3


Training Results: inspectTrainingResults is no longer recommended . . . . . 1-3

R2024a

New Trainers: Improve training results with more efficient training


algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2

Custom Training Loops: Use automatic differentiation with dlarray


objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
New State and Learnables properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
Approximator evaluation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
New name-value arguments to sample and allExperiences . . . . . . . . . . . . 2-3

Evolutionary Reinforcement Learning: Train agent using parallel


computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4

Approximator Objects: Use normalizers to normalize inputs of actors and


critics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4

Training and Simulation: New memory management options . . . . . . . . . . 2-5

Training Results: New objects for evolutionary strategy and multiagent


training results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5

Reinforcement Learning Designer App and Visualization Tools: New


features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6

iii
Data Logging: Additional learning data now available for logging . . . . . . 2-6

Functionality being removed or changed . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6


Train with Evolution Strategy: Episode replaced with Generation in
property names and values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
Training: New training algorithms lead to improved training results . . . . . 2-7
Custom Training Loops: accelerate and gradient are no longer
recommended . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7
rlSACAgentOptions: the CriticUpdateFrequency and
NumGradientStepsPerUpdate properties are no longer effective . . . . . . 2-9
rlTD3AgentOptions and rlSACAgentOptions: the PolicyUpdateFrequency
property has been redefined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
rlPPOAgentOptions and rlTRPOAgentOptions: the NumEpoch property
changed behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
Training and Simulation Results: SimulationInfo is now a SimulationStorage
object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
Evolution Strategy Training Results: SimulationInfo is now an
EvolutionStrategySimulationStorage object . . . . . . . . . . . . . . . . . . . . 2-10
Reinforcement Learning Episode Manager renamed . . . . . . . . . . . . . . . . 2-11
Reinforcement Learning Designer now always display visualization . . . . 2-11
Approximators: Object functions now enforce correct output dimensions
..................................................... 2-11

R2023b

Multi-Agent Reinforcement Learning: Train multiple agents in a MATLAB


environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2

Evolutionary Reinforcement Learning: Increase computational efficiency


of training by using evolutionary strategies . . . . . . . . . . . . . . . . . . . . . . 3-2

Agent Evaluation: Evaluate agents during training . . . . . . . . . . . . . . . . . . 3-2

Training: Use custom criteria to save agent or stop training . . . . . . . . . . 3-3

Functionality being removed or changed . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3


rlFunctionEnv Object: The LoggedSignals property is replaced by the Info
property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3

R2023a

Learning from Data: Train agents offline using previously recorded data
.......................................................... 4-2

Data Logging: Visualize logged data in Reinforcement Learning Data


Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2

iv Contents
Training: Stop and resume training in the Reinforcement Learning
Designer App . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2

Hindsight Replay Memory: Improve sample efficiency for goal-


conditioned tasks with sparse rewards . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2

Replay Memory: Validate experiences before adding to replay memory


buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3

R2022b

Deployment: Policy Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2

Training: Enhanced logging capability for built-in agents . . . . . . . . . . . . 5-2

Training: Prioritized experience replay for off-policy agents . . . . . . . . . . 5-2

R2022a

New Agent Architecture: Create memory-efficient agents with more


modular and scalable actors and critics . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
New Approximator Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
New Policy Objects for Custom Agents and Custom Training Loops . . . . . . 6-2

Neural Network Environment: Use an approximated environment model


based on a deep neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3

Model Based Policy Optimization Agent: Use a model of the environment


to improve sample efficiency and exploration . . . . . . . . . . . . . . . . . . . . . 6-4

Multi-Agent Reinforcement Learning: Train multiple agents in a


centralized manner for more efficient exploration and learning . . . . . 6-4

Training: Stop and resume agent training . . . . . . . . . . . . . . . . . . . . . . . . . 6-4

Event-Based Simulation: Use RL Agent block inside conditionally


executed subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4

RL Agent Block: Learn from last action applied to the environment . . . . 6-4

Reinforcement Learning Designer App: Support for SAC and TRPO agents
.......................................................... 6-5

New Examples: Train agents for robotics and automated parking


applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5

v
Functionality being removed or changed . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
Representation objects are not recommended . . . . . . . . . . . . . . . . . . . . . . 6-5
train now returns an object instead of a structure . . . . . . . . . . . . . . . . . . . 6-9
Training Parallelization Options: DataToSendFromWorkers and
StepsUntilDataIsSent properties are no longer active . . . . . . . . . . . . . 6-10
Code generated by generatePolicyFunction now uses policy objects . . . . 6-10

R2021b

Rewards Generation: Automatically generate reward functions from


controller specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2

Episode Manager: Improved layout management for single and multiple


agent training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2

Neural Network Representations: Improved internal handling of


dlnetwork objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2

Trust Region Policy Optimization Agent: Prevent significant performance


drops by restricting updated policy to trust region near current policy
.......................................................... 7-2

PPO Agents: Improve agent performance by normalizing advantage


function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3

New Example: Create and train custom agent using model-based


reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3

Functionality being removed or changed . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3


Built-in agents now use dlnetwork objects . . . . . . . . . . . . . . . . . . . . . . . . 7-3

R2021a

Reinforcement Learning Designer App: Design, train, and simulate


agents using a visual interactive workflow . . . . . . . . . . . . . . . . . . . . . . . 8-2

Recurrent Neural Networks: Train agents with recurrent deep neural


network policies and value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2

Guided Policy Learning: Perform imitation learning in Simulink by


learning policies based external actions . . . . . . . . . . . . . . . . . . . . . . . . . 8-2

inspectTrainingResult Function: Plot training information from a


previous training session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3

vi Contents
Deterministic Exploitation: Create PG, AC, PPO, and SAC agents that use
deterministic actions during simulation and in generated policy
functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3

New Examples: Train agent with constrained actions and use DQN agent
for optimal scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3

Functionality being removed or changed . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3


Properties defining noise probability distribution in the
GaussianActionNoise object have changed . . . . . . . . . . . . . . . . . . . . . . 8-3
Property names defining noise probability distribution in the
OrnsteinUhlenbeckActionNoise object have changed . . . . . . . . . . . . . . 8-4

R2020b

Multi-Agent Reinforcement Learning: Train multiple agents in a Simulink


environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2

Soft Actor-Critic Agent: Train sample-efficient policies for environments


with continuous-action spaces using increased exploration . . . . . . . . . 9-2

Default Agents: Avoid manually formulating policies by creating agents


with default neural network structures . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2

getModel and setModel Functions: Access computational model used by


actor and critic representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3

New Examples: Create a custom agent, use TD3 to tune a PI controller,


and train agents for automatic parking and motor control . . . . . . . . . . 9-3

Functionality being removed or changed . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3


Default value of NumStepsToLookAhead option for AC agents is now 32
...................................................... 9-3

R2020a

New Representation Objects: Create actors and critics with improved


ease of use and flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2

Continuous Action Spaces: Train AC, PG, and PPO agents in environments
with continuous action spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2

Recurrent Neural Networks: Train DQN and PPO agents with recurrent
deep neural network policies and value functions . . . . . . . . . . . . . . . . 10-2

TD3 Agent: Create twin-delayed deep deterministic policy gradient agents


......................................................... 10-2

vii
Softplus Layer: Create deep neural network layer using the softplus
activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3

Parallel Processing: Improved memory usage and performance . . . . . . 10-3

Deep Network Designer: Scaling, quadratic, and softplus layers now


supported . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3

New Examples: Train reinforcement learning agents for robotics and


imitation learning applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3

Functionality being removed or changed . . . . . . . . . . . . . . . . . . . . . . . . . 10-3


rlRepresentation is not recommended . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3
Target update method settings for DQN agents have changed . . . . . . . . . 10-5
Target update method settings for DDPG agents have changed . . . . . . . . 10-6
getLearnableParameterValues is now getLearnableParameters . . . . . . . . 10-7
setLearnableParameterValues is now setLearnableParameters . . . . . . . . 10-7

R2019b

Parallel Agent Simulation: Verify trained policies by running multiple


agent simulations in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2

PPO Agent: Train policies using proximal policy optimization algorithm


for improved training stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2

New Examples: Train reinforcement learning policies for applications


such as robotics, automated driving, and control design . . . . . . . . . . 11-2

viii Contents
1

R2024b

Version: 24.2

New Features

Bug Fixes

Compatibility Considerations
R2024b

Soft Actor-Critic Agent: Support for discrete action space


environments
Previously, you could only use a soft actor-critic (SAC) agent within a continuous action space
environment. Now, you can use a SAC agent also with an environment that has a discrete action
space.

To create a discrete action space SAC agent, use rlSACAgent.

To create a default discrete action space SAC agent (that is an agent with default neural networks for
actor and critics), pass to rlSACAgent the observation and action specifications from the
environment. Here the action specification is an rlFiniteSetSpec object. You can also pass an
rlAgentInitializationOptions object or an rlSACAgentOptions object as additional
arguments. For example, agent = rlSACAgent(obsInfo,actInfo).

Alternatively, you can separately create an rlDiscreteCategoricalActor object and an array of


one or two rlVectorQValueFunction objects and pass them to rlSACAgent, optionally with an
rlSACAgentOptions object as a third argument. For example, agent = rlSACAgent(actor,
critic).

For an example on how to use a discrete SAC agent, see “Train Discrete Soft Actor Critic Agent for
Lander Vehicle”.

Soft Actor-Critic Agent: Support for hybrid action space environments


You can now create environments with a hybrid (that is partly continuous and partly discrete) action
space, and use a hybrid SAC agent within these environments.

The observation specification of a hybrid action space environment is a vector composed of one
rlFiniteSetSpec object (defining the discrete part of the action space) followed by one
rlNumericSpec object (defining the continuous part of the action space). Therefore, a hybrid
environment must have one discrete action channel and one continuous action channel.

To create a default hybrid action space SAC agent, pass to rlSACAgent the observation and action
specifications from the environment. Here, the action specification must be a vector containing one
rlFiniteSetSpec object followed by one rlNumericSpec object. You can also pass an
rlAgentInitializationOptions or an rlSACAgentOptions object as additional arguments. For
example, agent = rlSACAgent(obsInfo,actInfo).

Alternatively, you can separately create an rlHybridStochasticActor and an array of one or two
rlVectorQValueFunction objects and pass them to rlSACAgent, optionally with an
rlSACAgentOptions object as a third argument. For example, agent = rlSACAgent(actor,
critic).

For use within custom training loops, you can also create an rlHybridStochasticActorPolicy
object from an rlHybridStochasticActor.

For an example on how to use a hybrid SAC agent, see “Train Hybrid SAC Agent for Path Following
Control”.

1-2
Reinforcement Learning Designer App: Support for evaluating agents
during training
You can now evaluate agents during training from Reinforcement Learning Designer. To do so, from
the Train tab, click on the Evaluate Agent button to enable agent evaluation and to open the Agent
Evaluation Options dialog box. You can then select the appropriate agent evaluation options from
the Agent Evaluation Options dialog box.

For more information on how to set agent evaluation in Reinforcement Learning Designer, see
“Specify Training Options in Reinforcement Learning Designer”.

Reproducibility: General guidelines to reproduce example results


Reproducing example results is sometimes challenging due to both the high sensitivity to initial
conditions and the existence of several source of randomness in reinforcement learning environments
and agents. To alleviate this problem, the examples involving agent training or simulation now
systematically set the random seed generator at the beginning as well as in multiple key locations,
briefly explaining the rationale behind the choice.

Additionally, generic reproducibility guidelines are also summarized in the last section of the “Train
Reinforcement Learning Agents” topic.

Functionality being removed or changed


Training Results: inspectTrainingResults is no longer recommended
Behavior change

inspectTrainingResult is no longer recommended. Use show instead.

1-3
2

R2024a

Version: 24.1

New Features

Bug Fixes

Compatibility Considerations
R2024a

New Trainers: Improve training results with more efficient training


algorithms
Reinforcement Learning Toolbox™ software now uses new training algorithms for DQN, DDPG, TD3,
SAC, PPO, and TRPO. These new algorithms tend to improve training outcomes by sampling from a
more diverse data set, resulting in better generalization.

The new training algorithms come with several new options.

• LearningFrequency — This is the minimum number of environment interactions between


learning iterations. In other words, it defines how many new data samples need to be generated
before learning. For off-policy agents, LearningFrequency defaults to -1, which means that
learning occurs after at least one episode is finished. For on-policy agents (PPO and TRPO),
LearningFrequency should be set as an integer multiple of MiniBatchSize. The default of -1
indicates that 10*MiniBatchSize samples are collected before learning. Set this number lower if
you want to learn more frequently (e.g. 2*MiniBatchSize).
• MaxMiniBatchPerEpoch — This is the maximum number of mini-batches used for learning
during a single epoch. For on-policy agents, the actual number of used mini-batches depends on
the length of aggregated trajectories used for learning, it is lower-bounded by
LearningFrequency/MiniBatchSize and upper-bounded by MaxMiniBatchPerEpoch. For off-
policy agents, the actual number of used mini-batches depends on the length of the replay buffer,
and it is upper-bounded by MaxMiniBatchPerEpoch. The MaxMiniBatchPerEpoch option
upper bounds the number of gradient steps when learning (max gradient steps =
MaxMiniBatchPerEpoch*NumEpoch). For on-policy agents you typically want to set this value to
an arbitrarily high number to ensure all data is used for training. For off-policy agents, higher
values means more time learning vs. collecting new data so it becomes a sample efficiency trade-
off parameter.
• NumEpoch — This is the number of times an agent learns over a data set. For on-policy agents,
this value defines the number of passes over a data set with minimum length of
LearningFrequency, and the default value is 3. For off-policy agents, this value defines the
number of passes over the data in the replay buffer, and the default value is 1. Note that for SAC
agents, this parameter now replaces NumGradientStepsPerUpdate.
• NumWarmStartSteps— This is the minimum number of samples that needs to be generated
before learning starts the for off-policy agents. Use this option to ensure learning over a more
diverse data set at the beginning of training. The default, and minimum, value is equal to
MiniBatchSize.
• PolicyUpdateFrequency— This option defines how often the actor in DDPG, TD3 and SAC
agents is updated with respect to each critic update. The default is 1 for DDPG and SAC and 2 for
TD3. Updating the actor less frequently than the critic can improve convergence at the cost of
longer training times.

For more information, see the option objects of the affected agents. For example,
rlSACAgentOptions.

Compatibility Considerations
Since agents learn differently with the new training algorithms, training results are generally
different than the ones obtained using the training algorithms of previous releases. For the same
reason, data associated with learning (such as actor or critic losses) may be logged at different

2-2
iterations than before. For more information, see “Training: New training algorithms lead to
improved training results” on page 2-7.

Custom Training Loops: Use automatic differentiation with dlarray


objects
Function approximator objects such as actors and critics now support automatic differentiation with
dlarray objects, resulting in improved performance and customizability.

New State and Learnables properties

You can now easily access the state and learnable parameters of function approximator objects using
the new State and Learnables properties. For example, to display the state of the critic myQVF, at
the MATLAB® command line, type myQVF.State. Both properties store values as cell arrays of
dlarray objects. For dlnetwork-based approximators, the State and Learnables properties
correspond to the value column of the State and Learnables tables of the dlnetwork object.

Approximator evaluation functions

The functions evaluate, getAction, getValue, getMaxQValue and predict now return (cell
arrays of) dlarray objects when their inputs are (cell arrays of) dlarray objects. This enables these
functions to be used with automatic differentiation.

Specifically, all these functions can now be used within custom loss functions directly, together with
dlfeval, dlgradient, and dlaccelerate. This makes writing custom loss functions easier.

For these functions, you can now use the UseForward name-value argument, which allows you to
explicitly call a forward pass when computing gradients. Specifying UseForward=true enables
layers such as batch normalization, and dropout to change their behavior for training.

New name-value arguments to sample and allExperiences

You can now use the new name-value arguments ReturnDlarray and ReturnGpuArray with the
functions sample and allExperiences to return data as dlarray or gpuArray (Parallel
Computing Toolbox) directly from a replay object. For example:

replay = rlReplayMemory(obsInfo,actInfo);
[mb,mask,idx,w] = sample(replay,ReturnDlarray=true);
exp = allExperiences(replay,ReturnGpuArray=true);

Compatibility Considerations
accelerate and gradient are no longer recommended.

Instead of using gradient on a function approximator object, write an appropriate loss function that
takes as arguments both the approximation object and its input. In the loss function you typically use
evaluate to calculate the output and dlgradient to calculate the gradient. Then call dlfeval on
your loss function, supplying both the approximator object and it inputs as arguments.

Similarly, instead of using accelerate to accelerate the gradient computation of a function


approximator object, use dlaccelerate on your loss function. Then use dlfeval on the
AcceleratedFunction object returned by dlaccelerate.

2-3
R2024a

For more information on how to update your code, see “Custom Training Loops: accelerate and
gradient are no longer recommended” on page 2-7. For an example, see Train Reinforcement
Learning Policy Using Custom Training Loop and Custom Training Loop with Simulink Action Noise.

The functions that handle approximator objects now enforce all the output dimensions, including the
batch dimension. When indexing the output of these functions, include the batch and sequence
dimension after the specification dimension. For more information, see “Approximators: Object
functions now enforce correct output dimensions” on page 2-11.

Evolutionary Reinforcement Learning: Train agent using parallel


computing
You can now leverage parallel computation when training an agent using evolutionary strategies. To
do so, set the new UseParallel property of the rlEvolutionStrategyTrainingOptions object
that you use for training to true.

Approximator Objects: Use normalizers to normalize inputs of actors


and critics
You can now normalize the inputs of function approximator objects such as actors and critics.

If upper and lower limits are defined for some channels in your observation (and, if needed, action)
specification object, then you can apply normalization, for those channels, at agent creation by
specifying the Normalization initialization option. For example:

opt = rlAgentInitializationOptions(Normalization="rescale-symmetric")
agent = rlDDPGAgent(obsInfo, actInfo, opt);

If no upper and lower limits are defined in the specification object for the channels you want to
normalize, then you can apply normalization by creating a normalization object for each channel that
you want to normalize. For example:

obsNrm1 = rlNormalizer(obsInfo(1),
Normalization="zscore",...
Mean=1,...
StandardDeviation=2);

obsNrm2 = rlNormalizer(obsInfo(2),
Normalization="zscore",...
Mean=2,...
StandardDeviation=3);

You then use setNormalizer to apply the normalization objects to the input channels of your
function approximator object. For example:

actor = getActor(agent);
actor = setNormalizer(actor, [obsNrm1, obsNrm2]);
setActor(agent, actor);

You can also set normalizers only for specific input channels. For more information, see
rlNormalizer, getNormalizer, setNormalizer and rlAgentInitializationOptions.

2-4
Training and Simulation: New memory management options
You can now improve memory management when training or simulating an agent. Three new options
allow you to select the storage type (for example disk instead of memory) for data generated by a
Simulink® environment. You can use these options to prevent out-of-memory issues during training or
simulation. The new options are the following.

• SimulationStorageType — This option is the type of storage used for environment data
generated during training or simulation. The default value is "memory", indicating that data is
stored in memory. To store environment data to disk instead, set this option to "file". When this
option is set to "none" environment data is not stored.
• SaveSimulationDirectory — This option specifies the directory to save environment data
when SimulationStorageType is set to "file". The default value is "savedSims".
• SaveFileVersion — This option specifies the MAT-file version for environment data files. The
default is "-v7". The other possible options are "-v7.3" and "-v6".

For more information, see rlTrainingOptions, rlMultiAgentTrainingOptions,


rlEvolutionStrategyTrainingOptions, and rlSimulationOptions.

Compatibility Considerations
The SimulationInfo property of an rlTrainingResult or rlMultiAgentTrainingResult
object (both returned by train) is now a SimulationStorage object (unless
SimulationStorageType is set to "none").

The SimulationInfo field of the experience structure (returned by sim) is now also a
SimulationStorage object (unless SimulationStorageType is set to "none").

For more information, see “Training and Simulation Results: SimulationInfo is now a
SimulationStorage object” on page 2-9.

Training Results: New objects for evolutionary strategy and


multiagent training results
The evolutionary strategy training function trainWithEvolutionStrategy now returns a new
training result object, rlEvolutionStrategyTrainingResult.

When you use it for multiagent training, train now returns a new result object,
rlMultiAgentTrainingResult.

You can use show to visualize a rlTrainingResult, rlMultiAgentTrainingResult and


rlEvolutionStrategyTrainingResult object in a new Reinforcement Learning Training Monitor
window.

Compatibility Considerations
The properties of the training result object that have Episode as a part of their name have been
replaced by respective properties having Generation as part of their name instead of Episode.

For more information, see “Train with Evolution Strategy: Episode replaced with Generation in
property names and values” on page 2-6.

2-5
R2024a

The SimulationInfo property of an rlEvolutionStrategyTrainingResult objects (returned


by trainWithEvolutionStrategy) is now a EvolutionStrategySimulationStorage object
(unless SimulationStorageType is set to "none").

The EvolutionStrategySimulationStorage has the read-only StorageType and


NumSimulations properties, indicating the storage type and the total number of simulations,
respectively. This object also contains environment information collected during simulation, which
you can access by indexing into the object using the specific number of generation, citizen, and
evaluation per individual.

For more information, see “Evolution Strategy Training Results: SimulationInfo is now an
EvolutionStrategySimulationStorage object” on page 2-10. For more information on the
SimulationInfo returned by train for multiagent environments, see “Training and Simulation
Results: SimulationInfo is now a SimulationStorage object” on page 2-9.

Reinforcement Learning Designer App and Visualization Tools: New


features
New features are now available with the Reinforcement Learning Designer app and other
visualization tools. Specifically:

• A new plot type, Multiple Bars, is available in the Reinforcement Learning Data Viewer.
• You can now set the StopTrainingCriteria training option to "None" from the Reinforcement
Learning Designer.
• You can now export the training plot to a MATLAB figure using the Export Plot button in the
Reinforcement Learning Training Monitor.

Data Logging: Additional learning data now available for logging


You can now log additional learning data using the AgentLearnFinishedFcn callback in a
rlDataLogger object. Specifically, the structure returned by the function specified in
AgentLearnFinishedFcn now contains the following additional fields.

For DQN, DDPG, TD3, and SAC agents, the additional fields are: TDTarget, TDError, SampleIndex,
MaskIndex, ActorGradientStepCount, and CriticGradientStepCount.

For PPO agents, the additional fields are: TDTarget, TDError, Advantage, PolicyRatio,
AdvantageLoss, EntropyLoss, ActorGradientStepCount, and CriticGradientStepCount.

For TRPO agents, the additional fields are: TDTarget, TDError, Advantage,
ActorGradientStepCount, CriticGradientStepCount.

For Q, SARSA, AC, and PG agents, the additional fields are: ActorGradientStepCount,
CriticGradientStepCount.

Functionality being removed or changed


Train with Evolution Strategy: Episode replaced with Generation in property names and
values
Behavior change

trainWithEvolutionStrategy now returns a training result object with the following properties.

2-6
GenerationIndex: [500×1 double]
GenerationReward: [500×1 double]
AverageReward: [500×1 double]
Q0: [500×1 double]
SimulationInfo: [1×1 rl.storage.EvolutionStrategySimulationStorage]
TrainingOptions: [1×1 rl.option.rlEvolutionStrategyTrainingOptions]

The properties that had Episode as a part of their name have been replaced by respective properties
having Generation as part of their name instead of Episode. For example, EpisodeReward is
replaced by GenerationReward. Using older names is not recommended.

rlEvolutionStrategyTrainingOptions now returns an options object with the following


properties.
PopulationSize: 25
PercentageEliteSize: 50
EvaluationsPerIndividual: 1
TrainEpochs: 10
PopulationUpdateOptions: [1×1 rl.option.GaussianUpdateOptions]
ReturnedPolicy: "AveragedPolicy"
MaxGenerations: 500
MaxStepsPerEpisode: 500
ScoreAveragingWindowLength: 5
StopTrainingCriteria: "AverageReward"
StopTrainingValue: 500
SaveAgentCriteria: "none"
SaveAgentValue: 500
SaveAgentDirectory: "savedAgents"
Verbose: 0
Plots: "training-progress"
UseParallel: 0
ParallelizationOptions: [1×1 rl.option.ParallelSimulation]
StopOnError: "on"
SimulationStorageType: "memory"
SaveSimulationDirectory: "savedSims"
SaveFileVersion: "-v7"

Similarly, the values of the properties of the rlEvolutionStrategyTrainingOptions object that


had Episode as a part of their name have been replaced by respective values having Generation as
part of their name instead of Episode (except MaxStepsPerEpisode). For example, the value
"EpisodeReward" of the StopTrainingCriteria property is replaced by "GenerationReward".
Using older values is not recommended.

Training: New training algorithms lead to improved training results


Behavior change

Since agents learn differently with the new training algorithms, training results are generally
different than the ones obtained using the training algorithms of previous releases. For the same
reason, data associated with learning (such as actor or critic losses) may be logged at different
iterations than before.

Custom Training Loops: accelerate and gradient are no longer recommended

accelerate and gradient are no longer recommended.

Instead of using gradient on a function approximator object, write an appropriate loss function that
takes as arguments both the approximation object and its input. In the loss function you typically use

2-7
R2024a

evaluate to calculate the output and dlgradient to calculate the gradient. Then call dlfeval,
supplying both the approximator object and it inputs as arguments.

This workflow is shown in the following table.

gradient: Not Recommended dlfeval and dlgradient: Recommended


g = gradient(actor,"output-input",u);g = dlfeval(@myOIGFcn,actor,dlarray(u));
g{1} g{1}

where:

function g = myOIGFcn(actor,u)
y = evaluate(actor,u);
loss = sum(y{1});
g = dlgradient(loss,u);
g = gradient(actor,"output-parameters",u);
g = dlfeval(@myOPGFcn,actor,dlarray(u));
g{1} g{1}

where:

function g = myOIGFcn(actor,u)
y = evaluate(actor,u);
loss = sum(y{1});
g = dlgradient(loss,actor.Learnables);
g = gradient(actor,@customLoss23b,u);g = dlfeval(@customLoss24a,actor,dlarray(u));
g{1} g{1}

where: where:

function loss = customLoss23b(y,varargin)


function g = customLoss24a(actor,u)
loss = sum(y{1}.^2); y = evaluate(actor,u);
loss = sum(y{1}.^2);
g = dlgradient(loss,actor.Learnables);

Similarly, instead of using accelerate to accelerate the gradient computation of a function


approximator object, use dlaccelerate on your loss function. Then use dlfeval on the
AcceleratedFunction object returned by dlaccelerate.

This workflow is shown in the following table.

accelerate: Not Recommended dlaccelerate: Recommended


actor = accelerate(actor,true); f = dlaccelerate(@customLoss);
g = gradient(actor,@customLoss,u); g = dlfeval(@customLoss,actor,dlarray(u));
g{1} g{1}

where: where:

function loss = customLoss(y,varargin)


function g = customLoss(actor,u)
loss = sum(y{1}.^2); y = evaluate(actor,u);
loss = sum(y{1}.^2);
g = dlgradient(loss,actor.Learnables);

For more information on using dlarray objects for custom deep learning training loops, see
dlfeval, AcceleratedFunction, dlaccelerate.

2-8
rlSACAgentOptions: the CriticUpdateFrequency and NumGradientStepsPerUpdate
properties are no longer effective
Behavior change

The CriticUpdateFrequency and NumGradientStepsPerUpdate properties of the


rlSACAgentOptions object are no longer effective. To specify the minimum number of environment
interactions after which the critic is updated, use the LearningFrequency property. To specify the
number of passes over the data in the replay buffer, use the NumEpoch property.

rlTD3AgentOptions and rlSACAgentOptions: the PolicyUpdateFrequency property has been


redefined
Behavior change

The PolicyUpdateFrequency property of the rlTD3AgentOptions and rlSACAgentOptions


objects has been redefined. Previously, it was defined as the number of steps between policy updates.
Now, it is defined as the period of the policy (actor) update with respect to the critic update. For
example, while a PolicyUpdateFrequency of 3 previously meant that the actor was updated every
three steps, it now means that is updated every three critic updates.

rlPPOAgentOptions and rlTRPOAgentOptions: the NumEpoch property changed behavior


Behavior change

The NumEpoch property of the rlPPOAgentOptions and rlTRPOAgentOptions objects changed


behavior. Previously, this property defined the number of learning passes over a data set with
minimum length of ExperienceHorizon. Now, it defines the number of passes used for learning
over a data set with minimum length of LearningFrequency.

Training and Simulation Results: SimulationInfo is now a SimulationStorage object


Behavior change

The SimulationInfo property of an rlTrainingResult or rlMultiAgentTrainingResult


object (both returned by train) is now a SimulationStorage object (unless
SimulationStorageType is set to "none").

The SimulationInfo field of the experience structure (returned by sim) is now also a
SimulationStorage object (unless SimulationStorageType is set to "none").

Note that if SimulationStorageType is set to "none" SimulationInfo is empty.

The SimulationStorage object has the read-only StorageType and NumSimulations properties,
indicating the storage type and the total number of episodes, respectively. This object also contains
environment information collected during simulation, which you can access by indexing into the
object using the episode number.

For example, if res is an rlTrainingResult object returned by train, you can access the
simulation information related to the second episode as:
mySimInfo2 = res.SimulationInfo(2);

• For MATLAB environments, mySimInfo2 is a structure containing the field SimulationError.


This structure contains any errors that occurred during simulation for the second episode.
• For Simulink environments, mySimInfo2 is a Simulink.SimulationOutput object containing
logged data from the Simulink model. Properties of this object include any signals and states that
the model is configured to log, simulation metadata, and any errors that occurred during the
second episode.

2-9
R2024a

Consider a Simulink environment that logs its states as xout over 10 episodes.

Previously, if res was the rlTrainingResult object returned by train, you could pack the
environment simulation information for all episodes in a single array as:
[res.SimulationInfo.xout]

Now, you must explicitly address all the elements:


[res.SimulationInfo(1:10).xout]

Similarly, previously you could access the simulation information related to the first episode as:
res.SimulationInfo.xout

Now, you must explicitly address the first element:


res.SimulationInfo(1).xout

For more information, see train and sim.

Evolution Strategy Training Results: SimulationInfo is now an


EvolutionStrategySimulationStorage object
Behavior change

The SimulationInfo property of an rlEvolutionStrategyTrainingResult objects (returned


by trainWithEvolutionStrategy) is now a EvolutionStrategySimulationStorage object
(unless SimulationStorageType is set to "none").

The EvolutionStrategySimulationStorage has the read-only StorageType and


NumSimulations properties, indicating the storage type and the total number of simulations,
respectively. This object also contains environment information collected during simulation, which
you can access by indexing into the object using the specific number of generation, citizen, and
evaluation per individual.

For example, if res is an rlEvolutionStrategyTrainingResult object returned by


trainWithEvolutionStrategy, you can access the simulation information related to the fourth
run, of the third citizen, in the second generation as:
mySimInfo234 = res.SimulationInfo(2,3,4)

• For MATLAB environments, mySimInfo234 is a structure containing the field


SimulationError. This structure contains any errors that occurred during simulation for the
fourth run of the third citizen in the second generation.
• For Simulink environments, mySimInfo234 is a Simulink.SimulationOutput object
containing logged data from the Simulink model. Properties of this object include any signals and
states that the model is configured to log, simulation metadata, and any errors that occurred for
the second generation, third citizen, and fourth run.

Consider a Simulink environment that logs its states as xout over 10 episodes.

Previously, if res was an rlTrainingResult object returned by trainWithEvolutionStrategy


you could pack the environment simulation information for all generations in a single array as:
[res.SimulationInfo.xout]

Now, you can only address one specific element:

2-10
res.SimulationInfo(5).xout

Similarly, previously you could access the simulation information related to the first generation as:

res.SimulationInfo.xout

Now, you must explicitly address the first element:

res.SimulationInfo(1).xout

For more information, see trainWithEvolutionStrategy.

Reinforcement Learning Episode Manager renamed


Behavior change

The Reinforcement Learning Episode Manager has been renamed to Reinforcement Learning
Training Monitor.

Reinforcement Learning Designer now always display visualization


Behavior change

You can no longer set the Plots training option the Reinforcement Learning Designer app. The
training visualization is now always displayed.

Approximators: Object functions now enforce correct output dimensions


Behavior change

The evaluation functions for approximator objects (e.g. getAction and evaluate) enforce
consistency with output dimensions, including the trailing singular dimension for cases like column
vector specifications. The first n dimensions are consistent with the specification dimensions. The
dimension n+1 corresponds to the batch dimension and the dimension n+2 corresponds to the
sequence dimension.

For example, suppose the output of an approximator method, gro_batch{1} has dimensions 4,1,5,
and 9, which means that has channel dimensions of 4 and 1, as well as five independent sequences,
each one made of nine sequential observations.

Previously, you accessed the third observation element of the first input channel, after the seventh
sequential observation in the fourth independent batch as:

gro_batch{1}(3,4,7)

Now, you must address the same element as:

gro_batch{1}(3,1,4,7)

2-11
3

R2023b

Version: 23.2

New Features

Bug Fixes

Compatibility Considerations
R2023b

Multi-Agent Reinforcement Learning: Train multiple agents in a


MATLAB environment
Previously, you could train or simulate multiple agents together using a custom Simulink environment
only. Now, you can do so also within a custom multi-agent MATLAB environment.

You can create two different kinds of custom multi-agent MATLAB environments:

• Multi-agent environments with universal sample time, in which all agents execute in the same
step. For more information on custom multi-agent function environments with universal sample
time, see rlMultiAgentFunctionEnv.
• Turn-based environments, in which agents execute in turns. Specifically, the environment assigns
execution to only one group of agents at a time, and the group executes when it is its turn to do
so. For more information on custom turn-based multi-agent function environments, see
rlTurnBasedFunctionEnv. For an example, see Train Agent to Play Turn-Based Game.

Once you create your custom multiagent MATLAB environment, you can train and simulate your
agents with is using train and sim, respectively.

Evolutionary Reinforcement Learning: Increase computational


efficiency of training by using evolutionary strategies
You can now train DDPG, TD3 and SAC agents using an evolutionary algorithm. Evolutionary
reinforcement learning adaptation strategies update the weights of your agents using a selection
process inspired by biological evolution. Compared to gradient-based approaches, evolutionary
algorithms are less reliant on backpropagation, are easily parallelizable, and have a reduced
sensitivity to local minima. They also generally display good (nonlocal) exploration and robustness,
especially in complex scenarios where data is incomplete or noisy and rewards are sparse or
conflicting.

To train your agent using an evolutionary algorithm, first, use


rlEvolutionStrategyTrainingOptions to create a training option object. Then pass this option
object (along with the environment and agent) to trainWithEvolutionStrategy to train your
agent.

For more information, see rlEvolutionStrategyTrainingOptions and


trainWithEvolutionStrategy. For an example, see Train Biped Robot to Walk Using Evolution
Strategy.

Agent Evaluation: Evaluate agents during training


You can now automatically evaluate your agent at regular intervals during training. Doing so allows
you to observe the actual training progress and automatically stop the training or save the agent
when some pre-specified conditions are met.

To configure evaluation options for your agents, first create an evaluator object using rlEvaluator.
You can specify properties such as the type of evaluation statistic, the frequency at which evaluation
episodes occur, or whether exploration is allowed during an evaluation episode.

To train the agents and evaluate them during training, pass this object to train.

3-2
You can also create a custom evaluator object, which uses a custom evaluation function that you
supply. To do so, use rlCustomEvaluator.

For more information, see rlEvaluator, rlCustomEvaluator and the EvaluationStatistic of


train.

Training: Use custom criteria to save agent or stop training


You now have two additional options to save agents during training or stop the training altogether.

The EvaluationStatistic option is available when an evaluator object is used with training. When
using this option, the agent is saved (or the training is stopped) when the evaluation statistic supplied
by the evaluator object equals or exceeds the value specified in SaveAgentValue (or
StopTrainingValue).

The Custom option allows you to save the agent (or stop the training) using a custom function that
you supply with the SaveAgentValue (or StopTrainingValue) argument. When using this option,
the agent is saved (or the training is stopped) when the custom function returns true.

For more information, see rlTrainingOptions and the EvaluationStatistic property of


train.

Functionality being removed or changed


rlFunctionEnv Object: The LoggedSignals property is replaced by the Info property
Warns

The LoggedSignals property of the rlFunctionEnv object is no longer active and will be removed
in a future release. To pass information from one step to the next, use the Info property instead.

3-3
4

R2023a

Version: 2.4

New Features

Bug Fixes
R2023a

Learning from Data: Train agents offline using previously recorded


data
You can now train off-policy agents (DQN, SAC, DDPG, and TD3) offline, using an existing dataset,
instead of an environment. For more information, see trainFromData and
rlTrainingFromDataOptions.

To deal with possible differences between the probability distribution of the dataset and the one
generated by the environment, use the batch data regularization options provided for off-policy
agents. For more information, see the new BatchDataRegularizerOptions property of the off-
policy agents options objects, as well as the new rlBehaviorCloningRegularizerOptions and
rlConservativeQLearningOptions options objects.

Data Logging: Visualize logged data in Reinforcement Learning Data


Viewer
You can now visualize logged data using the Reinforcement Learning Data Viewer, a new interactive
tool shipped with Reinforcement Learning Toolbox. For more information, see rlDataViewer.

Training: Stop and resume training in the Reinforcement Learning


Designer App
You can now resume agent training from the Reinforcement Learning Designer. When training
terminates, either because a termination condition is reached or because you click Stop Training in
the Reinforcement Learning Episode Manager, the stored training statistics and results can be now
used to resume training from the exact point at which it was stopped.

Hindsight Replay Memory: Improve sample efficiency for goal-


conditioned tasks with sparse rewards
You can now improve sample efficiency of off-policy agents (DQN, TD3, SAC, DDPG) using hindsight
replay memory, which is a data augmentation method for goal-conditioned tasks. When the reward
from the environment is sparse, hindsight replay memory can improve sample efficiency.

By default, built-in off-policy agents use an rlReplayMemory object as their experience buffer.
Agents uniformly sample data from this buffer. To perform uniform on nonuniform hindsight replay
memory, replace the default experience buffer with one of the following objects.

• rlHindsightReplayMemory — Replay memory with uniform sampling


• rlHindsightPrioritizedReplayMemory — Replay memory with prioritized sampling, which
can further improve sample efficiency

Hindsight experience replay does not support agents that use recurrent neural networks.

4-2
Replay Memory: Validate experiences before adding to replay memory
buffer
You can now validate experiences before adding them to a replay memory buffer using the
validateExperience function. If the experiences are not compatible with the replay memory,
validateExperience generates an error message in the MATLAB command window.

4-3
5

R2022b

Version: 2.3

New Features

Bug Fixes
R2022b

Deployment: Policy Block


The new Policy Simulink block simplifies the process of inserting a reinforcement learning policy into
a Simulink model. Use it for simulation, code generation and deployment purposes.

You can automatically generate and configure a Policy block using either the new function
generatePolicyBlock or the new button on the RL Agent block mask.

For more information, see generatePolicyBlock and the Policy block. For an example, see
Generate Policy Block for Deployment.

Training: Enhanced logging capability for built-in agents


You can now log data such as experience, losses and learnable parameters to disk. Also, for Simulink
environments, you can log any signal value. Once collected, this data can then be loaded into memory
and analyzed.

For more information, see rlDataLogger. For an example, see Log Training Data To Disk.

Training: Prioritized experience replay for off-policy agents


You can now improve sample efficiency of off-policy agents (DQN, TD3, SAC, DDPG) using prioritized
experience replay. To do so, replace the agent experience buffer with an
rlPrioritizedReplayMemory object.

Prioritized experience replay does not support agents that use recurrent neural networks.

5-2
6

R2022a

Version: 2.2

New Features

Bug Fixes

Compatibility Considerations
R2022a

New Agent Architecture: Create memory-efficient agents with more


modular and scalable actors and critics
You can now create agents using a redesigned architecture which is more modular, scalable, and
memory efficient, while also facilitating parallel training. The new architecture introduces new
functions and objects and changes some previously existing functionalities.

New Approximator Objects

You can represent actor and critic functions using six new approximator objects. These objects
replace the previous representation objects and improve efficiency, readability, scalability, and
flexibility.

• rlValueFunction — State value critic, computed based on an observation from the


environment.
• rlQValueFunction — State-action value critic with a scalar output, which is the value of an
action given an observation from the environment. The two inputs are a possible action and an
observation from the environment.
• rlVectorQValueFunction — State-action value critic with a vector output, where each element
of the vector is the value of one of the possible actions. The input is an observation from the
environment.
• rlContinuousDeterministicActor — Actor with a continuous action space. The output is a
deterministic action, the input is an observation from the environment.
• rlDiscreteCategoricalActor — Actor with a discrete action space. The output is a stochastic
action from a categorical distribution, the input is an observation from the environment.
• rlContinuousGaussianActor — Actor with a continuous action space. The output is a
stochastic action sampled from a Gaussian distribution, the input is an observation from the
environment.

When creating a critic or an actor, you can now select and update optimization options using the new
rlOptimizerOptions object, instead of using the older rlRepresentationOptions object.

Specifically, you can create an agent options object and set its CriticOptimizerOptions and
ActorOptimizerOptions properties to suitable rlOptimizerOptions objects. Then you pass the
agent options object to the function that creates the agent.

Alternatively, you can create the agent and then use dot notation to access the optimization options
for the agent actor and critic, for example:
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 0.1;.

New Policy Objects for Custom Agents and Custom Training Loops

To implement a customized agent, you can instantiate a policy using the following new policy objects.

• rlMaxQPolicy — This object implements a policy that selects the action that maximizes a
discrete state-action value function.
• rlEpsilonGreedyPolicy — This object implements a policy that selects the action that
maximizes a discrete state-action value function with probability 1-Epsilon, otherwise selects a
random action.
• rlDeterministicActorPolicy — This object implements a policy that you can use to
implement custom agents with a continuous action space.

6-2
• rlAdditiveNoisePolicy — This object is similar to rlDeterministicActorPolicy but noise
is added to the output according to an internal noise model.
• rlStochasticActorPolicy — This object implements a policy that you can use to implement
custom agents with a continuous action space.

For more information on these policy objects, at the MATLAB command line type, help followed by
the policy object name.

You can use the new rlReplayMemory object to append, store, save, sample and replay experience
data. Doing so makes it easier to implement custom training loops and your own reinforcement
learning algorithms.

When creating a customized training loop or agent you can also access optimization features using
the objects created by the new rlOptimizer function. Specifically, create an optimizer algorithm
object using rlOptimizer, and optionally use dot notation to modify its properties. Then, create a
structure and set its CriticOptimizer or ActorOptimizer field to the optimizer object. When you
call runEpisode, pass the structure as an input parameter. The runEpisode function can then use
the update method of the optimizer object to update the learnable parameters of your actor or critic.

For more information, see Custom Training Loop with Simulink Action Noise and Train
Reinforcement Learning Policy Using Custom Training Loop.

Compatibility Considerations
The following representation objects are no longer recommended:

• rlValueRepresentation
• rlQValueRepresentation
• rlDeterministicActorRepresentation
• rlStochasticActorRepresentation

Also, the corresponding representation options object, rlRepresentationOptions, is no longer


recommended, use an rlOptimizerOptions object instead.

For more information on how to update your code to use the new objects, see “Representation objects
are not recommended” on page 6-5.

Neural Network Environment: Use an approximated environment


model based on a deep neural network
You can now create an environment object that uses a deep neural network to calculate state
transitions and rewards. Using such an environment, you can:

• Create an internal environment model for a model-based policy optimization (MBPO) agent. For
more information on MBPO agents, see Model-Based Policy Optimization Agents.
• Create an environment for training other types of reinforcement learning agents. You can identify
the state-transition network using experimental or simulated data. Depending on your application,
using a neural network environment as an approximation of a more complex first-principle
environment can speed up your simulation and training.

To create a neural network environment, use an rlNeuralNetworkEnvironment object.

6-3
R2022a

Model Based Policy Optimization Agent: Use a model of the


environment to improve sample efficiency and exploration
You can now create and train model-based policy optimization (MBPO) agents. An MBPO agent uses a
neural network to internally approximate a transition model of the environment. This reusable
internal model allows for a greater sample efficiency and a more effective exploration, compared to a
typical model-free agent.

For more information on creating MBPO agents, see Model-Based Policy Optimization Agents.

Multi-Agent Reinforcement Learning: Train multiple agents in a


centralized manner for more efficient exploration and learning
You can now group agents according to a common learning strategy and specify whether they learn in
a centralized manner (that is all agents in a group share experiences) or decentralized manner
(agents do not share experiences).

Centralized learning boosts exploration and facilitates learning in applications where the agents
perform a collaborative (or the same) task.

For more information on creating training options set for multiple agents, see
rlMultiAgentTrainingOptions.

Training: Stop and resume agent training


You can now resume agent training from a training result object returned by the train function.
When training terminates, either because a termination condition is reached or because you click
Stop Training in the Reinforcement Learning Episode Manager, the stored training statistics and
results can be now used to resume training from the exact point at which it was stopped.

For more information and examples, see the train reference page.

Event-Based Simulation: Use RL Agent block inside conditionally


executed subsystems
You can now use the RL Agent Simulink block inside a conditionally executed subsystem, such as a
Triggered Subsystem (Simulink) or a Function-Call Subsystem (Simulink). To do so, you must specify
the sample time of the reinforcement learning agent object specified in the RL Agent block as -1 so
that the block can inherit the sample time of its parent subsystem.

For more information, see the SampleTime property of any agent options object. For more
information on conditionally executed subsystems, see Conditionally Executed Subsystems Overview
(Simulink).

RL Agent Block: Learn from last action applied to the environment


For some applications, when training an agent in a Simulink environment, the action applied to the
environment can differ from the action output by the RL Agent block. For example, the Simulink
model can contain a saturation block on the action output signal.

6-4
In such cases, to improve learning results, you can now enable an input port to connect the last
action signal applied to the environment.

Reinforcement Learning Designer App: Support for SAC and TRPO


agents
You can now create Soft Actor-Critic Agents and Trust Region Policy Optimization Agents from
Reinforcement Learning Designer.

For more information on creating agents using Reinforcement Learning Designer, see Create Agents
Using Reinforcement Learning Designer.

New Examples: Train agents for robotics and automated parking


applications
This release includes the following new reference examples.

• Train Reinforcement Learning Agents To Control Quanser QUBE™ Pendulum — Train a SAC agent
to generate a swing-up reference trajectory for an inverted pendulum and a PPO agent as a mode-
selection controller.
• Run SIL and PIL Verification for Reinforcement Learning — Perform software-in-the-loop and
processor-in-the-loop verification of trained reinforcement learning agents.
• Train SAC Agent for Ball Balance Control — Control a Kinova robot arm to balance a ball on a
plate using a SAC agent.
• Automatic Parking Valet with Unreal Engine Simulation — Implement a hybrid reinforcement
learning and model predictive control system that searches a parking lot and parks in an open
space.

Functionality being removed or changed


Representation objects are not recommended
Still runs

Functions to create representation objects are no longer recommended. Depending on the type of
actor or critic being created, use one of the following objects instead.

Representation Object: Approximator Object: Usage


Not Recommended Recommended
rlValueRepresentation rlValueFunction State value critic, computed based on an
observation from the environment. This
type of critic is used in rlACAgent,
rlPGAgent, rlPPOAgent and
rlTRPOAgent agents.

6-5
R2022a

Representation Object: Approximator Object: Usage


Not Recommended Recommended
rlQValueRepresentation rlQValueFunction State-action value critic with a scalar
output, which is the value of an action
given an observation from the
environment. The two inputs are a
possible action and an observation from
the environment. It is used in
rlDQNAgent and rlDDPGAgent agents.
rlQValueRepresentation rlVectorQValueFunctio State-action value critic with a vector
n output, where each element of the
vector is the value of one of the possible
actions. The input is an observations
from the environment. It is used in
rlDQNAgent agents, (and preferred
over single-output critics).
rlDeterministicActorRe rlContinuousDetermini Actor with a continuous action space.
presentation sticActor The output is a deterministic action, the
input is an observation from the
environment. This kind of actor is used
in rlDDPGAgent and rlTD3Agent
agents.
rlStochasticActorRepre rlDiscreteCategorical Actor with a discrete action space. The
sentation Actor output is a stochastic action sampled
from a categorical (also known as
Multinoulli) distribution, the input is an
observation from the environment. it is
used in rlPGAgent, rlACAgent agents,
as well as in rlPPOAgent and
rlTRPOAgent agents with a discrete
action space.
rlStochasticActorRepre rlContinuousGaussianA Actor with a continuous action space.
sentation ctor The output is a stochastic action
sampled from a Gaussian distribution,
the input is an observation from the
environment. It is used in rlSACAgent
agents as well as in rlPGAgent,
rlACAgent, rlTRPOAgent and
rlPPOAgent agents with a continuous
action space.

rlRepresentationOptions objects are no longer recommended. To specify optimization options


for actors and critics, use rlOptimizerOptions objects instead.

Specifically, you can create an agent options object and set its CriticOptimizerOptions and
ActorOptimizerOptions properties to suitable rlOptimizerOptions objects. Then you pass the
agent options object to the function that creates the agent. This workflow is shown in the following
table.

6-6
rlRepresentationOptions: Not rlOptimizerOptions: Recommended
Recommended
crtOpts = rlRepresentationOptions(...criticOpts = rlOptimizerOptions(...
'GradientThreshold',1); 'GradientThreshold',1);

critic = rlValueRepresentation(... agentOpts = rlACAgentOptions(...


net,obsInfo,'Observation',{'obs'},ctrOpts)
'CriticOptimizerOptions',crtOpts);

agent = rlACAgent(actor,critic,agentOpts)

Alternatively, you can create the agent and then use dot notation to access the optimization options
for the agent actor and critic, for example:
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;.

The following table shows some typical uses of the representation objects to create neural network-
based critics and actors, and how to update your code with one of the new function approximator
objects instead.

Network-Based Representations: Not Network-Based Approximators:


Recommended Recommended
myCritic = myCritic =
rlValueRepresentation(net,obsInfo,'Obs rlValueFunction(net,obsInfo,'Observati
ervation',obsNames), with net having onInputNames',obsNames). Use this syntax to
observations as inputs and a single scalar output. create a state value function object for a critic
that does not require action inputs.
myCritic = myCritic =
rlQValueRepresentation(net,obsInfo,act rlQValueFunction(net,obsInfo,actInfo,'
Info,'Observation',obsNames,'Action',a ObservationInputNames',obsNames,'Actio
ctNames), with net having both observations nInputNames',actNames). Use this syntax to
and actions as inputs and a single scalar output. create a single-output state-action value function
object for a critic that takes both observation and
action as inputs.
myCritic = myCritic =
rlQValueRepresentation(net,obsInfo,act rlVectorQValueFunction(net,obsInfo,act
Info,'Observation',obsNames) with net Info,'ObservationInputNames',obsNames)
having only the observations as inputs and a . Use this syntax to create a multiple-output state-
single output layer having as many elements as action value function object for a critic with a
the number of possible discrete actions. discrete action space. This critic takes
observations as inputs, and outputs a vector in
which each element is the value of one of the
possible actions.
myActor = myActor =
rlDeterministicActorRepresentation(net rlContinuousDeterministicActor(net,obs
,obsInfo,actInfo,'Observation',obsName Info,actInfo,'ObservationInputNames',o
s,'Action',actNames) with actInfo defining bsNames). Use this syntax to create a
a continuous action space and net having deterministic actor object with a continuous
observations as inputs and a single output layer action space.
with as many elements as the number of
dimensions of the continuous action space.

6-7
R2022a

Network-Based Representations: Not Network-Based Approximators:


Recommended Recommended
myActor = myActor =
rlStochasticActorRepresentation(net,ob rlDiscreteCategoricalActor(net,obsInfo
sInfo,actInfo,'Observation',obsNames), ,actInfo,'ObservationInputNames',obsNa
with actInfo defining a discrete action space mes). Use this syntax to create a stochastic actor
and net having observations as inputs and a object with a discrete action space. This actor
single output layer with as many elements as the samples its action from a categorical (also known
number of possible discrete actions. as Multinoulli) distribution.
myActor = myActor =
rlStochasticActorRepresentation(net,ob rlContinuousGaussianActor(net,obsInfo,
sInfo,actInfo,'Observation',obsNames), actInfo,'ObservationInputNames',obsNam
with actInfo defining a continuous action space es,'ActionMeanOutputNames',actMeanName
and net having observations as inputs and a s,'ActionStandardDeviationOutputNames'
single output layer with twice as many elements ,actStdNames). Use this syntax to create a
as the number of dimensions of the continuous stochastic actor object with a continuous action
action space (representing, in sequence, all the space. This actor samples its action from a
means and all the standard deviations of every Gaussian distribution, and you must provide the
action dimension). names of the network outputs representing the
mean and standard deviations for the action.

The following table shows some typical uses of the representation objects to express table-based
critics with discrete observation and action spaces, and how to update your code with one of the new
objects instead.

Table-Based Representations: Not Table-Based Approximators: Recommended


Recommended
rep = rep = rlValueFunction(tab,obsInfo). Use
rlValueRepresentation(tab,obsInfo) this syntax to create a value function object for a
where the table tab contains a column vector critic that does not require action inputs.
with as many elements as the number of possible
observations.
rep = rep =
rlQValueRepresentation(tab,obsInfo,act rlQValueFunction(tab,obsInfo,actInfo).
Info), where the table tab contains a vector Use this syntax to create a single-output state-
with as many elements as the number of possible action value function object for a critic that takes
observations plus the number of possible actions. both observations and actions as input.
rep = rep =
rlQValueRepresentation(tab,obsInfo,act rlVectorQValueFunction(tab,obsInfo,act
Info), where the table tab contains a Q-value Info). Use this syntax to create a multiple-
table with as many rows as the number of output state-action value function object for a
possible observations and as many columns as critic with a discrete action space. This critic
the number of possible actions. takes observations as inputs, and outputs a vector
in which each element is the value of one of the
possible actions. It is good practice to use critics
with vector outputs when possible.

The following table shows some typical uses of the representation objects to create critics and actors
which use a (linear in the learnable parameters) custom basis function, and how to update your code
with one of the new objects instead. In these function calls, the first input argument is a two-element

6-8
cell array containing both the handle to the custom basis function and the initial weight vector or
matrix.

Custom Basis Function-Based Custom Basis Function-Based


Representations: Not Recommended Approximators: Recommended
rep = rep =
rlValueRepresentation({basisFcn,W0},ob rlValueFunction({basisFcn,W0},obsInfo)
sInfo), where the basis function has only . Use this syntax to create a value function object
observations as inputs and W0 is a column vector. for a critic that does not require action input.
rep = rep =
rlQValueRepresentation({basisFcn,W0},o rlQValueRepresentation({basisFcn,W0},o
bsInfo,actInfo), where the basis function has bsInfo,actInfo). Use this syntax to create a
both observations and action as inputs and W0 is single-output state-action value function object
a column vector. for a critic that takes both observation and action
as inputs.
rep = rep =
rlQValueRepresentation({basisFcn,W0},o rlVectorQValueRepresentation({basisFcn
bsInfo,actInfo), where the basis function has ,W0},obsInfo,actInfo). Use this syntax to
both observations and action as inputs and W0 is create a multiple-output state-action value
a matrix with as many columns as the number of function object for a critic with a discrete action
possible actions. space. This critic takes observations as inputs,
and outputs a vector in which each element is the
value of one of the possible actions. It is good
practice to use critics with vector outputs when
possible.
rep = rep =
rlDeterministicActorRepresentation({ba rlContinuousDeterministicActor({basisF
sisFcn,W0},obsInfo,actInfo), where the cn,W0},obsInfo,actInfo). Use this syntax to
basis function has observations as inputs and create a deterministic actor object with a
actions as outputs, W0 is a matrix with as many continuous action space.
columns as the number of possible actions, and
actInfo defines a continuous action space.
rep = rep =
rlStochasticActorRepresentation({basis rlDiscreteCategoricalActor({basisFcn,W
Fcn,W0},obsInfo,actInfo), where the basis 0},obsInfo,actInfo). Use this syntax to
function has observations as inputs and actions create a stochastic actor object with a discrete
as outputs, W0 is a matrix with as many columns action space which returns an action sampled
as the number of possible actions, and actInfo from a categorical (also known as Multinoulli)
defines a discrete action space. distribution.

For more information on the new approximator objects, see, rlTable, rlValueFunction,
rlQValueFunction, rlVectorQValueFunction, rlContinuousDeterministicActor,
rlDiscreteCategoricalActor, and rlContinuousGaussianActor.

train now returns an object instead of a structure


Behavior change in future release

The train function now returns an object or an array of objects as the output. The properties of the
object match the fields of the structure returned in previous versions. Therefore, the code based on
dot notation works in the same way.

6-9
R2022a

For example, if you train an agent using the following command:

trainStats = train(agent,env,trainOptions);

When training terminates, either because a termination condition is reached or because you click
Stop Training in the Reinforcement Learning Episode Manager, trainStats is returned as an
rlTrainingResult object.

The rlTrainingResult object contains the same training statistics previously returned in a
structure along with data to correctly recreate the training scenario and update the episode manager.

You can use trainStats as third argument for another train call, which (when executed with the
same agents and environment) will cause training to resume from the exact point at which it stopped.

For more information and examples, see train and “Training: Stop and resume agent training” on
page 6-4. For more information on training agents, see Train Reinforcement Learning Agents.

Training Parallelization Options: DataToSendFromWorkers and StepsUntilDataIsSent


properties are no longer active
Warns

The property DataToSendFromWorkers of the ParallelizationOptions object is no longer


active and will be removed in a future release. The data sent from the workers to the learner is now
automatically determined based on agent type.

The property StepsUntilDataIsSent of the ParallelizationOptions object is no longer active


and will be removed in a future release. Data is now sent from the workers to the learner at the end
each episode.

Attempting to set either of these properties will cause a warning. For more information, see
barrierPenalty.

Code generated by generatePolicyFunction now uses policy objects


Behavior change in future release

The code generated by generatePolicyFunction now loads a deployable policy object from a
reinforcement learning agent. The results from running the generated policy function remain the
same.

6-10
7

R2021b

Version: 2.1

New Features

Bug Fixes

Compatibility Considerations
R2021b

Rewards Generation: Automatically generate reward functions from


controller specifications
You can now generate reinforcement learning reward functions, coded in MATLAB, from:

• Cost and constraint specifications defined in an mpc (Model Predictive Control Toolbox) or nlmpc
(Model Predictive Control Toolbox) controller object. This feature requires Model Predictive
Control Toolbox™ software.
• Performance constraints defined in Simulink Design Optimization™ model verification blocks.

For more information, see generateRewardFunction as well as exteriorPenalty,


hyperbolicPenalty, and barrierPenalty, as well as the examples Generate Reward Function
from a Model Predictive Controller for a Servomotor and Generate Reward Function from a Model
Verification Block for a Water Tank System.

Episode Manager: Improved layout management for single and


multiple agent training
You can now display the training progress of single and multiple agents with Episode Manager. The
layout management is integrated with the Reinforcement Learning Designer app, and exhibits a more
consistent plot resizing and moving behavior.

Neural Network Representations: Improved internal handling of


dlnetwork objects
The built-in agents now use dlnetwork objects as actor and critic representations. In most cases this
allows for a speedup of about 30%.

Compatibility Considerations
• getModel now returns a dlnetwork object.
• Due to numerical differences in the network calculations, previously trained agents might behave
differently. If this happens, you can retrain your agents.
• To use Deep Learning Toolbox™ functions that do not support dlnetwork, you must convert the
network to layerGraph. For example, to use deepNetworkDesigner, replace
deepNetworkDesigner(network) with deepNetworkDesigner(layerGraph(network)).

Trust Region Policy Optimization Agent: Prevent significant


performance drops by restricting updated policy to trust region near
current policy
You can now create and train trust region policy optimization (TRPO) agents. TRPO is a policy
gradient reinforcement algorithm. It prevents significant performance drops compared to standard
policy gradient methods by keeping the updated policy within a trust region close to the current
policy.

For more information on creating TRPO agents, see rlTRPOAgent and rlTRPOAgentOptions.

7-2
PPO Agents: Improve agent performance by normalizing advantage
function
In some environments, you can improve PPO agent performance by normalizing the advantage
function during training. The agent normalizes the advantage function by subtracting the mean
advantage value and scaling by the standard deviation.

To enable advantage function normalization, first create an rlPPOAgentOptions object. Then,


specify the NormalizedAdvantageMethod option as one of the following values.

• "current" — Normalize the advantage function using the advantage function mean and standard
deviation for the current mini-batch of experiences.
• "moving" — Normalize the advantage function using the advantage function mean and standard
deviation for a moving window of recent experiences. To specify the window size, set the
AdvantageNormalizingWindow option.

For example, configure the agent options to normalize the advantage function using the mean and
standard deviation from the last 500 experiences.

opt = rlPPOAgentOptions;
opt.NormalizedAdvantageMethod = "moving";
opt.AdvantageNormalizingWidnow = 500;

For more information on PPO agents, see Proximal Policy Optimization Agents.

New Example: Create and train custom agent using model-based


reinforcement learning
A model-based reinforcement learning agent learns a model of its environment that it can use to
generate additional experiences for training. For an example that shows how to create and train such
an agent, see Model-Based Reinforcement Learning Using Custom Training Loop.

Functionality being removed or changed


Built-in agents now use dlnetwork objects
Behavior change

The built-in agents now use dlnetwork objects as actor and critic representations. In most cases this
allows for a speedup of about 30%.

• getModel now returns a dlnetwork object.


• Due to numerical differences in the network calculations, previously trained agents might behave
differently. If this happens, you can retrain your agents.
• To use Deep Learning Toolbox functions that do not support dlnetwork, you must convert the
network to layerGraph. For example, to use deepNetworkDesigner, replace
deepNetworkDesigner(network) with deepNetworkDesigner(layerGraph(network)).

7-3
8

R2021a

Version: 2.0

New Features

Bug Fixes

Compatibility Considerations
R2021a

Reinforcement Learning Designer App: Design, train, and simulate


agents using a visual interactive workflow
The new Reinforcement Learning Designer app streamlines the workflow for designing, training, and
simulating agents. You can now:

• Import an environment from the MATLAB workspace.


• Automatically create or import an agent for your environment (DQN, DDPG, PPO, and TD3 agents
are supported).
• Train and simulate the agent against the environment.
• Analyze simulation results and refine your agent parameters.
• Export the final agent to the MATLAB workspace for further use and deployment.

To open the Reinforcement Learning Designer app, at the MATLAB command line, enter the
following:

reinforcementLearningDesigner

For more information, see Reinforcement Learning Designer.

Recurrent Neural Networks: Train agents with recurrent deep neural


network policies and value functions
You can now use recurrent neural networks (RNN) when creating actors and critics for use with PG,
DDPG, AC, SAC, and TD3 agents. Previously, only PPO and DQN agents were supported.

RNNs are deep neural networks with a sequenceInputLayer input layer and at least one layer that
has hidden state information, such as an lstmLayer. These networks can be especially useful when
the environment has states that are not included in the observation vector.

For more information on creating agent with RNNs, see rlDQNAgent, rlPPOAgent, and the
Recurrent Neural Networks section in Create Policy and Value Function Representations.

For more information on creating policies and value functions, see rlValueRepresentation,
rlQValueRepresentation, rlDeterministicActorRepresentation, and
rlStochasticActorRepresentation.

Guided Policy Learning: Perform imitation learning in Simulink by


learning policies based external actions
You can now perform imitation learning in Simulink using the RL Agent block. To do so, you pass an
external signal to the RL Agent block, such as a control signal from a human expert. The block can
pass this action signal to the environment and update its policy based on the resulting observations
and rewards.

You can also use this input port to override the agent action for safe learning applications.

8-2
inspectTrainingResult Function: Plot training information from a
previous training session
You can now plot the saved training information from a previous reinforcement learning training
session using the inspectTrainingResult function.

By default, the train function shows the training progress and results in the Episode Manager. If you
configure training to not show the Episode Manager or you close the Episode Manager after training,
you can view the training results using the inspectTrainingResult function, which opens the
Episode Manager.

Deterministic Exploitation: Create PG, AC, PPO, and SAC agents that
use deterministic actions during simulation and in generated policy
functions
PG, AC, PPO, and SAC agents generate stochastic actions during training. By default, these agents
also use stochastic actions during simulation deployment. You can now configure these agents to use
deterministic actions during simulations and in generated policy function code.

To enable deterministic exploitation, in the corresponding agent options object, set the
UseDeterministicExploitation property to true. For more information, see
rlPGAgentOptions, rlACAgentOptions, rlPPOAgentOptions, or rlSACAgentOptions.

For more information on simulating agents and generating policy functions, see sim and
generatePolicyFunction, respectively.

New Examples: Train agent with constrained actions and use DQN
agent for optimal scheduling
This release includes the following new reference examples.

• Water Distribution System Scheduling Using Reinforcement Learning — Train a DQN agent to
learn an optimal pump scheduling policy for a water distribution system.
• Train Reinforcement Learning Agent with Constraint Enforcement — Train an agent with critical
constraints enforced on its actions.

Functionality being removed or changed


Properties defining noise probability distribution in the GaussianActionNoise object have
changed
Still runs

The properties defining the probability distribution of the Gaussian action noise model have changed.
This noise model is used by TD3 agents for exploration and target policy smoothing.

• The Variance property has been replaced by the StandardDeviation property.


• The VarianceDecayRate property has been replaced by the StandardDeviationDecayRate
property.
• The VarianceMin property has been replaced by the StandardDeviationMin property.

8-3
R2021a

When a GaussianActionNoise noise object saved from a previous MATLAB release is loaded, the
value of VarianceDecayRate is copied to StandardDeviationDecayRate, while the square root
of the values of Variance and VarianceMin are copied to StandardDeviation and
StandardDeviationMin, respectively.

The Variance, VarianceDecayRate, and VarianceMin properties still work, but they are not
recommended. To define the probability distribution of the Gaussian action noise model, use the new
property names instead.

Update Code

This table shows how to update your code to use the new property names for rlTD3AgentOptions
object td3opt.

Not Recommended Recommended


td3opt.ExplorationModel.Variance = td3opt.ExplorationModel.StandardDeviat
0.5; ion = sqrt(0.5);
td3opt.ExplorationModel.VarianceDecayR td3opt.ExplorationModel.StandardDeviat
ate = 0.1; ionDecayRate = 0.1;
td3opt.ExplorationModel.VarianceMin = td3opt.ExplorationModel.StandardDeviat
0.1; ionMin = sqrt(0.1);
td3opt.TargetPolicySmoothModel.Varianc td3opt.TargetPolicySmoothModel.Standar
e = 0.5; dDeviation = sqrt(0.5);
td3opt.TargetPolicySmoothModel.Varianc td3opt.TargetPolicySmoothModel.Standar
eDecayRate = 0.1; dDeviationDecayRate = 0.1;
td3opt.TargetPolicySmoothModel.Varianc td3opt.TargetPolicySmoothModel.Standar
eMin = 0.1; dDeviationMin = sqrt(0.1);

Property names defining noise probability distribution in the


OrnsteinUhlenbeckActionNoise object have changed
Still runs

The properties defining the probability distribution of the Ornstein-Uhlenbeck (OU) noise model have
been renamed. DDPG and TD3 agents use OU noise for exploration.

• The Variance property has been renamed StandardDeviation.


• The VarianceDecayRate property has been renamed StandardDeviationDecayRate.
• The VarianceMin property has been renamed StandardDeviationMin.

The default values of these properties remain the same. When an


OrnsteinUhlenbeckActionNoise noise object saved from a previous MATLAB release is loaded,
the values of Variance, VarianceDecayRate, and VarianceMin are copied in the
StandardDeviation, StandardDeviationDecayRate, and StandardDeviationMin,
respectively.

The Variance, VarianceDecayRate, and VarianceMin properties still work, but they are not
recommended. To define the probability distribution of the OU noise model, use the new property
names instead.

8-4
Update Code

This table shows how to update your code to use the new property names for rlDDPGAgentOptions
object ddpgopt and rlTD3AgentOptions object td3opt.

Not Recommended Recommended


ddpgopt.NoiseOptions.Variance = 0.5; ddpgopt.NoiseOptions.StandardDeviation
= 0.5;
ddpgopt.NoiseOptions.VarianceDecayRate ddpgopt.NoiseOptions.StandardDeviation
= 0.1; DecayRate = 0.1;
ddpgopt.NoiseOptions.VarianceMin = 0; ddpgopt.NoiseOptions.StandardDeviation
Min = 0;
td3opt.ExplorationModel.Variance = td3opt.ExplorationModel.StandardDeviat
0.5; ion = 0.5;
td3opt.ExplorationModel.VarianceDecayR td3opt.ExplorationModel.StandardDeviat
ate = 0.1; ionDecayRate = 0.1;
td3opt.ExplorationModel.VarianceMin = td3opt.ExplorationModel.StandardDeviat
0; ionMin = 0;

8-5
9

R2020b

Version: 1.3

New Features

Bug Fixes

Compatibility Considerations
R2020b

Multi-Agent Reinforcement Learning: Train multiple agents in a


Simulink environment
You can now train and deploy multiple agents that work in the same Simulink environment. You can
visualize the training progress of all the agents using the Episode Manager.

Create a multi-agent environment by supplying to rlSimulinkEnv an array of strings containing the


paths of the agents, and cell arrays defining the observation and action specifications of the agent
blocks.

For examples on training multiple agents, see Train Multiple Agents to Perform Collaborative Task,
Train Multiple Agents for Area Coverage, and Train Multiple Agents for Path Following Control.

Soft Actor-Critic Agent: Train sample-efficient policies for


environments with continuous-action spaces using increased
exploration
You can now create soft actor-critic (SAC) agents. SAC is an improved version of DDPG that
generates stochastic policies for environments with a continuous action space. It tries to maximize
the entropy of the policy in addition to the cumulative long-term reward, thereby encouraging
exploration.

You can create a SAC agent using the rlSACAgent function. You can also create a SAC-specific
options object with the rlSACAgentOptions function.

Default Agents: Avoid manually formulating policies by creating


agents with default neural network structures
You can now create a default agent based only on the observation and action specifications of a given
environment. Previously, creating an agent required creating approximators for the agent actor and
critic, using these approximators to create actor and critic representations, and then using these
representations to create the agent.

Default agents are available for DQN, DDPG, TD3, PPO, PG, AC, and SAC agents. For each agent, you
can call the agent creation function, passing in the observation and action specifications from the
environment. The function creates the required actor and critic representations using deep neural
network approximators.

For example, agent = rlTD3Agent(obsInfo,actInfo) creates a default TD3 agent using a


deterministic actor network and two Q-value critic networks.

You can specify initialization options (such as the number of hidden units for each layer, or whether to
use a recurrent neural network) for the default representations using an
rlAgentInitializationOptions object.

After creating a default agent, you can then access its properties and change its actor and critic
representations.

For more information on creating agents, see Reinforcement Learning Agents.

9-2
getModel and setModel Functions: Access computational model used
by actor and critic representations
You can now access the computational model used by the actor and critic representations in a
reinforcement learning agent using the following new functions.

• getModel — Obtain the computational model from an actor or critic representation.


• setModel — Set the computational model in an actor or critic representation.

Using these functions, you can modify the computational in a representation object without
recreating the representation.

New Examples: Create a custom agent, use TD3 to tune a PI controller,


and train agents for automatic parking and motor control
This release includes the following new reference examples.

• Create Agent for Custom Reinforcement Learning Algorithm — Create a custom agent for your
own custom reinforcement learning algorithm.
• Tune PI Controller using Reinforcement Learning — Tune a PI controller using the twin-delayed
deep deterministic policy gradient (TD3) reinforcement learning algorithm.
• Train PPO Agent for Automatic Parking Valet — Train a PPO agent to automatically search for a
parking space and park.
• Train DDPG Agent for PMSM Control — Train a DDPG agent to control the speed of a permanent
magnet synchronous motor.

Functionality being removed or changed


Default value of NumStepsToLookAhead option for AC agents is now 32
Behavior change

For AC agents, the default value of the NumStepsToLookAhead option is now 32.

To use the previous default value instead, create an rlACAgentOptions object and set the option
value to 1.

opt = rlACAgentOptions;
opt.NumStepsToLookAhead = 1;

9-3
10

R2020a

Version: 1.2

New Features

Bug Fixes

Compatibility Considerations
R2020a

New Representation Objects: Create actors and critics with improved


ease of use and flexibility
You can represent actor and critic functions using four new representation objects. These objects
improve ease of use, readability, and flexibility.

• rlValueRepresentation — State value critic, computed based on observations from the


environment.
• rlQValueRepresentation — State-action value critic, computed based on both actions and
observations from the environment.
• rlDeterministicActorRepresentation — Actor with deterministic actions, based on
observations from the environment.
• rlStochasticActorRepresentation — Actor with stochastic actions, based on observations
from the environment.

These objects allow you to easily implement custom training loops for your own reinforcement
learning algorithms. For more information, see Train Reinforcement Learning Policy Using Custom
Training Loop.

Compatibility Considerations
The rlRepresentation function is no longer recommended. Use one of the four new objects
instead. For more information, see “rlRepresentation is not recommended” on page 10-3.

Continuous Action Spaces: Train AC, PG, and PPO agents in


environments with continuous action spaces
Previously, you could train AC, PG, and PPO agents only in environments with discrete action spaces.
Now, you can also train these agents in environments with continuous action spaces. For more
information see rlACAgent, rlPGAgent, rlPPOAgent, and Create Policy and Value Function
Representations.

Recurrent Neural Networks: Train DQN and PPO agents with recurrent
deep neural network policies and value functions
You can now train DQN and PPO agents using recurrent neural network policy and value function
representations. For more information, see rlDQNAgent, rlPPOAgent, and Create Policy and Value
Function Representations.

TD3 Agent: Create twin-delayed deep deterministic policy gradient


agents
The twin-delayed deep deterministic (TD3) algorithm is a state-of-the-art reinforcement learning
algorithm for continuous action spaces. It often exhibits better learning speed and performance
compared to deep deterministic policy gradient (DDPG) algorithms. For more information on TD3
agents, see Twin-Delayed Deep Deterministic Policy Gradient Agents. For more information on
creating TD3 agents, see rlTD3Agent and rlTD3AgentOptions.

10-2
Softplus Layer: Create deep neural network layer using the softplus
activation function
You can now use the new softplusLayer layer when creating deep neural networks. This layer
implements the softplus activation function Y = log(1 + eX), which ensures that the output is always
positive. This activation function is a smooth continuous version of reluLayer.

Parallel Processing: Improved memory usage and performance


For experience-based parallelization, off-policy agents now flush their experience buffer before
distributing them to the workers. Doing so mitigates memory issues when agents with large
observation spaces are trained using many workers. Additionally, the synchronous gradient algorithm
has been numerically improved, and the overhead for parallel training has been reduced.

Deep Network Designer: Scaling, quadratic, and softplus layers now


supported
Reinforcement Learning Toolbox custom layers, including the scalingLayer, quadraticLayer,
and softplusLayer, are now supported in the Deep Network Designer app.

New Examples: Train reinforcement learning agents for robotics and


imitation learning applications
This release includes the following new reference examples.

• Train PPO Agent to Land Rocket — Train a PPO agent to land a rocket in an environment with a
discrete action space.
• Train DDPG Agent with Pretrained Actor Network — Train a DDPG agent using an actor network
that has been previously trained using supervised learning.
• Imitate Nonlinear MPC Controller for Flying Robot — Train a deep neural network to imitate a
nonlinear MPC controller.

Functionality being removed or changed


rlRepresentation is not recommended
Still runs

rlRepresentation is not recommended. Depending on the type of representation being created,


use one of the following objects instead:

• rlValueRepresentation — State value critic, computed based on observations from the


environment.
• rlQValueRepresentation — State-action value critic, computed based on both actions and
observations from the environment.
• rlDeterministicActorRepresentation — Actor with deterministic actions, for continuous
action spaces, based on observations from the environment.
• rlStochasticActorRepresentation — Actor with stochastic actions, based on observations
from the environment.

10-3
R2020a

The following table shows some typical uses of the rlRepresentation function to create neural
network-based critics and actors, and how to update your code with one of the new objects instead.

Network-Based Representations: Not Network-Based Representations:


Recommended Recommended
rep = rep =
rlRepresentation(net,obsInfo,'Observat rlValueRepresentation(net,obsInfo,'Obs
ion',obsNames), with net having only ervation',obsNames). Use this syntax to
observations as inputs, and a single scalar create a representation for a critic that does not
output. require action inputs, such as a critic for an
rlACAgent or rlPGAgent agent.
rep = rep =
rlRepresentation(net,obsInfo,actInfo,' rlQValueRepresentation(net,obsInfo,act
Observation',obsNames,'Action',actName Info,'Observation',obsNames,'Action',a
s), with net having both observations and action ctNames). Use this syntax to create a single-
as inputs, and a single scalar output. output state-action value representation for a
critic that takes both observation and action as
input, such as a critic for an rlDQNAgent or
rlDDPGAgent agent.
rep = rep =
rlRepresentation(net,obsInfo,actInfo,' rlDeterministicActorRepresentation(net
Observation',obsNames,'Action',actName ,obsInfo,actInfo,'Observation',obsName
s), with net having observations as inputs and s,'Action',actNames). Use this syntax to
actions as outputs, and actInfo defining a create a deterministic actor representation for a
continuous action space. continuous action space.
rep = rep =
rlRepresentation(net,obsInfo,actInfo,' rlStochasticActorRepresentation(net,ob
Observation',obsNames,'Action',actName sInfo,actInfo,'Observation',obsNames).
s), with net having observations as inputs and Use this syntax to create a stochastic actor
actions as outputs, and actInfo defining a representation for a discrete action space.
discrete action space.

The following table shows some typical uses of the rlRepresentation objects to express table-
based critics with discrete observation and action spaces, and how to update your code with one of
the new objects instead.

Table-Based Representations: Not Table-Based Representations: Recommended


Recommended
rep = rlRepresentation(tab), with tab rep =
containing a value table consisting in a column rlValueRepresentation(tab,obsInfo). Use
vector with as many elements as the number of this syntax to create a representation for a critic
possible observations. that does not require action inputs, such as a
critic for an rlACAgent or rlPGAgent agent.

10-4
Table-Based Representations: Not Table-Based Representations: Recommended
Recommended
rep = rlRepresentation(tab), with tab rep =
containing a Q-value table with as many rows as rlQValueRepresentation(tab,obsInfo,act
the possible observations and as many columns Info). Use this syntax to create a single-output
as the possible actions. state-action value representation for a critic that
takes both observation and action as input, such
as a critic for an rlDQNAgent or rlDDPGAgent
agent.

The following table shows some typical uses of the rlRepresentation function to create critics and
actors which use a custom basis function, and how to update your code with one of the new objects
instead. In the recommended function calls, the first input argument is a two-element cell array
containing both the handle to the custom basis function and the initial weight vector or matrix.

Custom Basis Function-Based Custom Basis Function-Based


Representations: Not Recommended Representations: Recommended
rep = rep =
rlRepresentation(basisFcn,W0,obsInfo), rlValueRepresentation({basisFcn,W0},ob
where the basis function has only observations as sInfo). Use this syntax to create a
inputs and W0 is a column vector. representation for a critic that does not require
action inputs, such as a critic for an rlACAgent
or rlPGAgent agent.
rep = rlRepresentation(basisFcn,W0, rep =
{obsInfo,actInfo}), where the basis function rlQValueRepresentation({basisFcn,W0},o
has both observations and action as inputs and bsInfo,actInfo). Use this syntax to create a
W0 is a column vector. single-output state-action value representation
for a critic that takes both observation and action
as input, such as a critic for an rlDQNAgent or
rlDDPGAgent agent.
rep = rep =
rlRepresentation(basisFcn,W0,obsInfo,a rlDeterministicActorRepresentation({ba
ctInfo), where the basis function has sisFcn,W0},obsInfo,actInfo). Use this
observations as inputs and actions as outputs, W0 syntax to create a deterministic actor
is a matrix, and actInfo defines a continuous representation for a continuous action space.
action space.
rep = rep =
rlRepresentation(basisFcn,W0,obsInfo,a rlStochasticActorRepresentation({basis
ctInfo), where the basis function has Fcn,W0},obsInfo,actInfo). Use this syntax
observations as inputs and actions as outputs, W0 to create a deterministic actor representation for
is a matrix, and actInfo defines a discrete a discrete action space.
action space.

Target update method settings for DQN agents have changed


Behavior change

Target update method settings for DQN agents have changed. The following changes require updates
to your code:

10-5
R2020a

• The TargetUpdateMethod option has been removed. Now, DQN agents determine the target
update method based on the TargetUpdateFrequency and TargetSmoothFactor option
values.
• The default value of TargetUpdateFrequency has changed from 4 to 1.

To use one of the following target update methods, set the TargetUpdateFrequency and
TargetSmoothFactor properties as indicated.

Update Method TargetUpdateFrequency TargetSmoothFactor


Smoothing 1 Less than 1
Periodic Greater than 1 1
Periodic smoothing (new Greater than 1 Less than 1
method in R2020a)

The default target update configuration, which is a smoothing update with a TargetSmoothFactor
value of 0.001, remains the same.

Update Code

This table shows some typical uses of rlDQNAgentOptions and how to update your code to use the
new option configuration.

Not Recommended Recommended


opt = rlDQNAgentOptions(... opt = rlDQNAgentOptions;
'TargetUpdateMethod',"smoothing");
opt = rlDQNAgentOptions(... opt = rlDQNAgentOptions;
'TargetUpdateMethod',"periodic"); opt.TargetUpdateFrequency = 4;
opt.TargetSmoothFactor = 1;
opt = rlDQNAgentOptions; opt = rlDQNAgentOptions;
opt.TargetUpdateMethod = "periodic"; opt.TargetUpdateFrequency = 5;
opt.TargetUpdateFrequency = 5; opt.TargetSmoothFactor = 1;

Target update method settings for DDPG agents have changed


Behavior change

Target update method settings for DDPG agents have changed. The following changes require
updates to your code:

• The TargetUpdateMethod option has been removed. Now, DDPG agents determine the target
update method based on the TargetUpdateFrequency and TargetSmoothFactor option
values.
• The default value of TargetUpdateFrequency has changed from 4 to 1.

To use one of the following target update methods, set the TargetUpdateFrequency and
TargetSmoothFactor properties as indicated.

Update Method TargetUpdateFrequency TargetSmoothFactor


Smoothing 1 Less than 1
Periodic Greater than 1 1

10-6
Update Method TargetUpdateFrequency TargetSmoothFactor
Periodic smoothing (new Greater than 1 Less than 1
method in R2020a)

The default target update configuration, which is a smoothing update with a TargetSmoothFactor
value of 0.001, remains the same.
Update Code

This table shows some typical uses of rlDDPGAgentOptions and how to update your code to use the
new option configuration.

Not Recommended Recommended


opt = rlDDPGAgentOptions(... opt = rlDDPGAgentOptions;
'TargetUpdateMethod',"smoothing");
opt = rlDDPGAgentOptions(... opt = rlDDPGAgentOptions;
'TargetUpdateMethod',"periodic"); opt.TargetUpdateFrequency = 4;
opt.TargetSmoothFactor = 1;
opt = rlDDPGAgentOptions; opt = rlDDPGAgentOptions;
opt.TargetUpdateMethod = "periodic"; opt.TargetUpdateFrequency = 5;
opt.TargetUpdateFrequency = 5; opt.TargetSmoothFactor = 1;

getLearnableParameterValues is now getLearnableParameters


Behavior change

getLearnableParameterValues is now getLearnableParameters. To update your code, change


the function name from getLearnableParameterValues to getLearnableParameters. The
syntaxes are equivalent.

setLearnableParameterValues is now setLearnableParameters


Behavior change

setLearnableParameterValues is now setLearnableParameters. To update your code, change


the function name from setLearnableParameterValues to setLearnableParameters. The
syntaxes are equivalent.

10-7
11

R2019b

Version: 1.1

New Features

Bug Fixes
R2019b

Parallel Agent Simulation: Verify trained policies by running multiple


agent simulations in parallel
You can now run multiple agent simulations in parallel. If you have Parallel Computing Toolbox™
software, you can run parallel simulations on multicore computers. If you have MATLAB Parallel
Server™ software, you can run parallel simulations on computer clusters or cloud resources. For
more information, see rlSimulationOptions.

PPO Agent: Train policies using proximal policy optimization algorithm


for improved training stability
You can now train policies using proximal policy optimization (PPO). This algorithm is a type of policy
gradient training that alternates between sampling data through environmental interaction and
optimizing a clipped surrogate objective function using stochastic gradient descent. The clipped
surrogate objective function improves training stability by limiting the size of the policy change at
each step.

For more information on PPO agents, see Proximal Policy Optimization Agents.

New Examples: Train reinforcement learning policies for applications


such as robotics, automated driving, and control design
The following new examples show how to train policies for robotics, automated driving, and control
design:

• Quadruped Robot Locomotion Using DDPG Agent


• Imitate MPC Controller for Lane Keep Assist

11-2

You might also like