Reinforcement Learning Toolbox™ Release Notes
Reinforcement Learning Toolbox™ Release Notes
Phone: 508-647-7000
R2024b
R2024a
iii
Data Logging: Additional learning data now available for logging . . . . . . 2-6
R2023b
R2023a
Learning from Data: Train agents offline using previously recorded data
.......................................................... 4-2
iv Contents
Training: Stop and resume training in the Reinforcement Learning
Designer App . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
R2022b
R2022a
RL Agent Block: Learn from last action applied to the environment . . . . 6-4
Reinforcement Learning Designer App: Support for SAC and TRPO agents
.......................................................... 6-5
v
Functionality being removed or changed . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
Representation objects are not recommended . . . . . . . . . . . . . . . . . . . . . . 6-5
train now returns an object instead of a structure . . . . . . . . . . . . . . . . . . . 6-9
Training Parallelization Options: DataToSendFromWorkers and
StepsUntilDataIsSent properties are no longer active . . . . . . . . . . . . . 6-10
Code generated by generatePolicyFunction now uses policy objects . . . . 6-10
R2021b
R2021a
vi Contents
Deterministic Exploitation: Create PG, AC, PPO, and SAC agents that use
deterministic actions during simulation and in generated policy
functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
New Examples: Train agent with constrained actions and use DQN agent
for optimal scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
R2020b
R2020a
Continuous Action Spaces: Train AC, PG, and PPO agents in environments
with continuous action spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
Recurrent Neural Networks: Train DQN and PPO agents with recurrent
deep neural network policies and value functions . . . . . . . . . . . . . . . . 10-2
vii
Softplus Layer: Create deep neural network layer using the softplus
activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3
R2019b
viii Contents
1
R2024b
Version: 24.2
New Features
Bug Fixes
Compatibility Considerations
R2024b
To create a default discrete action space SAC agent (that is an agent with default neural networks for
actor and critics), pass to rlSACAgent the observation and action specifications from the
environment. Here the action specification is an rlFiniteSetSpec object. You can also pass an
rlAgentInitializationOptions object or an rlSACAgentOptions object as additional
arguments. For example, agent = rlSACAgent(obsInfo,actInfo).
For an example on how to use a discrete SAC agent, see “Train Discrete Soft Actor Critic Agent for
Lander Vehicle”.
The observation specification of a hybrid action space environment is a vector composed of one
rlFiniteSetSpec object (defining the discrete part of the action space) followed by one
rlNumericSpec object (defining the continuous part of the action space). Therefore, a hybrid
environment must have one discrete action channel and one continuous action channel.
To create a default hybrid action space SAC agent, pass to rlSACAgent the observation and action
specifications from the environment. Here, the action specification must be a vector containing one
rlFiniteSetSpec object followed by one rlNumericSpec object. You can also pass an
rlAgentInitializationOptions or an rlSACAgentOptions object as additional arguments. For
example, agent = rlSACAgent(obsInfo,actInfo).
Alternatively, you can separately create an rlHybridStochasticActor and an array of one or two
rlVectorQValueFunction objects and pass them to rlSACAgent, optionally with an
rlSACAgentOptions object as a third argument. For example, agent = rlSACAgent(actor,
critic).
For use within custom training loops, you can also create an rlHybridStochasticActorPolicy
object from an rlHybridStochasticActor.
For an example on how to use a hybrid SAC agent, see “Train Hybrid SAC Agent for Path Following
Control”.
1-2
Reinforcement Learning Designer App: Support for evaluating agents
during training
You can now evaluate agents during training from Reinforcement Learning Designer. To do so, from
the Train tab, click on the Evaluate Agent button to enable agent evaluation and to open the Agent
Evaluation Options dialog box. You can then select the appropriate agent evaluation options from
the Agent Evaluation Options dialog box.
For more information on how to set agent evaluation in Reinforcement Learning Designer, see
“Specify Training Options in Reinforcement Learning Designer”.
Additionally, generic reproducibility guidelines are also summarized in the last section of the “Train
Reinforcement Learning Agents” topic.
1-3
2
R2024a
Version: 24.1
New Features
Bug Fixes
Compatibility Considerations
R2024a
For more information, see the option objects of the affected agents. For example,
rlSACAgentOptions.
Compatibility Considerations
Since agents learn differently with the new training algorithms, training results are generally
different than the ones obtained using the training algorithms of previous releases. For the same
reason, data associated with learning (such as actor or critic losses) may be logged at different
2-2
iterations than before. For more information, see “Training: New training algorithms lead to
improved training results” on page 2-7.
You can now easily access the state and learnable parameters of function approximator objects using
the new State and Learnables properties. For example, to display the state of the critic myQVF, at
the MATLAB® command line, type myQVF.State. Both properties store values as cell arrays of
dlarray objects. For dlnetwork-based approximators, the State and Learnables properties
correspond to the value column of the State and Learnables tables of the dlnetwork object.
The functions evaluate, getAction, getValue, getMaxQValue and predict now return (cell
arrays of) dlarray objects when their inputs are (cell arrays of) dlarray objects. This enables these
functions to be used with automatic differentiation.
Specifically, all these functions can now be used within custom loss functions directly, together with
dlfeval, dlgradient, and dlaccelerate. This makes writing custom loss functions easier.
For these functions, you can now use the UseForward name-value argument, which allows you to
explicitly call a forward pass when computing gradients. Specifying UseForward=true enables
layers such as batch normalization, and dropout to change their behavior for training.
You can now use the new name-value arguments ReturnDlarray and ReturnGpuArray with the
functions sample and allExperiences to return data as dlarray or gpuArray (Parallel
Computing Toolbox) directly from a replay object. For example:
replay = rlReplayMemory(obsInfo,actInfo);
[mb,mask,idx,w] = sample(replay,ReturnDlarray=true);
exp = allExperiences(replay,ReturnGpuArray=true);
Compatibility Considerations
accelerate and gradient are no longer recommended.
Instead of using gradient on a function approximator object, write an appropriate loss function that
takes as arguments both the approximation object and its input. In the loss function you typically use
evaluate to calculate the output and dlgradient to calculate the gradient. Then call dlfeval on
your loss function, supplying both the approximator object and it inputs as arguments.
2-3
R2024a
For more information on how to update your code, see “Custom Training Loops: accelerate and
gradient are no longer recommended” on page 2-7. For an example, see Train Reinforcement
Learning Policy Using Custom Training Loop and Custom Training Loop with Simulink Action Noise.
The functions that handle approximator objects now enforce all the output dimensions, including the
batch dimension. When indexing the output of these functions, include the batch and sequence
dimension after the specification dimension. For more information, see “Approximators: Object
functions now enforce correct output dimensions” on page 2-11.
If upper and lower limits are defined for some channels in your observation (and, if needed, action)
specification object, then you can apply normalization, for those channels, at agent creation by
specifying the Normalization initialization option. For example:
opt = rlAgentInitializationOptions(Normalization="rescale-symmetric")
agent = rlDDPGAgent(obsInfo, actInfo, opt);
If no upper and lower limits are defined in the specification object for the channels you want to
normalize, then you can apply normalization by creating a normalization object for each channel that
you want to normalize. For example:
obsNrm1 = rlNormalizer(obsInfo(1),
Normalization="zscore",...
Mean=1,...
StandardDeviation=2);
obsNrm2 = rlNormalizer(obsInfo(2),
Normalization="zscore",...
Mean=2,...
StandardDeviation=3);
You then use setNormalizer to apply the normalization objects to the input channels of your
function approximator object. For example:
actor = getActor(agent);
actor = setNormalizer(actor, [obsNrm1, obsNrm2]);
setActor(agent, actor);
You can also set normalizers only for specific input channels. For more information, see
rlNormalizer, getNormalizer, setNormalizer and rlAgentInitializationOptions.
2-4
Training and Simulation: New memory management options
You can now improve memory management when training or simulating an agent. Three new options
allow you to select the storage type (for example disk instead of memory) for data generated by a
Simulink® environment. You can use these options to prevent out-of-memory issues during training or
simulation. The new options are the following.
• SimulationStorageType — This option is the type of storage used for environment data
generated during training or simulation. The default value is "memory", indicating that data is
stored in memory. To store environment data to disk instead, set this option to "file". When this
option is set to "none" environment data is not stored.
• SaveSimulationDirectory — This option specifies the directory to save environment data
when SimulationStorageType is set to "file". The default value is "savedSims".
• SaveFileVersion — This option specifies the MAT-file version for environment data files. The
default is "-v7". The other possible options are "-v7.3" and "-v6".
Compatibility Considerations
The SimulationInfo property of an rlTrainingResult or rlMultiAgentTrainingResult
object (both returned by train) is now a SimulationStorage object (unless
SimulationStorageType is set to "none").
The SimulationInfo field of the experience structure (returned by sim) is now also a
SimulationStorage object (unless SimulationStorageType is set to "none").
For more information, see “Training and Simulation Results: SimulationInfo is now a
SimulationStorage object” on page 2-9.
When you use it for multiagent training, train now returns a new result object,
rlMultiAgentTrainingResult.
Compatibility Considerations
The properties of the training result object that have Episode as a part of their name have been
replaced by respective properties having Generation as part of their name instead of Episode.
For more information, see “Train with Evolution Strategy: Episode replaced with Generation in
property names and values” on page 2-6.
2-5
R2024a
For more information, see “Evolution Strategy Training Results: SimulationInfo is now an
EvolutionStrategySimulationStorage object” on page 2-10. For more information on the
SimulationInfo returned by train for multiagent environments, see “Training and Simulation
Results: SimulationInfo is now a SimulationStorage object” on page 2-9.
• A new plot type, Multiple Bars, is available in the Reinforcement Learning Data Viewer.
• You can now set the StopTrainingCriteria training option to "None" from the Reinforcement
Learning Designer.
• You can now export the training plot to a MATLAB figure using the Export Plot button in the
Reinforcement Learning Training Monitor.
For DQN, DDPG, TD3, and SAC agents, the additional fields are: TDTarget, TDError, SampleIndex,
MaskIndex, ActorGradientStepCount, and CriticGradientStepCount.
For PPO agents, the additional fields are: TDTarget, TDError, Advantage, PolicyRatio,
AdvantageLoss, EntropyLoss, ActorGradientStepCount, and CriticGradientStepCount.
For TRPO agents, the additional fields are: TDTarget, TDError, Advantage,
ActorGradientStepCount, CriticGradientStepCount.
For Q, SARSA, AC, and PG agents, the additional fields are: ActorGradientStepCount,
CriticGradientStepCount.
trainWithEvolutionStrategy now returns a training result object with the following properties.
2-6
GenerationIndex: [500×1 double]
GenerationReward: [500×1 double]
AverageReward: [500×1 double]
Q0: [500×1 double]
SimulationInfo: [1×1 rl.storage.EvolutionStrategySimulationStorage]
TrainingOptions: [1×1 rl.option.rlEvolutionStrategyTrainingOptions]
The properties that had Episode as a part of their name have been replaced by respective properties
having Generation as part of their name instead of Episode. For example, EpisodeReward is
replaced by GenerationReward. Using older names is not recommended.
Since agents learn differently with the new training algorithms, training results are generally
different than the ones obtained using the training algorithms of previous releases. For the same
reason, data associated with learning (such as actor or critic losses) may be logged at different
iterations than before.
Instead of using gradient on a function approximator object, write an appropriate loss function that
takes as arguments both the approximation object and its input. In the loss function you typically use
2-7
R2024a
evaluate to calculate the output and dlgradient to calculate the gradient. Then call dlfeval,
supplying both the approximator object and it inputs as arguments.
where:
function g = myOIGFcn(actor,u)
y = evaluate(actor,u);
loss = sum(y{1});
g = dlgradient(loss,u);
g = gradient(actor,"output-parameters",u);
g = dlfeval(@myOPGFcn,actor,dlarray(u));
g{1} g{1}
where:
function g = myOIGFcn(actor,u)
y = evaluate(actor,u);
loss = sum(y{1});
g = dlgradient(loss,actor.Learnables);
g = gradient(actor,@customLoss23b,u);g = dlfeval(@customLoss24a,actor,dlarray(u));
g{1} g{1}
where: where:
where: where:
For more information on using dlarray objects for custom deep learning training loops, see
dlfeval, AcceleratedFunction, dlaccelerate.
2-8
rlSACAgentOptions: the CriticUpdateFrequency and NumGradientStepsPerUpdate
properties are no longer effective
Behavior change
The SimulationInfo field of the experience structure (returned by sim) is now also a
SimulationStorage object (unless SimulationStorageType is set to "none").
The SimulationStorage object has the read-only StorageType and NumSimulations properties,
indicating the storage type and the total number of episodes, respectively. This object also contains
environment information collected during simulation, which you can access by indexing into the
object using the episode number.
For example, if res is an rlTrainingResult object returned by train, you can access the
simulation information related to the second episode as:
mySimInfo2 = res.SimulationInfo(2);
2-9
R2024a
Consider a Simulink environment that logs its states as xout over 10 episodes.
Previously, if res was the rlTrainingResult object returned by train, you could pack the
environment simulation information for all episodes in a single array as:
[res.SimulationInfo.xout]
Similarly, previously you could access the simulation information related to the first episode as:
res.SimulationInfo.xout
Consider a Simulink environment that logs its states as xout over 10 episodes.
2-10
res.SimulationInfo(5).xout
Similarly, previously you could access the simulation information related to the first generation as:
res.SimulationInfo.xout
res.SimulationInfo(1).xout
The Reinforcement Learning Episode Manager has been renamed to Reinforcement Learning
Training Monitor.
You can no longer set the Plots training option the Reinforcement Learning Designer app. The
training visualization is now always displayed.
The evaluation functions for approximator objects (e.g. getAction and evaluate) enforce
consistency with output dimensions, including the trailing singular dimension for cases like column
vector specifications. The first n dimensions are consistent with the specification dimensions. The
dimension n+1 corresponds to the batch dimension and the dimension n+2 corresponds to the
sequence dimension.
For example, suppose the output of an approximator method, gro_batch{1} has dimensions 4,1,5,
and 9, which means that has channel dimensions of 4 and 1, as well as five independent sequences,
each one made of nine sequential observations.
Previously, you accessed the third observation element of the first input channel, after the seventh
sequential observation in the fourth independent batch as:
gro_batch{1}(3,4,7)
gro_batch{1}(3,1,4,7)
2-11
3
R2023b
Version: 23.2
New Features
Bug Fixes
Compatibility Considerations
R2023b
You can create two different kinds of custom multi-agent MATLAB environments:
• Multi-agent environments with universal sample time, in which all agents execute in the same
step. For more information on custom multi-agent function environments with universal sample
time, see rlMultiAgentFunctionEnv.
• Turn-based environments, in which agents execute in turns. Specifically, the environment assigns
execution to only one group of agents at a time, and the group executes when it is its turn to do
so. For more information on custom turn-based multi-agent function environments, see
rlTurnBasedFunctionEnv. For an example, see Train Agent to Play Turn-Based Game.
Once you create your custom multiagent MATLAB environment, you can train and simulate your
agents with is using train and sim, respectively.
To configure evaluation options for your agents, first create an evaluator object using rlEvaluator.
You can specify properties such as the type of evaluation statistic, the frequency at which evaluation
episodes occur, or whether exploration is allowed during an evaluation episode.
To train the agents and evaluate them during training, pass this object to train.
3-2
You can also create a custom evaluator object, which uses a custom evaluation function that you
supply. To do so, use rlCustomEvaluator.
The EvaluationStatistic option is available when an evaluator object is used with training. When
using this option, the agent is saved (or the training is stopped) when the evaluation statistic supplied
by the evaluator object equals or exceeds the value specified in SaveAgentValue (or
StopTrainingValue).
The Custom option allows you to save the agent (or stop the training) using a custom function that
you supply with the SaveAgentValue (or StopTrainingValue) argument. When using this option,
the agent is saved (or the training is stopped) when the custom function returns true.
The LoggedSignals property of the rlFunctionEnv object is no longer active and will be removed
in a future release. To pass information from one step to the next, use the Info property instead.
3-3
4
R2023a
Version: 2.4
New Features
Bug Fixes
R2023a
To deal with possible differences between the probability distribution of the dataset and the one
generated by the environment, use the batch data regularization options provided for off-policy
agents. For more information, see the new BatchDataRegularizerOptions property of the off-
policy agents options objects, as well as the new rlBehaviorCloningRegularizerOptions and
rlConservativeQLearningOptions options objects.
By default, built-in off-policy agents use an rlReplayMemory object as their experience buffer.
Agents uniformly sample data from this buffer. To perform uniform on nonuniform hindsight replay
memory, replace the default experience buffer with one of the following objects.
Hindsight experience replay does not support agents that use recurrent neural networks.
4-2
Replay Memory: Validate experiences before adding to replay memory
buffer
You can now validate experiences before adding them to a replay memory buffer using the
validateExperience function. If the experiences are not compatible with the replay memory,
validateExperience generates an error message in the MATLAB command window.
4-3
5
R2022b
Version: 2.3
New Features
Bug Fixes
R2022b
You can automatically generate and configure a Policy block using either the new function
generatePolicyBlock or the new button on the RL Agent block mask.
For more information, see generatePolicyBlock and the Policy block. For an example, see
Generate Policy Block for Deployment.
For more information, see rlDataLogger. For an example, see Log Training Data To Disk.
Prioritized experience replay does not support agents that use recurrent neural networks.
5-2
6
R2022a
Version: 2.2
New Features
Bug Fixes
Compatibility Considerations
R2022a
You can represent actor and critic functions using six new approximator objects. These objects
replace the previous representation objects and improve efficiency, readability, scalability, and
flexibility.
When creating a critic or an actor, you can now select and update optimization options using the new
rlOptimizerOptions object, instead of using the older rlRepresentationOptions object.
Specifically, you can create an agent options object and set its CriticOptimizerOptions and
ActorOptimizerOptions properties to suitable rlOptimizerOptions objects. Then you pass the
agent options object to the function that creates the agent.
Alternatively, you can create the agent and then use dot notation to access the optimization options
for the agent actor and critic, for example:
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 0.1;.
New Policy Objects for Custom Agents and Custom Training Loops
To implement a customized agent, you can instantiate a policy using the following new policy objects.
• rlMaxQPolicy — This object implements a policy that selects the action that maximizes a
discrete state-action value function.
• rlEpsilonGreedyPolicy — This object implements a policy that selects the action that
maximizes a discrete state-action value function with probability 1-Epsilon, otherwise selects a
random action.
• rlDeterministicActorPolicy — This object implements a policy that you can use to
implement custom agents with a continuous action space.
6-2
• rlAdditiveNoisePolicy — This object is similar to rlDeterministicActorPolicy but noise
is added to the output according to an internal noise model.
• rlStochasticActorPolicy — This object implements a policy that you can use to implement
custom agents with a continuous action space.
For more information on these policy objects, at the MATLAB command line type, help followed by
the policy object name.
You can use the new rlReplayMemory object to append, store, save, sample and replay experience
data. Doing so makes it easier to implement custom training loops and your own reinforcement
learning algorithms.
When creating a customized training loop or agent you can also access optimization features using
the objects created by the new rlOptimizer function. Specifically, create an optimizer algorithm
object using rlOptimizer, and optionally use dot notation to modify its properties. Then, create a
structure and set its CriticOptimizer or ActorOptimizer field to the optimizer object. When you
call runEpisode, pass the structure as an input parameter. The runEpisode function can then use
the update method of the optimizer object to update the learnable parameters of your actor or critic.
For more information, see Custom Training Loop with Simulink Action Noise and Train
Reinforcement Learning Policy Using Custom Training Loop.
Compatibility Considerations
The following representation objects are no longer recommended:
• rlValueRepresentation
• rlQValueRepresentation
• rlDeterministicActorRepresentation
• rlStochasticActorRepresentation
For more information on how to update your code to use the new objects, see “Representation objects
are not recommended” on page 6-5.
• Create an internal environment model for a model-based policy optimization (MBPO) agent. For
more information on MBPO agents, see Model-Based Policy Optimization Agents.
• Create an environment for training other types of reinforcement learning agents. You can identify
the state-transition network using experimental or simulated data. Depending on your application,
using a neural network environment as an approximation of a more complex first-principle
environment can speed up your simulation and training.
6-3
R2022a
For more information on creating MBPO agents, see Model-Based Policy Optimization Agents.
Centralized learning boosts exploration and facilitates learning in applications where the agents
perform a collaborative (or the same) task.
For more information on creating training options set for multiple agents, see
rlMultiAgentTrainingOptions.
For more information and examples, see the train reference page.
For more information, see the SampleTime property of any agent options object. For more
information on conditionally executed subsystems, see Conditionally Executed Subsystems Overview
(Simulink).
6-4
In such cases, to improve learning results, you can now enable an input port to connect the last
action signal applied to the environment.
For more information on creating agents using Reinforcement Learning Designer, see Create Agents
Using Reinforcement Learning Designer.
• Train Reinforcement Learning Agents To Control Quanser QUBE™ Pendulum — Train a SAC agent
to generate a swing-up reference trajectory for an inverted pendulum and a PPO agent as a mode-
selection controller.
• Run SIL and PIL Verification for Reinforcement Learning — Perform software-in-the-loop and
processor-in-the-loop verification of trained reinforcement learning agents.
• Train SAC Agent for Ball Balance Control — Control a Kinova robot arm to balance a ball on a
plate using a SAC agent.
• Automatic Parking Valet with Unreal Engine Simulation — Implement a hybrid reinforcement
learning and model predictive control system that searches a parking lot and parks in an open
space.
Functions to create representation objects are no longer recommended. Depending on the type of
actor or critic being created, use one of the following objects instead.
6-5
R2022a
Specifically, you can create an agent options object and set its CriticOptimizerOptions and
ActorOptimizerOptions properties to suitable rlOptimizerOptions objects. Then you pass the
agent options object to the function that creates the agent. This workflow is shown in the following
table.
6-6
rlRepresentationOptions: Not rlOptimizerOptions: Recommended
Recommended
crtOpts = rlRepresentationOptions(...criticOpts = rlOptimizerOptions(...
'GradientThreshold',1); 'GradientThreshold',1);
agent = rlACAgent(actor,critic,agentOpts)
Alternatively, you can create the agent and then use dot notation to access the optimization options
for the agent actor and critic, for example:
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;.
The following table shows some typical uses of the representation objects to create neural network-
based critics and actors, and how to update your code with one of the new function approximator
objects instead.
6-7
R2022a
The following table shows some typical uses of the representation objects to express table-based
critics with discrete observation and action spaces, and how to update your code with one of the new
objects instead.
The following table shows some typical uses of the representation objects to create critics and actors
which use a (linear in the learnable parameters) custom basis function, and how to update your code
with one of the new objects instead. In these function calls, the first input argument is a two-element
6-8
cell array containing both the handle to the custom basis function and the initial weight vector or
matrix.
For more information on the new approximator objects, see, rlTable, rlValueFunction,
rlQValueFunction, rlVectorQValueFunction, rlContinuousDeterministicActor,
rlDiscreteCategoricalActor, and rlContinuousGaussianActor.
The train function now returns an object or an array of objects as the output. The properties of the
object match the fields of the structure returned in previous versions. Therefore, the code based on
dot notation works in the same way.
6-9
R2022a
trainStats = train(agent,env,trainOptions);
When training terminates, either because a termination condition is reached or because you click
Stop Training in the Reinforcement Learning Episode Manager, trainStats is returned as an
rlTrainingResult object.
The rlTrainingResult object contains the same training statistics previously returned in a
structure along with data to correctly recreate the training scenario and update the episode manager.
You can use trainStats as third argument for another train call, which (when executed with the
same agents and environment) will cause training to resume from the exact point at which it stopped.
For more information and examples, see train and “Training: Stop and resume agent training” on
page 6-4. For more information on training agents, see Train Reinforcement Learning Agents.
Attempting to set either of these properties will cause a warning. For more information, see
barrierPenalty.
The code generated by generatePolicyFunction now loads a deployable policy object from a
reinforcement learning agent. The results from running the generated policy function remain the
same.
6-10
7
R2021b
Version: 2.1
New Features
Bug Fixes
Compatibility Considerations
R2021b
• Cost and constraint specifications defined in an mpc (Model Predictive Control Toolbox) or nlmpc
(Model Predictive Control Toolbox) controller object. This feature requires Model Predictive
Control Toolbox™ software.
• Performance constraints defined in Simulink Design Optimization™ model verification blocks.
Compatibility Considerations
• getModel now returns a dlnetwork object.
• Due to numerical differences in the network calculations, previously trained agents might behave
differently. If this happens, you can retrain your agents.
• To use Deep Learning Toolbox™ functions that do not support dlnetwork, you must convert the
network to layerGraph. For example, to use deepNetworkDesigner, replace
deepNetworkDesigner(network) with deepNetworkDesigner(layerGraph(network)).
For more information on creating TRPO agents, see rlTRPOAgent and rlTRPOAgentOptions.
7-2
PPO Agents: Improve agent performance by normalizing advantage
function
In some environments, you can improve PPO agent performance by normalizing the advantage
function during training. The agent normalizes the advantage function by subtracting the mean
advantage value and scaling by the standard deviation.
• "current" — Normalize the advantage function using the advantage function mean and standard
deviation for the current mini-batch of experiences.
• "moving" — Normalize the advantage function using the advantage function mean and standard
deviation for a moving window of recent experiences. To specify the window size, set the
AdvantageNormalizingWindow option.
For example, configure the agent options to normalize the advantage function using the mean and
standard deviation from the last 500 experiences.
opt = rlPPOAgentOptions;
opt.NormalizedAdvantageMethod = "moving";
opt.AdvantageNormalizingWidnow = 500;
For more information on PPO agents, see Proximal Policy Optimization Agents.
The built-in agents now use dlnetwork objects as actor and critic representations. In most cases this
allows for a speedup of about 30%.
7-3
8
R2021a
Version: 2.0
New Features
Bug Fixes
Compatibility Considerations
R2021a
To open the Reinforcement Learning Designer app, at the MATLAB command line, enter the
following:
reinforcementLearningDesigner
RNNs are deep neural networks with a sequenceInputLayer input layer and at least one layer that
has hidden state information, such as an lstmLayer. These networks can be especially useful when
the environment has states that are not included in the observation vector.
For more information on creating agent with RNNs, see rlDQNAgent, rlPPOAgent, and the
Recurrent Neural Networks section in Create Policy and Value Function Representations.
For more information on creating policies and value functions, see rlValueRepresentation,
rlQValueRepresentation, rlDeterministicActorRepresentation, and
rlStochasticActorRepresentation.
You can also use this input port to override the agent action for safe learning applications.
8-2
inspectTrainingResult Function: Plot training information from a
previous training session
You can now plot the saved training information from a previous reinforcement learning training
session using the inspectTrainingResult function.
By default, the train function shows the training progress and results in the Episode Manager. If you
configure training to not show the Episode Manager or you close the Episode Manager after training,
you can view the training results using the inspectTrainingResult function, which opens the
Episode Manager.
Deterministic Exploitation: Create PG, AC, PPO, and SAC agents that
use deterministic actions during simulation and in generated policy
functions
PG, AC, PPO, and SAC agents generate stochastic actions during training. By default, these agents
also use stochastic actions during simulation deployment. You can now configure these agents to use
deterministic actions during simulations and in generated policy function code.
To enable deterministic exploitation, in the corresponding agent options object, set the
UseDeterministicExploitation property to true. For more information, see
rlPGAgentOptions, rlACAgentOptions, rlPPOAgentOptions, or rlSACAgentOptions.
For more information on simulating agents and generating policy functions, see sim and
generatePolicyFunction, respectively.
New Examples: Train agent with constrained actions and use DQN
agent for optimal scheduling
This release includes the following new reference examples.
• Water Distribution System Scheduling Using Reinforcement Learning — Train a DQN agent to
learn an optimal pump scheduling policy for a water distribution system.
• Train Reinforcement Learning Agent with Constraint Enforcement — Train an agent with critical
constraints enforced on its actions.
The properties defining the probability distribution of the Gaussian action noise model have changed.
This noise model is used by TD3 agents for exploration and target policy smoothing.
8-3
R2021a
When a GaussianActionNoise noise object saved from a previous MATLAB release is loaded, the
value of VarianceDecayRate is copied to StandardDeviationDecayRate, while the square root
of the values of Variance and VarianceMin are copied to StandardDeviation and
StandardDeviationMin, respectively.
The Variance, VarianceDecayRate, and VarianceMin properties still work, but they are not
recommended. To define the probability distribution of the Gaussian action noise model, use the new
property names instead.
Update Code
This table shows how to update your code to use the new property names for rlTD3AgentOptions
object td3opt.
The properties defining the probability distribution of the Ornstein-Uhlenbeck (OU) noise model have
been renamed. DDPG and TD3 agents use OU noise for exploration.
The Variance, VarianceDecayRate, and VarianceMin properties still work, but they are not
recommended. To define the probability distribution of the OU noise model, use the new property
names instead.
8-4
Update Code
This table shows how to update your code to use the new property names for rlDDPGAgentOptions
object ddpgopt and rlTD3AgentOptions object td3opt.
8-5
9
R2020b
Version: 1.3
New Features
Bug Fixes
Compatibility Considerations
R2020b
For examples on training multiple agents, see Train Multiple Agents to Perform Collaborative Task,
Train Multiple Agents for Area Coverage, and Train Multiple Agents for Path Following Control.
You can create a SAC agent using the rlSACAgent function. You can also create a SAC-specific
options object with the rlSACAgentOptions function.
Default agents are available for DQN, DDPG, TD3, PPO, PG, AC, and SAC agents. For each agent, you
can call the agent creation function, passing in the observation and action specifications from the
environment. The function creates the required actor and critic representations using deep neural
network approximators.
You can specify initialization options (such as the number of hidden units for each layer, or whether to
use a recurrent neural network) for the default representations using an
rlAgentInitializationOptions object.
After creating a default agent, you can then access its properties and change its actor and critic
representations.
9-2
getModel and setModel Functions: Access computational model used
by actor and critic representations
You can now access the computational model used by the actor and critic representations in a
reinforcement learning agent using the following new functions.
Using these functions, you can modify the computational in a representation object without
recreating the representation.
• Create Agent for Custom Reinforcement Learning Algorithm — Create a custom agent for your
own custom reinforcement learning algorithm.
• Tune PI Controller using Reinforcement Learning — Tune a PI controller using the twin-delayed
deep deterministic policy gradient (TD3) reinforcement learning algorithm.
• Train PPO Agent for Automatic Parking Valet — Train a PPO agent to automatically search for a
parking space and park.
• Train DDPG Agent for PMSM Control — Train a DDPG agent to control the speed of a permanent
magnet synchronous motor.
For AC agents, the default value of the NumStepsToLookAhead option is now 32.
To use the previous default value instead, create an rlACAgentOptions object and set the option
value to 1.
opt = rlACAgentOptions;
opt.NumStepsToLookAhead = 1;
9-3
10
R2020a
Version: 1.2
New Features
Bug Fixes
Compatibility Considerations
R2020a
These objects allow you to easily implement custom training loops for your own reinforcement
learning algorithms. For more information, see Train Reinforcement Learning Policy Using Custom
Training Loop.
Compatibility Considerations
The rlRepresentation function is no longer recommended. Use one of the four new objects
instead. For more information, see “rlRepresentation is not recommended” on page 10-3.
Recurrent Neural Networks: Train DQN and PPO agents with recurrent
deep neural network policies and value functions
You can now train DQN and PPO agents using recurrent neural network policy and value function
representations. For more information, see rlDQNAgent, rlPPOAgent, and Create Policy and Value
Function Representations.
10-2
Softplus Layer: Create deep neural network layer using the softplus
activation function
You can now use the new softplusLayer layer when creating deep neural networks. This layer
implements the softplus activation function Y = log(1 + eX), which ensures that the output is always
positive. This activation function is a smooth continuous version of reluLayer.
• Train PPO Agent to Land Rocket — Train a PPO agent to land a rocket in an environment with a
discrete action space.
• Train DDPG Agent with Pretrained Actor Network — Train a DDPG agent using an actor network
that has been previously trained using supervised learning.
• Imitate Nonlinear MPC Controller for Flying Robot — Train a deep neural network to imitate a
nonlinear MPC controller.
10-3
R2020a
The following table shows some typical uses of the rlRepresentation function to create neural
network-based critics and actors, and how to update your code with one of the new objects instead.
The following table shows some typical uses of the rlRepresentation objects to express table-
based critics with discrete observation and action spaces, and how to update your code with one of
the new objects instead.
10-4
Table-Based Representations: Not Table-Based Representations: Recommended
Recommended
rep = rlRepresentation(tab), with tab rep =
containing a Q-value table with as many rows as rlQValueRepresentation(tab,obsInfo,act
the possible observations and as many columns Info). Use this syntax to create a single-output
as the possible actions. state-action value representation for a critic that
takes both observation and action as input, such
as a critic for an rlDQNAgent or rlDDPGAgent
agent.
The following table shows some typical uses of the rlRepresentation function to create critics and
actors which use a custom basis function, and how to update your code with one of the new objects
instead. In the recommended function calls, the first input argument is a two-element cell array
containing both the handle to the custom basis function and the initial weight vector or matrix.
Target update method settings for DQN agents have changed. The following changes require updates
to your code:
10-5
R2020a
• The TargetUpdateMethod option has been removed. Now, DQN agents determine the target
update method based on the TargetUpdateFrequency and TargetSmoothFactor option
values.
• The default value of TargetUpdateFrequency has changed from 4 to 1.
To use one of the following target update methods, set the TargetUpdateFrequency and
TargetSmoothFactor properties as indicated.
The default target update configuration, which is a smoothing update with a TargetSmoothFactor
value of 0.001, remains the same.
Update Code
This table shows some typical uses of rlDQNAgentOptions and how to update your code to use the
new option configuration.
Target update method settings for DDPG agents have changed. The following changes require
updates to your code:
• The TargetUpdateMethod option has been removed. Now, DDPG agents determine the target
update method based on the TargetUpdateFrequency and TargetSmoothFactor option
values.
• The default value of TargetUpdateFrequency has changed from 4 to 1.
To use one of the following target update methods, set the TargetUpdateFrequency and
TargetSmoothFactor properties as indicated.
10-6
Update Method TargetUpdateFrequency TargetSmoothFactor
Periodic smoothing (new Greater than 1 Less than 1
method in R2020a)
The default target update configuration, which is a smoothing update with a TargetSmoothFactor
value of 0.001, remains the same.
Update Code
This table shows some typical uses of rlDDPGAgentOptions and how to update your code to use the
new option configuration.
10-7
11
R2019b
Version: 1.1
New Features
Bug Fixes
R2019b
For more information on PPO agents, see Proximal Policy Optimization Agents.
11-2