0% found this document useful (0 votes)

3 views

preprints202403.0914.v1

Uploaded by

nagarajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

preprints202403.0914.v1

Uploaded by

nagarajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Article Not peer-reviewed version

Tuning of PID Controllers using

Reinforcement Learning for Nonlinear
Systems Control

Gheorghe Bujgoi and Dorin Sendrescu *

Posted Date: 15 March 2024

doi: 10.20944/preprints202403.0914.v1

Keywords: learning-based control; nonlinear systems control; PID controller; bioprocess

Preprints.org is a free multidiscipline platform providing preprint service that

is dedicated to making early versions of research outputs permanently
available and citable. Preprints posted at Preprints.org appear in Web of
Science, Crossref, Google Scholar, Scilit, Europe PMC.

Copyright: This is an open access article distributed under the Creative Commons
Attribution License which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work is properly cited.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

Disclaimer/Publisher’s Note: The statements, opinions, and data contained in all publications are solely those of the individual author(s) and
contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting
from any ideas, methods, instructions, or products referred to in the content.

Article

Tuning of PID Controllers Using Reinforcement

Learning for Nonlinear Systems Control
Gheorghe Bujgoi and Dorin Sendrescu *
Department of Automatic Control and Electronics, University of Craiova, 200585 Craiova, Romania;
[email protected], [email protected]
* Correspondence: [email protected]

Abstract: The numerical implementation of the controllers allows the use of very complex
algorithms with ease. However, in practice, due to its proven advantages, the PID controller (and
its variants) is widely used in industrial control systems as well as in many other applications that
require continuous control. Most of the methods for tuning the parameters of PID controllers are
based on time-invariant linear models of the processes, which in practice can lead to poor
performance of the control system. The paper presents an application of reinforcement learning
algorithms in tuning of PID controllers for the control of some classes of continuous nonlinear
systems. Tuning the parameters of PID controllers is done with the help of Twin Delayed Deep
Deterministic Policy Gradients (TD3) algorithm which presents a series of advantages compared to
other similar methods from Machine Learning dedicated to continuous state and action spaces. TD3
algorithm is an off-policy Actor-Critic based method and was used as it does not require a system
model. The presented technique is applied for control of a biotechnological system which has a
strongly nonlinear dynamic. The proposed tuning method is compared to the classical tuning
methods of PID controllers. The performance of the tuning method based on the TD3 algorithm is
demonstrated through simulation illustrating the effectiveness of the proposed methodology.

Keywords: learning-based control; nonlinear systems control; PID controller; bioprocess

1. Introduction
Despite the existence of a wide range of advanced control methods, most industrial processes
use classical PID-type control laws as control methods. This is due to the robustness of these control
laws to disturbances, to modeling errors or to the time variation of various parameters, but also due
to the simplicity of implementation on both analog and digital devices.
Here are some specific examples of using PID controllers:
- In industrial processes, PID controllers are used to control temperature, pressure, level and
other important variables. They can help keep these variables within safe and effective limits [1,2].
- In the aerospace industry, PID controllers are used to control the flight of aircraft. They can
help keep the aircraft straight and on course, even in difficult conditions [3].
- In the automotive industry, PID controllers are used to control the engine, transmission, and
other systems. They can help improve vehicle performance, efficiency and safety [4].
Tuning of classic PID controllers consists in setting the values of only three parameters - Kp, Ki
and Kd - corresponding to the three actions specified in the name of these controllers - proportional
(P), integrator (I), derivative (D). Although for linear systems the tuning of a PID controller is
straightforward using various methods [5], the practical implementation raises numerous problems
because real systems are non-linear (or the linearity zone is very narrow) or with variable parameters.
Also, some processes allow aggressive controls while others require smooth controls. Thus, the
selection of parameters must be carried out according to the specific characteristics of the process,
which is why, in industrial practice, various approaches have been developed for tuning the
parameters of PID controllers [6]. The simplest tuning method is trial and error. In this method, the

© 2024 by the author(s). Distributed under a Creative Commons CC BY license.

Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

parameters of the controller are adjusted according to the response of the system to various test
signals. It is a method that requires experienced personnel and can lead to equipment failure if not
performed carefully. Also, in this category of methods we can include tuning using the Ziegler –
Nichols frequency response method [7]. In this case, the closed-loop system is brought to the stability
limit and the parameters are adjusted based on practical rules. Another category of methods is based
on obtaining a simplified first-order plus time delay model (FOPTD) and using preset tuning
formulas to adjust the aggressiveness and robustness of the response to various types of input signals.
While PID controllers are particularly effective for linear systems, they can also be applied to
certain nonlinear systems with some limitations. The challenge with nonlinear systems is that their
behavior may vary across different operating points, and a fixed set of PID parameters may not
provide optimal control across the entire range of system dynamics [8]. In some cases, nonlinearities
can lead to instability or poor performance when using a standard PID controller.
However, there are several strategies to use PID controllers with nonlinear systems:
Linearization: One approach is to linearize the system around an operating point and design a
PID controller for that linearized model. This can work well if the nonlinearities are relatively small
within the operating range.
Gain Scheduling: Another method is gain scheduling, where different sets of PID parameters are
used for different operating conditions or operating points in the system's state space. The controller
parameters are adjusted based on the system's nonlinearities.
Adaptive Control: Adaptive control techniques can be employed to adjust the PID parameters
online based on the changing characteristics of the nonlinear system. This requires a feedback
mechanism to continuously update the controller parameters.
This paper presents a method based on Machine Learning for tuning the parameters of PID
controllers for automatic control of some classes of nonlinear systems. The development of machine
learning methods in the field of artificial intelligence did not remain without an echo in the field of
process control. Thus, the most well-known algorithms in the field of Reinforcement Learning (the
field of AI closest to automatic control) have been tested and implemented in applications of control
systems [9,10]. In particular, the actor-critic methods in RL have been intensively used for systems
control and optimization problems. Reinforcement learning (RL) is a type of machine learning where
an agent learns to behave in an environment by trial and error. The agent receives rewards for taking
actions that lead to desired outcomes, and punishments for taking actions that lead to undesired
outcomes. Over time, the agent learns to take actions that maximize its rewards. RL is a powerful
technique that can be used to solve a wide variety of problems, including game playing, robotics, and
finance [11].
RL is a challenging field of machine learning, but it is also one of the most promising. RL has the
potential to solve problems that are beyond the reach of other machine learning techniques. Some of
the key concepts in reinforcement learning are the following (see also Figure 1):
• Agent: The agent is the entity that is learning to behave in an environment. The most common
structure for the agent is composed of two elements: Critic and Actor. The critic estimates the
expected cumulative reward (value) associated with being in a certain state and following the policy
defined by the actor. The actor is responsible for learning and deciding the optimal policy – the
mapping from states to actions. It is essentially the decision-maker or policy function.
• Environment: The environment is the world that the agent interacts with.
• State: The state is the current condition of the environment.
• Action: An action is something that the agent can do in the environment.
• Reward: A reward is a signal that indicates whether an action was good or bad.
• Policy: A policy is a rule that tells the agent what action to take in a given state.
• Value function: A value function is a measure of how good it is to be in a given state.
The basic idea of RL-type algorithms is to improve their policy, and, from this point of view, two
main approaches have been developed: on-policy and off-policy learning. On-policy and off-policy
are two categories of reinforcement learning algorithms that differ in how they use collected data for
learning. The key distinction lies in whether the learning policy (the policy being optimized) is the
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

same as the policy used to generate the data [12]. In on-policy algorithms, the learning agent follows
a specific policy while collecting data, and this policy is the one being improved over time (the data
used for learning (experience) comes from the same policy that is being updated). In off-policy
algorithms, the learning agent has its own exploration policy for collecting data, but it learns and
improves a different target policy (the data used for learning can come from a different (possibly
older) policy than the one being updated). Representative examples are SARSA (State-Action-
Reward-State-Action) for on-line learning and Q-learning for off-line learning.
Reinforcement learning algorithms for continuous states have evolved significantly over the
years, driven by the need to address real-world problems with continuous and high-dimensional
state spaces. The early days of RL were dominated by discrete state and action spaces. Dynamic
programming algorithms, such as the Bellman equation, were effective for solving problems with
small, discrete state spaces. However, they were not suitable for continuous spaces due to the curse
of dimensionality. To handle continuous states, researchers introduced function approximation
techniques. Value function approximation using methods like tile coding and coarse coding helped
extend RL to continuous state spaces. This approach laid the foundation for handling larger and more
complex state representations. Policy gradient methods emerged as an alternative to value-based
approaches. Instead of estimating the value function, these methods directly learn a parameterized
policy. Algorithms like REINFORCE and actor-critic methods became popular for problems with
continuous state and action spaces.

Figure 1. General scheme of Reinforcement Learning.

They are particularly effective when dealing with high-dimensional and complex problems. The
advent of deep neural networks brought about a revolution in RL. Deep Q-Networks (DQN)
extended RL to problems with high-dimensional state spaces, and later algorithms like Deep
Deterministic Policy Gradients (DDPG) and Trust Region Policy Optimization (TRPO) addressed
continuous action spaces. These methods leverage neural networks to approximate complex
mappings from states to actions.
Twin Delayed DDPG (TD3) is an off-policy reinforcement learning algorithm designed for
continuous action spaces. It is an extension of the Deep Deterministic Policy Gradients (DDPG)
algorithm with several modifications to enhance stability and performance. TD3 was introduced by
Scott Fujimoto et al. in their 2018 paper titled “Addressing Function Approximation Error in Actor-
Critic Methods” [13] and has been successfully used in several control applications [14]. In this paper,
the TD3 algorithm was used to tune the parameters of the PID controller from a control loop of a
nonlinear system.
Next, the paper is structured as follows: in Section 2 the TD3 algorithm is presented in detail,
Section 3 presents the tuning of the PID controller parameters using the TD3 algorithm, in Section 4
presents the classical approach of PID tuning for a nonlinear system, Section 5 presents the simulation
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

results of the two approaches and Section 6 is dedicated to the conclusions regarding the results
obtained and possible future approaches.

2. Twin Delayed Deep Deterministic Policy Gradient (TD3) Algorithm

TD3 (Twin Delayed Deep Deterministic Policy Gradient) is an off-policy actor-critic
reinforcement learning algorithm. It is a successor (an enhanced version) to the Deep Deterministic
Policy Gradient (DDPG) algorithm, and it addresses some of the shortcomings of DDPG, such as
overestimation bias and numerical instability. Since TD3 is built on DDPG with some modifications,
we will present the DDPG algorithm first.

2.1. Deep Deterministic Policy Gradient

The main elements of DDPG and TD3 algorithms are represented by two types of deep neural
networks: actor neural network and critic neural network [14]. The general structure of these
networks is shown in Figure 2. The actor is associated with policy-based methods. It can learn a policy
that maps states in the environment to actions, trying to find an optimal policy that maximizes long-
term rewards. The actor network receives as input the state of the environment and has as output the
action to be applied in that state.

Figure 2. Actor and Critic Networks.

The critic estimates the value function of the states or of the pairs (state, action) in the
environment. The value function indicates how much reward you are expected to get by starting
from a certain state and taking a certain action. The critic evaluates actions taken by the actor in a
given environment and provides feedback regarding how well those actions performed in achieving
desired goals or rewards. The most used technique for establishing the value function is Q-learning.
In Q-learning, a function (usually denoted Q) that associates state-action pairs with the expected
value of the future reward is estimated using a deep neural network.
In the following, we shall use the usual notations from RL, presented in Table 1.

Table 1. Notations of main elements of DDPG and TD3 algorithms.

Notation RL Element
sk current state
sk+1 next state
ak current action
ak+1 next action
rk reward at state sk
Q Q-function (critic)
 policy function (actor)
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

 TD target
𝔸 action space
𝕊 state space

DDPG is an actor-critic algorithm that presents the advantages of both policy-based and value-
based methods and learn optimal estimates of both policy and value function. Using Bellman
equation and off-policy data the Q-function is learnt and then is used to learn the policy [15]. The
main idea is the same from Q learning: if the optimal Q-function (denoted 𝑄𝑜𝑝𝑡 (𝑠, 𝑎)) is known, the
optimal action is obtained solving the following equation:
𝑎𝑜𝑝𝑡 (𝑠) = 𝑎𝑟𝑔 max 𝑄 𝑜𝑝𝑡 (𝑠, 𝑎), (1)
𝑎

The starting point in learning an approximator for 𝑄 𝑜𝑝𝑡 (𝑠, 𝑎) is the well-known Bellman
equation:

𝑄 𝑜𝑝𝑡 (𝑠𝑘 , 𝑎𝑘 ) = 𝔼 [𝑟𝑘 + 𝛾 max 𝑄 𝑜𝑝𝑡 (𝑠𝑘+1 , 𝑎)], (2)

𝑎∈𝔸

Q-learning solves the Bellman optimality equation using temporal difference (TD):

𝑄(𝑠𝑘 , 𝑎𝑘 ) ← 𝑄(𝑠𝑘 , 𝑎𝑘 ) + 𝛼 [𝑟𝑘 + 𝛾 max 𝑄(𝑠𝑘+1 , 𝑎) − 𝑄(𝑠𝑘 , 𝑎𝑘 )], (3)

𝑎∈𝔸

where
𝛿 = [𝑟𝑘 + 𝛾 max 𝑄(𝑠𝑘+1 , 𝑎) − 𝑄(𝑠𝑘 , 𝑎𝑘 )] – TD-error
𝑎∈𝔸
𝛼 – learning rate
In order to use Q-learning for continuous state and action spaces, neural networks were used as
function approximators for Q values and function policy. So, considering 𝑄𝑤 - critic network
parametrized by w and 𝜋𝜃 - actor network parametrized by , relation (3) becomes:

𝑄𝑤 (𝑠𝑘 , 𝑎𝑘 ) ← 𝑄𝑤 (𝑠𝑘 , 𝑎𝑘 ) + 𝛼 [𝑟𝑘 + 𝛾 max 𝑄𝑤 (𝑠𝑘+1 , 𝑎) − 𝑄𝑤 (𝑠𝑘 , 𝑎𝑘 )], (4)

𝑎∈𝔸

In relation (4) the term 𝜑𝑘 = 𝑟𝑘 + 𝛾 max 𝑄𝑤 (𝑠𝑘+1 , 𝑎) represents the target (the desired or goal
𝑎∈𝔸
value that the algorithm is trying to approximate or learn). The target 𝜑𝑘 depends on the
parameter w that is to be optimized so, the target is varying and this causes significant difficulty for
the supervised learning. The solution to this problem in DDPG was to use a new neural network
(called critic target network) that has a similar structure with the critic network but whose parameters
does not change rapidly. Also, the target 𝜑𝑘 contains the maximization over 𝑎 ∈ 𝔸 that can be
expensive since the Q value function is a complex network with continuous inputs. To address this
problem, a target actor network is used to approximate the optimal policy that
maximizes 𝑄𝑤 (𝑠𝑘+1 , 𝑎). The target networks are denoted in the following by:
𝑄̂𝑤̂ - target critic network parametrized by 𝑤 ̂
𝜋̂𝜃̂ - target actor network parametrized by 𝜃̂
So, the target becomes:
𝜑𝑘 = 𝑟𝑘 + 𝛾 ∙ 𝑄̂𝑤̂ (𝑠𝑘+1 , 𝜋̂𝜃̂ (𝑠𝑘+1 )), (5)
For training the critic 𝑄𝑤 , using a minibatch of N samples and a target critic 𝑄̂𝑤̂ , compute
𝜑𝑖 = 𝑟𝑖 + 𝛾 ∙ 𝑄̂𝑤̂ (𝑠𝑖+1 , 𝜋̂𝜃̂ (𝑠𝑖+1 )), (6)
and update w by minimizing the loss function LC (calculated as the mean square error (MSE) between
the current Q-value and the target Q-value):
1
𝐿𝑐 = ∑𝑖(𝜑𝑖 − 𝑄𝑤 (𝑠𝑖 , 𝑎𝑖 ))2 , (7)
𝑁

In order to update the actor network parameters, the policy gradient is used:
1
∇𝜃 𝐽 = ∑𝑖(∇𝑎 𝑄𝑤 (𝑠𝑖 , 𝑎𝑖 ) ∙ ∇𝜃 𝜋𝜃 (𝑠𝑖 )), (8)
𝑁
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

For updating the target network parameters, DDPG performs “soft updates” using Polyak
averaging:
𝑤
̂ ← 𝜏 ∙ 𝑤 + (1 − 𝜏) ∙ 𝑤
̂
̂ ̂ , (9)
𝜃 ← 𝜏 ∙ 𝜃 + (1 − 𝜏) ∙ 𝜃
where 𝜏 ∈ [0, 1], 𝜏 ≪ 1.
In summary, the main steps of the DDPG algorithm are:
1. Initialize the critic and actor networks.
2. Collect data from the environment and store them in the replay buffer.
3. Sample a batch of experiences from the replay buffer.
4. Compute the target Q-value using the target networks and update the critic using a mean-squared
Bellman error loss.
5. Update the actor policy using the sampled batch, aiming to maximize the estimated Q-value.
6. Periodically update the target networks with a soft update.
7. Repeat steps 3-6 until the desired performance is achieved.

2.2. TD3 – The Main Characteristics

In Figure 3 is presented the general structure of TD3 algorithm. There are used six neural
networks, namely, two critics, two critic targets, an actor and corresponding target. During each
interaction with the environment, the agent collects experiences in the form of tuples (state, action,
reward, next state). Instead of immediately using these experiences to update the policy or value
function, the experiences are stored in the replay buffer. The replay buffer is a mechanism used in
DDPG and TD3 to store and replay past experiences, promoting more stable and efficient learning in
continuous action space reinforcement learning problems. A replay buffer is a key component used
to store and sample experiences from the agent's interactions with the environment. The replay buffer
typically has a fixed size, and new experiences overwrite the oldest ones once the buffer is full. This
ensures a continuous flow of new experiences while still benefiting from historical data. Training a
neural network with sequential experiences can lead to high correlations between consecutive
samples, which may hinder learning. The replay buffer allows for random sampling of experiences,
breaking the temporal correlations and providing more diverse training data.

Figure 3. General structure of TD3 algorithm.

Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

Overconfidence bias (a term that comes from psychology), defines a disparity between one's self-
assessment of skills and abilities and the actual reality. This phenomenon extends beyond human
behavior and is prevalent among RL agents, known in RL terminology as "Overestimation Bias."
Another drawback of Q-learning algorithms is the possible numerical instability generated using
function approximation to estimate Q-values for continuous or large state-action spaces. Instabilities
can arise when training neural networks, especially if they are deep. Issues like vanishing gradients
or exploding gradients can affect the stability of the learning process. Consequently, the TD3
algorithm was developed to address these challenges within the Actor-Critic RL framework,
specifically targeting the limitations observed in the DDPG algorithm. TD3 concentrates specifically
on the Actor-Critic framework, implementing three techniques to enhance the DDPG algorithm: 1.
Clipped Double Q-Learning, 2. Target policy smoothing and 3. Delayed policy and target updates.
Clipped double Q-learning is a way of reducing overestimation bias in the Q-functions. Double
Q-learning addresses the overestimation problem by using two sets of Q-values (Q1 and Q2), and
during the updates, it uses one set to select the best action and the other set to evaluate that action
[16]. Clipped Double Q-learning goes a step further by introducing a clipping mechanism during the
Q-value updates (as summarized in Figure 4). The idea is to prevent overestimation by limiting the
impact of the maximum estimated Q-value.

Figure 4. Flow of data through the target networks to calculate the TD-Target using the Clipped
Double Q-Learning approach.

When updating Q-values, the algorithm compares the Q-values from the two sets and chooses
the smaller one. This clipped value is then used in the update rule. Since only a single actor π is used,
a single TD-Target is then used for updating both Q1 and Q2.
𝜑𝑘 = 𝑟𝑘 + 𝛾 ∙ 𝑚𝑖𝑛{𝑄̂𝑤̂1 (𝑠𝑘+1 , 𝜋̂𝜃̂ (𝑠𝑘+1 ), 𝑄̂𝑤̂2 (𝑠𝑘+1 , 𝜋̂𝜃̂ (𝑠𝑘+1 )}, (10)
Target policy smoothing is a way of making the updates to the policy less aggressive. This helps
to prevent the policy from diverging too far from the current policy, which can lead to instability. In
order to prevent that, a perturbation 𝜂 (usually chosen as a small Gaussian noise) is added to the
action of the next state, so that the value evaluation is more accurate. Additionally, the noise itself, as
well as the perturbed action are clipped. The noise is clipped to ensure that it applies to only a small
region around the action, while the perturbed action is clipped to ensure that it lies within the range
of valid action values.
𝜑𝑘 = 𝑟𝑘 + 𝛾 ∙ 𝑚𝑖𝑛{𝑄̂𝑤̂1 (𝑠𝑘+1 , 𝜋̂𝜃̂ (𝑠𝑘+1 ), 𝑄̂𝑤̂2 (𝑠𝑘+1 , 𝜋̂𝜃̂ (𝑠𝑘+1 )} + 𝜂, (11)
where 𝜂 ∼ 𝑐𝑙𝑖𝑝(𝑁(0, 𝜎), −𝑐, 𝑐).
So, the update of w1 and w2 (critics parameters) is realized by minimizing two loss functions:
1
𝐿𝑐1 = ∑𝑖(𝜑𝑖 − 𝑄𝑤1 (𝑠𝑖 , 𝑎𝑖 ))2
𝑁
1 , (12)
𝐿𝑐2 = ∑𝑖(𝜑𝑖 − 𝑄𝑤2 (𝑠𝑖 , 𝑎𝑖 ))2
𝑁

Delayed policy and target updates is a technique by which the target networks are updated less
often than the main networks. In TD3, the policy update is delayed. Instead of updating the policy
every time step, the policy network is updated less frequently, typically after a fixed number of
iterations or time steps. This delay helps in reducing the correlation between consecutive policy
updates, leading to more stable learning and better exploration of the action space. In addition to
delaying the policy updates, TD3 also introduces delayed updates to the target networks (both the
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

actor and the critic networks). In DDPG, the target networks are updated by directly copying the
weights from the main networks periodically. However, in TD3, the target networks are updated less
frequently than the main networks. Specifically, the target networks are updated less often than the
policy network updates. This delayed updating of the target networks helps in stabilizing the
learning process by providing more consistent target values for the temporal difference (TD) error
calculation, thus reducing the variance in the update process. These two strategies, delayed policy
updates and delayed target updates, work together to improve the stability and performance of the
TD3 algorithm, making it more effective in training deep reinforcement learning agents for complex
tasks. Similar to DDPG, we used Polyak averaging technique with the formula below (τ has a very
small value):
𝑤
̂1 ← 𝜏 ∙ 𝑤1 + (1 − 𝜏) ∙ 𝑤̂1
𝑤
̂2 ← 𝜏 ∙ 𝑤2 + (1 − 𝜏) ∙ 𝑤̂2 , (13)
̂
𝜃 ← 𝜏 ∙ 𝜃 + (1 − 𝜏) ∙ 𝜃̂

TD3 is a complex algorithm, but it is relatively easy to implement. Summarizing the above, the
TD3 algorithm is as follows [18]:

Algorithm - TD3:
1. Initialize critic networks 𝑄𝑤1 , 𝑄𝑤2 , and actor network 𝜋𝜃 with random parameters 𝑤1 , 𝑤2 ,
𝜃.
2. Initialize the parameters of target networks:
𝑤
̂1 ← 𝑤1 , 𝑤̂2 ← 𝑤2 , 𝜃̂ ← 𝜃
3. Initialize replay buffer B

for k= 1 to T do
4.Select action with exploration noise 𝑎 ← 𝜋𝜃 + 𝜂 and observe reward rk and new state sk+1
5. Store transition (sk, ak , sk+1, rk) in B
6. Sample mini-batch of N transitions (sk, ak, sk+1, rk) from B and compute

𝑎𝑘 ← 𝜋̂𝜃̂ (𝑠𝑖+1 ) + 𝜂
𝜂 ← 𝑐𝑙𝑖𝑝(𝑁(0, 𝜎), −𝑐, 𝑐)
𝜑𝑘 ← 𝑟𝑘 + 𝛾 ∙ 𝑚𝑖𝑛{𝑄̂𝑤̂1 , 𝑄̂𝑤̂2 } + 𝜂

7. Update critics parameters:

1
𝑤1 ← 𝑎𝑟𝑔 min ∑(𝜑𝑘 − 𝑄𝑤1 (𝑠𝑘 , 𝑎𝑘 ))2
𝑤1 𝑁

1
𝑤2 ← 𝑎𝑟𝑔 min ∑(𝜑𝑘 − 𝑄𝑤2 (𝑠𝑘 , 𝑎𝑘 ))2
𝑤2 𝑁

8. Update 𝜃 by the deterministic policy gradient

1
∇𝜃 𝐽 = ∑𝑖(∇𝑎 𝑄𝑤1 (𝑠𝑘 , 𝑎𝑘 ) ∙ ∇𝜃 𝜋𝜃 (𝑠𝑘 ))
𝑁

9. Update target networks:

𝑤
̂1 ← 𝜏 ∙ 𝑤1 + (1 − 𝜏) ∙ 𝑤̂1
𝑤
̂2 ← 𝜏 ∙ 𝑤2 + (1 − 𝜏) ∙ 𝑤̂2
̂ ̂
𝜃 ← 𝜏 ∙ 𝜃 + (1 − 𝜏) ∙ 𝜃
end for
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

3. Tunning of PID Controllers Using TD3 Algorithm

In the following, we will consider a standard PID type controller with the following input –
output relation:
𝑑𝑒
𝑢 = [𝑒 ∫ 𝑒𝑑𝑡 ] ∗ [𝐾𝑝 𝐾𝑖 𝐾𝑑]𝑇 , (18)
𝑑𝑡

Here:
• u is the output of the actor neural network.
• Kp, Ki and Kd are the PID controller parameters.
• e(t)=v(t)−y(t) where e(t) is system error, y(t) is the system output,
and v(t) is the reference signal.
The proposed control structure with tuning of the PID controller parameters using the TD3
algorithm is presented in Figure 5. Given observations, a TD3 agent decides which action to take
using an actor representation. The weights of the actor are, in fact, the gains of the PID controller so,
one can model the PID controller as a neural network with one fully-connected layer with error, error
integral and error derivative as inputs (state of the system). The output of the actor neural network
represents the command signal (action). The proposed scheme has two time scales: a scale for
adapting the PID controller parameters (the weights of the actor) and a scale for analyzing the
system's response to a step input.
The reward can be defined using point-based metrics (e.g. settling time, overshoot etc.) or
metrics based on system trajectory. In our study one used a reward function based on LQG criterion
(see Section 5).

Figure 5. Illustration of TD3 based PID tuning approach.

4. Tuning PID Controller for a Biotechnological System – Classical Approach

Designing a PID controller for biotechnological processes is a difficult task because, in general,
these processes are very slow (usually the time constants of the system are measured in hours), in
many cases unstable, which makes practical tuning methods (for example Ziegler-Nichols type
methods) impossible to apply. For these reasons, to design a PID controller for a biotechnological
process, it is necessary first to obtain a linearized model after which various tuning methods can be
applied [17]. We present below the procedure for obtaining a linearized model for a bacterial growth
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

biosystem around an operating point. Modelling of bioprocesses that take place in a fed-batch
bioreactor is based on the general mass-balance equations [18]. Starting from these equations, the
general model is represented by a set of nonlinear differential equations of the following form:
𝜉̇ (𝑡) = 𝑓(𝜉, 𝐷, 𝐹) = Γ ⋅ 𝛷(𝜉, 𝑡) − 𝐷𝜉(𝑡) + 𝐹(𝑡), (19)
where:
• 𝜉(𝑡) ∈ ℜ𝑛×1 - represents the state vector (the concentrations of the systems variables);
• 𝛷 = [𝜇1 𝜇2 ⋯ 𝜇𝑚 ]𝑇 - denotes the vector of reactions kinetics (the rates of the reactions);
• Γ = [𝛾𝑖𝑗 ], 𝑖 = 1, 𝑛; 𝑗 = 1, 𝑚, is the matrix of the yield coefficients;
• Γ ⋅ 𝛷(𝜉, 𝑡) represents the rate of production;
• −𝐷𝜉(𝑡) + 𝐹(𝑡) is the exchange between the bioreactor and the exterior.
This model is strongly non-linear, a specific characteristic of most biotechnological processes. In
order to design a PID controller the system represented by relation (19) could be linearized around
of an equilibrium point (𝜉̃, ̃
𝐷 , 𝐹̃ ) (that is a solution of the equation 𝑓(𝜉, 𝐷, 𝐹) = 0). One gets the
following equations:
𝑑
(𝜉 − 𝜉̃) = 𝐴(𝜉̃) ⋅ (𝜉 − 𝜉̃) − (𝐷 − 𝐷
̃ )𝜉̃ + (𝐹 − 𝐹̃ ), (20)
𝑑𝑡

with
𝜕𝛷(𝜉)
𝐴(𝜉̃) = Γ ∙ [ ] ̃ 𝐼2 ,
−𝐷 (21)
𝜕𝜉 𝜉=𝜉̃

where 𝜉̃ represents the equilibrium value of 𝜉 and I2 represents the second order unity matrix. The
linear model (20) – (21) can be used to tune the PID parameters using classical design formulas or
computer aided design approaches.
In the following, one presents the model of bacterial growth bioprocess that takes place in a
continuous stirred-tank bioreactor. The dynamical model of the process can be expressed in the form
of a set of differential equation as follows:
𝑑𝜉𝐵
= 𝜇(𝜉𝑆 ) ⋅ 𝜉𝐵 − 𝐷 ⋅ 𝜉𝐵
𝑑𝑡
𝑑𝜉𝑆
, (22)
= −𝑘1 ∙ 𝜇(𝜉𝑆 ) ⋅ 𝜉𝐵 − 𝐷 ⋅ 𝜉𝑆 + 𝐹𝑖𝑛
𝑑𝑡

where 𝜉𝐵 – the biomass concentration, 𝜉𝑆 – the substrate concentration, 𝐹𝑖𝑛 is the input feed rate and
𝐹𝑖𝑛 = 𝐷 ⋅ 𝑆𝑖𝑛 (where 𝐷 denotes the dilution rate), 𝑆𝑖𝑛 the concentration of influent substrate.
The model can be written in the matrix form:
𝑑 𝜉 1 𝜉 0
[ 𝐵] = [ ] ⋅ 𝜇(𝜉𝑆 ) ⋅ 𝜉𝐵 − 𝐷 ⋅ [ 𝐵 ] + [ ], (23)
𝑑𝑡 𝜉𝑆 −𝑘1 𝜉𝑆 𝐷 ⋅ 𝑆𝑖𝑛
and if one denotes
𝜉 1 0
𝜉 = [ 𝐵] ; Γ = [ ] ; 𝛷(𝜉) = 𝜇(𝜉𝑆 ) ⋅ 𝜉𝐵 ; 𝐹 = [ ], (24)
𝜉 −𝑘1 𝐷 ⋅ 𝑆𝑖𝑛
𝑆
one gets the form (19).
Because it takes into account the inhibitory effect of the substrate at high concentrations, one of
the most used models for the specific growth rate is the Haldane model. The Haldane model is
described by the following relation:
𝜉𝑆
𝜇(𝜉𝑆 ) = 𝜇0 ∙ (25)
𝐾𝑀 + 𝜉𝑆 + 𝜉𝑆2 /𝐾𝑆
where 𝐾𝑀 represents Michaelis-Menten constant, 𝐾𝑆 the inhibition constant and 𝜇0 the maximum
specific growth rate.
Since
𝜕𝜙(𝜉) 𝜉̃𝑆
[ ] = 𝜇(𝜉̃𝑆 ) = 𝜇0 ∙ and
𝜕𝜉𝐵 𝜉=𝜉̃ 𝐾𝑀 +𝜉̃𝑆 +𝜉̃2
𝑆 /𝐾𝑆
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

𝜕𝜙(𝜉) 𝑑𝜇(𝜉𝑆 ) 𝐾𝑀 −𝜉̃𝑆2 /𝐾𝑆

[ ] = | ∙ 𝜉̃𝐵 = 𝜇0 ∙ 𝜉̃𝐵 ≜𝛽
𝜕𝜉𝑆 𝜉=𝜉̃ 𝑑𝜉𝑆 𝜉=𝜉̃ (𝐾𝑀 +𝜉̃𝑆 +𝜉̃𝑆2 /𝐾𝑆 )2

the linearized matrix has the following form:

𝜇(𝜉̃𝑆 ) − 𝐷
̃ 𝛽
A=[ ], (26)
−𝑘1 ⋅ 𝜇(𝜉̃𝑆 ) −𝑘1 ⋅ 𝛽 − 𝐷
̃

For the bacterial growth bioprocess, it is important to control the substrate concentration since
the too high a value of this concentration leads to the inhibition of biomass growth in the bioreactor
[18]. So, one considers 𝜉𝑆 as the controlled variable (output of the system) and the dilution rate D as
input. Denoting 𝑦 = 𝜉𝑆 − 𝜉̃𝑆 and 𝑢 = 𝐷 − 𝐷 ̃ , one gets the standard state space representation of a
linear system:
𝑑𝑥
=𝐴∙𝑥+𝐵∙𝑢
{ 𝑑𝑡 , (27)
𝑦 =𝐶∙𝑥
where
𝜇(𝜉̃𝑆 ) − 𝐷
̃ 𝛽 −𝜉̃𝐵
A=[ ], 𝐵=[ ], 𝐶 = [0 1]
−𝑘1 ⋅ 𝜇(𝜉̃𝑆 ) −𝑘1 ⋅ 𝛽 − 𝐷
̃ −𝜉̃𝑆 + 𝑆𝑖𝑛
The linearized model (27) can be used to design a PID controller, for example using the PID
Tuner app from Matlab [19].

5. Simulation Results
Tuning approaches presented in Sections 3 and 4 were implemented in Matlab/Simulink
environment. The Simulink implementation of bacterial growth bioprocess is presented in Figure 6.
The bioprocess parameters used in simulations were the following:
10𝑔 100𝑔 100𝑔
𝜇0 = 6ℎ−1 , 𝐾𝑀 = , 𝐾𝑆 = , 𝑘1 = 1, 𝑆𝑖𝑛 =
𝑙 𝑙 𝑙

Figure 6. Matlab/Simulink implementation of bacterial growth bioprocess.

5.1. RL Approach
The Matlab implementation of the PID tuning structure shown in Figure 5 is presented in Figure
7 and is based on RL-TD3 agent implemented in Matlab/Simulink [20].
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

Figure 7. Matlab/Simulink implementation block diagram of the proposed control system using RL-
TD3 agent.

Observation block has the three elements [𝑃(𝑡𝑛 ) 𝐼(𝑡𝑛 ) 𝐷(𝑡𝑛 )] (see Figure 8), while the Reward
block has implemented the formula for the reward. One used a reward function based on LQG
criterion:
1 𝑇
𝐽 = lim ( ∫0 (𝑎 ∙ (𝑟𝑒𝑓 − 𝑦)2 + b ∙ 𝑢2 (𝑡))𝑑𝑡), (20)
𝑇→∞ 𝑇

with 𝑎, 𝑏 two weighting coefficients (see Matlab implementation in Figure 9).

Figure 8. Matlab/Simulink implementation of observation vector.

Figure 9. Matlab/Simulink implementation of reward function.

The parameters which are utilized for actor and critic networks of the TD3 agent are presented
in Table 2.

Table 2. Training parameters for actor and critic networks.

Parameter Value
Mini-batch size 128
Experience buffer length 500000
Gaussian noise variance 0.1
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

Steps in episode 200

Maximum number of episodes 1000
Optimizer Adam
Discount factor 0.97
Fully connected learning size 32
Critic learning rate 0.01
Actor learning rate 0.01

The training process is presented in Figure 10.

Figure 9. Training of TD3 neural networks.

The gains of the PID controller are the absolute weights of the actor representation. After 200
episodes, the following values of the PID controller were obtained:
𝐾𝑃 = 0.62; 𝐾𝐼 = 3.57; 𝐾𝐷 = 0.00023
With gains obtained from the RL agent a step-response simulation was performed. Time
evolution of output of the system (substrate concentration) is presented in Figure 10 and command
signal and the biomass concentration are presented in Figure 11.

Figure 10. Step response of biotechnological system.

Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

Figure 11. Time evolution of command signal and the biomass concentration.

5.2. Classical Approach

The nonlinear bioprocess was linearized around the equilibrium point 𝐷 ̃ = 3.6ℎ−1 , 𝜉̃
𝐵 = 80 𝑔/𝑙,
𝜉̃𝑆 = 23 𝑔/𝑙 . One gets the linear model:
𝑑𝑥
{ 𝑑𝑡 = 𝐴 ∙ 𝑥 + 𝐵 ∙ 𝑢
𝑦 =𝐶∙𝑥
with
0 59.04 −80
A=[ ] 𝐵=[ ] , 𝐶 = [0 1]
−3.60 −62.65 77
The PID controller parameters were tuned using the PID Tuner app from Matlab (see Figure 12).
The values of the PID controller are:
𝐾𝑃 = 0.207; 𝐾𝐼 = 28.64; 𝐾𝐷 = 0.00037

Figure 12. Tuning the controller using PID Tuner app.

Time evolution of output of the system (substrate concentration) is presented in Figure 13 and
command signal and the biomass concentration are presented in Figure 14.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

Figure 13. Step response of system output .

Figure 14. Time evolution of command signal and the biomass concentration.

From Figures 11 and 14 it can be seen that the control of the substrate is important (too high a
value inhibits the concentration of biomass), the main purpose of the regulation system being the
increase of biomass production.
The main results and response characteristics are summarized in Table 3.

Table 3. Tuned parameters and step response characteristics.

Overshoot Settling
Tuning method Kp Ki Kd
[%] time [h]
TD3 based tuning 0.62 3.57 0.00023 0 0.8

Linearization- based tuning 0.207 28.64 0.00037 20 ÷ 40 0.4 ÷ 0.5

The response obtained through the classic approach is more aggressive, with a large overshoot
and a slightly shorter response time. The response in the case of TD3 based tuning is slower and it
has no overshoot.

5. Conclusions
In this paper was proposed the use of learning techniques from the field of artificial intelligence
for tuning the parameters of the PID controllers used to control non-linear systems. The proposed
method is very useful from a practical point of view because many industrial processes still use PID
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

controllers and their tuning is a particularly important step. Also, many time in practice, empirical
tuning methods are used that cannot be used for certain classes of nonlinear systems. The RL-based
tuning method is compared with a classical technique that uses the linearization of the nonlinear
system around an operating point. In the first case, tuning the controller is done using Twin Delayed
Deep Deterministic Policy Gradients (TD3) algorithm which presents a series of advantages
compared to other similar RL approaches dedicated to continuous systems. This method is an off-
policy Actor-Critic based method and it was chosen as it does not require a system model and works
on environments with continuous action and state spaces. In the classical approach, the nonlinear
system was linearized around an equilibrium point and then standard tuning methods were used.
The presented techniques were applied to the control of a biotechnological system – a bacterial
growth process – that take place in a fed-batch bioreactor. The simulations demonstrate the possibility
of using machine learning algorithms for tuning the parameters of the PID controllers with the aim
of controlling non-linear systems.

Author Contributions: Conceptualization, G.B. and D.S.; methodology, D.S.; software, G.B.; validation, G.B. and
D.S.; formal analysis, D.S.; investigation, G.B. and D.S.; resources, G.B. and D.S.; data curation, G.B.; writing—
original draft preparation, D.S.; writing—review and editing, G.B.; visualization, G.B.; supervision, D.S; project
administration, D.S. All authors have read and agreed to the published version of the manuscript.”.
Funding: This research received no external funding.
Data Availability Statement: The data used to support the findings of this study are available from the
corresponding author upon request.
Conflicts of Interest: The authors declare that they have no conflicts of interest.

References
1. Borase RP, Maghade D, Sondkar S, Pawar S. A review of PID control, tuning methods and applications.
International Journal of Dynamics and Control 2020:1–10.
2. Bucz Š, Kozáková A. Advanced methods of PID controller tuning for specified performance. PID Control
for Industrial Processes 2018:73–119
3. Noordin, A.; Mohd Basri, M.A.; Mohamed, Z. Real-Time Implementation of an Adaptive PID Controller
for the Quadrotor MAV Embedded Flight Control System. Aerospace 2023, 10, 59.
https://doi.org/10.3390/aerospace10010059.
4. Amanda Danielle O. S. D.; André Felipe O. A. D.; João Tiago L. S. C.; Domingos L. A. N. and Carlos
Eduardo T. D. PID Control for Electric Vehicles Subject to Control and Speed Signal Constraints, Journal of
Control Science and Engineering. 2018, Article ID 6259049. https://doi.org/10.1155/2018/6259049.
5. Aström, K.J., Hägglund, T. Advanced PID Control, vol. 461. Research Triangle Park, NC: ISA-The
Instrumentation, Systems, and Automation Society. 2006.
6. Liu, G.; Daley, S. Optimal-tuning PID control for industrial systems. Control Eng. Pract. 2001, 9, 1185–1194.
7. Aström, K.J., Hägglund, T. Revisiting the Ziegler–Nichols step response method for PID control. J. Process
Control. 2004. 14 (6), 635–650.
8. Sedighizadeh M, Rezazadeh A. Adaptive PID controller based on reinforcement learning for wind turbine
control. In: Proceedings of World Academy of Science, Engineering and Technology. vol. 27. 2008. pp. 257–
62.
9. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, et al. Continuous control with deep reinforcement
learning. arXiv Preprint, arXiv:150902971 2015.
10. Brunton, S.L.; Kutz, J.N. Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and
Control. Cambridge University Press: Cambridge, UK, 2019.
11. H. Dong and H. Dong, Deep Reinforcement Learning. Singapore: Springer, 2020.
12. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction. MIT Press: Cambridge, MA, USA, 2018.
13. Fujimoto, S. Hoof, H. and Meger, D. Addressing function approximation error in actor-critic methods. In
Proc. Int. Conf. Mach. Learn., 2018, pp. 1587–1596.
14. Muktiadji, R.F.; Ramli, M.A.M.; Milyani, A.H. Twin-Delayed Deep Deterministic Policy Gradient
Algorithm to Control a Boost Converter in a DC Microgrid. Electronics 2024, 13, 433.
https://doi.org/10.3390/electronics13020433.
15. Yao, J.; Ge, Z. Path-Tracking Control Strategy of Unmanned Vehicle Based on DDPG Algorithm. Sensors
2022, 22, 7881.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 March 2024 doi:10.20944/preprints202403.0914.v1

16. H. V. Hasselt, A. Guez, and D. Silver, ‘‘Deep reinforcement learning with double Q-learning,’’ in Proc.
AAAI Conf. Artif. Intell., vol. 30, 2016, pp. 1–7.
17. Rathore, A.S.; Mishra, S.; Nikita, S.; Priyanka, P. Bioprocess Control: Current Progress and Future
Perspectives. Life 2021, 11, 557. https://doi.org/10.3390/life11060557.
18. Sendrescu, D.; Petre, E.; Selisteanu, D. Nonlinear PID controller for a Bacterial Growth Bioprocess. In
Proceedings of the 2017 18th International Carpathian Control Conference (ICCC), Sinaia, Romania, 28–31
May 2017; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, USA, 2017; pp. 151–155.
19. MathWorks—PID Tuner. Available online: https://www.mathworks.com/help/control/ref/pidtuner-
app.html.
20. MathWorks—Twin-Delayed Deep Deterministic Policy Gradient Reinforcement Learning Agent.
Available online: https://www.mathworks.com/help/reinforcement-learning/ug/td3-agents.html

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those
of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s)
disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or
products referred to in the content.

Design and Implementation of Line Follower Robot Using Arduino Microcontroller
No ratings yet
Design and Implementation of Line Follower Robot Using Arduino Microcontroller
26 pages
PID Controller Design Approaches - Theory Tuning and Application To Frontier Areas
100% (2)
PID Controller Design Approaches - Theory Tuning and Application To Frontier Areas
298 pages
The Operational Auditing Handbook: Auditing Business and IT Processes
From Everand
The Operational Auditing Handbook: Auditing Business and IT Processes
Andrew Chambers
4.5/5 (5)
Reinenforement Learning With Pid Loop
No ratings yet
Reinenforement Learning With Pid Loop
7 pages
PDF
No ratings yet
PDF
49 pages
Design and Application of Adaptive PID Controller Based On Asynchronous Advantage Actor-Critic Learning Method
No ratings yet
Design and Application of Adaptive PID Controller Based On Asynchronous Advantage Actor-Critic Learning Method
11 pages
Literature Review of PID Controller Based On Various Soft Computing Techniques
No ratings yet
Literature Review of PID Controller Based On Various Soft Computing Techniques
4 pages
Development of Tuning Free SISO PID Controllers For First Ord 2021 Results I
No ratings yet
Development of Tuning Free SISO PID Controllers For First Ord 2021 Results I
15 pages
Review of PID Control Design and Tuning Methods
No ratings yet
Review of PID Control Design and Tuning Methods
7 pages
cong2009
No ratings yet
cong2009
8 pages
Chertovskikh 2019 J. Phys. Conf. Ser. 1359 012090
No ratings yet
Chertovskikh 2019 J. Phys. Conf. Ser. 1359 012090
7 pages
Advanced Methods of PID Controller Tuning For Specified Performance
No ratings yet
Advanced Methods of PID Controller Tuning For Specified Performance
48 pages
1 s2.0 S2405844022006879 Main
No ratings yet
1 s2.0 S2405844022006879 Main
29 pages
2015-Article-MeriçÇetin-ISA Transactions-A Novel Auto-Tuning PID Control Mechanism For Nonlinear Systems
No ratings yet
2015-Article-MeriçÇetin-ISA Transactions-A Novel Auto-Tuning PID Control Mechanism For Nonlinear Systems
17 pages
(Ebook) PID Tuning: A Modern Approach via the Weighted Sensitivity Problem by Salvador Alcántara Cano, Ramon Vilanova Arbós, Carles Pedret i Ferré ISBN 9780367343729, 9780429325335, 036734372X, 0429325339 - The latest ebook is available, download it today
100% (2)
(Ebook) PID Tuning: A Modern Approach via the Weighted Sensitivity Problem by Salvador Alcántara Cano, Ramon Vilanova Arbós, Carles Pedret i Ferré ISBN 9780367343729, 9780429325335, 036734372X, 0429325339 - The latest ebook is available, download it today
80 pages
applsci-12-10269-v2
No ratings yet
applsci-12-10269-v2
14 pages
1905.13268
No ratings yet
1905.13268
16 pages
Design of A Data-Driven PID Controller: Toru Yamamoto, Member, IEEE, Kenji Takao, and Takaaki Yamada
No ratings yet
Design of A Data-Driven PID Controller: Toru Yamamoto, Member, IEEE, Kenji Takao, and Takaaki Yamada
11 pages
Simple Analytic Rules for Model Reduction and PID
No ratings yet
Simple Analytic Rules for Model Reduction and PID
20 pages
IMC Based Robust PID Design: Tuning Guidelines and Automatic Tuning
No ratings yet
IMC Based Robust PID Design: Tuning Guidelines and Automatic Tuning
10 pages
Design and Implementation of A Self-Tuning Pid Controller
No ratings yet
Design and Implementation of A Self-Tuning Pid Controller
6 pages
PMhaskar NHEl-Farra PDChristofides AIChEJ 2005 51 Method Classical Controller Tuning
No ratings yet
PMhaskar NHEl-Farra PDChristofides AIChEJ 2005 51 Method Classical Controller Tuning
8 pages
Skogestad Simple Pid Tuning Rules
No ratings yet
Skogestad Simple Pid Tuning Rules
27 pages
Sol2012lic Pid
No ratings yet
Sol2012lic Pid
86 pages
Practical Design of PID-type Controllers With Constraints
No ratings yet
Practical Design of PID-type Controllers With Constraints
15 pages
Tuning A PID Controller Using Metaheuristic Algorithms
No ratings yet
Tuning A PID Controller Using Metaheuristic Algorithms
7 pages
Chemcon2015 (1)
No ratings yet
Chemcon2015 (1)
6 pages
21531
No ratings yet
21531
50 pages
SKS SimplePID
No ratings yet
SKS SimplePID
27 pages
A novel fuzzy logic scheme for PID controller auto-tuning
No ratings yet
A novel fuzzy logic scheme for PID controller auto-tuning
14 pages
Fuzzy Logic Based Set-Point Weight Tuning of PID Controllers
No ratings yet
Fuzzy Logic Based Set-Point Weight Tuning of PID Controllers
6 pages
Biomimetics 08 00434
No ratings yet
Biomimetics 08 00434
26 pages
PID Tuning: A Modern Approach via the Weighted Sensitivity Problem 1st Edition Salvador Alcántara Cano - Download the complete ebook in PDF format and read freely
100% (1)
PID Tuning: A Modern Approach via the Weighted Sensitivity Problem 1st Edition Salvador Alcántara Cano - Download the complete ebook in PDF format and read freely
70 pages
Glushchenko 2019
No ratings yet
Glushchenko 2019
8 pages
Paper Adaptive@PID@Computed Torque@Control@Robot
No ratings yet
Paper Adaptive@PID@Computed Torque@Control@Robot
10 pages
Comparison Study of Different Structures of PID Controllers
No ratings yet
Comparison Study of Different Structures of PID Controllers
9 pages
fernandes20
No ratings yet
fernandes20
6 pages
Where can buy PID Tuning: A Modern Approach via the Weighted Sensitivity Problem 1st Edition Salvador Alcántara Cano ebook with cheap price
100% (3)
Where can buy PID Tuning: A Modern Approach via the Weighted Sensitivity Problem 1st Edition Salvador Alcántara Cano ebook with cheap price
40 pages
Analytical Design of PID Controllers (Iván D. Díaz-Rodríguez, Sangjin Han Etc.) (Z-Library)
No ratings yet
Analytical Design of PID Controllers (Iván D. Díaz-Rodríguez, Sangjin Han Etc.) (Z-Library)
304 pages
Pseudo-PID Controller: Design, Tuning and Applications
No ratings yet
Pseudo-PID Controller: Design, Tuning and Applications
6 pages
Real Time Application of Ants Colony Optimization: Dr.S.M.Girirajkumar Dr.K.Ramkumar Sanjay Sarma O.V
No ratings yet
Real Time Application of Ants Colony Optimization: Dr.S.M.Girirajkumar Dr.K.Ramkumar Sanjay Sarma O.V
13 pages
Virtual Reference Feedback Tuning For Industrial PID Controllers
No ratings yet
Virtual Reference Feedback Tuning For Industrial PID Controllers
6 pages
Pid Controllers in Bestune: Standard Independent Pid Controller
No ratings yet
Pid Controllers in Bestune: Standard Independent Pid Controller
7 pages
The Application of Adaptive PID Control in The Spray Robot
No ratings yet
The Application of Adaptive PID Control in The Spray Robot
4 pages
Lecture2 3
No ratings yet
Lecture2 3
25 pages
Adaptive PID Controllers: State of The Art and Development Prospects
No ratings yet
Adaptive PID Controllers: State of The Art and Development Prospects
12 pages
Probably The Best Simple PID Tuning Rules in The World: Sigurd Skogestad
No ratings yet
Probably The Best Simple PID Tuning Rules in The World: Sigurd Skogestad
28 pages
10 1 1 623 275 PDF
No ratings yet
10 1 1 623 275 PDF
28 pages
Autotuning of PID Controllers A Relay Feedback Approach 2nd edition Cheng-Ching Yu - The ebook in PDF format is available for download
100% (2)
Autotuning of PID Controllers A Relay Feedback Approach 2nd edition Cheng-Ching Yu - The ebook in PDF format is available for download
48 pages
A Review of PID Control Tuning Methods and Applications
No ratings yet
A Review of PID Control Tuning Methods and Applications
10 pages
The Setpoint Overshoot Method
No ratings yet
The Setpoint Overshoot Method
15 pages
Pid Tuning Thesis
100% (3)
Pid Tuning Thesis
7 pages
Pillay 2008
No ratings yet
Pillay 2008
207 pages
A. S. Silveira Et Al. - Pseudo PID Controler - Design, Tuning and Applications
No ratings yet
A. S. Silveira Et Al. - Pseudo PID Controler - Design, Tuning and Applications
6 pages
5 PID Controller Design
No ratings yet
5 PID Controller Design
24 pages
Studyon PIDController Designand Performance Basedon Tuning Techniques
No ratings yet
Studyon PIDController Designand Performance Basedon Tuning Techniques
8 pages
Automatic Control: Experimental Approaches
From Everand
Automatic Control: Experimental Approaches
Subodh Keshari
No ratings yet
Control System: Fundamentals and Applications
From Everand
Control System: Fundamentals and Applications
Fouad Sabry
No ratings yet
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
The Science of Controller Synthesis
From Everand
The Science of Controller Synthesis
Martin Braae
No ratings yet
Introduction to N.C.M., a Non Contact Measurement Tool
From Everand
Introduction to N.C.M., a Non Contact Measurement Tool
Dennis R. Branch
No ratings yet
Syllabus
No ratings yet
Syllabus
2 pages
WINSEM2022-23 RMC0001 TH VL2022230507556 ReferenceMaterialI TueFeb2100 00 00IST2023 ResearchandPublicationEthics (2022)
No ratings yet
WINSEM2022-23 RMC0001 TH VL2022230507556 ReferenceMaterialI TueFeb2100 00 00IST2023 ResearchandPublicationEthics (2022)
47 pages
Power Theft Report
No ratings yet
Power Theft Report
52 pages
V Library
No ratings yet
V Library
700 pages
Extended Abstract PIT18-0140 - PIT IAGI 2018
No ratings yet
Extended Abstract PIT18-0140 - PIT IAGI 2018
5 pages
An Overview of Microprocessor
No ratings yet
An Overview of Microprocessor
16 pages
A Tutorial On Preview Control Systems
No ratings yet
A Tutorial On Preview Control Systems
6 pages
Full Download Electron correlation in molecules -- ab initio beyond Gaussian quantum chemistry 1st Edition Hoggan PDF DOCX
100% (4)
Full Download Electron correlation in molecules -- ab initio beyond Gaussian quantum chemistry 1st Edition Hoggan PDF DOCX
76 pages
B10013W ACCA Kaplan Order Form
No ratings yet
B10013W ACCA Kaplan Order Form
2 pages
Correct Filling Weight: Wiring Diagram
No ratings yet
Correct Filling Weight: Wiring Diagram
1 page
Sea Math Skills Checklist
No ratings yet
Sea Math Skills Checklist
8 pages
Timing and Control Unit
0% (1)
Timing and Control Unit
22 pages
Single Aisle Technical Training Manual M35 LINE MECHANICS (CFM 56) (LVL 2&3) Information Systems
100% (1)
Single Aisle Technical Training Manual M35 LINE MECHANICS (CFM 56) (LVL 2&3) Information Systems
42 pages
Nova 1.10 Getting Started
No ratings yet
Nova 1.10 Getting Started
238 pages
Calculation of U Value Simple Construction
No ratings yet
Calculation of U Value Simple Construction
5 pages
Alcohols, Phenols and Ethers
No ratings yet
Alcohols, Phenols and Ethers
4 pages
Module 1 Calculus (Integration) BSM102
No ratings yet
Module 1 Calculus (Integration) BSM102
52 pages
3 3081483 en
No ratings yet
3 3081483 en
2 pages
Orthographic Projection - S3 A
No ratings yet
Orthographic Projection - S3 A
20 pages
Differentials
No ratings yet
Differentials
8 pages
Reflexw Manual
No ratings yet
Reflexw Manual
729 pages
Unit 2 Materials Technology
No ratings yet
Unit 2 Materials Technology
78 pages
Module 2
No ratings yet
Module 2
30 pages
2.2.3.A UniversalGatesNORLogicDesign Done
No ratings yet
2.2.3.A UniversalGatesNORLogicDesign Done
4 pages
October 22, 2022: Numpy Scipy - Optimize - Future - Scipy - Optimize Scipy - Optimize.optimize Numpy
No ratings yet
October 22, 2022: Numpy Scipy - Optimize - Future - Scipy - Optimize Scipy - Optimize.optimize Numpy
9 pages
Grade 8 Math Released Questions
No ratings yet
Grade 8 Math Released Questions
60 pages
Basic Concepts of Lubricants
No ratings yet
Basic Concepts of Lubricants
3 pages
Dynabolt Sleeve Anchor Product Data 2088518
No ratings yet
Dynabolt Sleeve Anchor Product Data 2088518
3 pages
Mod 3 Laplace Transfrom - Notes
No ratings yet
Mod 3 Laplace Transfrom - Notes
49 pages
ChiWriter Tutorial
No ratings yet
ChiWriter Tutorial
11 pages
ELE2103 Linear Systems and Control: Introductory Book
No ratings yet
ELE2103 Linear Systems and Control: Introductory Book
39 pages
2023 03 14 Detailed Comparison of Winding Tension Control Solutions
100% (1)
2023 03 14 Detailed Comparison of Winding Tension Control Solutions
10 pages
T8990 Depliant ENG PDF
No ratings yet
T8990 Depliant ENG PDF
2 pages

preprints202403.0914.v1

Uploaded by

preprints202403.0914.v1

Uploaded by

Article Not peer-reviewed version

Tuning of PID Controllers using

Gheorghe Bujgoi and Dorin Sendrescu *

Posted Date: 15 March 2024

Keywords: learning-based control; nonlinear systems control; PID controller; bioprocess

Preprints.org is a free multidiscipline platform providing preprint service that

Tuning of PID Controllers Using Reinforcement

Keywords: learning-based control; nonlinear systems control; PID controller; bioprocess

© 2024 by the author(s). Distributed under a Creative Commons CC BY license.

Figure 1. General scheme of Reinforcement Learning.

2. Twin Delayed Deep Deterministic Policy Gradient (TD3) Algorithm

2.1. Deep Deterministic Policy Gradient

Figure 2. Actor and Critic Networks.

Table 1. Notations of main elements of DDPG and TD3 algorithms.

𝑄 𝑜𝑝𝑡 (𝑠𝑘 , 𝑎𝑘 ) = 𝔼 [𝑟𝑘 + 𝛾 max 𝑄 𝑜𝑝𝑡 (𝑠𝑘+1 , 𝑎)], (2)

𝑄(𝑠𝑘 , 𝑎𝑘 ) ← 𝑄(𝑠𝑘 , 𝑎𝑘 ) + 𝛼 [𝑟𝑘 + 𝛾 max 𝑄(𝑠𝑘+1 , 𝑎) − 𝑄(𝑠𝑘 , 𝑎𝑘 )], (3)

𝑄𝑤 (𝑠𝑘 , 𝑎𝑘 ) ← 𝑄𝑤 (𝑠𝑘 , 𝑎𝑘 ) + 𝛼 [𝑟𝑘 + 𝛾 max 𝑄𝑤 (𝑠𝑘+1 , 𝑎) − 𝑄𝑤 (𝑠𝑘 , 𝑎𝑘 )], (4)

2.2. TD3 – The Main Characteristics

Figure 3. General structure of TD3 algorithm.

7. Update critics parameters:

8. Update 𝜃 by the deterministic policy gradient

9. Update target networks:

3. Tunning of PID Controllers Using TD3 Algorithm

Figure 5. Illustration of TD3 based PID tuning approach.

4. Tuning PID Controller for a Biotechnological System – Classical Approach

𝜕𝜙(𝜉) 𝑑𝜇(𝜉𝑆 ) 𝐾𝑀 −𝜉̃𝑆2 /𝐾𝑆

the linearized matrix has the following form:

Figure 6. Matlab/Simulink implementation of bacterial growth bioprocess.

with 𝑎, 𝑏 two weighting coefficients (see Matlab implementation in Figure 9).

Figure 8. Matlab/Simulink implementation of observation vector.

Figure 9. Matlab/Simulink implementation of reward function.

Table 2. Training parameters for actor and critic networks.

Steps in episode 200

The training process is presented in Figure 10.

Figure 9. Training of TD3 neural networks.

Figure 10. Step response of biotechnological system.

5.2. Classical Approach

Figure 12. Tuning the controller using PID Tuner app.

Figure 13. Step response of system output .

Table 3. Tuned parameters and step response characteristics.

Linearization- based tuning 0.207 28.64 0.00037 20 ÷ 40 0.4 ÷ 0.5

You might also like