Data-Driven Model Predictive Control of DC-To-DC Buck-Boost Converter
Data-Driven Model Predictive Control of DC-To-DC Buck-Boost Converter
ABSTRACT The data-driven model predictive control (DDMPC) scheme is proposed to obtain fast
convergence to a desired reference and to be utilised to mitigate the destabilising effects that a DC-to-DC
buck-boost converter (BBC) with an active load experiences. The DDMPC strategy uses the observed state
to derive an optimal control policy using a reinforcement learning (RL) algorithm. The employed Proximal
Policy Optimisation (PPO) algorithm’s performance is benchmarked against the PI controller. From the
simulated results obtained using the MATLAB Simulink solver, the most robust methods for short settling
time and stability were the hybrid methods. These methods take advantage of the short settling time provided
by the PPO algorithm and the stability provided by the PI controller or the filtering mechanism over the
transient time. The source code for this study is available on GitHub to support reproducible research in
industrial electronics society.
INDEX TERMS Adaptive control, data-driven model predictive control, DC-to-DC buck-boost converter,
proximal policy optimisation, reinforcement learning.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
101902 VOLUME 9, 2021
K. Prag et al.: Data-Driven Model Predictive Control of DC-to-DC BBC
that of the input or source voltage Vin due to how the inductor where A1 , B1 and C1 are the system matrices, x the state
discharges charge [15]. The output voltage Vout over the load variable, ẋ the derivative of the state variable, and y the output
in a buck-boost converter is defined as follows: which are obtained using Eqns. (2)-(3) and defined as:
D −Ron + rL
Vout = − · Vin , (1) 0
1−D L ,
A1 =
where Vin is the input or source voltage and D is the duty ratio.
1
0 −
Comparing buck and boost mode, it is found that the duty RC
T
ratio D is greater in boost mode than that in buck mode. 1 T
, x = iL vc ,
B1 = 0
In boost mode, the switch’s on-state is on for a longer duration L
than when in buck mode, thus storing more energy in the C1 = 0 1 , y = Vout = vc .
(5)
inductor, which prevents a rapid change in current to be
passed to the capacitor. The output voltage is increased when In the consecutive cycles, where the MOSFET is in the on-
enough energy is built up in the inductor and transferred to state, the capacitor then supplies energy to the load [17]. The
the capacitor. state-space representation for when the MOSFET switch is in
In the buck-boost converter circuit, the flow of charge the on-state is given by:
in the circuit is determined by the MOSFET switch state, ẋ = A2 x + B2 Vin ,
and the diode controls the direction of the flow of charge.
When the MOSFET switch in the buck-boost converter is in y = C2 x, (6)
the on-state in the initial cycle, the circuit is closed. In this where A2 , B2 and C2 are the system matrices, x the state
state, current flows to only the inductor as the input voltage variable, ẋ the derivative of the state variable, and y the
source is directly connected to the inductor, and the diode output, which are respectively defined as follows:
prevents current from flowing to the output of the circuit as
the diode is reversed biased. Furthermore, while the circuit is rL 1
− RC L
closed and the MOSFET switch is in the on-state, the induc- A2 = ,
1 1
tor accumulates charge and stores energy in the form of a − −
magnetic field. When the MOSFET switch is in the off-state, C RC
T
the diode will allow current to flow from the inductor to the
1 T
, x = iL vc ,
rest of the components of the circuit [16]. While in this state, B2 = 0
L
the inductor’s polarity is reversed, and the diode is forward C2 = 0 1 , y = Vout = vc .
(7)
biased. The inductor provides the energy stored and works
as the source, allowing current to flow from the inductor to The average matrices A, B and C obtained using Eqn. (5)
the capacitor and the load. In this state, the capacitor now and Eqn. (6) are given by:
accumulates charge and stores energy.
A = DA1 + (1 − D)A2
When the MOSFET switch is in the off-state, the inductor
D(Ron + rL ) (1 − D)rL (1 − D)
experiences a sudden drop in current, thus inducing a volt-
− −
1 , (8)
age to the output. In this state, the change in the inductor’s = L RC L
(1 − D)
current iL : − −
C RC
diL Vin Ron + rL B = DB1 + (1 − D)B2 = DB1 , (9)
= − iL , (2)
dt L L C = DC1 + (1 − D)C2 = C1 = C2 , (10)
is the difference between the input voltage Vin , and the prod- The steady-state is given by:
uct of the inductor’s current iL and the sum of the resistance
when the switch is on Ron and the resistance of the inductor X = −A−1 BVin
DVin
rL , which is divided by the inductance L [17]. The change in
the capacitor’s voltage vc : (1 − D)rL L
(1 − D)2 R + D(Ron + rL ) +
= RC ,(11)
dvc vc
=− , (3)
− D(1−D)Vin
dt RC
D(Ron + rL ) (1 − D)rL L
+ +(1−D)2
is the capacitor’s voltage, and the reciprocal of the product of R R2 C
the resistance R and the capacitance C. which is the product of the average state variables, inverse of
The state-space representation of the system in the off-state A and B, and the input voltage Vin .
is given by: The transfer function G from the input voltage Vin to the
output capacitor vc [17] is given by:
ẋ = A1 x + B1 Vin , D(D−1)
GVin (s) = L , (12)
y = C1 x, (4) LCs2 + R s+(1−D)2
The DDMPC scheme’s RL policy considered is the PPO action a in the action space when in this state. However,
algorithm. Furthermore, two hybrid cases are considered. this function does not measure how good an action is
compared to the other available actions; hence, the critic
A. PROPORTIONAL INTEGRAL CONTROL network is employed to critique the actions returned by
The details of a general PID controller are discussed in the actor network.
Section II-C. For the DC-to-DC buck-boost converter, a PI • The critic takes the observed state st and the action at
controller is used to eliminate the steady-state error et and returned by the actor network as an input and returns the
reduce the forward gain. The pulse generator generates rect- discounted long-term reward’s corresponding expecta-
angular wave pulses in a duty cycle. The integrator integrates tion. The critic network is trained to predict the value
the proportional gain over the current time step. This value function shown by Eqn. (17), which measures how good
is then subtracted from the output voltage value and then fed it is to be in a specific state st .
into a relay function, allowing its output to switch between The actor-critic network aims to maximise the surrogate
two states. The relay function compares the input to a thresh- objective function:
old value to determine which corresponding actuation output h i
the controller should return. A summary of the PI controller L (θ) = Êt min rt (θ) Ât , clip (rt (θ) , 1 − , 1+) Ât ,
is given in Fig. 5. (14)
which is an expectation function of the advantage function by
Â, policy parameters θ, and the probability ratio rt (θ) which
is defined as:
πθ (at |st )
rt (θ) = , (15)
πθold (at |st )
which is the ratio between the current policy and the policy
based on past experiences. The comprehensive definition
FIGURE 5. PI controller for DC-to-DC buck-boost converter. of the probability ratio is: a ratio between the probability
taking an action a when in state s at time t, given the policy
The applied PI controller’s design uses only the output parameters πθ , and the probability of taking action a when in
voltage value from the DC-to-DC buck-boost converter and state s at time t using the past or old policy parameters θold
does not consider the dynamics of the model, hence making from the previous epoch.
this a model-free PI controller implementation. The general algorithmic structure of the PPO algorithm
[40] is as follows:
B. PROXIMAL POLICY OPTIMISATION ALGORITHM 1) For the PPO algorithm, the first step is to initialise the
PPO algorithm is a model-free, online, or on-policy RL parameters of the actor-critic network.
method employed in the DDMPC scheme. The algorithm 2) The next step is to generate N experiences:
entails using small batches of experiences from interacting
with the environment to update the decision making policy. {st1 , at1 , rt1 }, {st2 , at2 , rt2 }, . . . , {stN , atN , rtN },
Iteratively, once the policy is updated, past experiences are
the sequence of experiences consist of a tuple of the
discarded, and a new batch is generated to update the policy.
state-action pair and their corresponding reward value.
The PPO algorithm is a class of policy gradient training
3) Calculate the action-value function and the advantage
methods that try to reduce the gradient estimations’ vari-
function for each time instance t.
ance towards better policies, causing consistent progress and
ensuring that the policy does not drastically change from the • For each instance, the action-value function and
previous policy or go down irrecoverable paths [40]. the advantage functions are computed at each time
The PPO algorithm alternates between sampling data step t. The action-value function is defined as the
through interacting with the environment and optimising a expected return of starting at state s and taking
clipped surrogate objective function which employs stochas- action a following the policy π is given by:
tic gradient ascent [43]. The stability of training the agent Qπ (s, a) =
X
Eπθ [R (st , at ) |s, a] , (16)
is improved by utilising a clipped surrogate objective func- t
tion and limiting the size of the policy change at each
iteration [44]. where the function is the sum of the expected
The PPO algorithm maintains two function approximators, rewards given the corresponding state-action pair.
the actor and the critic networks. • The value function is the expected return of how
good it is to be in a particular state, is shown by:
• The actor network maps action choices directly to the
observed state. At any particular time, t, the actor takes V π (s) =
X
Eπθ [R (st , at ) |s] , (17)
the observed state s and returns the probability of taking t
where the function is the sum of the expected and then employs the PI controller, as discussed in Section III-
rewards given the state. The Advantage function, A, to determine the actions to be applied to the buck-boost
given by: converter.
Aπ (s, a) = Qπ (s, a) − V π (s) , (18) 2) HYBRID II
is the difference between the action-value function This hybrid approach conditionally utilises a filtering mech-
Q and the value function V . anism with the PPO algorithm. The PPO algorithms are
4) Over K epochs, learn from the mini-batch experiences. solely used to determine the actions to be applied until the
reference voltage is reached, then the filtering mechanism is
• Randomly sample a set of M experiences to form
used to filter the pulse signal dictated by the PPO algorithm.
part of the mini-batch which is used to estimate the
The filter sends a 0 pulse to the buck-boost converter if the
gradient.
stipulated conditional statement is violated, else applies the
• The critic network’s parameters can be updated
signal dictated by the PPO algorithm. The reason for applying
using critic loss function Lc :
a 0 pulse if the measured output voltage is greater than the
M
1 X π 2 reference voltage is that when the MOSFET switch is open,
Lc (θv ) = Q (s, a) − V (s|θv ) , (19) the inductor dissipates current to the capacitor, which powers
M
t=1 the fixed load, thus reducing the stored energy in the inductor
which minimises the loss over the sampled mini- and capacitor.
batch.
• The actor network’s parameters are updated by: IV. RESULTS
The DC-to-DC buck-boost converter model’s performance
M
1 Xh with an active load or fixed resistance, detailed in
La (θv ) = − min rt (θ )Ât , Section II-A, is analysed using the different applied control
M
t=1
techniques, which are discussed in Section III. The proce-
clip (rt (θ), 1 − , 1 + ) Ât ] , (20) dure of experimentally testing the applied control techniques
outlined in Section IV-A, the corresponding experimental
which minimises the loss La over the sampled mini-
results, quantitative results and result analysis are reported
batch.
for three different cases in Section IV-C. The three cases are
5) Repeat steps (2) through to (4) until the terminating
for the different reference voltages used in the model, which
criterion is met.
are 30V , 80V and 110V .
Sampling actions train the PPO based RL agent according
to the updated stochastic policy; hence it is considered a A. EXPERIMENTAL PROCEDURE
stochastic policy trained in an on-policy manner. During the
The setup of the DC-to-DC buck-boost converter and the
initial stage of training, the state-action space is explored
procedure followed to experimentally compare the perfor-
through randomly selecting actions. As policy training pro-
mance of the four applied control methods are described in
gresses, the policy becomes less random, and the update rule
this section.
exploits actions found to yield higher rewards.
During training, the PPO agent estimates and associ-
1) DC-TO-DC BUCK-BOOST CONVERTER
ated probabilities of taking each action in the action space.
An action is randomly selected based on the probability The model of a DC-to-DC buck-boost converter with a pas-
distribution over actions. The actor and critic properties are sive load used is as per Fig. 1. The corresponding parameters
updated after training over multiple epochs, using mini- of the circuit are tabulated in Table 2.
batches, as the PPO agent interacts with the environment.
TABLE 2. DC-to-DC buck-boost converter circuit parameters.
The PPO agent aims to train the coefficient of the actor-critic
neural networks’ coefficient to reduce the error e between the
desired output Vref with the actual value Vout .
C. HYBRID APPROACHES
The hybrid approach uses the PPO algorithm, discussed in
Section III-B, with either a PI controller or a filter which are
discussed in Section III-C1 and Section III-C2, respectively.
1) HYBRID I
This hybrid approach applies the PPO algorithm until a stop- The model was constructed using the computation engine
ping condition is met, which is when the output voltage value in MATLAB R and Simulink R R2021a. The motivation for
is greater than or equal to the reference voltage in magnitude, using the simulated model rather than a state-space model is
to take advantage of the native matrix computation engine in which is the difference between the reference voltage value
MATLAB/Simulink [15]. and the magnitude of the output voltage value. The absolute
value of the measured output voltage is used in both the
2) PI CONTROLLER reward and error value calculation; as the output voltage is
The PI controller uses the output voltage from the buck-boost reversed in polarity to that of the input voltage, as discussed
converter and the reference voltage value to determine the in Section II-A.
action signal pulse to be applied to the model. Details of At each sample time step t, a vector representing the state
the PI controller are given in Section III-A, and the corre- st is constructed. The PPO agent measures and calculates the
sponding parameters used are detailed in this section. The following parameters of the DC-to-DC buck-boost converter
PI controller’s corresponding parameters were selected after model which forms the state st : the output voltage Vout ,
performing a grid search. The corresponding optimised PI the error value et and change in error de dt , thus the state vector
controller parameters utilised in the experiments are tabulated is represented as st = {Vout , et , de
dt }. Eqn. (22) shows how the
in Table 3. The MATLAB/Simulink solver used for the PI change in value error is calculated:
controller is ODE23 stiff/TR-BDF2). de et−1 − et
= . (22)
dt t − (t − 1)
TABLE 3. PI Controller’s parameters for the DC-to-DC buck-boost
converter. The training of the PPO RL agent uses a fixed number of
sample steps T unless the termination criterion is met. Should
the output voltage value exceed Vref is greater than the upper
bound uB , the training for that episode is terminated. During
the training of the PPO agent, at each sample time step t,
the PPO RL agent takes the current state st and the awarded
reward value rt as inputs. The reward function is given in
Algorithm 1.
3) PPO
The DDMPC scheme’s RL based controller employs the PPO Algorithm 1 PPO Reward Function
algorithm. The corresponding details of the PPO algorithm Input: Vref , Vout , et , et−1 , , Vref Reached
are delineated in Section III-B. The PPO RL agent parameters 1: uB = Vref · (1 + ) F Upper Bound
are consistent for the three different reference voltage cases; 2: lB = Vref · (1 − ) F Lower Bound
30V , 80V and 110V . The duration of the simulation was 0.3s, 3: if (Vref Reached == True & Vout < lB) or (Vout > uB)
with a sample time of 1E−5s. then
The actor-critic network architecture is built using three 4: rt = −1
completely connected hidden layers for both the actor net- 5: else
work and the critic network. Each of these hidden layers is 6:
1
rt = abs(e
built using 256 neurons. The non-linear mapping function t)
7: end if
used in both these networks is a rectified linear unit (ReLU).
The output layer of the actor network employs a softmax
activation function. The parameters of the PPO algorithm and
4) HYBRID I
the neural networks are tabulated in Table 4.
This hybrid approach uses the PPO RL agent with a PI
TABLE 4. PPO parameters for the DC-to-DC buck-boost converter. controller, as described in Section III-C1. The parameters
used for the RL agent and the PI controller are given
in Section IV-A2 and Section IV-A3, respectively. The PI
controller is implemented to determine the action to be
applied to the buck-boost converter once the magnitude of
the absolute voltage exceeds that of the reference voltage,
abs(Vout ) > Vref .
5) HYBRID II
This hybrid approach uses the PPO RL agent with filter,
as described in Section III-C2. The parameters used for the
RL agent are given in Section IV-A3. The filter mecha-
nism is applied if the following condition is violated: if the
absolute output voltage is greater than the reference voltage
In the PPO implementation, the error value calculated at
abs(Vout ) > Vref .
each sample time instance t is given by:
The simulation time for the PI controller is 3s and, for
et = Vref − abs(Vout ), (21) the PPO, Hybrid I and Hybrid II , each of the conducted
B. SETTLING TIME
The settling time is the time elapsed from the instantaneous
step to when the outputs of the considered dynamical control
system remain within a specified error range. The error range
used is 2% of the reference voltage value. Thus, the value
used in the reward function, Algorithm 1 is 0.02. In Fig. 6 two
time periods of interest for a dynamical control system are
highlighted; these are settling time and transient time. All the
responses or the observed states of a control system from the
end of the simulation duration to the first data point value that
does not fall between the error band or the range of accepted
values with respect to a reference value make up the transient
time. The time taken to reach the transient time is known as
the settling time.
C. EXPERIMENTAL RESULTS
The experimental results for the four different control tech-
niques applied are presented in this section, followed by the
analysis of the results obtained.
The results presented in Table 5, record the settling time
and the error values for both the entire duration of the
simulation time and for the transient duration after the set- FIGURE 7. Buck-Boost converter output voltage for the employed control
techniques for reference voltages: (a) 30V , (b) 80V , (c) 110V .
tling time period. For each of the attributes in the results
table, the average of the validation experiment values is
recorded. The relationship between time and the output voltage by
The lowest obtained value for each corresponding quan- the buck-boost converter when employing various controllers
titative measures is highlighted for the respected reference are presented for the three different reference voltage values,
voltage cases. respectively, in Fig. 7.
The quantitative measurements used to evaluate the applied
control techniques’ performance are the real elapsed time, 1) PI CONTROLLER
settling time, MSE, mean absolute error (MAE) and integral The PI controller’s settling time is proportional to the magni-
absolute error (IAE), the standard deviation σ is calculated tude of the reference voltage. It is observed that the average
for each of these attributes, respectively. These results are settling time is greater than that of the respective reference
tabulated in Table 5. values for all three cases. The PI controller can be seen as a
TABLE 5. Average quantitative measurements of the applied control techniques for the DC-to-DC buck-boost converter using Algorithm 1.
stable control technique, based on the numerical output volt- lower absolute output voltage than the reference voltage;
age values, the plots illustrating the relationship between time this features for the boost mode instances using this hybrid
and the output voltage, and the MSE, MAE, and IAE values technique, which was seen for the vanilla PI control method.
for the transient time. A disadvantage of the PI controller is Taking advantage of the short settling time provided PPO and
that the settling time is the longest compared to the other the stability provided by the PI controller, the error values
employed control techniques. he PI controller stabilises the are significantly smaller than that of both the individual
output voltage at a value less than that of the desired voltage; implementations of the PPO and PI controller for the 30V
hence it does not necessarily guarantee the lowest error values case; for both the entire simulation duration and for that after
over the transient time. the settling time, making it a robust method for when the
converter is used in buck mode.
2) PPO
The PPO algorithm is found to have the shortest settling 4) HYBRID II
time for when the converter is used for the boost cases. The Combining the PPO algorithm with a filter mechanism in this
variance and standard deviation values for the corresponding hybrid approach, it has been found that this method’s vari-
error values are considered when discussing stability. It is ance and standard deviation values for corresponding error
found that these values, for the entire simulation duration, values are generally less than that of the PPO algorithm and
are lower than that when the PI or hybrid approaches are Hybrid I for the boost cases. This indicates that this approach
used for the boost cases, which can be attributed to the short is the most robust control method with respect to stability,
settling time for these cases. However, if only transient time as the mean settling time values, output voltage value and
is considered, the PPO quantitative measurements are not the corresponding quantitative tabulated values substantiate the
lowest indicating that this is not the most robust method in performance of this control technique.
terms of stability. From the relationship between time and
output voltage illustrated in Fig 7, it can be seen that the 5) REWARD FUNCTION
PI controller resembles a parabolic decay rate, whilst for the In [40] the reward function used is the same as Algorithm 1,
PPO algorithm, exponential decay is seen for the settling time whilst in [39] Algorithm 2, both these methods, respectively,
duration. have been applied without the conditional statements to sim-
It is highlighted that the PPO algorithm does not always ilar DC-DC converters.
converge to the desired reference voltage. Hence a 100% of The results obtained using this alternative reward function
the experiments do not fall within the settling time, as seen for the PPO algorithm is tabulated in Table 6.
for the case when the converter is used for a buck case. The results of the applied PPO algorithm to the DC-to-DC
buck-boost converter using the reward function defined in
3) HYBRID I Algorithm 1 are tabulated in Table 5, and the results when
This hybrid approach uses the PPO algorithm and employs Algorithm 2 is used is tabulated in Table 6. From these results
the PI controller to determine the actions to be applied to it is found that the PPO used with the reward function defined
the buck-boost converter once the magnitude of the output in Algorithm 1, has lower settling time and quantitative mea-
voltage reaches or exceeds that of the desired voltage. The surements in comparison to when Algorithm 2 is used, hence
shortcoming of the PI controller is that it stabilises to a was the reward function employed.
Algorithm 2 PPO Updated Reward Function TABLE 7. Quantitative measurements of the PI control techniques with
AWGN for the DC-to-DC buck-boost converter with reference voltages:
Input: Vref , Vout , et , et−1 , , Vref Reached Table (7a) 30 V , Table (7b) 80 V and Table (8c) 110 V .
1: uB = Vref · (1 + ) F Upper Bound
2: lB = Vref · (1 − ) F Lower Bound
3: if (Vref Reached == True & Vout < lB) or (Vout > uB)
then
4: rt = −1
5: else
1
6: rt = abs(e 2
t)
7: end if
6) SENSITIVITY TO NOISE
The robustness of the applied controllers to the BBC can
be evaluated based on the controller’s performance when
experiencing noise. Additive White Gaussian Noise (AWGN)
is an added linear noise model applied to the transmitted
signal, which has a uniform power across the frequency band
for the output signal and has a Gaussian distribution with
respect to time. The AWGN channel model is represented by
the outputs Hk at discreet time-steps k. The sum of the input
Fk and noise Gk is the value of Hk :
Hk = Fk + Gk , (23)
To compare the performance of the controllers an AWGN
where Gk is independently and identically distributed from with a range of SNRs have been applied to the output mea-
a zero-mean normal distribution with variance N, that is sured voltage of the DC-to-DC BBC. Table 7 records the
Gk ∼ N (0, N). AWGN is added to the transmitted sig- quantitative measurement when applying the PI controller
nal to measure and compare the controller’s performance with the AWGN model.
when experiencing such an impairment. The measurement Comparing the results in Table 7 to Table 5, it can be seen
parameter, signal-to-noise ratio (SNR), compares the power that as the SNR decreases, both the average error and settling
of the desired information signal to the power of the undesired time increases, as a result of the noise signal increasing. The
signal or background noise, which is denoted as: settling time is used to decide the threshold of SNR of the
Psignal PI controller. The settling time of the PI controller without
SNR = . (24) AWGN remains unchanged for when AWGN with SNR is
Pnoise
101912 VOLUME 9, 2021
K. Prag et al.: Data-Driven Model Predictive Control of DC-to-DC BBC
TABLE 8. Average quantitative measurements of the applied control Thus the results indicate a bias towards the signal value, this
techniques with AWGN for the DC-to-DC buck-boost converter with
reference voltages: Table (8a) 30V , Table (8b) 80V and Table (8c) 110V . is a result of the error band being calculated relative to the ref-
erence voltage value, the cases with a lower reference voltage
are more sensitive to noise. Comparing the performance of the
applied controllers with the AWGN model, it can be seen that
the PPO and Hybrid II controllers are the most robust when
considering their quantitative measurements, particularly the
percentage value of episodes that converged and the error
values for the settled time duration.
In summary, applying the DDMPC scheme with the PPO
algorithm has been found to have a shorter settling time for
the boost cases and Hybrid I compared to the PI controller.
However, it is found that the hybrid approaches are the most
robust in the light of both settling time and stability, as they
take advantage of the short settling time provided by the PPO
algorithm and the stability ensured by the PI controller and
the filtering mechanism. Given that the literature does not
document the performance of buck-boost converters with an
active load or VPL, there is no direct comparison to previ-
ous work in this regard. However, considering similar work
[40] where the buck-boost converter has a CPL, and the PI
controller is tuned using the PPO algorithm, comparing the
error values and the inferred settling time, it is found that the
hybrid approaches’ performance is comparable. Furthermore,
from observing the impact of the reward function, we find
that investigating the impact of the employed reward function
and optimising the reward function does hold promise in
improving the quality of the results found using the DDMPC
techniques. With respect to the robustness of the controllers
when experiencing noise, the PPO and Hybrid II controllers
were found to be most robust.
V. CONCLUSION
The popularity of renewable energy plants and the increasing
number of electronic applications, which are DC in nature,
make the study of DC-to-DC buck-boost converters with
active loads nascent. The buck-boost converter converts an
input voltage to the desired lower reference output voltage
when in buck mode, and to the desired reference with a
greater output voltage magnitude, in boost mode. The quality
of these converters is based on the settling time to reach the
reference voltage and the ability of the controller to maintain
a constant output voltage. The impact of the reward function
on controllers using the PPO algorithm opens up interesting
lines of follow-up research for future development, as well as
applying and testing the robustness of the discussed control
greater than or equal to 30 dB is applied. However, when the methods on physical BBC prototype.
SNR is set to 25 dB, the settling time exceeds that of the PI DDMPC techniques have been considered to improve the
controller with no AWGN. quality of these converters. The applied control techniques’
Table 8 records the results of the PPO controller with performance to the buck-boost converter was evaluated based
AWGN when applied to a BBC for respective reference volt- on the applied control technique’s short settling time and sta-
ages. When the Hybrid I controller with AWGN of 25 dB bility. The PI controller’s performance was used as a bench-
SNR was applied to a BBC with a reference voltage of 30V , mark to compare the performance of the vanilla DDMPC
the results obtained indicated that the simulation was termi- technique using the PPO algorithm. The PPO algorithm was
nated as the stopping conditions described in Algorithm 1 found to provide a short settling time to reach the reference
were met before reaching the full simulation duration of 0.3s. voltage and outperformed the PI controller in this respect.
Taking advantage of the short settling time of the PPO method [19] E. F. Camacho and C. Bordons, ‘‘Nonlinear model predictive control:
and the stability provided by the PI controller and the filtering An introductory review,’’ in Assessment and Future Directions of Nonlin-
ear Model Predictive Control. Berlin, Germany: Springer, 2007, pp. 1–16.
mechanism, merit was found in the hybrid techniques as their [20] S.-K. Kim, C. R. Park, J.-S. Kim, and Y. I. Lee, ‘‘A stabilizing model
performance surpass that of the PI controller with respect to predictive controller for voltage regulation of a DC/DC boost converter,’’
settling time and the PPO algorithm with respect stability. IEEE Trans. Control Syst. Technol., vol. 22, no. 5, pp. 2016–2023,
Jan. 2014.
Furthermore, the PPO and Hybrid II controllers were found to [21] S. Bououden, O. Hazil, S. Filali, and M. Chadli, ‘‘Modelling and model
perform comparatively to the controllers without noise when predictive control of a DC–DC boost converter,’’ in Proc. 15th Int. Conf.
AWGN was applied to the feedback signal, thus in general Sci. Techn. Automat. Control Comput. Eng. (STA), Dec. 2014, pp. 643–648.
[22] S. L. Brunton and J. N. Kutz, Data-Driven Science and Engineering:
the PPO and Hybrid II controllers have merit with respect to Machine Learning, Dynamical Systems, and Control. Cambridge, U.K.:
short settling time, stability and sensativity to noise. Cambridge Univ. Press, 2019.
[23] G. C. Calafiore and L. Fagiano, ‘‘Stochastic model predictive control
of LPV systems via scenario optimization,’’ Automatica, vol. 49, no. 6,
REFERENCES
pp. 1861–1866, 2013.
[1] J. Berberich, J. Kohler, M. A. Muller, and F. Allgower, ‘‘Data-driven model [24] G. Schildbach, L. Fagiano, C. Frei, and M. Morari, ‘‘The scenario approach
predictive control with stability and robustness guarantees,’’ IEEE Trans. for stochastic model predictive control with bounds on closed-loop con-
Autom. Control, vol. 66, no. 4, pp. 1702–1717, Apr. 2021. straint violations,’’ Automatica, vol. 50, no. 12, pp. 3009–3018, 2014.
[2] Z. Hou, H. Gao, and F. Lewis, ‘‘Data-driven control and learning systems,’’ [25] S. Grammatico, X. Zhang, K. Margellos, P. Goulart, and J. Lygeros,
IEEE Trans. Ind. Electron., vol. 64, no. 5, pp. 4070–4075, May 2017. ‘‘A scenario approach for non-convex control design,’’ IEEE Trans. Autom.
[3] D. Marx, P. Magne, B. Nahid-Mobarakeh, S. Pierfederici, and B. Davat, Control, vol. 61, no. 2, pp. 334–345, Feb. 2016.
‘‘Large signal stability analysis tools in DC power systems with constant [26] M. Lorenzen, F. Dabbene, R. Tempo, and F. Allgöwer, ‘‘Stochastic
power loads and variable power loads—A review,’’ IEEE Trans. Power MPC with offline uncertainty sampling,’’ Automatica, vol. 81, no. 1,
Electron., vol. 27, no. 4, pp. 1773–1787, Apr. 2012. pp. 176–183, 2017.
[4] S. R. Huddy and J. D. Skufca, ‘‘Amplitude death solutions for stabilization
[27] L. Buşoniu, T. de Bruin, D. Tolić, J. Kober, and I. Palunko, ‘‘Reinforcement
of DC microgrids with instantaneous constant-power loads,’’ IEEE Trans.
learning for control: Performance, stability, and deep approximators,’’
Power Electron., vol. 28, no. 1, pp. 247–253, Jan. 2013.
Annu. Rev. Control, vol. 46, pp. 8–28, Jan. 2018.
[5] S. Singh, N. Rathore, and D. Fulwani, ‘‘Mitigation of negative impedance
[28] Z. Wu, J. Zhao, and J. Zhang, ‘‘Cascade PID control of buck-boost-
instabilities in a DC/DC buck-boost converter with composite load,’’
type DC/DC power converters,’’ in Proc. 6th World Congr. Intell. Control
J. Power Electron., vol. 16, no. 3, pp. 1046–1055, May 2016.
Automat., vol. 2, 2006, pp. 8467–8471.
[6] Q. Xu, C. Zhang, C. Wen, and P. Wang, ‘‘A novel composite nonlinear
[29] M. A. A. Mohamed, Q. Guan, and M. Rashed, ‘‘Control of DC–DC
controller for stabilization of constant power load in DC microgrid,’’ IEEE
converter for interfacing supercapcitors energy storage to DC micro grids,’’
Trans. Smart Grid, vol. 10, no. 1, pp. 752–761, Jan. 2019.
in Proc. IEEE Int. Conf. Electr. Syst. Aircr., Railway, Ship Propuls. Road
[7] V. C. Kotak and P. Tyagi, ‘‘DC to DC converter in maximum power point
Vehicles Int. Transp. Electrific. Conf. (ESARS-ITEC), Nov. 2018, pp. 1–8.
tracker,’’ Int. J. Adv. Res. Electr., Electron. Instrum. Eng., vol. 3297, no. 12,
pp. 6115–6125, 2007. [Online]. Available: https://www.ijareeie.com [30] R. Sumita and T. Sato, ‘‘PID control method using predicted output voltage
for digitally controlled DC/DC converter,’’ in Proc. 1st Int. Conf. Electr.,
[8] R. F. Coelho, F. Concer, and D. C. Martins, ‘‘A study of the basic DC–DC
Control Instrum. Eng. (ICECIE), Nov. 2019, pp. 1–7.
converters applied in maximum power point tracking,’’ in Proc. Brazilian
Power Electron. Conf., Sep. 2009, pp. 673–678. [31] R. D. Bhagiya and R. M. Patel, ‘‘PWM based double loop PI control of a
[9] J. M. Carrasco, L. G. Franquelo, J. T. Bialasiewicz, E. Galván, bidirectional DC–DC converter in a standalone PV/battery DC power sys-
R. C. P. Guisado, M. N. M. Prats, J. I. León, and N. Moreno-Alfonso, tem,’’ in Proc. IEEE 16th India Council Int. Conf. (INDICON), Dec. 2019,
‘‘Power-electronic systems for the grid integration of renewable energy pp. 1–4.
sources: A survey,’’ IEEE Trans. Ind. Electron., vol. 53, no. 4, [32] T. Kobaku, R. Jeyasenthil, S. Sahoo, R. Ramchand, and T. Dragicevic,
pp. 1002–1016, Jun. 2006. ‘‘Quantitative feedback design-based robust PID control of voltage mode
[10] O. Ibrahim, N. Z. Yahaya, and N. Saad, ‘‘State-space modelling and controlled DC–DC boost converter,’’ IEEE Trans. Circuits Syst. II, Exp.
digital controller design for DC–DC converter,’’ Telkomnika, Telecommun. Briefs, vol. 68, no. 1, pp. 286–290, Jan. 2021.
Comput. Electron. Control, vol. 14, no. 2, pp. 497–506, 2016. [33] Q. Xu, Y. Yan, C. Zhang, T. Dragicevic, and F. Blaabjerg, ‘‘An offset-free
[11] A. Kwasinski and C. N. Onwuchekwa, ‘‘Dynamic behavior and stabiliza- composite model predictive control strategy for DC/DC buck converter
tion of DC microgrids with instantaneous constant-power loads,’’ IEEE feeding constant power loads,’’ IEEE Trans. Power Electron., vol. 35, no. 5,
Trans. Power Electron., vol. 26, no. 3, pp. 822–834, Mar. 2011. pp. 5331–5342, May 2020.
[12] Z. Zhang, D. Zhang, and R. C. Qiu, ‘‘Deep reinforcement learning for [34] N. Boutchich, A. Moufid, N. Bennis, and S. E. Hani, ‘‘A constrained
power system applications: An overview,’’ CSEE J. Power Energy Syst., MPC approach applied to buck DC–DC converter for greenhouse powered
vol. 6, no. 1, pp. 213–225, 2019. by photovoltaic source,’’ in Proc. Int. Conf. Electr. Inf. Technol. (ICEIT),
[13] K. Osmani, A. Haddad, T. Lemenand, B. Castanier, and M. Ramadan, Mar. 2020, pp. 1–6.
‘‘An investigation on maximum power extraction algorithms from PV sys- [35] Z. Zhou, L. Zhang, Z. Liu, Q. Chen, R. Long, and H. Su, ‘‘Model predictive
tems with corresponding DC–DC converters,’’ Energy, vol. 224, Jun. 2021, control for the receiving-side DC–DC converter of dynamic wireless power
Art. no. 120092, doi: 10.1016/j.energy.2021.120092. transfer,’’ IEEE Trans. Power Electron., vol. 35, no. 9, pp. 8985–8997,
[14] K. Rouzbehi, A. Miranian, J. M. Escaño, E. Rakhshani, N. Shariati, and Sep. 2020.
E. Pouresmaeil, ‘‘A data-driven based voltage control strategy for DC–DC [36] J. Fan, S. Li, J. Wang, and Z. Wang, ‘‘A GPI based sliding mode control
converters: Application to DC microgrid,’’ Electronics, vol. 8, no. 5, p. 493, method for boost DC–DC converter,’’ in Proc. IEEE Int. Conf. Ind. Technol.
Apr. 2019. (ICIT), Mar. 2016, pp. 1826–1831.
[15] R. H. G. Tan and L. Y. H. Hoo, ‘‘DC–DC converter modeling and simu- [37] M. M. Mardani, N. Vafamand, M. H. Khooban, T. Dragičević, and
lation using state space approach,’’ in Proc. IEEE Conf. Energy Convers. F. Blaabjerg, ‘‘Design of quadratic d-stable fuzzy controller for DC micro-
(CENCON), Oct. 2015, pp. 42–47. grids with multiple CPLs,’’ IEEE Trans. Ind. Electron., vol. 66, no. 6,
[16] S. Arora, P. T. Balsara, and D. K. Bhatia, ‘‘Effect of sampling time and pp. 4805–4812, Jun. 2019.
sampling instant on the frequency response of a boost converter,’’ in [38] R. F. Bastos, C. R. Aguiar, A. F. Q. Gonçalves, and R. Q. Machado,
Proc. 42nd Annu. Conf. IEEE Ind. Electron. Soc. (IECON), Oct. 2016, ‘‘An intelligent control system used to improve energy production from
pp. 7155–7160. alternative sources with DC/DC integration,’’ IEEE Trans. Smart Grid,
[17] X. Zhou and Q. He, ‘‘Modeling and simulation of buck-boost converter vol. 5, no. 5, pp. 2486–2495, Sep. 2014.
with voltage feedback control,’’ in Proc. MATEC Web Conf., vol. 31, 2015, [39] M. Gheisarnejad, H. Farsizadeh, M.-R. Tavana, and M. H. Khooban,
pp. 5–9. ‘‘A novel deep learning controller for DC–DC buck-boost converters in
[18] K. J. Astrom and T. Hägglund, ‘‘Advanced PID control,’’ IEEE Control wireless power transfer feeding CPLs,’’ IEEE Trans. Ind. Electron., vol. 68,
Syst., vol. 26, no. 1, pp. 98–101, Feb. 2006. no. 7, pp. 6379–6384, Jul. 2021.
[40] M. Hajihosseini, M. Andalibi, M. Gheisarnejad, H. Farsizadeh, and MATTHEW WOOLWAY received the Ph.D.
M.-H. Khooban, ‘‘DC/DC power converter control-based deep machine degree from the University of the Witwatersrand,
learning techniques: Real-time implementation,’’ IEEE Trans. Power Elec- Johannesburg, South Africa, in 2018. He is cur-
tron., vol. 35, no. 10, pp. 9971–9977, Oct. 2020. rently a Teaching Fellow in applied mathemat-
[41] C. Cui, N. Yan, and C. Zhang, ‘‘An intelligent control strategy for buck DC– ics with the Department of Mathematics, Imperial
DC converter via deep reinforcement learning,’’ 2020, arXiv:2008.04542. College London, and a Research Associate with
[Online]. Available: http://arxiv.org/abs/2008.04542 the Faculty of Engineering and the Built Environ-
[42] M. Gheisarnejad, H. Farsizadeh, M.-R. Tavana, and M. H. Khooban,
ment, University of Johannesburg. His research
‘‘A novel deep learning controller for DC/DC buck-boost converters in
interests include computational intelligence, arti-
wireless power transfer feeding CPLs,’’ IEEE Trans. Ind. Electron., vol. 68,
no. 7, pp. 6379–6384, Jul. 2021. ficial intelligence, and optimization.
[43] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, ‘‘Prox-
imal policy optimization algorithms,’’ 2017, arXiv:1707.06347. [Online].
Available: http://arxiv.org/abs/1707.06347
[44] B. Liu, Q. Cai, Z. Yang, and Z. Wang, ‘‘Neural proximal/trust region policy
optimization attains globally optimal policy,’’ 2019, arXiv:1906.10306.
[Online]. Available: http://arxiv.org/abs/1906.10306
TURGAY CELIK (Member, IEEE) received
the second Ph.D. degree from the University of
Warwick, Coventry, U.K., in 2011. He is currently
a Professor of digital transformation and the Direc-
KRUPA PRAG is currently a postgraduate stu- tor of the Wits Institute of Data Science, Univer-
dent at the University of the Witwatersrand, sity of the Witwatersrand, Johannesburg, South
Johannesburg, South Africa. She is an Associate Africa. His research interests include signal and
Lecturer with the School of Computer Science image processing, computer vision, machine intel-
and Applied Mathematics, University of the ligence, robotics, data science and engineering,
Witwatersrand. Her research interests include opti- and remote sensing. He is an Associate Editor of
mization, optimal control theory, and computa- the ELL (IET), IEEE ACCESS, IEEE GEOSCIENCE AND REMOTE SENSING LETTERS
tional intelligence. (GRSL), IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS
AND REMOTE SENSING (JSTARS), and SIVP (Springer).