Energy Plus With RL
Energy Plus With RL
Relatore:
Prof. Alfonso Capozzoli
Correlatori:
Ing. Silvio Brandi
Ing. Giuseppe Pinto
Candidato
Davide Borello
1
After the training phase, the agent is tested in the deployment phase by analysing
four different scenarios to examine its adaptivity in different conditions.
In conclusion, the agent obtains a significant reduction in temperature violations,
these reductions range between 750 °C and 950 °C, in all scenarios, and, at the same
time, the energy-saving obtained ranges between 2% and 6%.
2
Summary
1. Introduction ....................................................................................................... 10
3
5.3 Deployment phase ....................................................................................... 67
6 Results ................................................................................................................ 70
7. Discussion ......................................................................................................... 94
8. Conclusion ........................................................................................................ 98
4
List of figures
Figure 1 - Italian energy consumption by sector [1] ............................................. 10
Figure 2 - General scheme of a single-level control [2] ....................................... 11
Figure 3 - Classification of control methods for HVAC systems ......................... 14
Figure 4 - On/Off control logic [5] ....................................................................... 15
Figure 5 - Proportional action [6] ......................................................................... 15
Figure 6 - Integral action [6] ................................................................................. 16
Figure 7 - Derivative action [6] ............................................................................ 16
Figure 8 - Schematic representation of the standard closed-loop system with MPC
[13] ........................................................................................................................ 18
Figure 9 - Example of RC analogy for a radiant floor system [4] ........................ 19
Figure 10 - Number of publications for action-selection method ......................... 21
Figure 11 - Number of publications for control timestep ..................................... 22
Figure 12 - Number of publications for reward term ............................................ 23
Figure 13 - Equations' coefficient [58] ................................................................. 29
Figure 14 - Windows opening model coefficient [60] .......................................... 31
Figure 15 - Windows closing model coefficient [60] ........................................... 31
Figure 16 - Window's model comparison ............................................................. 33
Figure 17 - Machine Learning branches [62]........................................................ 35
Figure 18 - Typical control loop based on RL ...................................................... 36
Figure 19 - Example of Deep Neural Network [65] ............................................. 40
Figure 20 - Reinforcement Learning Deep Q-Network [30] ................................ 40
Figure 21 - Actor and critic neural network structure [69] ................................... 42
Figure 22 - Framework of the application of SAC control. .................................. 45
Figure 23 - Simulation environment for SAC controller [34] .............................. 48
Figure 24 - Geometrical model ............................................................................. 50
Figure 25 - Building's boundary conditions .......................................................... 50
Figure 26 - Example of the Material definition .................................................... 54
5
Figure 27 - Example of the Construction definition ............................................. 54
Figure 28 - Example of window definition ........................................................... 54
Figure 29 - Windows' model equations ................................................................ 56
Figure 30 - Implementation of windows' model ................................................... 57
Figure 31 - Example of OtherEquipment definition ............................................. 58
Figure 32 - Case study heating system ................................................................. 59
Figure 33 - Baseline Logic - Climatic curve ......................................................... 60
Figure 34 - Reward function structure .................................................................. 64
Figure 35 - Heating energy consumption of the heating season ........................... 71
Figure 36 - Daily heating consumption comparison ............................................. 71
Figure 37 - Comparison between the energy consumption and the outdoor
temperature ........................................................................................................... 72
Figure 38 - Windows Heat Gain: Comparison between direct solar radiation and
ground floor heat gain ........................................................................................... 73
Figure 39 - Window heat gain comparison between floors in December ............. 74
Figure 40 - - Window heat gain comparison between floors in March ................ 74
Figure 41 -Infiltration Heat Loss: Comparison between outdoor temperature and
total infiltration heat loss ...................................................................................... 75
Figure 42 - Infiltration Heat Loss: Comparison between different floors in
December .............................................................................................................. 76
Figure 43 - Infiltration Heat Loss: Comparison between different floors in March
............................................................................................................................... 76
Figure 44 - Daily ventilation heat loss comparison .............................................. 77
Figure 45 - SAC control performance in the last episode of the training phase ... 79
Figure 46 - Comparison of cumulative reward energy-term and temperature-term
............................................................................................................................... 80
Figure 47 - Comparison between three agents during a training day ................... 82
6
Figure 48 – Comparison of heating energy consumption and cumulative sum of
temperature violations between the deployed control agent in the four scenarios and
the baseline ............................................................................................................ 84
Figure 49 - Energy saving in each scenario .......................................................... 85
Figure 50 - Differences in cumulative sum of temperature violations in each
scenario ................................................................................................................. 86
Figure 51 - Comparison between SAC control agent and baseline in Scenario 1 of
the deployment phase ............................................................................................ 87
Figure 52 - Comparison between SAC control agent and baseline in Scenario 2 of
the deployment phase ............................................................................................ 88
Figure 53 - Comparison between SAC control agent and baseline in Scenario 3 of
the deployment phase ............................................................................................ 89
Figure 54 - Comparison between SAC control agent in Scenario 1 and Scenario 3
............................................................................................................................... 90
Figure 55 - Zone temperature comparison in different floors for Scenario 1 and
Scenario 3 .............................................................................................................. 91
Figure 56 - Comparison between SAC control agent and baseline in Scenario 4 of
the deployment phase ............................................................................................ 92
Figure 57 - Comparison between SAC control agent in Scenario 1 and Scenario 4
............................................................................................................................... 92
7
List of tables
8
Chapter 1:
Introduction
9
1. Introduction
As the importance of climate change themes has grown, programmes focused on
reducing primary energy consumption, and CO2 emissions have been encouraged
in recent years. In this context, it is also incentivised the use of Renewable Energy
Source (RES).
Since people spend most of their time in buildings, they consume a considerable
amount of energy, and this is due to different factors like occupants’ behaviour,
building characteristics and the context in which the building is located.
In particular, as shown in Figure 1, the residential buildings in Italy are responsible
for around 30% of the total energy consumption.
This energy consumption is mainly due to the HVAC systems that have to satisfy
the occupants' comfort. However, very often these systems are either inefficient or
not optimally controlled.
In recent years, HVAC systems became increasingly complex, and, consequently,
the design of control systems which has to take into account several factors related
to grid requirements (Demand Response), occupant preferences and external
forcing variables.
These factors, being stochastics, lead to the non-linearity of the system,
complicating, even more, the control actions. Therefore, recent researches have
10
focused on adaptive control systems and new control methods for HVAC systems
to maintain the comfort conditions for the occupants and to reduce the energy
consumption.
Figure 2 shows the general scheme for a single-level control and the principal
factors that influenced the controller.
11
function over a prediction time horizon. The main drawback of this method is the
high computational cost for the construction of the model.
Therefore, researchers start to study the application of Reinforcement Learning
(RL) to control the HVAC systems, which is a control technique belonging to the
Machine Learning family. This technique does not require prior knowledge neither
of the system nor of the buildings to be controlled. In fact, an agent learns an optimal
policy directly by the interaction with the environment, receiving a reward
influenced by the action taken from a certain state.
In this thesis, a residential building, located in Turin, was controlled, through a
control algorithm based on SAC, (Soft Actor-Critic), which is a branch of RL that
allows the use of continuous action and state space. In addition, it introduces an
entropy term on the reward definition. The objective of the control agent is the
maintenance of the comfort conditions by control the supply power to be provided
in each floor.
The geometrical model of the building was first built on SketchUp, while the energy
model was made on EnergyPlus v9.2.0.
Then, the reinforcement learning control logic is implemented by matching
EnergyPlus and Python. This two software are connected through Building Control
Virtual Test Bed (BCVTB) and the ExternalInterface of EnergyPlus.
The objective is to maintain the indoor temperature inside a comfort range when
there is the presence of occupants, trying to obtain also a reduction of energy
consumption. To do this, the designed control agent chooses the supply power to
be given to each floor.
Moreover, since the large importance of the occupants in energy consumption,
some models that simulate the windows opening and closing behaviour were tested
and implemented to try to represent the reality as close as possible.
Sections 1.1 and 1.2 present a description, respectively, of the main control
techniques for HVAC systems, and an overview of previous works on the
Reinforcement Learning. Section 2 provides a description of the different windows'
12
models analysed and the qualitative choice of the best. In contrast, section 3
illustrates the background for this work introducing the RL and the SAC. In section
4, the framework of this analysis is described, in particular, the case study and the
construction of the simulation environment.
In section 5, the development of the SAC control agent is illustrated, together with
the introduction of the training and deployment phases.
The result of the simulation environment and the results of both phases are
described in section 6. Finally, section 7 and 8 provide a discussion of the results
and the conclusion.
Classical control;
Hard control;
Soft control;
Hybrid controls
Figure 3 highlights the classification of the control techniques for HVAC systems.
13
Figure 3 - Classification of control methods for HVAC systems
The first category includes On / Off control and PID control. On / off controllers
regulate the process within a defined lower and upper value, to maintain the process
between this range.
Figure 4 show the On / Off control logic and the behaviour of the controlled
variable. The system stays turn on until the controlled variable reaches the upper
limit of the threshold, then the system is turned off and remain in this position until
the variable reaches the lower limit.
On the other hand, the PID control a variable by using error dynamics, in particular
applying three different action: proportional, integral and derivative. The signal
(u(t)) provide by the controller can be defined with the following equation:
𝑑𝑒(𝑡)
𝑢(𝑡) = 𝐾 𝑒(𝑡) + 𝐾 𝑒(𝑡)𝑑𝑟 + 𝐾
𝑑𝑡
14
in which e(t) indicates the control error, and it is given by the difference between
the controlled variable and the setpoint value, KP is the proportional gain, KI is the
integral gain, and KD is the derivative gain.
The proportional action produces a difference between the actual value of the
variables and the desired one. This difference can be reduct increasing the
proportional gain. Figure 5 illustrates the response of the variable subject to the
proportional action.
The integral action tends to reduce the offset between the controlled variables and
its setpoint. Figure 6 shows the response of the system to both the integral and the
proportional action. The following equation defines the parameter T I:
15
𝐾
𝑇 =
𝐾
With a higher value of KI, and consequently lower value of T I, the offset can be
reduced.
Finally, the derivative action increases the stability of the response decreasing its
oscillations. Figure 7 shows the response to a PID controller with all of the three
actions. The parameter TD is defined by the following equation:
𝐾
𝑇 =
𝐾
The stability of the response increases with the increasing of TD.
The main drawback of this control logic is the tuning of the three parameters K P,
KI, and KD to minimize the offset, respond fastly to disturbances and increase the
stability. The two most used tuning method are the two proposed by Ziegler-
Nichols.
16
The Hard control category includes control techniques such as Optimal Control,
Robust Control and Gain Scheduling PID, but the most important control method
in this family is Model Predictive Control (MPC). It applies a building model to
predict the future states and to optimise the cost function over a prediction time
horizon; it also takes into account disturbances and constraints [7].
The goal of the MPC is to minimize the cost function, which is influenced by
different factors, such as building dynamics, type of the HVAC system and user
preference. For example, Picard and Helsen [8] in their work modelled only the
building envelope and their cost function aims to minimize the heat inputs from the
two different heating and cooling system, which each of them has an associated
cost. In contrast, Jorissen [9] also modelled the HVAC system, and he controls the
setpoints of different components to minimize energy consumption.
The objective of the cost function can be different in each case, for example, Cigle
et al. [10] and Yang et al. [11] introduce the PMV value to maximize the occupancy
thermal comfort. Jorissen et al. [12] developed a model based on statistical data to
estimate the future air quality and an occupancy model.
Another objective of the cost function can be the minimization of the cost of the
energy. This task takes more importance in particular for the systems that are
electricity-based, such as chiller and heat pump, because the electricity cost is
variable [13].
Avci et al. [14] and Bianchini et al. [15] studied the response of MPC in demand-
response problem applying the real-time pricing. Instead, Oldewurtel et al. [16]
focused their works on the reduction of the peak electricity demand optimizing the
economic cost. Moreover, Qureshi and Jones [17] and Patteeuw et al. [18] studied
the effects on stability and flexibility of the MPC controller with the introduction
of RES.
In recent year, since the introduction of programs that aim to the reduction of the
greenhouse gas emissions, Knudsen and Petersen [19] and Vogler-Finck et al. [20]
introduced as the objective function the minimization of greenhouse gas emissions.
17
Finally, Vandermeulen et al. [21] and Vogler-Finck et al. [22] designed a cost
function that aims to maximize the use of RES, or to minimize the use of fossil
fuels.
Figure 8 shows the schematic representation of the standard close loop with MPC
which can describe most of the applications in building control. The building is
affected by disturbances, such as weather conditions, and it is subjected to some
constraints, such as the acceptability range for indoor temperature.
Figure 8 - Schematic representation of the standard closed-loop system with MPC [13]
The most important feature of this technique is the building model, this can be
obtained by using three different modelling paradigms:
White box models describe the building in details with physical knowledge;
therefore, they are based on the conservation of mass and energy and
principles of heat transfer. They require information about building
geometry, material properties, and equipment. The obtained models often
include thousands of parameters, so there al lot of potential sources of
inaccuracy. These models are difficult to implement, and they have a high
computational cost.
Grey box models simplify the physical representation of the building using
the RC (resistance and capacitance) analogy. Figure 9 illustrates an example
18
of this electric analogy. The thermal mass of the construction is represented
by a capacitor, while the resistor represents the building elements, such as
wall or floor, and the nodes of the networks indicate the temperature of the
construction. The order of the dynamic system is defined by the number of
capacitors. These models require less computational time than the white box
models, but they are less accurate.
Black box models use mathematical and empirical equations to describe the
buildings. They require a large dataset to calibrate and train the model.
One of the main drawbacks is the computational time for both the buildings’ model
construction and the solving of the optimization problem. Furthermore, a sudden
change in the variables can lead to instability of the solution. Other issues are the
availability of data and the possible building-model mismatch or the inaccurate
measurements. Since the fundamental element of the MPC techniques is the
building model, a non-accurate model can lead to wrong future predictions.
As regards Soft control, it is an emerging control method based on the application
of Neural Networks (NNs), Fuzzy Logic (FL) and Genetic Algorithms (GAs) [23].
NNs are a mathematical representation of biological neurons which associate the
input and output actions. They are used to control systems, in which the models are
not fully known. Curtiss et al. [24] made a comparison between the control
performance of an NNS and a PID controller on the decentralized and centralized
control of an HVAC system. So et al. [25] design a NNs-based controller for an air
19
handling unit (AHU) to minimize the offset between the temperature and its
setpoint and the energy consumption.
FL is a control method which use a series of if-then logic to imitate the human
actions to control the output variable. Table 1 illustrates an example of Fuzzy Logic.
… … … …
Table 1 - Example of Fuzzy Logic
Huang and Nelson [26] and Arima et a. [27] designed a FL controller to maintain
the temperature near a the setpoint for a HVAC system.
GAs are derivate-free optimization method of multiobjective functions. Wright et
al. [28] design a GA controller for a single-zone AHU to define the supply air
temperature and the flow rate to maximize the thermal comfort and minimize the
operating cost.
The last category refers to the fusion of Hard and Soft controls.
Finally, in recent years, studies on a RL-based technology for the control of HVAC
systems have increased. RL is a model-free control technique which can also be
implemented without a priori knowledge of the controlled environment or process.
In this control approach, a designed agent learns a control policy from its
interactions with the environment through a reward.
This typology will be described in the following sections. In particular, an analysis
of the existing works will be analysed in section 1.2, while chapter 2 will go into
the details of the methodology.
20
control, has raised interest in recent years, even if its applications persist limited.
Consequently, the reinforcement learning method is becoming more distinguishing
and applicable in control networks for buildings. Furthermore, this aspect is even
more expanded in the Deep Reinforcement Learning because it is not only a more
futuristic but also a more dynamic section of these algorithms.
Table 2 lists some works which use RL and their respective objectives.
One characteristic that distinguishes different RL algorithm is the action selection
method, in particular, the two most applied approach are the ϵ-greedy and the
Boltzmann method. Among the analysed works, the ϵ-greedy method results the
most used, while in some case the implemented method is not specified. The
number of publications of each technique is illustrated in Figure 10.
21
Figure 11 - Number of publications for control timestep
The formulation of the reward equation varies in different RL works. Between the
studies analysed in this dissertation, five different reward term are individuated:
Energy consumption
Cost function, this term is strictly related to the first, but focuses more on
the energy’s price;
CO2, this term aims to control the CO2 concentration in the controlled
environment;
RES, which takes into account the energy produced by renewable energy
sources.
In some cases, the reward equation is formed by two or more competing terms. In
fact, between the 27 analysed studies, 20 use multiple terms into for the reward.
As observable in Figure 12, the most present terms are the energy-related one and
the comfort-related term.
22
Figure 12 - Number of publications for reward term
Among the study cited above, some deserve more attention for the results obtained.
For example, Vazquez-Canteli et al. [29] used a batch reinforcement learning
(BRL) algorithm with fitted Q-iteration to control the heat pump and two water
tanks (one for heating and once for cooling). At the end, they obtained a reduction
in energy consumption while maintaining adequate thermal comfort.
A similar goal has been reached by Ki Uhn Agn and Cheol Soo Park [30], who
minimize the building’s energy usage by 15.7% in comparison with the baseline
operation while maintaining the indoor CO2 concentration below 1,000 ppm. These
targets are reached through a Q-network (DQN) for model-free optimal control
balancing between different HVAC systems, and it was designed with two hidden
layers.
An example of Multi agent RL problem was proposed by Nagarathinam et al. [31].
In their work they introduced MACRO (Multi-Agent Reinforcement learning
COntrol), which is based on Double Deep Q-Network algorithm and it used two
separated control agents to control both the AHUs and chillers setting the building
and chillers setpoints to optimize the HVAC operating phase. The control objective
was the minimization the HVAC energy consumption respecting the comfort
23
constraint. The designed agents was trained and deployment in real configuration,
and as result MACRO learn the optimal policy, improving the comfort and
obtaining an energy saving of about 17%.
Park and Nagy [32] execute another study on the application of a Q-Learning
(Tabular Q-Learning) algorithm on an HVAC system. The agent learns the
occupant behaviour and indoor environments by monitoring indoor air temperature,
occupancy, and thermal vote, and estimates adaptive thermostat set-points to match
between occupant comfort and energy efficiency.
Zhang et al. [33] proposed a control algorithm based on A3C (Asynchronous
Advantage Actor-critic) and it was implemented to a radiant heating system. In
particular, this control agent regulates the supply water temperature set-point of the
Mullion system in order to reduce the heating demand consumption and to maintain
the indoor thermal comfort.
An example of Deep Reinforcement Learning (DRL) on the control of the radiant
heating system with a boiler and radiators is issued by Silvio Brandi et al. [34].
The controller is implemented to manage the supply water temperature setpoint to
terminal units. Moreover, two sets of input variables are analysed for estimating
their impact on the adaptability capabilities of the DRL controller; so, a static and
dynamic deployment of the DRL controller is performed. The trained control agent
is tested for four different scenarios to fix its adaptability to the variation of forcing
variables. Consequently, when the set of variables are appropriately selected, the
energy saved ranges between 5 and 12 %.
Last but not least, a more complex usage of a Q-Learning (Double Deep Q-
Network) has been done by Ding et al. [35], who developed a system called
OCTOPUS, which uses a data-driven method to find the optimal control series of
all building’s subsystems, including HVAC, lighting, blinds and window systems.
Overall, they demonstrated that OCTOPUS could achieve 14.26% and 8.1% energy
savings related with the state-of-the-art rule-based system and the latest DRL-based
24
method available in the literature respectively while maintaining human comfort
within the aspired range.
25
Thermal and Energy Management Based on
Bimodal Airflow-Temperature Sensing and A3C Minimize the energy consumption
Reinforcement Learning [51]
Advanced Building Control via Deep
Not described in detail Reducing Energy consumption
Reinforcement Learning [52]
Energy optimization associated with thermal Optimization of energy consumption of air-
comfort and indoor air control via a deep DQN conditioning systems in association with
reinforcement learning algorithm [53] thermal comfort and indoor air quality.
Towards optimal control of air handling units
Q-Learning (Deep Q- Minimizing energy consumption while
using deep reinforcement learning and
Network with LSTM) maintaining thermal comfort for occupants.
recurrent neural network [54]
HVACLearn: A reinforcement learning based Calculating thermostat set-points to balance
Q-Learning (Tabular Q-
occupant-centric control for thermostat set- between occupant comfort and energy
Learning)
points [32] efficiency
MARCO - Multi-Agent Reinforcement
Q-Learning (Double Deep Maintaining thermal comfort with the lowest
learning based Control of building HVAC
Q-Network) energy consumption.
systems [31]
Minimize the energy consumed by all
OCTOPUS: Deep Reinforcement Learning for Q-Learning (Double Deep subsystems in the building and maintain the
Holistic Smart Building Control [35] Q-Network) human comfort metrics within a particular
range.
Deep Reinforcement Learning to optimise Reduce the amount of thermal energy while
Q-Learning (Double Deep
indoor temperature control and heating energy maintaining indoor air temperature within an
Q-Network)
consumption in buildings [34] acceptability range during occupied periods
26
Chapter 2:
Modelling occupants’ behaviour
27
2. Modelling occupants’ behaviour
As many researchers describe, occupant behaviour has a more considerable
influence on residential consumption. In particular, van den Brom et al. [56] found
that at least 54% of the variance in energy consumption in similar buildings can be
explained by “building characteristics, 17% by the occupants’ lifestyle, 15% by the
change of occupants and 13% by house-related quality differences. Therefore,
occupants contribute approximately 50% of the variance.
While in older buildings, the physical characteristics have a more decisive impact
on the variance, in more recent ones households cause a larger percentage of the
variance.
In the occupants’ lifestyle, one of the most impacting actions on energy
consumption is the windows opening behaviour.
In this study, to better represent the occupant behaviour and in particular, the
windows behaviour, are analysed different window models found in the literature.
The first one is the model proposed by Rouleau and Gosselin [57], and it is based
on the study of eight apartments in Quebec City (Canada). This article points to
create a probabilistic window opening model based on a logistic regression to
predict the state (open/close) of windows according to different parameters related
to indoor and outdoor environments and time-related terms. After a sensibility
analysis, the two most relevant parameters are the indoor and outdoor temperatures,
and consequently, two equations are created, one for the opening of the windows
and one for the closing:
𝑝
𝑙𝑜𝑔𝑖𝑡 𝑝 = 𝑙𝑛 = −6.216 + 0.059 ∗ 𝑇 + 0.33 ∗ 𝑇
1−𝑝
𝑝
𝑙𝑜𝑔𝑖𝑡(𝑝 ) = 𝑙𝑛 = −0.871 − 0.091 ∗ 𝑇 − 0.028 ∗ 𝑇
1−𝑝
Instead, Andersen et al. [58] made their study in 15 dwellings in Denmark and
proposed a model based on the logistic regression. However, concerning the
previous model, the coefficient of the equation varies with both the day of the week
28
and the time of the day. The parameters involved in the equations and their
coefficient are shown in Figure 13.
Calì et al. [59] made their study in 90 dwellings in Germany, applying the same
method of the other two models. They found that the most common driver which
influence the opening was the time of the day and the carbon dioxide concentration,
while the most common driver which leads to closure was the outdoor temperature,
and the time of the day. The days are divided into:
Night, low probability of action: 7 hours, between 11:00 p. m. and 5:59 a.m.
29
The following two equations show, respectively, the model for the opening and the
closure.
1
𝑙𝑜𝑔𝑖𝑡 𝑝 = 𝛼 − 551.15 ∗ + 0.134 ∗ 𝑇
𝐶𝑂
1
𝑙𝑜𝑔𝑖𝑡(𝑝 ) = 𝛼 − 785.7 ∗ − 0.268 ∗ 𝑇 − 0.058 ∗ 𝑅𝐻 − 0.105 ∗ 𝑇
𝐶𝑂
+ 0.022 ∗ 𝑅𝐻
In which α vary with the time of the day, and its value is reported in Table 3
Open Close
αnight -10.089 2.539
αmorning -8.214 3.317
αrest of the day -7.795 3.955
Table 3 - Value of the coefficient α [59]
Lastly, Jones et al. [60] work on 10 dwellings located in Torquay (south-west of the
UK). They applied the logistic regression model, and they studied the influence of
the indoor and outdoor temperature and relative humidity, wind speed, solar
radiation and rainfall on the two probability. In both cases, the equation coefficients
vary with both the time of the day and the season.
Figure 14 shows the coefficients of the windows opening model, while the
coefficients for the closing model are illustrated in Figure 15.
30
Figure 14 - Windows opening model coefficient [60]
31
The main problem in the application of these models is the climatic difference
between locations and these differences can lead to a broad diversity between the
simulated conditions and the real ones.
To show the weather difference Table 4 presents the Heating Degree Days (HDD)
in the various localities.
To choose the model that best fits the case study, each of them was implemented in
a different simulation, and through a qualitative analysis of the evolution of the
window state, the one with the most realistic behaviour was chosen.
Figure 16 shows the comparison of the windows states applying the four different
models. In these pictures, the open state is represented by 1, while the close state is
represented by 0.
The implementation of the model proposed by Rouleau and Gosselin [57],
illustrated in figure 16a, leads the simulation to have a prolonged open state of the
windows, in contradiction with the night hour and the autumn season.
Meanwhile, the use of the model proposed by Calì et al [59] does not give good
results because the state of the window never changes, in fact, it always stays
closed, as can be seen in Figure 16c.
On the other hand, applying the English model proposed by Jones et al. [60], it
results that the window state changes too frequent and the windows stay open for a
long time, in contrast with the season.
32
The best model seems to be the Danish model proposed by Andersen et al [58] and
illustrated in Figure 16b, which gives results consistent with reality, with less
frequent openings and for a single time step.
Therefore, this last model was implemented in the simulation environment for both
the training and deployment phase of the SAC control agent.
33
Chapter 3:
Reinforcement Learning
34
3. Reinforcement Learning
Reinforcement learning is a branch of Machine learning, as illustrated in Figure 17,
with supervised learning and unsupervised learning. Concerning the other two types
of machine learning, it requires inputs, and it provides outputs and a score for them
[61].
Necessary for this technique is a control policy, which is learnt through interaction
with the environment. Consequently, an agent is created to choose the best actions
to achieve a specific objective, Figure 18 shows a typical RL loop structure. Since
RL does not require a priori known model, it is defined model-free.
35
Figure 18 - Typical control loop based on RL
RL is based on the Markov Decision Property (MDP), in which the future step is
independent on what has already happened; so, it is conditioned only on the current
level. In particular, both the reward and the transition probability between two states
depend on the actual state and the chosen action.
In the MDP, the mathematical formalization of the interaction between
environment and agent is described by the following components [50]:
Action space (a ϵ A), which is the set of all possible actions. The agent
selects one of them at each time step;
State space (s ϵ S), which is the set of all possible environment's state. It can
be divided into time-dependent, controllable and exogenous
(uncontrollable) state information;
Reward (r), which is a scalar value released by the environment after the
evaluation of both the action chosen and the new state;
Policy (π), that is a mapping between states and the probability of selecting
each action. As a result, the agent aims to learn the optimal policy.
Transition probability distribution, which represents the likelihood of the
transition to the next state when the agent is in the initial state.
Furthermore, the object of the control agent is learning the optimal policy that
maximizes the total reward/return.
36
In addition, the two value-functions called the state-value and the action-value,
respectively, are very valuable to determine the optimal policy [50]:
The first one represents the expected return of the agent, starting from a state
s and following the policy π: [34]
𝑉 (𝑠) = 𝐸[𝑟 + 1 + 𝛾𝑣 (𝑠 )|𝑆 = 𝑠, 𝑆 = 𝑠′]
where 𝛄, which is ranged between [0,1], is the discount factor for future
rewards:
a) if 𝛄=0, the agent gives greater importance to immediate reward,
neglecting the future one.
b) if 𝛄=1, the agent provides more weight to the future reward.
The second one represents the expected return of the agent when it takes the
action a, being in a certain state s and following the policy π: [34]
q𝝅 (s, a) = E[rt+1+γq𝝅 (s’, a’)|St=s, At=a]
These two functions are obtained through the experience of the agent, and they are
updated online during the training phase.
The RL agent is trained through a trial-and-error approach by a technique called
on-policy learning, which means that after having tried and evaluated the
performance of various policies, it improves them as much as possible. On the other
hand, in the analogue method named the off-policy learning, the agent learns from
other policies already created for other cases but, the main issue is the lack of skill
to explore the action space.
Furthermore, all RL problems can be divided into two main categories [63]:
37
One of the peculiarities that characterize reinforcement learning is the compromise
between exploration and exploitation for the action-selection, which has to be
optimised by a right control agent.
During the exploration phase, the agent selects new random actions by neglecting
the maximization of the reward, while, during the exploitation, the agent selects
actions already undertaken in order to maximize the rewards.
The exploration phase is more focused on the initial phase when the agent explores
all action-space. Consequently, after a certain period, the exploitation becomes
more significant in order to reach the objective.
Moreover, to balance these two main actions, two methods can be witnessed:
In the ε-greedy, the agent would choose the currently known action with the
highest estimated value with a probability of 1-ε and selects a random action
with the probability of ε [64]. ε represent the exploration rate, and it can
decrease over time in order to support the exploitation phase. This method
can be described by the following two equation:
𝑃(𝑎 = 𝑟𝑎𝑛𝑑𝑜𝑚) = 𝜖
The Soft-max method selects the action based on the action's performance
and τ, which is the Boltzmann temperature constant. The agent tends to
exploit more when most of the action space has been explored already [64].
The next equation represents this method:
(
𝑃(𝑎 ) = (
∑
3.1 Q-Learning
Q-learning is one of the most popular methods of model-free RL. It belongs to the
Temporal Difference (TD) technique, and it is used when the model issues
incomplete data.
38
The TD methods, in comparison to the other two RL techniques (Monte-Carlo (MC)
and Dynamic Programming (DP)) converge towards an optimal policy faster [63].
In Q-learning all transition are represented by a table, called Q-Table, in which each
entry represents a state-action tuple, and then state-action value or Q-Value is
stored.
Overall, Q-learning tries to evaluate the Q-values from experience, and they are
updated according to Bellman’s equation:
𝑄(𝑠, 𝑎) ← 𝑄(𝑠, 𝑎) + 𝛼[𝑟 + 𝛾𝑚𝑎𝑥 𝑄(𝑠 , 𝑎 ) − 𝑄(𝑠, 𝑎)]
Where α, which is the learning rate, behaves between [0,1]. It determines with
which capacity new information overrides old knowledge, for example, α = 1 means
that the new data overrides completely the old one, while α = 0 means that no
learning occurs [34].
39
Figure 19 - Example of Deep Neural Network [65]
Moreover, the Q-values are indicated with the following formula, taken from [66]:
𝑄(𝑠, 𝑎) = 𝑄(𝑠, 𝑎, 𝜃)
The equation represents the Q-network, in which the term θ, which is the weights
of the network, parameterizes the Q-value function. The neurons' number in the
input layer is equal to the variables' number that composes the state space, while
the number of neurons in the output layer corresponds to the size of the action space.
[34].
This structure, which is represented in Figure 20, is helpful because the network
allows learning the relation between states and the Q-value for each action, which
is unknown a priori, and it is learnt over successive interaction with the
environment. Overall, the Q-value is updated following the Bellman’s equation
(introduced in section 2.3).
40
3.3 Soft Actor-Critic
Model-free deep reinforcement learning (RL) algorithms suffer from two
significant hurdles, very high sample complexity and brittle convergence
properties, which necessitate meticulous hyperparameter tuning. Both of these
challenges severely limit the applicability of such methods to complex, real-world
domains.
In particular, algorithm like TRPO (Trust Region Policy Optimization) and PPO
(Proximal policy optimization) have stochastic policies, and they use on-policy
process to improve them. Besides, they suffer from low sample efficiency because
they require new samples to be collected after each policy update.
On the other hand, algorithm like DDPG (deep deterministic policy gradient) and
TD3 (Twin Delayed DDPG) use deterministic policies, and they adopt off-policy
approach for the optimization. Compared to previous algorithms, they present a
better sample efficiency, thanks to the replay buffer, but they are extreme brittleness
and suffer from hyperparameter sensitivity [67].
To try to overcome these obstacles, a new algorithm, called Soft Actor-Critic
(SAC), is proposed like a combination of the properties of the two previous group.
SAC is an off-policy algorithm which combines stochastic policy and replay buffer,
and it introduces the entropy regularization. This algorithm allows the use of
continuous action space instead of the discrete one used in traditional RL algorithm.
SAC aims to maximize a new target function composed of two term, the expected
reward and the entropy term. This last term expresses the attitude of choosing
random actions.
High entropy is necessary to encourage exploration, to promote the policy to assign
same probabilities to actions with same Q-values and to guarantee that it does not
always select a particular action that could lead to inconsistency in the
approximated Q function. Consequently, SAC supports the policy network to
41
explore and not assign a very high probability to any one part of the range of actions
[67].
As proposed by Haarnoja et al. [68], the soft actor-critic algorithm includes three
principal elements: an actor-critic architecture with separate policy and value
function networks, an off-policy formulation that allows reuse of the previous
sample, and entropy maximization to encourage stability and exploration. The
structure of actor and critic neural networks are shown in Figure 21.
42
where 𝒟 is the distribution of previously sampled states and actions, or a
replay buffer.
2. The soft Q-function parameters can be trained to minimize the soft Bellman
residual:
1
𝐽 (𝜃) = 𝐸( , )~𝒟 𝑄 (𝑠 , 𝑎 ) − 𝑄(𝑠 , 𝑎 )
2
3. The policy parameters can be learned by minimizing the expected KL-
divergence:
𝑒𝑥𝑝 𝑄 (𝑠 ,∙)
𝐽 (𝜙) = 𝐸 ~𝒟 𝐷 𝜋 (∙ |𝑠 )||
𝑍 (𝑠 )
43
Chapter 4:
Framework of the analysis and Case Study
44
4. Framework of the analysis and Case Study
In this section the methodological framework is illustrated with the goal of
introducing the stages of the SAC control agent development. The present
framework is based on three different levels as shown in Figure 22.
The first phase of the framework is the problem formulation, which has the aim to
define the principal components of the learning problem. The action-space includes
all the possible control actions that the agent can take. The reward is the function
that describes the performance of the control agent concerning the objectives.
Lastly, the state-space is a set of variables that describe the environment. These
variables are sent to the control agent.
The second stage of the procedure is the training phase in which the control agent
was trained. In this phase, a sensitivity analysis was performed on the most relevant
hyperparameters by training the control agent with different configurations.
The training process repeats multiple time a training episode in order to improve
the agent’s control policy. At the end of this phase, after the comparison between
different solutions and baseline, the best configuration was selected.
In the last phase the trained agent was tested through a static deployment in one
episode of a different period from the training episode. The deployment was carried
45
out in four different scenarios. Finally, a comparison between the different
scenarios and the baseline was performed.
Control time step that represents the time step in which the agent takes
action, in this case, the control time step is set equal to 30 minutes;
46
Episode, which represents simulation’s duration performed by EnergyPlus.
During the training phase, one episode is repeated multiple time in order to
explore different paths. In this case, a training episode lasts two months.
For the exchange flow of data between the control agent and Energy plus
simulation, the model proposed by [34] is used, and it is illustrated in Figure 23, in
which the green lines illustrate the exchange of data between Python and
EnergyPlus that is managed using BCVTB.
In particular, the loop is characterized by four functions in Python:
Step(), which receives the action selected by the agent and translates the
encoded value into a physical control action. This function also returns four
objects: next state, reward, done (True/False) and info;
47
If the episode is the last one, the process ends here, otherwise a new episode start
from the reset() function.
48
The building under study is located in Turin, Italy, and it is representative of a big
portion of the Italian building stock in terms of both heating system configuration
and building construction characteristics. It is a five-level building with a net heated
surface of 527 m2, and each floor represent a thermal zone, Table 5 shows both the
volume and the floors’ surface for each level.
Volume [m3] Surface [m2]
For a better understanding, Figure 24 witnesses the developed model, in which the
various floors are easily recognizable, and it is already possible to distinguish
opaque walls, windows and ceiling.
49
Figure 24 - Geometrical model
Figure 25 represents the boundary condition for each surface required for the
energy balance calculation. Blue surfaces indicate the external ones, while the
browns are in contact with the ground. Furthermore, boundary conditions for
external walls, that are in contact with nearby buildings, have been considered
adiabatic supposing that the other structures are also subjected to air conditioning,
and they are represent by pink surfaces.
50
season, which in Turin goes from 15th October to 15th April, and the simulation
time step is defined equal to 30 minutes.
To complete the definition of the building, there were inserted every material
presents in the construction. For these materials are defined some properties, which
are fundamental to determine the dynamics of the building, such as density,
conductivity and the specific heat. After, the constructions are created using the
materials according to the following stratigraphies.
The external walls are composed of two layers of lime plaster and one layer of brick.
there is not the presence of some insulant materials, in fact, the transmittance value
for this type of construction (0.985 W/m2K) results higher than its limit value (0.300
W/m2K). Table 7 summarises the thermal properties of the external walls.
Thermal
Thickness Density Conductivity Specific heat
Layer resistance
[mm] [kg/m3] [W/mK] [J/kgK]
[m2k/W]
Internal
- - - 0.13 -
Surface Rt
Lime plaster 15 1800 0.90 0.017 840
Table 8 highlights the roof stratigraphy for this building. It is a wooden roof with
two layers of air gap, and the outside layer is made of tile. Also this construction
presents a U-Value (1.215 W/m2K) higher than the limit one (0.26 W/m2K).
51
Thermal
Thickness Density Conductivity Specific heat
Layer resistance
[mm] [kg/m3] [W/mK] [J/kgK]
[m2k/W]
Internal
- - - 0.10 -
Surface Rt
Lime plaster 15 1800 0.90 0.017 840
The internal vertical partition have a construction similar to the exterior walls one,
the only difference is the thickness of the brick layer, that in this case, it is 120 mm.
Since the absence of insulation material and the small thickness of the wall, this
construction presents a very high transmittance value (2.174 W/m 2K) compared to
the limit one (0.80 W/m2K). Table 9 shows the properties of this type of
construction.
Thermal
Thickness Density Conductivity Specific heat
Layer resistance
[mm] [kg/m3] [W/mK] [J/kgK]
[m2k/W]
Internal
- - - 0.13 -
Surface Rt
Lime plaster 15 1800 0.90 0.017 840
In the horizontal partitions host the radiant floor system, which provide the power
to each floor. The stratigraphy of this construction is described in Table 10. The
pipes of the heating system are located in the screed layer. The construction has a
52
transmittance value of 1.376 W/m2K, and also in this case, it is higher than the limit
value (0.80 W/m2K).
Thermal
Thickness Density Conductivity Specific heat
Layer resistance
[mm] [kg/m3] [W/mK] [J/kgK]
[m2k/W]
Internal
- - - 0.13 -
Surface Rt
Lime plaster 15 1800 0.90 0.017 840
The resulting transmittance is 2.681 W/m2K that do not respect the limit value
imposed by the standards.
Table 11 shows the thermal properties of the window’s glazing stratigraphy.
53
Thermal
Thickness Density Conductivity Specific heat
Layer resistance
[mm] [kg/m3] [W/mK] [J/kgK]
[m2k/W]
Internal
- - - 0.13 -
Surface Rt
Glass 4 2500 1.00 0.004 840
In order to simplify the design of the energy model, the windows are defined using
the object WindowMaterial:SimpleGlazingSystem, in which the global U-value of
the windows is added. Figure 28 shows an example of the definition of the window
property.
54
Furthermore, the building was split into five thermal zones, each of them
corresponding to a different floor. Each thermal zone is characterised by its own
schedules and temperature setpoint.
Since the presence of occupants in the building is stochastic, different schedules for
the occupancy is implemented on each floor. The schedules differ in both the arrival
and leaving hours and the number of people in the zone; this is done to represent
different occupants’ behaviour. It has been hypothesised that in each flat there is a
family of four people. Table 12 illustrates the different occupancy schedule of each
floor.
Floor Range Number of people
7.00 – 17.00 0
19.00 – 7.00 4
8.00 – 19.00 0
22.00 – 8.00 4
6.00 – 7.00 3
7.00 – 16.00 0
Second floor
16.00 – 18.00 2
18.00 – 6.00 4
6.00 – 8.00 2
8.00 – 18.00 0
Third floor
18.00 – 20.00 3
20.00 – 6.00 4
7.00 – 16.00 0
18.00 – 7000 4
Table 12 - Occupancy schedules
The windows’ state, open or closed, is passed to EnergyPlus with a schedule that is
updated each simulation timestep according to the model proposed by Andersen et
al. [58], which can be represented by the following equations:
55
𝑝
log = 𝛼 + 𝛼 𝑇 + 𝛼 𝐶𝑂 , +𝛼 𝑇 +𝛼 𝑣 + 𝛼 𝐼
1−𝑝
𝑝
log = 𝛼 + 𝛼 𝑇 + 𝛼 𝑅𝐻 + 𝛼 𝐶𝑂 , +𝛼 𝑇 +𝛼 𝑣
1−𝑝
+ 𝛼 𝑅𝐻 +𝛼 𝐼
Where the coefficients αi can be find in Figure 13. These values change with both
the hour of the day and the day of the week.
Figure 29 shows the Python function used for the implementation of the two
windows’ equations.
Through a probabilistic function, the model will receive the value 1 if windows are
open, and 0 if windows are closed. The state value of the windows is determined
by the Python function, which is illustrated in Figure 30, and it is passed through
BCVTB to the EnergyPlus’s schedule that regulates the
ZoneVentilation:DesignFlowRate.
In this function the variable pc verifies the presence of occupants in the thermal
zone, ss indicates the state of the window, ww is the schedule value to be sent to the
building model, and pp is the value obtained by the application of the equation.
Being a probabilistic model, the value of pp is compared with a random number,
56
which is generated each timestep, and if the probability is higher than the random
number, the state of the window changes.
The application of the windows’ model will mainly affect natural ventilation losses
and consequently, the load required for heating. Each thermal zone has its own
schedule independent from the others.
The air flow due to the opening of the window was modelled in EnergyPlus by
using the object ZoneVentilation:DesignFlowRate, assuming a natural air flow of 2
Air Changes per hours (ACH):
𝑉̇ = 2 𝐴𝐶𝐻
These air flows are considered only when the windows’ model returns the open
state for the windows on each floor. The windows states are updated every 30
minutes, corresponding to the simulation timestep. Therefore, if the window is
open, there will be a natural ventilation heat loss which lasts until the state change
again.
The other sources of internal heat gain considered are the lights and the electric
equipment. Both are controlled by two schedules, which regulates the heat gain
coherently with the presence of occupants.
57
Concerning the lights, it has been assumed that in the entire building are installed
standard light with a lighting power of 7 W/m2.
Infiltrations, which are the unplanned air flows from the external environment
directly into the thermal zone. Are generally caused by the uncorrected sealing of
windows and doors or through building elements. When the outdoor temperature is
lower than the internal one, the infiltrations lead to heat loss. In this energetic model
they are modelled by using the object ZoneInfiltration:DesignFlowRate, and it has
been hypothesised that the air flow is equal to 0.1 ACH in each thermal zone:
𝑉̇ = 0.1𝐴𝐶𝐻
Since the objective of the SAC control agent will be the maintaining of the
temperature in the acceptability range by controlling the supply power to be
provided in each thermal zone, the simplest way to regulate this power is the
introduction of an internal source of heat gain in each zone. The EnergyPlus
objected OtherEquipment is used. The designed power level in Watt, defined in this
object, must be equal to the maximum power that can be supplied by the existing
system in that zone. The schedule linked to this object the schedule related to this
object will specify the percentage of power to be supplied in that timestep. Figure
31 forgives an example of the definition of the OtherEquipment object.
However, for the application of this strategy in real cases, the control of the supply
power depends on the variable that the control system can regulate, such us flowrate
or supply temperature.
58
4.1.4 Heating system and Baseline control logic
The building is heated through a radiant floor heating system. The hot water loop
is composed of a gas-fired boiler of 70 kW, a variable speed pump and a collector
to separate the flow in the five radiant floors. A simplified scheme of the building’s
heating system is provided by Figure 32
Since the radiant floor can only operate on the sensible load, and the comfort
parameter is not monitored, the work focuses on the thermal zone internal
temperature.
The baseline control logic is a combination of rule-based and climatic-based for the
control of the supply power.
The climatic curve follows a step function in which the fraction of nominal power
depends on the outdoor air temperature. Figure 33 forgives a graphical
representation of this function.
59
Figure 33 - Baseline Logic - Climatic curve
The time in which the system is switched on/off is based on indoor temperature
when there is the presence of occupants, following this logic:
If the indoor temperature is larger than 21 °C, the system is switched off;
If the indoor temperature is less than 19 °C, the system is switched on;
60
Chapter 5:
SAC Development
61
5. SAC development
In this chapter will be described the design of the principal elements of the SAC
control algorithm and it will introduce the methodologies used in the training and
deployment phases.
𝐴 = 0 ≤ 𝑆𝑃 . ≤ 6.5
𝐴 = 0 ≤ 𝑆𝑃 . ≤ 5.0
𝐴 = 0 ≤ 𝑆𝑃 . ≤ 5.0
𝐴 = 0 ≤ 𝑆𝑃 . ≤ 6.5
These values are selected to provide to the SAC agent the same range of supply
power as the baseline controller. Furthermore, the simulation environment is set to
shut down the system when the supply power reaches a value below the 30% of the
nominal power.
62
5.1.2 Design of reward function
The reward that the agent receives after having taken actions at each control time
step depends on two competing values: the energy and temperature-related terms.
The energy-related one is proportional to the energy provided to each floors to reach
the desired setpoint, while the temperature-related is quadratically proportional to
the distance between zone air temperature setpoint and its actual value on each
floor.
The coefficient δ and β are introduced to weight the importance of the two terms of
the reward function. The weight factor β determines the relative importance of
indoor temperature requirement concerning energy consumption. A high value of
this factor guarantees lower temperature violations at the expense of lower energy-
saving and vice-versa.
While the energy-related term is always present, the temperature-related one is
inserted when there is the presence of occupants inside the zone and the temperature
falls outside the acceptability range.
The following equation expresses the reward function:
∑ 𝑆𝑢𝑝𝑝𝑙𝑦 𝐸𝑛𝑒𝑟𝑔𝑦
−𝛿 ∗ − 𝛽∗ (𝑇 . . − 𝑇 ) 𝑖𝑓 𝑜𝑐𝑐 ≥ 1
𝑅= 3600000
∑ 𝑆𝑢𝑝𝑝𝑙𝑦 𝐸𝑛𝑒𝑟𝑔𝑦
−𝛿 ∗ 𝑖𝑓 𝑜𝑐𝑐 = 0
3600000
Figure 34 shows the structure of the reward function graphically to facilitate its
understanding.
63
Figure 34 - Reward function structure
The variables must provide to the agent all the necessary information to
predict immediate future rewards;
64
Moreover, to take into account previous information about the indoor temperature,
the above-cited difference with one, two and four hours lag are also included in the
state-space. These informations are useful to the agent to correct his past actions.
To guarantee a satisfactory indoor air temperature during the occupancy period, It
would be helpful that the agent can learn when it is convenient to pre-heat the zone
or when to turn off in the final phase of occupancy.
For this scope, Brandi et al. [34] suggest the introduction of two variables: Time to
Occupancy Start and Time to Occupancy End.
When the zone is not occupied, Time to Occupancy Start represents the number of
hours left for the return of the occupants, therefore during occupancy periods the
variable is equal to zero
Conversely, when the building is occupied, Time to Occupancy End represents the
number of hours that have to pass before the occupants' leaving time, so during the
absence of occupants this variable is equal to zero.
Since the heating system is based on radiant floors, and it is able of controlling only
the sensible load, the Relative Humidity can not be included in the state-space.
Table 13 - State-Space
65
Lastly, in order to feed the variables to the neural network, they are scaled in the
(0, 1) range according to a min-max normalization.
Furthermore, Table 15 lists in detail each hyperparameter used in each run for the
sensitivity analysis. In this analysis, different hyperparameters are involved:
66
was set equal to 20 °C and the temperature acceptability range between 19 °C and
21 °C.
hyperparameters
run γ Learning β δ Batch Neurons for episodes
rate size Hidden
layer
1 0.9 0.001 1 0.1 256 256 10
2 0.95 0.001 1 0.1 256 256 10
3 0.99 0.001 1 0.1 256 256 10
4 0.9 0.001 1 0.5 256 256 10
5 0.9 0.001 1 0.1 512 256 10
6 0.9 0.001 1 0.1 128 256 10
7 0.9 0.0001 1 0.1 128 256 10
8 0.9 0.0001 1 0.1 256 256 25
9 0.9 0.001 1 0.1 256 256 25
10 0.9 0.0001 1 0.01 256 256 25
11 0.9 0.0001 5 0.1 256 256 25
12 0.9 0.0005 1 0.1 256 256 25
13 0.9 0.0001 10 0.1 256 256 25
14 0.9 0.0001 1 0.1 256 128 25
15 0.9 0.0001 1 0.1 256 512 25
Table 15 - Different hyperparameter configurations
Scenario 1: in this scenario, the control environment does not change, the
aim is to evaluate the adaptability of the control agent to different weather
67
conditions such as outdoor temperature and direct solar radiation. The
parameters related to the occupancy schedule and building don’t change.
Scenario 2: in this case, the zone setpoint temperature was set equal to 21
°C. Consequently, the new acceptability range was between 20 and 22 °C.
The goal of this test is to evaluate the adaptability of the SAC controller in
satisfying different temperature requirements.
located in Turin (climate zone E) must be lower than 1.9 , there were
°
g of 0.33.
Static deployment: the updating of the control policy does not take place in
the deployment phase and it requires less computational time;
68
Chapter 6:
Results
69
6 Results
This section illustrates the results of the different phase of the framework. In
particular section 6.1 highlights the results of the energy model implementing the
baseline control logic, which is a combination of the rule-based control and the
climatic curve, as described in section 4.2.
The designed SAC algorithm was implemented in the simulation environment
described in section 4.3.1.
Sections 6.2 and 6.3 show the results of the training and deployment phases and the
comparison between the SAC control agent and the baseline control logic.
70
Figure 35 - Heating energy consumption of the heating season
Analysing the heating consumption in more detail, Figure 36 shows the mean daily
energy consumption of each floor.
In all floors, energy consumption has a similar shape. The system is switched on at
the end of October. As mentioned before, the ground floor presents the highest
consumption, while the other floors have a similar consumption in pairs,
respectively the first with the fourth and the third with the second.
71
Figure 37 highlights the comparison between the daily heating energy consumption
pattern and the outdoor temperature one. These two variables are strictly correlated,
in fact, when the temperature goes down, the energy consumption goes up and vice-
versa. If there is a minimum of the mean daily outdoor temperature, there is a
maximum in the mean daily energy consumption.
Figure 37 - Comparison between the energy consumption and the outdoor temperature
Now, the more relevant building loads are analysed, including the natural
ventilation and infiltration losses and the windows heat gain.
Since the building under study has a large number of windows on each floor, the
heat gain from the transparent surfaces takes importance in the building energy
balance. This term is strictly correlated with the amount of solar radiation, as can
be observable in Figure 38, in which the daily solar radiation was compared with
the ground floor daily window heat gain. In fact, the two shapes are quite similar
during the heating season, in particular, an increase in the mean daily solar radiation
leads to a high mean daily heat gain. The highest values are reached at the end of
the heating season, when the temperature and the solar radiation are higher.
72
Figure 38 - Windows Heat Gain: Comparison between direct solar radiation and ground floor heat gain
Figure 39 and Figure 40 show the comparison between the windows heat gain
through different floors. This analysis was performed in two different periods:
73
Figure 39 - Window heat gain comparison between floors in December
Infiltrations are the unplanned air flows from the external environment directly into
the thermal zone. These flows are generally caused by the uncorrected sealing of
windows and doors or through building elements.
Being the outdoor air temperature lower than the internal one during the heating
season, infiltration leads to heat losses that affects the building energy balance.
74
Figure 41 shows the relation between the outdoor temperature and the daily
infiltration loss.
The two variables have opposite patterns, in fact, when the external temperature
reached its lowest value, the mean daily infiltration heat loss significantly increases.
These losses take importance particularly during the winter period.
Figure 41 -Infiltration Heat Loss: Comparison between outdoor temperature and total infiltration heat loss
Figure 42 and Figure 43 show the comparison between the infiltration heat loss
through different floors. This analysis was performed in the same two periods of
the window heat gain analysis.
The other parameter which influences the infiltration losses is the volume of the
thermal zone, higher is the volume, higher will be the loss.
The infiltration heat loss has similar behaviour on all floors. In each floor, the heat
loss reaches the highest value at midnight and the lowest value near the midday.
Having the biggest volume, the ground floor presents the highest infiltration heat
loss, while the last floor, having the smallest volume, has the lowest loss. In fact, at
midday, the ground floor loss is three time the fourth floor one. These losses during
75
the period between the 1st and the 6th December reach value of about 2,5/3 kW in
the first floor, that is two times the value reached in March, that is 1.5 kW.
76
of the window opening and closing models individuated in section 2. Therefore, the
ventilation heat loss is present only when the window is open, and its magnitude is
influenced by both the external temperature and the opening time.
Figure 44 illustrates the mean daily ventilation heat loss on three different floors.
The amount of the heat loss is higher during the winter period on all floors. These
losses are not located in the same days in each floor because the windows opening
and closing models are implemented separately on each floor. The ground floors
reached the highest values.
77
6.2 Results of the training phase
As mentioned in section 5.4, in the training phase, a sensitivity analysis of the most
relevant SAC hyperparameters was performed to study their impact on the
performance of the control algorithm.
Two parameters were used to evaluate different configuration’s actions: the energy-
saving with respect to the baseline and the cumulative sum of temperature
violations. While both terms were also calculated for the baseline control algorithm
as reference parameters, the last one was introduced to evaluate comfort control
performance.
A temperature violation is taken into consideration only when the indoor
temperature is outside the acceptability range, and there are occupants. The
magnitude of the temperature violation is then determined as the absolute difference
between the indoor temperature and the designed set point value at each simulation
step. The cumulative value over a whole episode returns the performance of the
control algorithm expressed in °C.
Figure 45 exhibits the cumulative sum of temperature violations for the last training
episode as a function of the energy saving, compared to the baseline control logic
for the different configurations stated in Table 15.
The graph presents a y-axis on a logarithmic scale for better understanding. The
performances of the baseline, which are indicated with black dashed lines, separate
the plot in four-quadrants, and to each of them corresponds a different agent’s
behaviour.
Since the objectives of the designed control agent are the energy saving and the
reduction of the temperature violations, the left-bottom quadrant illustrates the
configurations that have performed better than the baseline. On the other side, the
worst cases fall in the right-top quadrant, and it includes the solutions with higher
energy consumption and temperature violations. Moreover, the other two quadrants
witness configurations that fulfil only one requirement.
78
However, in this study, there are not configurations that fall in the worst region.
Figure 45 - SAC control performance in the last episode of the training phase
A high value of the weight of the energy related term (δ) leads to significant energy
savings at the expense of temperature violations, as in the case of run 4 with a δ of
0.5.
On the other hand, a low value for δ causes a relevant reduction in the sum of
temperature violation, but the agent consumes more heating energy than the
baseline, as for run 10 with a δ of 0.01.
A similar discourse can be made for the weight of the temperature-related term β,
in fact, with the other hyperparameters being equal, a too high value of β leads to a
relevant reduction of the temperature violations, but an increase in consumption, as
in the case of run 11 (β = 5) and 13(β = 10).
In configuration 2 and 3, with a discount factor γ of 0.95 and 0.99 respectively, only
one requirement is satisfied because the controllers consume more energy than the
baseline. Therefore, solutions with γ equal to 0.9 are preferred.
79
In particular, solutions with a learning rate of 0.0001, a discount factor (γ) of 0.9
and a weight of the temperature-related term of 0.1, corresponding to run 8, 14 and
15, show the best trade-off between energy saving and reduction of temperature
violations. These three solutions, which are characterized by 256, 128 and 512
neurons for hidden layer respectively, are analysed in detail.
Firstly, a good indicator to evaluate the goodness of the learning process of the SAC
control agents could be the evolution of the cumulative reward during each training
episode. This term has not a physical meaning, but it takes into account both the
energy consumption and the zone temperature, combining them in a single value.
Furthermore, it gives indications about the converge of the control policy of the
control agents. On the other hand, a non-convergent trend could be caused by an
agent, who failed in reaching the optimal control policy. Higher is the reward, and
higher is the performance of the agents.
For this task, the convergence of the three different solutions was analysed in
Figure 46, in which the trend of both the temperature-related cumulative reward
and the energy-related one, are shown for each configuration. While the first terms
are represented in blue lines, the second ones are defined in red stripes.
80
In all configurations analysed, the agent starts the exploration with a high value for
both the energy-related and the temperature-related terms.
Overall, run 8 reaches the biggest value for the energy term, while run 14 gets the
peak for the temperature one.
In all solutions, the agent learns at first how to maintain the zone temperature in the
comfort range during the first 10 episodes. This can be understood by looking at
the temperature-related terms, which reach the convergence before the energy-
related ones, due to their more oscillatory pattern. Only in the final episodes of run
8 and 15, the energy term reaches the convergence.
Taking into account these results, the control agent with the highest stability is the
one proposed in solution 8.
Since the stability of the reward terms is not a complete indicator for assessing the
goodness of the SAC agent performance, the analysis continues with a graphical
representation of the results of the last training episode for these three
configurations.
The three control agents are compared on the same day of the training episode.
While the day was chosen by considering the coldest outdoor temperature in the
training period, the zone was selected by looking for the level with the biggest
volume, which corresponds to the ground floor.
Figure 47 shows the comparison between the three configurations, and it illustrates
the daily heating energy consumption in the first line, the zone temperature in the
second one and the supply power in the last one. In the graphs of the zone
temperature, the period with the presence of occupants is represented by the yellow
area, while the green area represents the temperature’s acceptability range.
81
Figure 47 - Comparison between three agents during a training day
Overall, as can be noted in the central line of the figure, the agent has learnt how to
maintain the zone temperature in the comfort range.
In addition, the agent of the configuration 8 keeps the zone temperature as closest
as possible to the set-point of 20 °C. In contrast, the agent of configuration 15
maintains the temperature across the lower threshold. Run 14 presents the higher
daily heating energy, consuming 0.51 MWh, while the other two configurations
have a similar consumption. The supply power has a similar pattern in all solution.
However, in configuration 15, the system switches on earlier than the other two,
and during the occupants' presence, it reaches lower power values.
In conclusion, run 8 looks to adapt better the occupancy, in fact when there are no
occupants the system is switched off for most time, and it switches on to pre-heat
the zone, ensuring the reaching of the comfort.
To choose the best solution, Table 16 shows the comparison of the three control
agents in the entire last episode (25th).
82
SAC Control Agent Baseline control logic comparison
Since both the cumulative sum of the temperature violations of each configuration
is significantly lower than the baseline, and the order of magnitudes are similar, the
energy saving takes more importance. Therefore, solution 8 is selected as the best
control agent because it presents the biggest energy saving, it has the greatest
stability on the rewards, and it learns how to maintain the temperature in the comfort
range. So, it will be implemented in the deployment phase.
characteristics Values
DNN Architecture 3 layers
Number of episodes 1
83
Episode Length 2832 Control steps (59 days)
Figure 48 shows the results of the deployment of the SAC control agent in the four
scenarios, compared with the baseline control logic. The first graph illustrates the
heating energy consumption in MWh, while the second one shows the cumulative
sum of the temperature violations. The blue bars represent the control agent while
the baseline is illustrated with the orange bars.
In all scenarios, the control agent leads to a reduction in both heating energy
consumption ad cumulative sum of temperature violation compared with the
baseline control logic.
Figure 48 – Comparison of heating energy consumption and cumulative sum of temperature violations
between the deployed control agent in the four scenarios and the baseline
As illustrated in Figure 49, in the fourth scenario, the control agent obtains the
highest energy saving of about 6%, while in the third scenario, the energy saving
achieved by the agent is the lowest (2%).
84
In the second scenario, both the control agent and the baseline have the biggest
heating energy consumption, which is due to an increasing of the zone’s setpoint
temperature.
In all of them, the cumulative sum of temperature violations for the control agent
has a similar order of magnitude, but, while in the fourth scenario this term is quite
similar to the baseline one, in the other three the reduction is very significant, as
can be seen in Figure 50. In fact, in the third one, the agents achieved a difference
in temperature violation of about 950°C, while in the last scenario, this difference
is only 4 °C.
85
Figure 50 - Differences in cumulative sum of temperature violations in each scenario
The following pictures illustrate the comparison between the control agent and the
baseline in the four deployment scenarios. In particular, in the graphs of the indoor
temperature, the yellow area represents the occupancy period, while the green one
indicates the acceptability range of the temperature. Moreover, while the red lines
indicate the performance of the baseline, the blue ones indicate the SAC control
agent.
Figure 51 shows the comparison between the baseline control and the deployed
SAC agent in the first scenario, during five days on the ground floor. Since in this
scenario, there was no change in the environment, the zone temperature profile and
the supply power trend are compared to the outdoor temperature, represented in
green line in the graph.
The SAC control agent permits the reduction on temperature violation in all days,
and its temperature profile is more stable than the one of the baseline around the
setpoint value (20 °C), which has a more oscillatory pattern.
86
Moreover, the control agent learnt how to optimize the pre-heating phase and the
switch-off phase. In fact, the SAC agent tends to switch-on the system later, but
with a high value of the supply power, reducing the time to reach the comfort range.
On the other hand, the agent tends to switch-off the system before than the baseline,
when there is still the presence of occupants, decreasing the energy consumption.
When the outdoor temperature reached the lower value, the supply power provided
by the control agent is lower than the one provided by the baseline; therefore, it
leads to a reduction in energy consumption. In general, the SAC control agent has
good adaptability to the change on exogenous factors.
Figure 51 - Comparison between SAC control agent and baseline in Scenario 1 of the deployment phase
Figure 52 highlights the comparison between the deployed agent and the baseline
in the second scenario, in which the setpoint temperature is higher (21°C) than the
training one (20 °C). The analysis was performed on the same five days of the
previous case.
As can be seen in Figure 48, in this scenario, the control agent reached the lowest
value for the cumulative sum of temperature violation. In fact, the agent has learnt
how to maintain the temperature in the new comfort range, and in particular, how
to optimise the pre-heating phase; so, when people arrive in that zone, the
87
temperature is always very close to the lower limit of the threshold (20 °). A similar
discussion can be made for the switch-off phase, in fact, when people leave the
building, the temperature returns close to 20 °C.
In the middle of the occupancy period, the temperature is between the setpoint and
the lower bound of the range, reaching high values near the midnight.
In the first two days of this period, the baseline has some issues to maintain the
temperature in the acceptability range, in particular on the second day when it
decreases until 19 °C.
Looking at the supply power graph, the baseline turns on the heating system before
than the control agent, while the turn off period starts later.
Overall, the SAC agent maintains the indoor temperature in the acceptability range,
providing a lower supply power to the zone.
Figure 52 - Comparison between SAC control agent and baseline in Scenario 2 of the deployment phase
The comparison between the baseline and deployed agent in the third scenario, in
which were installed more efficient windows, is illustrated in Figure 53. The
analysis was performed in the same days of the first case.
88
As in the previous scenarios, the agent obtains a temperature profile within the
comfort range during the occupancy period. Furthermore, this profile is more stable
than the one provided by the baseline.
Here, both the control agent and the baseline consume less energy, compared to
other scenarios, thanks to the better performance of the windows.
Moreover, the SAC agent learnt how to maintain the desired temperature by
consuming lower heating energy compared to the baseline.
Figure 53 - Comparison between SAC control agent and baseline in Scenario 3 of the deployment phase
Figure 54 illustrates the comparison of the control agent in the first and third
scenarios. In this case, the red line indicates the first one, while the third scenario
is represented with the blue one.
Both the supply power pattern and the zone temperature profile are very similar.
Moreover, even if the supply power is the same, the indoor temperature is slightly
higher in the third scenario. However, the temperature in the third scenario
decreases slightly slower.
89
Figure 54 - Comparison between SAC control agent in Scenario 1 and Scenario 3
To analyse these two scenarios in more detail, Figure 55 highlights the temperature
comparison of each floor on the same day.
In general, both agents learnt how to maintain the desired temperature in all zones
during the occupancy period, and the two temperature patterns are very similar.
Except for the second floor, the deployed agent in the third scenario reaches higher
zone temperatures than in the first one. Furthermore, the temperature decreases
slowly in each floor.
90
Figure 55 - Zone temperature comparison in different floors for Scenario 1 and Scenario 3
Figure 56 shows the comparison of the two controls logic in the last scenario.
While in the baseline control logic, the temperature frequently reaches the upper
threshold of the acceptability range, the SAC control agent maintains the
temperature close to the setpoint temperature in the occupancy time.
As said before and reported in Figure 49, the agent obtains the highest energy
saving compared with the baseline. In fact, for most of the time, the supply power
of the agent is lower than the baseline one.
Furthermore, with respect to the other three scenarios, the temperature decreases
slowly when the system is switched off, and it is mainly due to the increased thermal
inertia.
91
Figure 56 - Comparison between SAC control agent and baseline in Scenario 4 of the deployment phase
Finally, Figure 57 shows the comparison of the control agent in two different
scenarios: the first and the fourth. In these graphs, it is easy to observe the difference
in the decrease of the temperature in the two scenarios when the system is switched
off.
However, during the occupancy period, both the zone temperature and the supply
power have a similar profile in both scenarios.
92
Chapter 7:
Discussion
93
7. Discussion
This work focuses on the development of a SAC control agent of a heating system
to control the supply power to be provided to each zone of a residential building
located in Turin, Italy.
The developed control agent was trained and deployed in a simulation environment
which integrates EnergyPlus and Python.
The goal of this controller is to optimize both energy consumption and the
maintenance of the desired temperature indoor air temperature withing the building.
The main challenge that the controller must facies the identification of the best
trade-off between these two contrasting objectives.
In RL algorithms the selection of the hyperparameters and the reward design have
a fundamental role in identifying the optimal configuration of the SAC control
agent. Therefore, a sensitivity analysis on the most important hyperparameters was
performed to study their influence on the controller performance.
Among these hyperparameters, the two weight factors of the two terms of the
reward seem to be the most influencing ones. In fact, a higher value of the weight
of the energy-related term of the reward leads to a significant reduction in heating
energy consumption at the expense of the zone temperature. Therefore, these
configurations give too importance on the energy savings compromising the
maintenance of the zone comfort.
On the other hand, high values of the weight of the temperature-related terms,
giving more importance on the temperature, drive to opposite results, with a
reduction in temperature violations and increased energy consumption.
Between the fifteen configurations analysed in a training period of two months, the
hyperparameters of the eighth solution lead to control policy with the higher
stability and the control agent presents the best trade-off between energy saving and
reduction of temperature violation. Furthermore, this control agent learnt how to
adapt to the occupancy patterns, in fact, the heating system is switched off when
94
the people leave the zone, and it is turn-on before their arrival to pre-heat the floor.
During the occupancy period, the indoor temperature is inside the acceptability
range. Overall, the control agent performs better than the baseline. The eighth
configuration was chosen and tested in the static deployment.
The SAC control agent was tested in four different scenarios to prove its
adaptability to modifications in the environment, such as different weather
conditions, changes in the indoor setpoint temperature and building characteristics.
Thanks to the adoption of an adaptive set of variables for the state-space, the
deployment could be performed in a static configuration, that can avoid instability
problem for the control policy and the high computational time of a dynamic
deployment.
The SAC control agent proved to be adaptable in all the tested scenarios, and it
presents better performance than the baseline controller. Both the cumulative sum
of temperature violations and the heating energy consumption were lower than the
baseline.
In particular, in the fourth scenario, the agent learnt how to optimise the pre-heating
phase and thanks to the increased thermal inertia it obtains the highest energy
saving.
On the other hand, in the second scenario, the agent has the lowest value of the
cumulative sum of temperature violations. Therefore, the tested control agent has
efficient adaptability to a change in the desired indoor temperature setpoint.
Overall, an accurate design of the variables of the state-space could increase the
flexibility and adaptability of the SAC control agent to changes in the boundary
condition even in static deployment conditions.
To implement the designed control agent in a real building, it is necessary to
monitor some variables which can be collected thanks to low-cost sensor available
in the market.
For the external temperature and the solar radiation could be used outdoor ambient
sensor, or external weather data provider can provide those data.
95
Similar solutions are available for the zone temperature.
Since the stochastic nature of the occupancy in residential buildings, the biggest
obstacle to overcome is the identification of the time to occupancy start and time to
occupancy end.
These values can be estimated using weekly schedules based on the monitoring of
people arriving and leaving hours, and updating it with the new collected data.
96
Chapter 8:
Conclusion
97
8. Conclusion
In this dissertation, the application of the SAC control agent in a radiant floor
heating system was developed and analysed in a simulation environment, which
combines Energy and Python.
The residential building was modelled in SketchUp to obtain the geometric
structure for the energy model. With the use of External Interface and BCVTB the
building model communicate with the simulation environment in Python.
Furthermore, to represent the occupants’ behaviour as close as possible to the
reality, the windows opening and closing were simulated with a probabilistic
model.
The control agent was designed to choose the optimal action, in this case the supply
power to be provided to each zone to maintain the temperature within the
acceptability range.
Ten adaptive variables were included in the definition of the state-space. They
include exogenous parameters, indoor air temperature-related terms and two
variables which describe the occupancy status of the building.
During the training phase, a sensitivity analysis was performed on the main
hyperparameters to highlight their influence on the controller’s performance. The
best configuration was tested in four different scenarios in the deployment phase to
evaluate the flexibility and adaptability of the SAC control agent to changes of
outdoor condition, indoor requirement and building characteristics.
Depending on the scenario, the deployed control agent reaches a significant
reduction on the cumulative sum of temperature violations compared with the
baseline, and, at same time, it also reaches an energy-saving that varies between 2%
and 6%. These results are achieved by this SAC control agent in a static
deployment.
The designed state-space allows the implementation of a static deployment instead
of the dynamic one, which may cause instability and high computational time.
98
To extend and improve this work, the next studies will focus on the following
aspects:
Compare the results obtained with the static deployment with the dynamic
one, using the same scenarios. This comparison can show the differences in
both the stability of the control policy and the adaptability of the control
agent.
Compare the results obtained with the adaptive set of variables with the non-
adaptive one.
99
Acronyms
A3C Asynchronous Advantage Actor-Critic
ACH Air Changes per hours
AHU Air handling Unit
BCVTB Building Control Virtual Test Bed
BRL Batch Reinforcement Learning
DDPG Deep Deterministic Policy Gradient
DNN Deep Neural Networks
DP Dynamic Programming
DQN Deep Q-Network
DRL Deep Reinforcement Learning
FL Fuzzy Logic
GA Genetic Algorithm
HDD Heating Degree Days
HVAC Heating, Ventilation and Air Conditioning
MC Monte-Carlo
MDP Markov Decision Property
MPC Model Predictive Control
NN Neural Networks
PMV Predicted Mean Vote
PPD Predicted Percentage of Dissatisfied
PPO Proximal Policy Optimization
RC Resistance and Capacitors
RES Renewable Energy Source
RL Reinforcement Learning
SAC Soft Actor-Critic
SP Supply Power
TD Temporal Difference
TD3 Twin Delayed DDPG
TRPO Trust Region Policy Optimization
100
References
101
[8] D. Picard and L. Helsen, “MPC performance for hybrid GEOTABS
buildings,” 2018.
[9] F. Jorissen, “Toolchain for Optimal Control and Design of Energy Systems in
Buildings,” 2018.
[14 M. Avci, M. Erkoc, A. Rahmani and S. Asfour, “Model predictive HVAC load
] control in buildings using real-time electricity pricing.,” Energy and
Buildings, 2013.
102
[17 F. A. Qureshi and C. N. Jones, “Hierarchical control of building HVAC
] system for ancillary services provision.,” Energy and Buildings, 2018.
103
[25 A. So, W. Chan, T. Chow and W. Tse, “A neural-network-based
] identifier/controller for modern HVAC control,” ASHRAE Transactions
Research, 1994.
[27 M. Arima, E. Hara and J. Katzberg., “A fuzzy logic and rough sets controller
] for HVAC systems,” 1995.
104
[33 Z. Zhang, A. Chong, Y. Pan, C. Zhang and K. P. Lam, “Whole building energy
] model for HVAC optimal control: A practical framework based on deep
reinforcement learning,” Elsevier, Energy & Buildings 199, 2019.
[36 S. Qiu, Z. Li, Z. Li, J. Li, S. Long and X. Li, “Model-free control method based
] on reinforcement learning for building cooling water systems: Validation by
measured data-based simuilation,” Elsevier, Energy & Buildings 218, 2020.
105
[41 S. Lu, W. Wang, C. Lin and E. C. Hameen, “Data-driven simulation of a
] thermal comfort-based temperature set-point control with ASHRAE RP884,”
Elsevier, Building & Environment 156, 2019.
[42 Y. Sun, A. Somani and T. E. Carroll, “Learning Based Bidding Strategy for
] HVAC Systems in Double Auction Retail Energy Markets,” IEEE, American
Control Conference, 2015.
106
Deep Reinforcement Learning,” IEEE Transactions on smart grid, vol. 10, no.
4, 2019.
[49 Z. Yu and A. Dexter, “Online tuning of a supervisory fuzzy controller for low-
] energy building system using reinforcement learning,” Elsevier, Control
Engineering Practice 18, 2010.
[52 R. Jia, M. Jin, K. Sun, T. Hong and C. Spanos, “Advanced Building Control
] via Deep Reinforcement Learning,” Elsevier, Energy Procedia 158, 2019.
[53 W. Valladaresa, M. Galindo, J. Gutiérreza, W.-C. Wu, K.-K. Liao, J.-C. Liao,
] K.-C. Lu and C.-C. Wang, “Energy optimization associated with thermal
comfort and indoor air control,” Elsevier, Building & Environment 155, 2019.
[54 Z. Zou, X. Yu and S. Ergan, “Towards optimal control of air handling units
] using deep reinforcement learning and recurrent neural network,” Elsevier,
Building & Environment 168, 2020.
[55 L. Yu, Y. Sun, C. Shen, D. Yue, T. Jiang and X. Guan, “Multi-Agent Deep
] Reinforcement Learning for HVAC Control in Commercial Buildings,” IEEE
Transactions on smart grid , vol. 20, no. 20, 2020.
107
building characteristics and occupants analysed by movers and stayers,”
Applied Energy, 2019.
[64 Z. Wang and T. Hong, “Reinforcement learning for building controls: The
] opportunities and challenges,” Elsevier, Applied Energy, 2020.
108
[65 J. J. Moolayil, “A Layman’s Guide to Deep Neural Networks,” [Online].
] Available: https://towardsdatascience.com/a-laymans-guide-to-deep-neural-
networks-ddcea24847fb. [Accessed 10 September 2020].
109
Acknowledgements
For the completion of this Thesis, my sincere gratitude goes to Professor Capozzoli
for having believed in me and for his constant clarification throughout.
A special thanks belongs to Silvio and Giuseppe, for the enormous help and support
they provide during this period; it would not have been possible to complete this
journey without them.
A heartfelt thank you to my parents for the love that you have ever show and for
the great sacrifices you have made, and thanks to my sister, for sharing the problems
and nervousness during that period.
Thanks also to the rest of my family for their love and encouragement over the
years.
Special thanks to my sweetheart Anita, for the enormous patience over the years,
for believing and supporting me in achieving goals I thought I was not capable of,
and especially for always making me happy.
For making these years lighter, for the enormous help you have always given me,
for the good times spent outside the classrooms, I thanks the members of the “Royal
Team”: Giuseppe B., Roberto and Davide, a big thanks also to Matteo, despite his
escape in Paris. I also thanks Mariacarla and the “Termoteam”: Francesco,
Giuseppe D. and Enzo, with the hope that these friendships will last forever.
Finally, thanks to all my friend in Villastellone, in particular the “96 group”, which
despite my limited presence in these years has always made me feel part of the
group, they have always believed and supported me. Thanks for all adventures we
have had over the years, and for those we will have in the future!
110