8.Towards Optimal HVAC Control in Non-stationary Building Environments Combining Active Change Detection and Deep Reinforcement Learning
8.Towards Optimal HVAC Control in Non-stationary Building Environments Combining Active Change Detection and Deep Reinforcement Learning
Keywords: Energy consumption for heating, ventilation and air conditioning (HVAC) has increased significantly and
Heating accounted for a large proportion of building energy growth. Advanced control strategies are needed to reduce
Ventilation and air conditioning (HVAC) energy consumption with maintaining occupant thermal comfort. While compared to other control problems,
Non-stationary environments
HVAC control is faced with numerous restrictions from real building environments. One key restriction is
Deep reinforcement learning (DRL)
non-stationarity, i.e., the varying HVAC system dynamics. Researchers have paid efforts to solve the non-
Change point detection
stationarity problems through different approaches, among which deep reinforcement learning gains traction
for its advantages in capturing real-time information, controlling adaptively to system feedbacks, avoiding
tedious modeling works and combining with deep learning techniques. However, current researches solved
non-stationarity in a passive manner which hinders its potential and adds instability in real application. To fill
this research gap, we propose a novel HVAC control method combining active building environment change
detection and deep Q network (DQN), named non-stationary DQN. This method aims to disentangle the non-
stationarity by actively identifying the change points of building environments and learning effective control
strategies for corresponding building environments. The simulation results demonstrate that this developed
non-stationary DQN method outperforms the state-of-art DQN method in both single-zone control and multi-
zone control tasks by saving unnecessary energy use and reducing thermal violation caused by non-stationarity.
The improvement can reach 13% in energy-saving and 9% in thermal comfort. Besides, according to the
results, our proposed method obtains stability against disturbance and generalization to an unseen building
environment, which shows its robustness and potential in real-life applications.
∗ Corresponding author at: Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, Shenzhen, 518055, PR China.
E-mail address: [email protected] (Y. Zhang).
https://doi.org/10.1016/j.buildenv.2021.108680
Received 31 August 2021; Received in revised form 8 November 2021; Accepted 9 December 2021
Available online 3 January 2022
0360-1323/© 2021 Elsevier Ltd. All rights reserved.
X. Deng et al. Building and Environment 211 (2022) 108680
Model errors might prevent a designed MPC controller from real ap-
Nomenclature plications. Besides, modeling is time-demanding and labor-expensive.
A3C Asynchronous advantage actor critic Furthermore, it might be insufficient to model building dynamics with
BCNN Bayesian convolutional neural network non-stationarity by first or second-order approximations widely used in
current MPC methods [22].
BCVTB Building control virtual test bed
Reinforcement learning (RL) has achieved significant success in
DQN Deep Q network
many areas from Go [23] to video games [24]. In recent years, deep
DRL Deep reinforcement learning
reinforcement learning (DRL) has gained traction in HVAC control
HVAC Heating, ventilation, and air conditioning researches. Compared to RBC and MPC, DRL exhibits advantages in
LBNL Lawrence Berkeley National Laboratory learning from collected data for decision-making [7]. Besides, the latest
LSTM Long Short-Term Memory developments in thriving deep learning can be combined with DRL
MDP Markov decision process to improve the model complexity and decision efficiency. DRL is a
MPC Model predictive control data-driven method which consists of model-based RL and model-
ODCP Online parametric Dirichlet change point free RL. Model-based RL employs models to accelerate the learning
PID Proportional–Integral–Derivative process, which is similar to MPC but less computationally expensive.
PPD Predicted percentage of dissatisfied Recently, [25] adopted Long Short-Term Memory (LSTM) networks to
model the HVAC systems for DRL. In [26], the authors proposed to
QL Q-learning
model the air-conditioning and room temperature by Bayesian Convo-
RBC Rule-based control
lutional Neural Network (BCNN) to train an uncertainty-aware deep
RL Reinforcement learning Q-learning agent. Related works on model-based RL have provided
RL-CD Reinforcement Learning with Context De- insights on solving the non-stationary. In [27], the authors claimed
tection that DRL controllers can solve the non-stationarity by relearning the
RMPC Robust model predictive control environment dynamics and the control models. However, relearning
SMPC Stochastic model predictive control building environment models could be time-demanding and decrease
TMY Typical meteorological year the control efficiency. In addition, it did not provide a practical method
UCRL Upper Confidence Reinforcement Learning for explicitly telling the differences between non-stationary building
VAV Variable air volume environments and the timings when the environment changes happen.
While model-free RL learns optimal control policies by trial-and-
error interactions with building environments without a model, which
means it reduces tedious modeling works [7] and processes better scal-
a statistic time series analysis. [6] improved the prediction accuracy of ability and generalization [28]. For air-conditioning, [28] showed that
HVAC energy consumption by considering non-stationary patterns. popular model-free RL algorithms can reliably maintain the tempera-
In HVAC control researches, efforts have been paid to mainly 3 ap- ture while reducing energy consumption by more than 13% compared
proaches: rule-based control (RBC) methods, model predictive control to model-based methods. In [29], a model-free DRL-based algorithm
(MPC) and reinforcement learning (RL) [7]. RBC methods are widely was developed for controlling energy consumption and tenant thermal
used in HVAC systems, which adopt a set of artificial rules designed comfort with a co-simulation framework based on EnergyPlus [30]
by expert experience [8–11]. For non-stationary environments, sea- for DRL training and validation. In [31], the authors implemented an
sonal climate changes (e.g. temperature variations) are considered in Asynchronous Advantage Actor-Critic (A3C) based control method for
control rules for more comfortable indoor environments [12]. Besides, a radiant heating system achieving considerable heat demand savings
data-driven methods are also employed to design the setpoint of the in an office building deployment. Moreover, in [32], model-free DRL is
air-conditioning system under different human activity levels [13] and extended to multi-zone cases and achieves more than 56% energy cost
reduction for 30 zones in commercial buildings. Literature has shown
benchmark the energy performance [14]. However, as HVAC systems
the significant potential of model-free DRL. As for non-stationarity,
and building environments are becoming more complicated, it will
some researchers believe that model-free RL algorithms can passively
cost endless efforts to cover all the possible situations in control rules.
adapt to the latest building environment by updating models with data
Besides, artificial rules lack the flexibility to capture the evolving HVAC
newly collected. However, this assumption might only work theoreti-
system information in real time.
cally or in idealized experiment environments where enough training
Model predictive control (MPC) is more promising in HVAC con-
time and data are given. [27] pointed out that the performance of the
trol for its capability of exploiting predictive information and making
DRL controllers is degraded when the building environments evolve
decisions adaptively [15,16]. MPC is based on a rigorous HVAC sys-
periodically. Besides, passive adaptation results in the fact that the RL
tem model including building structures and HVAC schematics, and controllers are tired of converging from one environment to another
optimizes control strategies by predictions of system future evolution. environment without explicitly learning optimal policies according to
The modeling methods can be categorized into 3 kinds: the white- different situations.
box model developed by physical laws, the black-box model based on In a nutshell, non-stationary building environments are vital prob-
data-driven methods, and the gray-box model fitting load data into lems hindering the further development of HVAC control methods.
a physical model [17]. To overcome non-stationarity, MPC takes the While we find some limitations for solving the non-stationarity after
nonlinear dynamics, the time-varying system dynamics and the time- investigating the existing literature. On the one hand, modeling non-
varying disturbances into the consideration of modeling [18]. Variants stationarity in the building models might add complexity and decrease
of MPC such as robust MPC (RMPC) and stochastic MPC (SMPC) are efficiency for the control policies. On the other hand, passive adapta-
employed to solve the uncertainty caused by non-stationarity [19]. [20] tion without modeling does not work for highly non-stationary building
proposed an SMPC design methodology which can minimize the ex- environments.
pected energy cost and maintain comfort using uncertain prediction To overcome these gaps, an active and practical environment change
of weather conditions and building loads. In [21], the authors pre- detection method is needed for model-free DRL to exploit the poten-
sented that their scenario-based MPC capturing building dynamics tial and ensure the robustness of DRL-based HVAC control methods.
outperformed vanilla MPC methods in robustness and energy efficiency. We formulate HVAC control as Markov decision process (MDP) in
However, MPC highly depends on the accuracy of the system models. non-stationary environments and propose a new DRL-based method
2
X. Deng et al. Building and Environment 211 (2022) 108680
focusing on optimizing HVAC control strategies in non-stationary build- reinforcement learning learns from its own experience of interaction
ing environments. The main contributions of this study are summarized with all kinds of situations; in addition, reinforcement learning differs
as follows: from unsupervised learning as it aims to maximize the reward signal
rather than find the hidden structure of the data [33]. In recent
• We develop a novel method called non-stationary DQN. To our
decades, reinforcement learning has been proved effective in solving
best knowledge, it is the first method to actively detect the
decision-making problems.
dynamic changes of the building environments and exploit this
Most of the reinforcement learning problems can be modeled in a
information in HVAC control optimization. Meanwhile, it is a
mathematical form of the Markov decision process. In MDP, an agent
model-free DRL method which is time-saving and can avoid errors
from inaccurate modeling. learns and makes decisions; other things the agent interacts with are
called environment [33]. For example, in a building, the agent can be
• We conduct a thorough simulation-based comparison between the
a controller determining temperature setpoints, while the environment
proposed method and the state-of-art DQN method in both single-
zone control and multi-zone control problems. Thermal comfort can be the combination of occupants, the HVAC system, weather, and
and energy consumption are adopted as key metrics to demon- so on. At decision epoch 𝑡, the agent chooses an action 𝑎𝑡 according to
strate the effectiveness of the proposed method in eliminating the environment state 𝑠𝑡 . At the next epoch, the agent receives a reward 𝑟𝑡
effects of non-stationary environments. and a new state 𝑠𝑡+1 (see the stationary case in Fig. 1). Thus, MDP can
• We verify the stability by examining the restorability of the pro- be represented by a tuple of ⟨𝑆, 𝐴, 𝑃 , 𝑅⟩, where S and A are the state
posed method after extreme disturbance. In addition, we verify and action spaces. 𝑃 ∶ 𝑆 × 𝐴 × 𝑆 → [0, 1] is the transition probability
the generalization by testing the proposed method in an unseen function. It defines the probability distribution of next state 𝑠′ at 𝑡+1 the
building environment with stochastic operation schedules. system evolves into, given the current state 𝑠 and action 𝑎 (see Eq. (1)).
𝑅 ∶ 𝑆 × 𝐴 → R is the reward function. It models the numerical reward
The rest of the paper is organized as follows. Section 2 presents yielded by applying action a to state s, i.e., 𝑅 ∼ 𝑟(𝑠, 𝑎).
a review of reinforcement learning (RL) and RL algorithms in static
and non-stationary environments, together with the concept of MDP 𝑝(𝑠′ |𝑠, 𝑎) = 𝑃 𝑟{𝑠𝑡+1 = 𝑠′ |𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎} (1)
in non-stationary environments. The formulation of the single-zone
Reinforcement learning algorithms can also be categorized into
control and multi-zone control problems are presented in Section 3.
value-based methods (e.g., Q-learning) and policy-based methods (e.g.
In Section 4, two algorithms adopted in our proposed method, DQN
policy gradient). These algorithms have been combined with deep
and ODCP, are briefly introduced, followed by a detailed explanation
neural networks for more complicated control problems.
of our proposed HVAC control method combining environment change
detection and DRL. Section 5 discusses our simulation results and
Section 6 concludes the paper and presents directions for future work. 2.2. MDP and RL algorithms in non-stationary environments
2. RL model for non-stationary environments Technically speaking, the non-stationarity of the building environ-
ment mainly refers to the fact that the state transition probability 𝑃 and
2.1. Reinforcement learning reward function 𝑅 are non-stationary. They are changing along with
some internal or external factors. To be more concrete, even though
Reinforcement learning is one of the machine learning methods, the instant states (e.g., indoor air temperature) are similar in different
which learns to perform serial actions according to situations in order periods, the building state evolutions can be different. It is insufficient
to maximize the reward signal by trial-and-error search [33]. Unlike su- to describe such a control problem as a stable MDP. Thus, [34] defined
pervised learning which learns from a training set of labeled examples, MDP in non-stationary environments as below:
3
X. Deng et al. Building and Environment 211 (2022) 108680
Given a family of MDPs {𝑀𝜃 }𝜃∈N+ , where 𝜃 takes values from a Another approach is to learn an optimal control policy for non-
finite index set 𝛩, 𝑀𝜃 = ⟨𝑆, 𝐴, 𝑃𝜃 , 𝑅𝜃 ⟩ with the same state and action stationary environments by detecting the environmental changes. [37]
spaces and different transition probability distributions and reward proposed a context detection based algorithm called Reinforcement
functions. Each MDP in the family is called a context. We assume that Learning with Context Detection (RL-CD). Similar to UCRL2, it statis-
only one context is active during a time interval (see Fig. 1). There tically estimates the transition probability functions and the reward
will be a change point between two sequentially active contexts. For functions for a set of partial MDP models. It evaluates the quality of
example, the environment may change from 𝑀𝜃0 to 𝑀𝜃1 at timestep each partial model, that is, the difference between the current transition
𝑇1 . Thus, the non-stationary dynamics 𝑃 , 𝑅 will be: and reward and learned transition probability and reward functions.
⎧ 𝑃 (𝑠′ |𝑠, 𝑎) 𝑡 < 𝑇1
Then it updates the Q-values of the partial model with the highest
′ ⎪ 𝜃0 ( ) quality. However, it is difficult to apply in complex control problems
𝑃 (𝑠𝑡+1 = 𝑠 |𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎) = ⎨ 𝑃𝜃1 𝑠′ |𝑠, 𝑎 𝑇1 ≤ 𝑡 < 𝑇2 (2)
⎪ like HVAC control because high dimensional and continuous state
⎩ ⋮
variables are intractable for probability function estimation. A more
and for 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎 efficient algorithm called Context QL is proposed by [34]. It addresses
the context detection problems not by statistical estimation but by
⎧ 𝑅 (𝑠, 𝑎) 𝑡 < 𝑇1
⎪ 𝜃0 finding the change points of the state-reward tuples over a horizon
𝑅(𝑠, 𝑎) = ⎨ 𝑅𝜃1 (𝑠, 𝑎) 𝑇1 ≤ 𝑡 < 𝑇2 (3) T. Experience tuple (𝑠𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) consists of the current state, reward,
⎪ ⋮
⎩ and next state in an iteration. Non-stationary MDPs are modeled as the
Fig. 1 demonstrates the non-stationary case of environments in combination of piecewise MDP contexts as mentioned before. These RL
reinforcement learning. algorithms have been proved effective in simple non-stationary control
In the literature, there have been some pioneer works addressing problems.
RL in non-stationary environments. One approach aims to minimize
the performance criterion called regret (see Eq. (4)), which refers to 3. Problem formulation
the difference between the expected optimal rewards collected over a
finite horizon 𝑇 from a start state 𝑠0 , and those produced by the current
3.1. A brief introduction of the non-stationary HVAC control problem
policy. [35] proposed UCRL2 algorithm, a variant of Upper Confidence
Reinforcement Learning (UCRL) proposed by [36]. UCRL2 adopts this
approach and gives a sublinear regret upper bound by finding the In this work, we consider the HVAC system in a commercial building
optimal MDP from a set of plausible MDPs for the current environment with 5 zones. The heating and cooling setpoints of the thermostat in
and its corresponding optimal policy. A set of plausible MDPs is built by each zone are controlled. The goal is to minimize the HVAC energy
estimating the transition probability function and the reward function. consumption and satisfy occupant thermal comfort in non-stationary
UCRL2 restarts learning when it exceeds the diameter of the current building environments. We consider two problems for experiments, the
MDP (see Eq. (5), where 𝑀 is the environment context and 𝑇 is the single-zone control problem and the multi-zone control problem. The
timestep of the first arrival of 𝑠′ from 𝑠), which however will discard experiment settings are as follows:
all the former estimations.
𝑇∑
−1 • The single-zone problem. Only one of the 5 zones (Zone 1) is
Regret = 𝑉𝑇∗ (𝑠0 ) − 𝑅(𝑠𝑡 , 𝑎𝑡 ) (4) controlled by the RL controller while the others are controlled by
𝑡=0 a default fixed schedule (see Fig. 3). Only Zone 1 needs to do the
[ ] learning to investigate the performance of our proposed method
𝐷𝑀 = max min E 𝑇 (𝑠′ |𝑠, 𝜋) (5)
𝑠≠𝑠′ 𝜋∶𝑆→𝐴 in a single zone with relatively stable influence from other zones.
4
X. Deng et al. Building and Environment 211 (2022) 108680
Fig. 3. The fixed operation schedule used in the single-zone problem: on weekdays, the heating setpoint and cooling setpoint are set to be 21.1 ◦ C and 23.9 ◦ C from 7:00 to
18:00, and set to be 15.0 ◦ C and 30.0 ◦ C for the rest hours; on weekends, the heating setpoint and cooling setpoint are set to be 21.1 ◦ C and 23.9 ◦ C from 7:00 to 13:00 and
set to be 15.0 ◦ C and 30.0 ◦ C for the rest hours.
5
X. Deng et al. Building and Environment 211 (2022) 108680
Table 2 person. Only when there are occupants, the thermal comfort cost will
RL network design and training hyperparameters.
be considered in the optimization. 𝐶𝑐𝑜𝑚𝑓 𝑜𝑟𝑡 (𝑡) is calculated as follows:
Hyperparameter Value {
Size of input 15 1 𝑃 𝑃 𝐷 > 𝑃 𝑃 𝐷𝑡ℎ𝑟𝑒𝑠
𝐶𝑐𝑜𝑚𝑓 𝑜𝑟𝑡 (𝑡) = (9)
No. of hidden layers 4 𝑃𝑃𝐷 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Size of each hidden layer [512,512,512,512]
Activation layer ReLU
In Eq. (9), 𝑃 𝑃 𝐷𝑡ℎ𝑟𝑒𝑠 is the threshold of PPD, that is, the lower bound
Size of output 4 of thermal comfort at which the occupant can tolerate. Note that both
Memory capacity 10000 𝑃𝐻𝑉 𝐴𝐶 and 𝑃 𝑃 𝐷 are normalized here. We set 𝑃 𝑃 𝐷𝑡ℎ𝑟𝑒𝑠 ∈ (0, 1) and
Optimizer Adam usually close to 0.
Batch size 128
𝑤𝑒 and 𝑤𝑐 are heuristically designed to weigh the importance of two
Learning rate 0.0001
Discount factor 0.99 reward signals and make the reward function as the linear combination
Soft update rate 0.1 of these two reward signals. The design of weights here is tunable for
Soft update iteration 200 different optimization objectives in respect of the user’s will. In the
Initial epsilon 0.15
paper, we use 0.4 and 0.6 (see Table 2) as an example to prove the
Reward weights for energy 0.4
Reward weights for comfort 0.6 effectiveness of our proposed method. For convenience, 𝑤𝑒 and 𝑤𝑐 add
PPD threshold 0.15 up to 1 in our paper.
4. Methodology
actions are more like options to choose from instead of variables. Given
4.1. ODCP
a state at time 𝑡, a DQN agent decides to take one of the options. The
listed 4 tuples below are 4 options. It is similar to the options we take
In [40], the authors propose an online parametric Dirichlet change
in a maze game where we choose to move up, move down, move left,
point algorithm (ODCP). They model the compositional data as sam-
or move right at a cross.
pling from a family of Dirichlet distributions with a group of parame-
𝐴 = {(−1.0, −1.0), (1.0, 1.0), (−1.0, 1.0), (0.0, 0.0)} ters 𝜶 (1) , 𝜶 (2) , … , 𝜶 (𝑘) . A Dirichlet distribution with parameter 𝜶 (𝑟) , 𝑟 =
1, … , 𝑘, generates sample data 𝐱𝑖 , 𝜏𝑟−1 < 𝑖 ≤ 𝜏𝑟 .
For each action tuple, two numbers represent the increments to pre- Based on the data modeling, ODCP detects single change point by
vious heating setpoint 𝐻𝑡𝑆𝑝𝑡 and cooling setpoint 𝐶𝑙𝑆𝑝𝑡 respectively. a hypothesis testing framework. For an active observation window ,
For example, if the RL agent chooses the first tuple as the action at time {𝐱1 , … , 𝐱𝑡 }, the null hypothesis and the alternate hypothesis are follows:
𝑡, the heating setpoint and cooling setpoint at 𝑡 + 1 will be lowered by
1 ◦ C compared to those at 𝑡: • 𝐻0 : is generated by a single Dirichlet distribution.
• 𝐻𝜏 : is generated by two Dirichlet distribution divided at some
𝐻𝑡𝑆𝑝𝑡𝑡+1 = 𝐻𝑡𝑆𝑝𝑡𝑡 − 1 𝜏, 1 < 𝜏 < 𝑡.
𝑃 𝑃 𝐷 = 100 − 95 × exp{−0.03353 × 𝑃 𝑀𝑉 4 − 0.2179 × 𝑃 𝑀𝑉 2 } (7) Deep Q network (DQN) combines Q-learning and deep neural
network. As mentioned before, Q-learning is a value-based method.
In our paper, PPD can be obtained by EnergyPlus simulation. In
It learns a lookup table called a Q-table to store Q-values for spe-
addition, total HVAC electric power demand is used as the energy
cific state–action (𝑠, 𝑎) pairs. Thus, Q-learning is also called tabu-
consumption metric. The details of the total HVAC electric power de-
lar Q-learning. The Q-values are updated by the experience tuple
mand is introduced in Section 5.1. Considering both occupant thermal
(𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) every iteration the agent interacts with the environment.
comfort and HVAC energy consumption, the reward function is shown
The update rule is based on the Bellman equation:
in:
{ 𝑄(𝑠, 𝑎) = 𝑄(𝑠, 𝑎) + 𝛼(𝑟 + 𝛾 max 𝑄(𝑠′ , 𝑎′ ) − 𝑄(𝑠, 𝑎)) (10)
−𝑤𝑒 𝑃𝐻𝑉 𝐴𝐶 (𝑡) 𝑂𝑐𝑐𝑢𝑝 = 0 𝑎′
𝑟(𝑡) = (8)
−𝑤𝑒 𝑃𝐻𝑉 𝐴𝐶 (𝑡) − 𝑤𝑐 𝐶𝑐𝑜𝑚𝑓 𝑜𝑟𝑡 (𝑡) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where 𝛼 is the learning rate and 𝛾 is the discount factor. Tabular
In Eq. (8), 𝑤𝑒 , 𝑤𝑐 are the weights for the total HVAC electric demand Q-learning methods suffer from explosive memory for tracking tremen-
power 𝑃𝐻𝑉 𝐴𝐶 (𝑡) and the occupant thermal comfort cost 𝐶𝑐𝑜𝑚𝑓 𝑜𝑟𝑡 (𝑡), dous state–action pairs when faced with MDP with large size. For
𝑂𝑐𝑐𝑢𝑝 is the occupant status flag which equals to 0 when no people example, in our HVAC control problem, each zone has 15 state vari-
are in the controlled zone and equals to 1 as long as there is one ables (see Table 1), among which most are continuous variables such as
6
X. Deng et al. Building and Environment 211 (2022) 108680
outdoor air temperature (OAT) and Predicted Percentage of Dissatisfied • We assume that the MDP context in non-stationary environments
(PPD). Thus, our RL controller faces a high-dimensional and continuous lasts enough time, i.e., at least one week. It is reasonable for
state space. It would be trapped by high computational complexity and collecting enough samples for ODCP to detect non-stationarity.
memory expanse if we use tabular Q-learning. Deep neural network can Emergencies are not considered in this paper as they only account
relieve us from those troubles and guarantee the performance by its for negligible influence on control policy if we consider long-term
ability in approximation. optimization.
[24] proposes a two-network design for DQN algorithm: one is • We assume that the major factor causing the non-stationary build-
called Q network 𝑄(𝑠, 𝑎; 𝜃) and the other one is called target network ing environment is periodic like the local weather conditions. The
̂ 𝑎; 𝜃 − ). The former selects the optimal action with respect to the
𝑄(𝑠, MDP context will be active again after a period. For example, if
state. While the latter generates the target value as a reference for the
the local weather transitions follow a distinct seasonal pattern,
update of the former. At update iteration i, the Q network is updated
̂ 𝑎; 𝜃 − ), then an MDP context working for the building environment cur-
by minimizing the mean-square error between 𝑄(𝑠, 𝑎; 𝜃) and 𝑄(𝑠,
rently will be activated in a year. Therefore, we set the learning
i.e., DQN loss (see Eq. (11)). With the loss, parameters of the Q network
timesteps 𝑇 (see line 1 in Algorithm 1) based on some prior
are updated by a stochastic gradient descent algorithm. Parameters of
knowledge of the period.
the target network are slowly replaced with those of the Q network.
[( ) ] 2 In the beginning, the Q network and the target network are initial-
𝐿𝑖 (𝜃𝑖 ) = E ̂ ′ , 𝑎′ ; 𝜃 − ) − 𝑄(𝑠, 𝑎; 𝜃𝑖 )
𝑟 + 𝛾 max 𝑄(𝑠 (11) ized randomly with the same parameters together with a memory buffer
𝑎′ 𝑖
for experience storage. Then the initial tuple ⟨𝐷, 𝑄, 𝑄⟩̂ for a DQN agent
Besides deep neural network, DQN adopts experience replay buffer is added into an empty agent list as shown in line 6.
[24], which allows that a batch of experience tuples can be randomly The training process of the DQN agent starts, which can be divided
sampled from the buffer for Q-network update. Replay buffer can into 3 stages (see Fig. 5). Over the first M0 episodes, we apply the base-
prevent the high correlation of the learning experiences and thus help line DQN algorithm to learn a basic DQN model called the pretrained
the agent learn more efficiently. As for the action selection of the Q net- DQN (see line 11–12), which does not consider non-stationarity in
work, exploration and exploitation should be well designed throughout weather conditions and serves to collect experience tuples {(𝑠𝑡 , 𝑟𝑡 , 𝑠𝑡+1 )},
the whole training process of the agent. Exploration means to try new
𝑡=1,2, . . . , T, for change point detection. At Stage 2, the ODCP algo-
actions whereas exploitation prefers to employ the best actions known
rithm is performed on experience tuples collected throughout episode
from learning. One strategy for balancing them is 𝜖-greedy, that is, with
M0 to find the change points as shown in line 30–32. Meanwhile,
the probability of 1 − 𝜖, the agent selects the action with the highest Q-
the number of MDP contexts and their activated time intervals are
value, while with probability 𝜖, the agent selects a random action. We
known. Finally, at Stage 3, we replace 𝐿 with a ⟨𝐷, 𝑄, 𝑄⟩-tuple
̂ list. The
adopt 𝜖-greedy in our DQN design.
number of tuples in 𝐿 is equal to the context number, and each tuple is
4.3. Non-stationary DQN method initialized as line 3–5. To make a distinction, we call each DQN in agent
list context DQN. Over the training episodes after episode M0 , context
Our proposed non-stationary DQN algorithm for HVAC control in DQN is updated only in the time interval when its corresponding MDP
non-stationary building environments is shown in Algorithm 1. Be- context is activated, as shown in line 13–15.
fore we further explain the details of the applied algorithm, some Here we explain the advantage of utilizing ODCP to experience tu-
preliminary assumptions on Algorithm 1 must be demonstrated. ples from a pretrained DQN instead of experience tuples from the very
7
X. Deng et al. Building and Environment 211 (2022) 108680
8
X. Deng et al. Building and Environment 211 (2022) 108680
Table 3
Cumulative rewards of two methods after convergence.
Non-stationary DQN Baseline DQN
Mean −407.40 −430.65
Single-zone problem
Std. 4.68 7.63
Mean −2476.12 −2882.82
Multi-zone problem
Std. 30.05 69.68
Fig. 8. Simulation framework. Considering episodes at the end, we can see that the average cu-
mulative reward of non-stationary DQN is higher than that of baseline
DQN. As shown in Table 3, in the single-zone problem the average
the EnergyPlus [30] model in OpenAI Gym interface [44]. The sim- cumulative reward of non-stationary DQN is 5% higher than that
ulation environment includes EnergyPlus and BCVTB (building control of baseline DQN meanwhile the standard deviation is 39% smaller.
virtual test bed). BCVTB serves as a connector for different simulation Meanwhile, in the multi-zone problem, the average cumulative reward
programs to run co-simulation [45]. of non-stationary DQN is 15% higher than that of baseline DQN while
In a simulation run, an EnergyPlus instance is created using the the standard deviation is 57% smaller.
input definition file and the data exchange file. Input definition file
defines the building physics, HVAC system and simulation settings. The 5.4. Performance in the single-zone problem
data exchange file defines the observation variables and control vari-
ables. In the data-exchange stage, variables can be read and written by We apply the well-trained models to the training environment. As
python scripts for control. The processor converts the raw observations shown in Fig. 10, both two controllers satisfy the occupant thermal
to state and reward information for controllers. Every simulation run comfort maintaining PPD lower than 10%. The energy consumption
curves are nearly periodic as two controllers learn to work during
starts at 0:00, January 1st and ends at 0:00, April 1st in a nonleap year,
the occupied hours every weekday. By checking the rest hours like
that is, 90 days. The granularity of the EnergyPlus simulation is 15 min.
the weekends when there is no occupant inside the zone, both two
For RL agent training, a simulation run is called a training episode.
controllers learn to maintain the heating and cooling setpoints at the
In addition to the whole simulation framework and simulation
limitation of the setpoint band to idle the coils of VAV boxes avoiding
settings, we also present the detailed design of the neural network and
unnecessary energy consumption.
related hyperparameters (see Table 2). Both the neural network design
In Fig. 11, the baseline DQN controller conducts heating actions
and hyperparameters are the same for non-stationary DQN and baseline
from 0:00 to 4:00 on March 20th (see Fig. 11(a)). The state resulting
DQN.
in the heating actions turns out to be similar to the state at 6:00 on
For the single-zone problem, the training episode number is 400.
February 28th (see Fig. 11(b)) when it is the very moment both the
When the non-stationary DQN agent finishes the 100th training episode,
baseline DQN controller and the non-stationary controller have learned
i.e., M0 = 100 (see Algorithm 1), we apply ODCP to the experience
to start heating for the working hours. It seems that heating actions are
tuples throughout the episode. The minimum length of the active
appropriate for the state under the environment context on February
observation window for a single changepoint detection is 1344, that
28th but not for the state under the environment context on March
is, two weeks, as we assume that there is at most one change point in
20th, since the baseline DQN controller has to conduct cooling actions
any window of two weeks for our building environment. For the multi-
from 0:00 to 4:00 on March 20th to correct the mistakes. We call this
zone problem, the training episode number is 250. When the agent
situation ‘mistake and correct’ in this study, which causes unnecessary
finishes the 50th training episode, i.e., M0 = 50, we apply ODCP to
energy use and can be seen frequently when being controlled by the
the experience tuples.
baseline DQN controller, but hardly occurs when being controlled by
the non-stationary controller.
5.3. Convergence of non-stationary DQN In Fig. 12, the baseline DQN controller fails to raise the temperature
before the working hours and causes a huge short-term discomfort
As shown in Fig. 9, both two curves become steady as the training (see Fig. 12(a)). This is because the baseline DQN keeps the heating
evolves. Notice that the convergence of baseline DQN is faster com- setpoint at the limit until 7:15 as it believes it is facing a state similar to
pared with non-stationary DQN. The reasons lie in two aspects. First, that at 3:30 on February 26th (see Fig. 12(b)). The baseline DQN fails
non-stationary DQN has to train multiple context DQNs. There will be to recognize the different environment context in these two periods,
more DQNs in the multi-zone problem. In each training episode, context which thus delay the heating behaviors. In contrast, the non-stationary
DQNs are updated respectively by the experiences in their activated DQN learns the right timings to raise the temperature for different
time intervals. Therefore, they need more episodes to collect enough environment contexts.
experiences for learning. Second, except for the first context DQN, the Considering the two examples above, it seems that the control policy
start state of a context DQN is exactly the end state of the last one. of the baseline DQN controller is not stable when faced with some states
Context DQNs have to spend more time learning control strategies for though the controller has been trained for a long time. In contrast, the
variant start states. The details of time cost are as follows: non-stationary DQN controller can react properly for similar states of
A computer with Intel 32-core Xeon(R) Gold 6130 CPU and NVIDIA different contexts. The reason is that the baseline DQN controller gets
GTX 3090 GPU is used for training. For non-stationary DQN, it takes confused when it learns experiences from different contexts. Although
200 episodes to converge in the single-zone problem while it takes 175 some states seem similar, the rewards and next states can differ greatly
episodes in the multi-zone problem. In the single-zone problem, one since the experiences come from contexts with distinct dynamics. Then
iterative training process takes 42 s to finish; in the multi-zone problem, errors occur when we use the Bellman equation (see Eq. (11)) to update
one iterative training process takes 77 s to finish. For baseline DQN, Q-values. As a result, the action selection of the baseline DQN controller
it takes 100 episodes to converge in the single-zone problem while is ambiguous as it tends to be neutral towards optimal control policies
it also takes 100 episodes in the multi-zone problem. In the single- for different contexts.
zone problem, one iterative training process takes 35 s to finish; in the Besides, the non-stationary DQN controller is compared with a rule-
multi-zone problem, one iterative training process takes 79 s to finish. based controller, which maintains the heating setpoint at 24 ◦ C and
9
X. Deng et al. Building and Environment 211 (2022) 108680
Fig. 9. Convergence of non-stationary DQN and baseline DQN. Both the average values and two standard deviation bounds are presented.
Fig. 10. Total HVAC electric demand power, PPD, heating and cooling setpoints and indoor air temperature of baseline DQN and non-stationary DQN for single-zone control.
10
X. Deng et al. Building and Environment 211 (2022) 108680
Fig. 13. Cumulative reward, average total HVAC electric demand power (W) and average PPD (%) of the rule-based controller, the baseline DQN controller and the non-stationary
DQN controller in the training building environment.
cooling setpoint at 26 ◦ C for working hours when the occupant number improvement of a single zone is diluted in the total HVAC demand of
is larger than 0. When there is no occupant, it sets the heating and all zones. Besides, non-stationary DQN reduces variation in occupant
cooling setpoints at 15 ◦ C and 30 ◦ C. The results are shown in Fig. 13 comfort with a smaller standard deviation in PPD. As for the total
and related discussions will be presented in the next subsection. demand, the average total HVAC electric demand power in the multi-
zone problem is larger than that in the single-zone problem because
5.5. Performance in the multi-zone problem more energy is required to satisfy the thermal comfort in 5 zones.
As can be seen in Fig. 14, 5 zones detect similar change points and
Fig. 13 shows the cumulative reward, average total HVAC elec- take similar action series throughout 90 days. In addition, the change
tric demand power and average PPD of the three controllers in the points of Zone 1 in the multi-zone problem are different from those in
single-zone problem and the multi-zone problem. In both the single- the single-zone problem. The reason is that besides weather conditions,
zone problem and the multi-zone problem, the cumulative rewards of there is another non-stationary factor in the multi-zone problem, that
non-stationary DQN and baseline DQN are higher than that of the rule- is, the actions of other controllers, since other controllers are controlled
based controller. Although the average total HVAC electric demand by RL agents rather than by a fixed schedule. However, the non-
power of the rule-based controller is lower, the average PPD and the stationarity of controller actions seems to be also highly influenced by
standard deviation of PPD are much higher than those of the two DQN weather conditions (see Fig. 10 and Fig. 14). Therefore, although a new
based methods and keep away from the desired comfort range for the non-stationary factor is introduced, the multi-zone problem still con-
indoor thermal environment. Occupants’ thermal comfort cannot be sists with our assumption about periodic factors in the non-stationary
guaranteed by the rule-based controller. Compared to baseline DQN, DQN algorithm (see Section 4.3).
in the single-zone problem, non-stationary DQN reduces the average Comparing the results of the single-zone and multi-zone problems,
total HVAC electric demand power by 2% and the average PPD by 9%; we find that when controlled by the non-stationary DQN controller,
in the multi-zone problem, non-stationary DQN reduces the average the energy consumption of Zone 1 in the multi-zone problem is higher
total HVAC electric demand power by 13% and the average PPD by than the energy consumption of Zone 1 in the single-zone problem.
around 7% for each zone. The improvement in energy saving might Checking the actions in January when the outdoor air temperature is
seem marginal for the single-zone problem, but this is because the low, we find that the extra energy consumption is spent on the forward
11
X. Deng et al. Building and Environment 211 (2022) 108680
Fig. 14. The heating and cooling setpoints and indoor air temperatures of the baseline DQN controller and the non-stationary DQN controller for 5 zones in the multi-zone
problem.
Fig. 15. Comparison of the non-stationary DQN controller in Zone 1 between the single-zone problem and the multi-zone problem.
heating actions during the rest hours between two weekdays (as shown learn optimal actions for different environment contexts, which leads
in Fig. 15). The forward heating action means the controller brings to unwanted energy waste.
the heating timing forward from around 6:00, when working hours are
about to begin, to around 0:00. The explanation might be that under 5.6. Performance in areas with non-stationarity of different levels
low outdoor air temperature, more heating time is required for all 5
zones to raise and stabilize the indoor air temperature. The forward The weather conditions in Pittsburgh are of the temperate con-
heating actions of the non-stationary DQN controller can hardly be seen tinental climate, which often has a significant annual variation in
after January, as shown in Fig. 16(a). However, the forward heating temperature (hot summers and cold winters). In addition, every sim-
actions of the baseline DQN controller can still be found in February ulation starts at 0:00, January 1st and ends at 0:00, April 1st, which
(see Fig. 16(b)). It is another evidence that baseline DQN fails to covers the period when the temperature rises significantly. Therefore,
12
X. Deng et al. Building and Environment 211 (2022) 108680
Fig. 16. Comparison of the performance between non-stationary DQN and baseline DQN in the multi-zone problem.
we believe that the weather conditions in Pittsburgh show typical 5.7. Stability of non-stationary DQN
non-stationarity in this HVAC control problem.
Nevertheless, an experiment is conducted to compare the non- To prove the stability of non-stationary DQN, we conduct some
stationarity of different levels caused by two kinds of weather con- experiments where disturbance happens at random timesteps. Distur-
ditions, that is, middle latitude continental climate and low latitude bance means that the setpoints are set to be the boundaries before
marine climate. We use the typical weather profile in 6 cities for simula- some timestep during the simulation. To be more concrete, we adopt 3
tion in the single-zone control problem. Among the 6 cities, Pittsburgh, boundary settings:
Beijing, and Dunhuang are of middle latitude continental climate; while
Shenzhen, Miami, and Bangkok are of low latitude marine climate. To 𝑖 heating setpoint = 15 ◦ C, cooling setpoint = 15 ◦ C.
show the difference between the two climates, we pick up outdoor air 𝑖𝑖 heating setpoint = 15 ◦ C, cooling setpoint = 30 ◦ C.
temperature as an example, which has a major influence on the build- 𝑖𝑖𝑖 heating setpoint = 30 ◦ C, cooling setpoint = 30 ◦ C.
ing thermal model. Fig. 17(a) shows the temperature evolution from
January 1st to March 31st in 6 cities. Temperature rise is significant in In addition, we randomly choose 4 timesteps as recovery points
cities of middle latitude continental climate, while temperature trend from the 8640 timesteps (every 15 mins in 90 days): 264, 1800, 5796,
seems steady in cities of low latitude marine climate. It is suggested that 6766. For example, if we assign the recovery point to be 1800 and
the formers have more non-stationarity caused by weather conditions. adopt boundary settings 𝑖, before recovery point 1800 the setpoints
Fig. 17(b) shows the convergence curve of non-stationary DQN and are set to be 15 ◦ C, 15 ◦ C, while after recovery point the setpoints are
baseline DQN in the single-zone problem for 6 cities. Non-stationary controlled by the trained RL controller.
DQN increases the average episode reward after it is converged for As shown in Fig. 18, non-stationary DQN can adjust the indoor tem-
all 6 cities. For those cities with more obvious non-stationarity, the perature to get everything back on track. Shortly after the disturbance,
improvement is more significant accordingly. Therefore, non-stationary the state trajectories and reward trajectories can return to the no-
DQN works for HVAC control in areas with non-stationarity of different disturbance trajectories respectively. This indicates that non-stationary
levels. DQN obtained deployment stability: first, the size of the state space and
13
X. Deng et al. Building and Environment 211 (2022) 108680
Fig. 18. State (indoor air temperature) trajectories and reward trajectories in 3 boundary settings: (𝑖) heating setpoint = 15 ◦ C, cooling setpoint = 15 ◦ C; (𝑖𝑖) heating setpoint =
15 ◦ C, cooling setpoint = 30 ◦ C; (𝑖𝑖𝑖) heating setpoint = 30 ◦ C, cooling setpoint = 30 ◦ C.
Fig. 19. Cumulative reward, average total HVAC electric demand power (W) and average PPD (%) of the rule-based controller, the baseline DQN controller and the non-stationary
DQN controller in the test building environment.
the action space in the HVAC control problem is moderate for the RL The major difference lies in the schedules of occupancy and electric
controller to explore and learn the Q values for nearly all the states; equipment. As shown in Fig. 20, the occupants and equipment follow a
second, deep neural network can approximate the Q values for similar periodic schedule. The working hours start at 8:15 and end at 21:15 on
states even not explored. weekdays. However, it is more realistic for occupants and equipment
to follow a schedule with randomness. In the test environment, we test
5.8. Generalization of non-stationary DQN our well-trained DQN controllers with a stochastic occupancy schedule
and a stochastic equipment schedule provided by [46]. The stochastic
To validate the generalization of our non-stationary DQN algo-
occupancy schedule is generated by the LBNL simulator [47] and the
rithm to an unseen environment, both the well-trained non-stationary
stochastic equipment schedule is based on the occupancy and added
DQN controller and baseline DQN controller are further tested in
with Gaussian noise [46] (see Fig. 20). All five zones in the building
a new building environment. Here we specify some similarities and
differences of the training and test environment settings. adopt the new schedules of occupancy and equipment.
Considering the model assumptions in Section 2.2, it requires a The same 90-day simulation is applied in the test case. The test
similar non-stationary MDP in the test environment for the application results are shown in Fig. 19. Similar to the results in Fig. 13, the
of the trained DRL controller. In Section 4.3, we assume the major cause average total HVAC electric demand power and average PPD of the
of the non-stationary environment is periodic weather conditions. Thus, non-stationary DQN controller is lower than that of the baseline DQN
in the test environment we adopt the same weather profile, building controller in both the single-zone and multi-zone problems. There-
model and VAV system used in training. fore, when faced with an unseen stochastic schedule, non-stationary
14
X. Deng et al. Building and Environment 211 (2022) 108680
Acknowledgments
References
[1] A. Costa, M.M. Keane, J.I. Torrens, E. Corry, Building operation and en-
ergy performance: Monitoring, analysis and optimisation toolkit, Appl. Energy
101 (2013) 310–316, http://dx.doi.org/10.1016/j.apenergy.2011.10.037, http:
//www.sciencedirect.com/science/article/pii/S030626191100691X, sustainable
Fig. 20. The occupancy schedule and equipment schedule for the training environment Development of Energy, Water and Environment Systems.
and the test environment. [2] L. Pérez-Lombard, J. Ortiz, C. Pout, A review on buildings energy con-
sumption information, Energy Build. 40 (2008) 394–398, http://dx.doi.org/10.
1016/j.enbuild.2007.03.007, http://www.sciencedirect.com/science/article/pii/
S0378778807001016.
DQN still exceeds baseline DQN after the offline training with a fixed [3] Y. Chen, H. Tan, Short-term prediction of electric demand in building sector via
schedule. hybrid support vector regression, Appl. Energy 204 (2017) 1363–1374, http:
//dx.doi.org/10.1016/j.apenergy.2017.03.070, https://www.sciencedirect.com/
science/article/pii/S0306261917303082.
6. Conclusion [4] F. Jalaei, G. Guest, A. Gaur, J. Zhang, Exploring the effects that a non-stationary
climate and dynamic electricity grid mix has on whole building life cycle assess-
In this study, a novel HVAC control method combining active ment: A multi-city comparison, Sustainable Cities Soc. 61 (2020) 102294, http://
building environment change detection and deep Q network is pro- dx.doi.org/10.1016/j.scs.2020.102294, https://www.sciencedirect.com/science/
article/pii/S2210670720305151.
posed to minimize the HVAC energy consumption with maintaining the
[5] Y. Zhou, Z. Kang, L. Zhang, C. Spanos, Causal analysis for non-stationary time
occupant thermal comfort. This non-stationary DQN method identifies series in sensor-rich smart buildings, in: 2013 IEEE International Conference on
the environment contexts by detecting the change points of the building Automation Science and Engineering (CASE), 2013, pp. 593–598, http://dx.doi.
environment and learns more effective control strategies respectively. org/10.1109/CoASE.2013.6654000.
Case study results prove the effectiveness of our proposed method: (i) [6] Y. Chen, F. Zhang, U. Berardi, Day-ahead prediction of hourly subentry energy
consumption in the building sector using pattern recognition algorithms, Energy
compared to the state-of-art DQN method, our proposed method ends
211 (2020) 118530, http://dx.doi.org/10.1016/j.energy.2020.118530, https://
up training with higher episode cumulative reward; (ii) our proposed www.sciencedirect.com/science/article/pii/S0360544220316388.
method can save extra HVAC energy consumption by 13% at most and [7] Z. Wang, T. Hong, Reinforcement learning for building controls:
reduce comfort violation by 9% at most; (iii) our proposed method The opportunities and challenges, Appl. Energy 269 (2020) 115036,
http://dx.doi.org/10.1016/j.apenergy.2020.115036, http://www.sciencedirect.
shows exceedingly good performance in both single-zone and multi-
com/science/article/pii/S0306261920305481.
zone HVAC control problems; (iv) it also obtains stability against [8] S. Wang, Z. Ma, Supervisory and optimal control of building hvac
disturbance and generalization to unseen building environments. systems: A review, HVAC & R Res. 14 (2008) 3–32, http://dx.doi.
In future work, we aim to improve the method with consideration org/10.1080/10789669.2008.10390991, https://www.tandfonline.com/doi/abs/
of more complicated situations and make it more applicable in the 10.1080/10789669.2008.10390991.
[9] Jian Liu, Wen-Jian Cai, Gui-Qing Zhang, Design and application of handheld
real world. First, the trade-off between energy savings and comfort
auto-tuning pid instrument used in hvac, in: 2009 4th IEEE Conference on
satisfaction should be further studied by varying the weights of the Industrial Electronics and Applications, 2009, pp. 1695–1698, http://dx.doi.org/
energy consumption and the thermal comfort. Second, different from 10.1109/ICIEA.2009.5138484.
the simulation environment where the states can be acquired easily by [10] Jiangjiang Wang, Chunfa Zhang, Youyin Jing, Application of an intelligent
input or calculation, the real building deployment requires an efficient pid control in heating ventilating and air-conditioning system, in: 2008 7th
World Congress on Intelligent Control and Automation, 2008, pp. 4371–4376,
design of the sensor system and the ability to reduce the potential errors
http://dx.doi.org/10.1109/WCICA.2008.4593624.
of the sensor data. Third, to further enhance the learning efficiency, it [11] Guang Geng, G.M. Geary, On performance and tuning of pid controllers in
can be investigated to include distributed learning in the framework, es- hvac systems, in: Proceedings of IEEE International Conference on Control and
pecially in the scenarios of large and complex HVAC systems. Moreover, Applications, vol. 2, 1993, pp. 819–824, http://dx.doi.org/10.1109/CCA.1993.
in the district operation management involving multiple buildings, the 348229.
[12] C. Bae, C. Chun, Research on seasonal indoor thermal environment and residents’
RL agents should learn to do group decisions to achieve complicated
control behavior of cooling and heating systems in korea, Build. Environ.
objectives, for example, electric power peak load shifting. By exploring 44 (2009) 2300–2307, http://dx.doi.org/10.1016/j.buildenv.2009.04.003, https:
these directions, the non-stationary control method will be more robust //www.sciencedirect.com/science/article/pii/S0360132309000973, special Issue
and efficient for real-world deployment. for 2008 International Conference on Building Energy and Environment (COBEE).
[13] W.-T. Li, S.R. Gubba, W. Tushar, C. Yuen, N.U. Hassan, H.V. Poor, K.L. Wood,
C.-K. Wen, Data driven electricity management for residential air conditioning
CRediT authorship contribution statement systems: An experimental approach, IEEE Trans. Emerg. Top. Comput. 7 (2019)
380–391, http://dx.doi.org/10.1109/TETC.2017.2655362.
Xiangtian Deng: Investigation, Methodology, Software, Writing – [14] Y. Zhou, C. Lork, W.-T. Li, C. Yuen, Y.M. Keow, Benchmarking air-
original draft. Yi Zhang: Resources, Supervision. Yi Zhang: Concep- conditioning energy performance of residential rooms based on regres-
sion and clustering techniques, Appl. Energy 253 (2019) 113548, http:
tualization, Methodology, Writing – review & editing. He Qi: Project
//dx.doi.org/10.1016/j.apenergy.2019.113548, https://www.sciencedirect.com/
administration, Funding acquisition, Writing – review & editing. science/article/pii/S030626191931222X.
[15] Y. Ma, F. Borrelli, B. Hencey, B. Coffey, S. Bengea, P. Haves, Model predictive
Declaration of competing interest control for the operation of building cooling systems, IEEE Trans. Control Syst.
Technol. 20 (2012) 796–803, http://dx.doi.org/10.1109/TCST.2011.2124461.
[16] M. Maasoumy, M. Razmara, M. Shahbakhti, A.S. Vincentelli, Handling model un-
The authors declare that they have no known competing finan- certainty in model predictive control for energy efficient buildings, Energy Build.
cial interests or personal relationships that could have appeared to 77 (2014) 377–392, http://dx.doi.org/10.1016/j.enbuild.2014.03.057, http://
influence the work reported in this paper. www.sciencedirect.com/science/article/pii/S0378778814002771.
15
X. Deng et al. Building and Environment 211 (2022) 108680
[17] B. Rajasekhar, W. Tushar, C. Lork, Y. Zhou, C. Yuen, N.M. Pindoriya, K.L. [29] T. Wei, Yanzhi. Wang, Q. Zhu, Deep reinforcement learning for building hvac
Wood, A survey of computational intelligence techniques for air-conditioners control, in: 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC),
energy management, IEEE Trans. Emerg. Top. Comput. Intell. 4 (2020) 555–570, 2017, pp. 1–6, http://dx.doi.org/10.1145/3061639.3062224.
http://dx.doi.org/10.1109/TETCI.2020.2991728. [30] EnergyPlus, 2020, URL https://energyplus.net/.
[18] A. Afram, F. Janabi-Sharifi, Theory and applications of hvac control sys- [31] Z. Zhang, A. Chong, Y. Pan, C. Zhang, K.P. Lam, Whole building energy model for
tems – a review of model predictive control (mpc), Build. Environ. 72 hvac optimal control: A practical framework based on deep reinforcement learn-
(2014) 343–355, http://dx.doi.org/10.1016/j.buildenv.2013.11.016, https:// ing, Energy Build. 199 (2021) 472–490, http://dx.doi.org/10.1016/j.enbuild.
www.sciencedirect.com/science/article/pii/S0360132313003363. 2019.07.029.
[19] Y. Yao, D.K. Shekhar, State of the art review on model predictive control [32] L. Yu, Y. Sun, Z. Xu, C. Shen, D. Yue, T. Jiang, X. Guan, Multi-agent deep
(mpc) in heating ventilation and air-conditioning (hvac) field, Build. Environ. reinforcement learning for hvac control in commercial buildings, IEEE Trans.
200 (2021) 107952, http://dx.doi.org/10.1016/j.buildenv.2021.107952, https: Smart Grid 12 (2021) 407–419, http://dx.doi.org/10.1109/TSG.2020.3011739.
//www.sciencedirect.com/science/article/pii/S0360132321003565. [33] R. Sutton, A. Barto, Reinforcement Learning:An Introduction, MIT Press, 1998.
[20] Y. Ma, J. Matuško, F. Borrelli, Stochastic model predictive control for building [34] S. Padakandla, K.J. Prabuchandran, S. Bhatnagar, Reinforcement learning
hvac systems: Complexity and conservatism, IEEE Trans. Control Syst. Technol. algorithm for non-stationary environments, Appl. Intell. 50 (2020).
23 (2015) 101–116, http://dx.doi.org/10.1109/TCST.2014.2313736. [35] T. Jaksch, R. Ortner, P. Auer, Near-optimal regret bounds for reinforcement
[21] A. Parisio, D. Varagnolo, M. Molinari, G. Pattarello, L. Fabietti, K.H. Jo- learning, J. Mach. Learn. Res. 11 (2010) 1563–1600, http://jmlr.org/papers/
hansson, Implementation of a scenario-based mpc for hvac systems: an v11/jaksch10a.html.
experimental case study, IFAC Proc. Vol. 47 (2014) 599–605, http://dx.doi.org/ [36] P. Auer, R. Ortner, Logarithmic online regret bounds for undiscounted reinforce-
10.3182/20140824-6-ZA-1003.02629, https://www.sciencedirect.com/science/ ment learning, in: B. Schölkopf, J. Platt, T. Hoffman (Eds.), Advances in Neural
article/pii/S1474667016416800. 19th IFAC World Congress. Information Processing Systems, vol. 19, MIT Press, 2007, https://proceedings.
[22] X. Ding, W. Du, A.E. Cerpa, Mb2c: Model-based deep reinforcement learning neurips.cc/paper/2006/file/c1b70d965ca504aa751ddb62ad69c63f-Paper.pdf.
for multi-zone building control, in: Proceedings of the 7th ACM International [37] B.C. da Silva, E.W. Basso, A.L.C. Bazzan, P.M. Engel, Dealing with non-stationary
Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, environments using context detection, in: Proceedings of the 23rd International
BuildSys ’20, Association for Computing Machinery, New York, NY, USA, 2020, Conference on Machine Learning, ICML ’06, Association for Computing Ma-
pp. 50–59, http://dx.doi.org/10.1145/3408308.3427986. chinery, New York, NY, USA, 2006, pp. 217–224, http://dx.doi.org/10.1145/
[23] E. Gibney, Google AI algorithm masters ancient game of Go, Nature (2016). 1143844.1143872.
[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, [38] C.M. Bishop, Neural Networks for Pattern Recognition, 1995.
M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. [39] P.O. Fanger, Thermal Comfort: Analysis and Applications in Environmental
Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human- Engineering, Danish Technical Press, Copenhagen, Denmark, 1970.
level control through deep reinforcement learning, Nature 518 (2015) 529–533, [40] P.K.J.N. Singh, P. Dayama, V. Pandit, Change point detection for compositional
http://dx.doi.org/10.1038/nature14236. multivariate data, 2019, arXiv:1901.04935.
[25] Z. Zou, X. Yu, S. Ergan, Towards optimal control of air handling units using [41] Department of Energy, Commercial reference buildings, 2020, https://www.
deep reinforcement learning and recurrent neural network, Build. Environ. energy.gov/eere/buildings/commercial-reference-buildings.
168 (2020) 106535, http://dx.doi.org/10.1016/j.buildenv.2019.106535, https: [42] S. Wilcox, W. Marion, Users manual for tmy3 data sets, 2008, http://dx.doi.org/
//www.sciencedirect.com/science/article/pii/S0360132319307474. 10.2172/928611.
[26] C. Lork, W.-T. Li, Y. Qin, Y. Zhou, C. Yuen, W. Tushar, T.K. Saha, An [43] Z. Zhang, K.P. Lam, Practical implementation and evaluation of deep reinforce-
uncertainty-aware deep reinforcement learning framework for residential air ment learning control for a radiant heating system, in: In: Proceedings of the
conditioning energy management, Appl. Energy 276 (2020) 115426, http: 5th Conference on Systems for Built Environments, BuildSys ’18, Association
//dx.doi.org/10.1016/j.apenergy.2020.115426, https://www.sciencedirect.com/ for Computing Machinery, New York, NY, USA, P, 2018, pp. 148–157, http:
science/article/pii/S0306261920309387. //dx.doi.org/10.1145/3276774.3276775.
[27] A. Naug, M. Quiñones Grueiro, G. Biswas, A relearning approach to [44] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W.
reinforcement learning for control of smart buildings, 2020, arXiv:2008.01879. Zaremba, Openai gym, 2016.
[28] M. Biemann, F. Scheller, X. Liu, L. Huang, Experimental evaluation of model- [45] Michael Wetter, Coffey Philip, Brian, Building controls virtual test bed, 2008.
free reinforcement learning algorithms for continuous hvac control, Appl. Energy [46] URL https://github.com/zhangzhizza/HVAC-RL-Control/tree/a3c/src/eplus-env,
298 (2021) 117164, http://dx.doi.org/10.1016/j.apenergy.2021.117164, https: 2019.
//www.sciencedirect.com/science/article/pii/S0306261921005961. [47] Y. Chen, T. Hong, X. Luo, An Agent-Based Stochastic Occupancy Simulator,
Building Simulation, 2018.
16