0% found this document useful (0 votes)
4 views

8.Towards Optimal HVAC Control in Non-stationary Building Environments Combining Active Change Detection and Deep Reinforcement Learning

Uploaded by

a.maker
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

8.Towards Optimal HVAC Control in Non-stationary Building Environments Combining Active Change Detection and Deep Reinforcement Learning

Uploaded by

a.maker
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Building and Environment 211 (2022) 108680

Contents lists available at ScienceDirect

Building and Environment


journal homepage: www.elsevier.com/locate/buildenv

Towards optimal HVAC control in non-stationary building environments


combining active change detection and deep reinforcement learning
Xiangtian Deng a , Yi Zhang a,c , Yi Zhang a,b ,∗, He Qi d
a
Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, Shenzhen, 518055, PR China
b
Institute of Future Human Habitats, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, 518055, PR China
c
Department of Automation, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, 100084, PR China
d
China Construction Science and Technology Group Cooperation, Shenzhen 518000, China

ARTICLE INFO ABSTRACT

Keywords: Energy consumption for heating, ventilation and air conditioning (HVAC) has increased significantly and
Heating accounted for a large proportion of building energy growth. Advanced control strategies are needed to reduce
Ventilation and air conditioning (HVAC) energy consumption with maintaining occupant thermal comfort. While compared to other control problems,
Non-stationary environments
HVAC control is faced with numerous restrictions from real building environments. One key restriction is
Deep reinforcement learning (DRL)
non-stationarity, i.e., the varying HVAC system dynamics. Researchers have paid efforts to solve the non-
Change point detection
stationarity problems through different approaches, among which deep reinforcement learning gains traction
for its advantages in capturing real-time information, controlling adaptively to system feedbacks, avoiding
tedious modeling works and combining with deep learning techniques. However, current researches solved
non-stationarity in a passive manner which hinders its potential and adds instability in real application. To fill
this research gap, we propose a novel HVAC control method combining active building environment change
detection and deep Q network (DQN), named non-stationary DQN. This method aims to disentangle the non-
stationarity by actively identifying the change points of building environments and learning effective control
strategies for corresponding building environments. The simulation results demonstrate that this developed
non-stationary DQN method outperforms the state-of-art DQN method in both single-zone control and multi-
zone control tasks by saving unnecessary energy use and reducing thermal violation caused by non-stationarity.
The improvement can reach 13% in energy-saving and 9% in thermal comfort. Besides, according to the
results, our proposed method obtains stability against disturbance and generalization to an unseen building
environment, which shows its robustness and potential in real-life applications.

1. Introduction in an infinite time horizon, which means the building environments


will evolve into the same future states if the current states and con-
Building energy consumption has been playing a major role in the trol actions are the same. However, the HVAC systems in the real
rapid growth of world energy use. Building consumption has made life are highly non-stationary. A recent study [3] clarified that some
up 40% of contribution to global primary energy and 30% to CO2 buildings show non-stationary electrical temporal features caused by
emissions [1]. Among all the building services, heating, ventilation, endogenous factors (eg., activities) or exogenous factors (eg. weather
and air conditioning (HVAC) system has greatly increased figures in conditions). [4] pointed out that the whole building life cycle can be
energy use growth. It accounts for 50% of building consumption and influenced by non-stationary climate. To be more concrete, the system
20% of total consumption in energy use growth in USA [2]. Therefore,
dynamics (e.g., thermal model) can evolve into a different stage peri-
improving the efficiency of HVAC systems under the guarantee of
odically: the building environments can change monthly for operation
occupant thermal comfort can greatly benefit energy saving and carbon
schedules, seasonally for weather conditions, or yearly for equipment
reduction.
aging. In addition, emergencies like sudden occupant increases can
However, compared to other control problems, HVAC control is
also degrade the control effectiveness. Therefore, non-stationarity has
more difficult in finding the optimal control policies. One of the vital
difficulties is the non-stationary building environments. For problem gained traction in recent studies. [5] proposed to capture the casual re-
simplification, most works assume that control systems are stationary lations on non-stationary sensor data collected in smart buildings using

∗ Corresponding author at: Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, Shenzhen, 518055, PR China.
E-mail address: [email protected] (Y. Zhang).

https://doi.org/10.1016/j.buildenv.2021.108680
Received 31 August 2021; Received in revised form 8 November 2021; Accepted 9 December 2021
Available online 3 January 2022
0360-1323/© 2021 Elsevier Ltd. All rights reserved.
X. Deng et al. Building and Environment 211 (2022) 108680

Model errors might prevent a designed MPC controller from real ap-
Nomenclature plications. Besides, modeling is time-demanding and labor-expensive.
A3C Asynchronous advantage actor critic Furthermore, it might be insufficient to model building dynamics with
BCNN Bayesian convolutional neural network non-stationarity by first or second-order approximations widely used in
current MPC methods [22].
BCVTB Building control virtual test bed
Reinforcement learning (RL) has achieved significant success in
DQN Deep Q network
many areas from Go [23] to video games [24]. In recent years, deep
DRL Deep reinforcement learning
reinforcement learning (DRL) has gained traction in HVAC control
HVAC Heating, ventilation, and air conditioning researches. Compared to RBC and MPC, DRL exhibits advantages in
LBNL Lawrence Berkeley National Laboratory learning from collected data for decision-making [7]. Besides, the latest
LSTM Long Short-Term Memory developments in thriving deep learning can be combined with DRL
MDP Markov decision process to improve the model complexity and decision efficiency. DRL is a
MPC Model predictive control data-driven method which consists of model-based RL and model-
ODCP Online parametric Dirichlet change point free RL. Model-based RL employs models to accelerate the learning
PID Proportional–Integral–Derivative process, which is similar to MPC but less computationally expensive.
PPD Predicted percentage of dissatisfied Recently, [25] adopted Long Short-Term Memory (LSTM) networks to
model the HVAC systems for DRL. In [26], the authors proposed to
QL Q-learning
model the air-conditioning and room temperature by Bayesian Convo-
RBC Rule-based control
lutional Neural Network (BCNN) to train an uncertainty-aware deep
RL Reinforcement learning Q-learning agent. Related works on model-based RL have provided
RL-CD Reinforcement Learning with Context De- insights on solving the non-stationary. In [27], the authors claimed
tection that DRL controllers can solve the non-stationarity by relearning the
RMPC Robust model predictive control environment dynamics and the control models. However, relearning
SMPC Stochastic model predictive control building environment models could be time-demanding and decrease
TMY Typical meteorological year the control efficiency. In addition, it did not provide a practical method
UCRL Upper Confidence Reinforcement Learning for explicitly telling the differences between non-stationary building
VAV Variable air volume environments and the timings when the environment changes happen.
While model-free RL learns optimal control policies by trial-and-
error interactions with building environments without a model, which
means it reduces tedious modeling works [7] and processes better scal-
a statistic time series analysis. [6] improved the prediction accuracy of ability and generalization [28]. For air-conditioning, [28] showed that
HVAC energy consumption by considering non-stationary patterns. popular model-free RL algorithms can reliably maintain the tempera-
In HVAC control researches, efforts have been paid to mainly 3 ap- ture while reducing energy consumption by more than 13% compared
proaches: rule-based control (RBC) methods, model predictive control to model-based methods. In [29], a model-free DRL-based algorithm
(MPC) and reinforcement learning (RL) [7]. RBC methods are widely was developed for controlling energy consumption and tenant thermal
used in HVAC systems, which adopt a set of artificial rules designed comfort with a co-simulation framework based on EnergyPlus [30]
by expert experience [8–11]. For non-stationary environments, sea- for DRL training and validation. In [31], the authors implemented an
sonal climate changes (e.g. temperature variations) are considered in Asynchronous Advantage Actor-Critic (A3C) based control method for
control rules for more comfortable indoor environments [12]. Besides, a radiant heating system achieving considerable heat demand savings
data-driven methods are also employed to design the setpoint of the in an office building deployment. Moreover, in [32], model-free DRL is
air-conditioning system under different human activity levels [13] and extended to multi-zone cases and achieves more than 56% energy cost
reduction for 30 zones in commercial buildings. Literature has shown
benchmark the energy performance [14]. However, as HVAC systems
the significant potential of model-free DRL. As for non-stationarity,
and building environments are becoming more complicated, it will
some researchers believe that model-free RL algorithms can passively
cost endless efforts to cover all the possible situations in control rules.
adapt to the latest building environment by updating models with data
Besides, artificial rules lack the flexibility to capture the evolving HVAC
newly collected. However, this assumption might only work theoreti-
system information in real time.
cally or in idealized experiment environments where enough training
Model predictive control (MPC) is more promising in HVAC con-
time and data are given. [27] pointed out that the performance of the
trol for its capability of exploiting predictive information and making
DRL controllers is degraded when the building environments evolve
decisions adaptively [15,16]. MPC is based on a rigorous HVAC sys-
periodically. Besides, passive adaptation results in the fact that the RL
tem model including building structures and HVAC schematics, and controllers are tired of converging from one environment to another
optimizes control strategies by predictions of system future evolution. environment without explicitly learning optimal policies according to
The modeling methods can be categorized into 3 kinds: the white- different situations.
box model developed by physical laws, the black-box model based on In a nutshell, non-stationary building environments are vital prob-
data-driven methods, and the gray-box model fitting load data into lems hindering the further development of HVAC control methods.
a physical model [17]. To overcome non-stationarity, MPC takes the While we find some limitations for solving the non-stationarity after
nonlinear dynamics, the time-varying system dynamics and the time- investigating the existing literature. On the one hand, modeling non-
varying disturbances into the consideration of modeling [18]. Variants stationarity in the building models might add complexity and decrease
of MPC such as robust MPC (RMPC) and stochastic MPC (SMPC) are efficiency for the control policies. On the other hand, passive adapta-
employed to solve the uncertainty caused by non-stationarity [19]. [20] tion without modeling does not work for highly non-stationary building
proposed an SMPC design methodology which can minimize the ex- environments.
pected energy cost and maintain comfort using uncertain prediction To overcome these gaps, an active and practical environment change
of weather conditions and building loads. In [21], the authors pre- detection method is needed for model-free DRL to exploit the poten-
sented that their scenario-based MPC capturing building dynamics tial and ensure the robustness of DRL-based HVAC control methods.
outperformed vanilla MPC methods in robustness and energy efficiency. We formulate HVAC control as Markov decision process (MDP) in
However, MPC highly depends on the accuracy of the system models. non-stationary environments and propose a new DRL-based method

2
X. Deng et al. Building and Environment 211 (2022) 108680

Fig. 1. Reinforcement learning in stationary and non-stationary environments.

focusing on optimizing HVAC control strategies in non-stationary build- reinforcement learning learns from its own experience of interaction
ing environments. The main contributions of this study are summarized with all kinds of situations; in addition, reinforcement learning differs
as follows: from unsupervised learning as it aims to maximize the reward signal
rather than find the hidden structure of the data [33]. In recent
• We develop a novel method called non-stationary DQN. To our
decades, reinforcement learning has been proved effective in solving
best knowledge, it is the first method to actively detect the
decision-making problems.
dynamic changes of the building environments and exploit this
Most of the reinforcement learning problems can be modeled in a
information in HVAC control optimization. Meanwhile, it is a
mathematical form of the Markov decision process. In MDP, an agent
model-free DRL method which is time-saving and can avoid errors
from inaccurate modeling. learns and makes decisions; other things the agent interacts with are
called environment [33]. For example, in a building, the agent can be
• We conduct a thorough simulation-based comparison between the
a controller determining temperature setpoints, while the environment
proposed method and the state-of-art DQN method in both single-
zone control and multi-zone control problems. Thermal comfort can be the combination of occupants, the HVAC system, weather, and
and energy consumption are adopted as key metrics to demon- so on. At decision epoch 𝑡, the agent chooses an action 𝑎𝑡 according to
strate the effectiveness of the proposed method in eliminating the environment state 𝑠𝑡 . At the next epoch, the agent receives a reward 𝑟𝑡
effects of non-stationary environments. and a new state 𝑠𝑡+1 (see the stationary case in Fig. 1). Thus, MDP can
• We verify the stability by examining the restorability of the pro- be represented by a tuple of ⟨𝑆, 𝐴, 𝑃 , 𝑅⟩, where S and A are the state
posed method after extreme disturbance. In addition, we verify and action spaces. 𝑃 ∶ 𝑆 × 𝐴 × 𝑆 → [0, 1] is the transition probability
the generalization by testing the proposed method in an unseen function. It defines the probability distribution of next state 𝑠′ at 𝑡+1 the
building environment with stochastic operation schedules. system evolves into, given the current state 𝑠 and action 𝑎 (see Eq. (1)).
𝑅 ∶ 𝑆 × 𝐴 → R is the reward function. It models the numerical reward
The rest of the paper is organized as follows. Section 2 presents yielded by applying action a to state s, i.e., 𝑅 ∼ 𝑟(𝑠, 𝑎).
a review of reinforcement learning (RL) and RL algorithms in static
and non-stationary environments, together with the concept of MDP 𝑝(𝑠′ |𝑠, 𝑎) = 𝑃 𝑟{𝑠𝑡+1 = 𝑠′ |𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎} (1)
in non-stationary environments. The formulation of the single-zone
Reinforcement learning algorithms can also be categorized into
control and multi-zone control problems are presented in Section 3.
value-based methods (e.g., Q-learning) and policy-based methods (e.g.
In Section 4, two algorithms adopted in our proposed method, DQN
policy gradient). These algorithms have been combined with deep
and ODCP, are briefly introduced, followed by a detailed explanation
neural networks for more complicated control problems.
of our proposed HVAC control method combining environment change
detection and DRL. Section 5 discusses our simulation results and
Section 6 concludes the paper and presents directions for future work. 2.2. MDP and RL algorithms in non-stationary environments

2. RL model for non-stationary environments Technically speaking, the non-stationarity of the building environ-
ment mainly refers to the fact that the state transition probability 𝑃 and
2.1. Reinforcement learning reward function 𝑅 are non-stationary. They are changing along with
some internal or external factors. To be more concrete, even though
Reinforcement learning is one of the machine learning methods, the instant states (e.g., indoor air temperature) are similar in different
which learns to perform serial actions according to situations in order periods, the building state evolutions can be different. It is insufficient
to maximize the reward signal by trial-and-error search [33]. Unlike su- to describe such a control problem as a stable MDP. Thus, [34] defined
pervised learning which learns from a training set of labeled examples, MDP in non-stationary environments as below:

3
X. Deng et al. Building and Environment 211 (2022) 108680

Fig. 2. The VAV system of the simulation building.

Given a family of MDPs {𝑀𝜃 }𝜃∈N+ , where 𝜃 takes values from a Another approach is to learn an optimal control policy for non-
finite index set 𝛩, 𝑀𝜃 = ⟨𝑆, 𝐴, 𝑃𝜃 , 𝑅𝜃 ⟩ with the same state and action stationary environments by detecting the environmental changes. [37]
spaces and different transition probability distributions and reward proposed a context detection based algorithm called Reinforcement
functions. Each MDP in the family is called a context. We assume that Learning with Context Detection (RL-CD). Similar to UCRL2, it statis-
only one context is active during a time interval (see Fig. 1). There tically estimates the transition probability functions and the reward
will be a change point between two sequentially active contexts. For functions for a set of partial MDP models. It evaluates the quality of
example, the environment may change from 𝑀𝜃0 to 𝑀𝜃1 at timestep each partial model, that is, the difference between the current transition
𝑇1 . Thus, the non-stationary dynamics 𝑃 , 𝑅 will be: and reward and learned transition probability and reward functions.
⎧ 𝑃 (𝑠′ |𝑠, 𝑎) 𝑡 < 𝑇1
Then it updates the Q-values of the partial model with the highest
′ ⎪ 𝜃0 ( ) quality. However, it is difficult to apply in complex control problems
𝑃 (𝑠𝑡+1 = 𝑠 |𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎) = ⎨ 𝑃𝜃1 𝑠′ |𝑠, 𝑎 𝑇1 ≤ 𝑡 < 𝑇2 (2)
⎪ like HVAC control because high dimensional and continuous state
⎩ ⋮
variables are intractable for probability function estimation. A more
and for 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎 efficient algorithm called Context QL is proposed by [34]. It addresses
the context detection problems not by statistical estimation but by
⎧ 𝑅 (𝑠, 𝑎) 𝑡 < 𝑇1
⎪ 𝜃0 finding the change points of the state-reward tuples over a horizon
𝑅(𝑠, 𝑎) = ⎨ 𝑅𝜃1 (𝑠, 𝑎) 𝑇1 ≤ 𝑡 < 𝑇2 (3) T. Experience tuple (𝑠𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) consists of the current state, reward,
⎪ ⋮
⎩ and next state in an iteration. Non-stationary MDPs are modeled as the
Fig. 1 demonstrates the non-stationary case of environments in combination of piecewise MDP contexts as mentioned before. These RL
reinforcement learning. algorithms have been proved effective in simple non-stationary control
In the literature, there have been some pioneer works addressing problems.
RL in non-stationary environments. One approach aims to minimize
the performance criterion called regret (see Eq. (4)), which refers to 3. Problem formulation
the difference between the expected optimal rewards collected over a
finite horizon 𝑇 from a start state 𝑠0 , and those produced by the current
3.1. A brief introduction of the non-stationary HVAC control problem
policy. [35] proposed UCRL2 algorithm, a variant of Upper Confidence
Reinforcement Learning (UCRL) proposed by [36]. UCRL2 adopts this
approach and gives a sublinear regret upper bound by finding the In this work, we consider the HVAC system in a commercial building
optimal MDP from a set of plausible MDPs for the current environment with 5 zones. The heating and cooling setpoints of the thermostat in
and its corresponding optimal policy. A set of plausible MDPs is built by each zone are controlled. The goal is to minimize the HVAC energy
estimating the transition probability function and the reward function. consumption and satisfy occupant thermal comfort in non-stationary
UCRL2 restarts learning when it exceeds the diameter of the current building environments. We consider two problems for experiments, the
MDP (see Eq. (5), where 𝑀 is the environment context and 𝑇 is the single-zone control problem and the multi-zone control problem. The
timestep of the first arrival of 𝑠′ from 𝑠), which however will discard experiment settings are as follows:
all the former estimations.
𝑇∑
−1 • The single-zone problem. Only one of the 5 zones (Zone 1) is
Regret = 𝑉𝑇∗ (𝑠0 ) − 𝑅(𝑠𝑡 , 𝑎𝑡 ) (4) controlled by the RL controller while the others are controlled by
𝑡=0 a default fixed schedule (see Fig. 3). Only Zone 1 needs to do the
[ ] learning to investigate the performance of our proposed method
𝐷𝑀 = max min E 𝑇 (𝑠′ |𝑠, 𝜋) (5)
𝑠≠𝑠′ 𝜋∶𝑆→𝐴 in a single zone with relatively stable influence from other zones.

4
X. Deng et al. Building and Environment 211 (2022) 108680

Fig. 3. The fixed operation schedule used in the single-zone problem: on weekdays, the heating setpoint and cooling setpoint are set to be 21.1 ◦ C and 23.9 ◦ C from 7:00 to
18:00, and set to be 15.0 ◦ C and 30.0 ◦ C for the rest hours; on weekends, the heating setpoint and cooling setpoint are set to be 21.1 ◦ C and 23.9 ◦ C from 7:00 to 13:00 and
set to be 15.0 ◦ C and 30.0 ◦ C for the rest hours.

• The multi-zone problem. All zones are controlled by the RL


controllers, which is a distributed multi-agent control problem
(see Fig. 2). We will train 5 agents respectively for 5 zones from
the beginning to investigate the performance of our methods
under varying interactions between different zones. Moreover,
the state space differs from that of the single-zone problem (see
Section 3.2.1).

3.2. Markov decision process formulation for HVAC control problem


Fig. 4. Exogenous environment in the HVAC control problem.
As mentioned in Section 2.1, the MDP formulation includes state
space, action space and reward function. Table 1
State variables. All listed variables are available by input data or simulation in
EnergyPlus.
3.2.1. State space
Type Variable names Notation Unit
The state is the input for the RL controller. In the single-zone
problem, the state vector includes 15 variables as shown in Table 1. Environment Day of the week WD –
Environment Hour of the day DH –
They can be categorized into 2 types: (a) system states denoting HVAC Environment Outdoor air temperature OAT ℃
system dynamics, which can be changed by the RL controller’s actions. Environment Outdoor air relative humidity ORH %
Heating setpoint and cooling setpoint are included in system states be- Environment Wind speed WS m/s
cause we use the increments to setpoints for control (see Section 3.2.2). Environment Wind direction WD degree from north
Environment Diffuse solar radiation DifS W/m2
(b) environment states, which do not change with the RL controller’s
Environment Direct solar radiation DirS W/m2
actions but can influence HVAC system dynamics in some approaches System Heating setpoint HtSpt ℃
(natural evolution, plans, etc.). The relationship between the 2 types of System Cooling setpoint ClSpt ℃
states is shown in Fig. 4. It is proper to model the HVAC problem as System Indoor air temperature IAT ℃
System Indoor air relative humidity IRH %
MDP with an extended state space for the RL controller. The state space
System Predicted Percentage of Dissatisfied PPD %
of this MDP not only includes the HVAC dynamics 𝑆, but also includes Environment Occupant flag Occup –
the actions of the exogenous environment 𝐸. The transition function of System Total HVAC energy consumption P w
a Markov decision process with exogenous variables is:

𝑃 (𝑠′ |𝑠, 𝑎, 𝑒) = 𝑃 𝑟{𝑠𝑡+1 = 𝑠′ |𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎, 𝑒𝑡 = 𝑒}


consideration. The group behaviors of agents may also affect the non-
where 𝑠 ∈ 𝑆, 𝑒 ∈ 𝐸. Moreover, the Bellman equation (Eq. (10)) for Q stationarity of the building environment, which will be demonstrated
value update still holds if 𝑒, 𝑒′ is given: in detail in Section 5.5.
𝑄(𝑠, 𝑎, 𝑒) = 𝑟 + 𝛾 max 𝑄(𝑠′ , 𝑎′ , 𝑒′ ) In order to facilitate the learning of the deep neural network, all
𝑎 the state variables are normalized into [0,1] by min–max normalization
As for the state space of the multi-zone problem, besides the above (see Eq. (6))
15 state variables of the single-zone problem, the heating setpoint and 𝑣 − 𝑣𝑚𝑖𝑛
𝑣𝑛𝑜𝑟𝑚 = (6)
cooling setpoint of the other 4 zones are also included. The action and 𝑣𝑚𝑎𝑥 − 𝑣𝑚𝑖𝑛
state of other agents also influence the evolution of the building thermal where 𝑣 is one of the state variables and 𝑣𝑚𝑎𝑥 and 𝑣𝑚𝑖𝑛 for each
dynamics. Thus, from the perspective of one agent, other agents are also variable can be estimated from historical data. This is a widely used
a part of the environment and their actions or states can also provide pre-processing skill for machine learning [38].
extra information for the control policy. Moreover, the goal of the
multi-zone problem is to minimize the total energy consumption and 3.2.2. Action space
maintain thermal comfort for each zone. It becomes a more complex According to our experiments, DQN obtains better efficiency than
problem when the group behaviors of agents should be taken into other DRL algorithms. DQN requires discretized actions. For DQN,

5
X. Deng et al. Building and Environment 211 (2022) 108680

Table 2 person. Only when there are occupants, the thermal comfort cost will
RL network design and training hyperparameters.
be considered in the optimization. 𝐶𝑐𝑜𝑚𝑓 𝑜𝑟𝑡 (𝑡) is calculated as follows:
Hyperparameter Value {
Size of input 15 1 𝑃 𝑃 𝐷 > 𝑃 𝑃 𝐷𝑡ℎ𝑟𝑒𝑠
𝐶𝑐𝑜𝑚𝑓 𝑜𝑟𝑡 (𝑡) = (9)
No. of hidden layers 4 𝑃𝑃𝐷 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Size of each hidden layer [512,512,512,512]
Activation layer ReLU
In Eq. (9), 𝑃 𝑃 𝐷𝑡ℎ𝑟𝑒𝑠 is the threshold of PPD, that is, the lower bound
Size of output 4 of thermal comfort at which the occupant can tolerate. Note that both
Memory capacity 10000 𝑃𝐻𝑉 𝐴𝐶 and 𝑃 𝑃 𝐷 are normalized here. We set 𝑃 𝑃 𝐷𝑡ℎ𝑟𝑒𝑠 ∈ (0, 1) and
Optimizer Adam usually close to 0.
Batch size 128
𝑤𝑒 and 𝑤𝑐 are heuristically designed to weigh the importance of two
Learning rate 0.0001
Discount factor 0.99 reward signals and make the reward function as the linear combination
Soft update rate 0.1 of these two reward signals. The design of weights here is tunable for
Soft update iteration 200 different optimization objectives in respect of the user’s will. In the
Initial epsilon 0.15
paper, we use 0.4 and 0.6 (see Table 2) as an example to prove the
Reward weights for energy 0.4
Reward weights for comfort 0.6 effectiveness of our proposed method. For convenience, 𝑤𝑒 and 𝑤𝑐 add
PPD threshold 0.15 up to 1 in our paper.

4. Methodology

actions are more like options to choose from instead of variables. Given
4.1. ODCP
a state at time 𝑡, a DQN agent decides to take one of the options. The
listed 4 tuples below are 4 options. It is similar to the options we take
In [40], the authors propose an online parametric Dirichlet change
in a maze game where we choose to move up, move down, move left,
point algorithm (ODCP). They model the compositional data as sam-
or move right at a cross.
pling from a family of Dirichlet distributions with a group of parame-
𝐴 = {(−1.0, −1.0), (1.0, 1.0), (−1.0, 1.0), (0.0, 0.0)} ters 𝜶 (1) , 𝜶 (2) , … , 𝜶 (𝑘) . A Dirichlet distribution with parameter 𝜶 (𝑟) , 𝑟 =
1, … , 𝑘, generates sample data 𝐱𝑖 , 𝜏𝑟−1 < 𝑖 ≤ 𝜏𝑟 .
For each action tuple, two numbers represent the increments to pre- Based on the data modeling, ODCP detects single change point by
vious heating setpoint 𝐻𝑡𝑆𝑝𝑡 and cooling setpoint 𝐶𝑙𝑆𝑝𝑡 respectively. a hypothesis testing framework. For an active observation window ,
For example, if the RL agent chooses the first tuple as the action at time {𝐱1 , … , 𝐱𝑡 }, the null hypothesis and the alternate hypothesis are follows:
𝑡, the heating setpoint and cooling setpoint at 𝑡 + 1 will be lowered by
1 ◦ C compared to those at 𝑡: • 𝐻0 :  is generated by a single Dirichlet distribution.
• 𝐻𝜏 :  is generated by two Dirichlet distribution divided at some
𝐻𝑡𝑆𝑝𝑡𝑡+1 = 𝐻𝑡𝑆𝑝𝑡𝑡 − 1 𝜏, 1 < 𝜏 < 𝑡.

Calculating the log-likelihood ratio of two hypothesizes, a signifi-


𝐶𝑙𝑆𝑝𝑡𝑡+1 = 𝐶𝑙𝑆𝑝𝑡𝑡 − 1
cance test is carried on to determine whether an alternative change
Note that the action setting also ensures that the cooling setpoint is point should be accepted.
always not less than the heating setpoint. There is also an action value In ODCP, multiple change points are identified by performing a
limit for heating and cooling setpoints: the heating setpoint should be sequence of single change point detections (see Stage 2 in Fig. 5): if
not lower than 15 ◦ C and the cooling setpoint should be not higher a change point 𝜏 is detected in an active observation window  of size
than 30 ◦ C. 𝐼, {𝐱1 , … , 𝐱𝐼 }, then the active observation window is reset to begin at 𝜏;
otherwise, new observation is added to the active observation window.
3.2.3. Reward function With respect to the real data, [40] justifies that general multivariate
In our work, Predicted Percentage of Dissatisfied (PPD) [39] is data can be transformed into compositional data with the invariance
used as the thermal comfort metric. PPD is adopted as one of the of statistical properties. In addition, although MDP does not produce
thermal comfort indices by globally recognized ASHRAE 55 and ISO independent and identically distributed samples of multivariate data re-
7730 standards for evaluating indoor environments. PPD is calculated quired by ODCP, [40] explains that it is still reasonable to utilize ODCP
based on Predicted Mean Vote (PMV) (see Eq. (7)). PMV quantifies for experience tuples. Experiments in [40] also show the efficiency of
the thermal comfort by mathematical formula with multiple factors ODCP in detecting change points between alternating MDPs.
including metabolic rate, insulation, air temperature, airspeed, mean
radiant temperature, and relative humidity. 4.2. Deep Q network

𝑃 𝑃 𝐷 = 100 − 95 × exp{−0.03353 × 𝑃 𝑀𝑉 4 − 0.2179 × 𝑃 𝑀𝑉 2 } (7) Deep Q network (DQN) combines Q-learning and deep neural
network. As mentioned before, Q-learning is a value-based method.
In our paper, PPD can be obtained by EnergyPlus simulation. In
It learns a lookup table called a Q-table to store Q-values for spe-
addition, total HVAC electric power demand is used as the energy
cific state–action (𝑠, 𝑎) pairs. Thus, Q-learning is also called tabu-
consumption metric. The details of the total HVAC electric power de-
lar Q-learning. The Q-values are updated by the experience tuple
mand is introduced in Section 5.1. Considering both occupant thermal
(𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) every iteration the agent interacts with the environment.
comfort and HVAC energy consumption, the reward function is shown
The update rule is based on the Bellman equation:
in:
{ 𝑄(𝑠, 𝑎) = 𝑄(𝑠, 𝑎) + 𝛼(𝑟 + 𝛾 max 𝑄(𝑠′ , 𝑎′ ) − 𝑄(𝑠, 𝑎)) (10)
−𝑤𝑒 𝑃𝐻𝑉 𝐴𝐶 (𝑡) 𝑂𝑐𝑐𝑢𝑝 = 0 𝑎′
𝑟(𝑡) = (8)
−𝑤𝑒 𝑃𝐻𝑉 𝐴𝐶 (𝑡) − 𝑤𝑐 𝐶𝑐𝑜𝑚𝑓 𝑜𝑟𝑡 (𝑡) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where 𝛼 is the learning rate and 𝛾 is the discount factor. Tabular
In Eq. (8), 𝑤𝑒 , 𝑤𝑐 are the weights for the total HVAC electric demand Q-learning methods suffer from explosive memory for tracking tremen-
power 𝑃𝐻𝑉 𝐴𝐶 (𝑡) and the occupant thermal comfort cost 𝐶𝑐𝑜𝑚𝑓 𝑜𝑟𝑡 (𝑡), dous state–action pairs when faced with MDP with large size. For
𝑂𝑐𝑐𝑢𝑝 is the occupant status flag which equals to 0 when no people example, in our HVAC control problem, each zone has 15 state vari-
are in the controlled zone and equals to 1 as long as there is one ables (see Table 1), among which most are continuous variables such as

6
X. Deng et al. Building and Environment 211 (2022) 108680

Fig. 5. Schematic of the non-stationary DQN method.

outdoor air temperature (OAT) and Predicted Percentage of Dissatisfied • We assume that the MDP context in non-stationary environments
(PPD). Thus, our RL controller faces a high-dimensional and continuous lasts enough time, i.e., at least one week. It is reasonable for
state space. It would be trapped by high computational complexity and collecting enough samples for ODCP to detect non-stationarity.
memory expanse if we use tabular Q-learning. Deep neural network can Emergencies are not considered in this paper as they only account
relieve us from those troubles and guarantee the performance by its for negligible influence on control policy if we consider long-term
ability in approximation. optimization.
[24] proposes a two-network design for DQN algorithm: one is • We assume that the major factor causing the non-stationary build-
called Q network 𝑄(𝑠, 𝑎; 𝜃) and the other one is called target network ing environment is periodic like the local weather conditions. The
̂ 𝑎; 𝜃 − ). The former selects the optimal action with respect to the
𝑄(𝑠, MDP context will be active again after a period. For example, if
state. While the latter generates the target value as a reference for the
the local weather transitions follow a distinct seasonal pattern,
update of the former. At update iteration i, the Q network is updated
̂ 𝑎; 𝜃 − ), then an MDP context working for the building environment cur-
by minimizing the mean-square error between 𝑄(𝑠, 𝑎; 𝜃) and 𝑄(𝑠,
rently will be activated in a year. Therefore, we set the learning
i.e., DQN loss (see Eq. (11)). With the loss, parameters of the Q network
timesteps 𝑇 (see line 1 in Algorithm 1) based on some prior
are updated by a stochastic gradient descent algorithm. Parameters of
knowledge of the period.
the target network are slowly replaced with those of the Q network.
[( ) ] 2 In the beginning, the Q network and the target network are initial-
𝐿𝑖 (𝜃𝑖 ) = E ̂ ′ , 𝑎′ ; 𝜃 − ) − 𝑄(𝑠, 𝑎; 𝜃𝑖 )
𝑟 + 𝛾 max 𝑄(𝑠 (11) ized randomly with the same parameters together with a memory buffer
𝑎′ 𝑖
for experience storage. Then the initial tuple ⟨𝐷, 𝑄, 𝑄⟩̂ for a DQN agent
Besides deep neural network, DQN adopts experience replay buffer is added into an empty agent list as shown in line 6.
[24], which allows that a batch of experience tuples can be randomly The training process of the DQN agent starts, which can be divided
sampled from the buffer for Q-network update. Replay buffer can into 3 stages (see Fig. 5). Over the first M0 episodes, we apply the base-
prevent the high correlation of the learning experiences and thus help line DQN algorithm to learn a basic DQN model called the pretrained
the agent learn more efficiently. As for the action selection of the Q net- DQN (see line 11–12), which does not consider non-stationarity in
work, exploration and exploitation should be well designed throughout weather conditions and serves to collect experience tuples {(𝑠𝑡 , 𝑟𝑡 , 𝑠𝑡+1 )},
the whole training process of the agent. Exploration means to try new
𝑡=1,2, . . . , T, for change point detection. At Stage 2, the ODCP algo-
actions whereas exploitation prefers to employ the best actions known
rithm is performed on experience tuples collected throughout episode
from learning. One strategy for balancing them is 𝜖-greedy, that is, with
M0 to find the change points as shown in line 30–32. Meanwhile,
the probability of 1 − 𝜖, the agent selects the action with the highest Q-
the number of MDP contexts and their activated time intervals are
value, while with probability 𝜖, the agent selects a random action. We
known. Finally, at Stage 3, we replace 𝐿 with a ⟨𝐷, 𝑄, 𝑄⟩-tuple
̂ list. The
adopt 𝜖-greedy in our DQN design.
number of tuples in 𝐿 is equal to the context number, and each tuple is
4.3. Non-stationary DQN method initialized as line 3–5. To make a distinction, we call each DQN in agent
list context DQN. Over the training episodes after episode M0 , context
Our proposed non-stationary DQN algorithm for HVAC control in DQN is updated only in the time interval when its corresponding MDP
non-stationary building environments is shown in Algorithm 1. Be- context is activated, as shown in line 13–15.
fore we further explain the details of the applied algorithm, some Here we explain the advantage of utilizing ODCP to experience tu-
preliminary assumptions on Algorithm 1 must be demonstrated. ples from a pretrained DQN instead of experience tuples from the very

7
X. Deng et al. Building and Environment 211 (2022) 108680

first episode. As ODCP detects change points by likelihood estimation


and significance test, it is sensitive to the noise of the sampled data.
We assume that the pretrained DQN has learned a basic policy for
common situations in all MDP contexts, and its experience tuples can
approximately be of the typical distributions sampled from the MDP
contexts for the control strategy. Thus, it can reduce the potential noise
for change point detection. If we execute ODCP after the first episode,
the confidence of likelihood estimation might suffer from the large
variation of data brought by random exploration in early episodes.
As for the learning steps of the DQN agent as shown in line 17–
28, the Q network is updated by sampling a minibatch of experiences
from the replay buffer as mentioned in the last section, maintaining Fig. 6. 3D view of the simulation building.
the independent and identically distributed sample assumption for the
learning model. In line 28, the target network is updated K steps slower
than the Q network with a soft update parameter 𝛽, 0 < 𝛽 < 1. Unlike
the common 𝜖-greedy strategy, we decrease 𝜖 value linearly along the
whole training process, as shown in line 34.
Algorithm 1 Non-stationary DQN.
1: Set total episode number M, episode number M0 for pretraining, the
HVAC system running timesteps T
2: Initialize greedy parameter 𝜖 to 𝜖0
3: Initialize replay memory 𝐷 to capacity N
4: Initialize Q network 𝑄 with random weights 𝜃
5: Initialize target network 𝑄̂ with weights 𝜃 − = 𝜃
6: Add tuple ⟨𝐷, 𝑄, 𝑄⟩ ̂ into agent list 𝐿
7: Initialize change point detection memory 𝐶 as empty
8: for episode = 1 to M do
9: Initialize state 𝑠𝑡 , reward 𝑟𝑡 , and terminal signal 𝑑𝑜𝑛𝑒 Fig. 7. Plan view of the simulation building.
10: for t = 1 to T do
11: if episode ≤ M0 then
12: Set 𝐷, 𝑄, 𝑄̂ as 𝐿[0] vanilla DQN and a rule-based controller. For clear clarification, we call
13: else them non-stationary DQN, baseline DQN and rule-based controller in
14: Find the corresponding context index 𝑖 with respect to t and the following content.
𝐶𝑃
15: Set 𝐷, 𝑄, 𝑄̂ as 𝐿[𝑖] 5.1. Simulation building environment
16: end if
17: With probability 𝜖 select a random action 𝑎𝑡 We consider a reference building model designed by the US. DOE
18: otherwise select 𝑎𝑡 = arg max𝑎 𝑄(𝑠𝑡 , 𝑎; 𝜃) [41]. It is a 463 m2 single-story building divided into 4 exterior and
19: Execute 𝑎𝑡 and get next state 𝑠𝑡+1 , reward 𝑟𝑡 , and terminal one interior conditioned zones and a return plenum (see Fig. 6 for 3D
signal 𝑑𝑜𝑛𝑒 view and Fig. 7 for plan view). It is equipped with electric equipment
20: Store transition (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 , 𝑑𝑜𝑛𝑒) in 𝐷 and lights in different zones. The light, electric equipment and occupant
21: if episode = M0 then activity produce heat inside the system.
22: Store experience tuple (𝑠𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) in 𝐶 The building climate is controlled by a packaged variable air volume
23: end if (VAV) system with a DX cooling coil and gas heating coils. The VAV
24: Sample random minibatch of transitions (𝑠𝑗 , 𝑎𝑗 , 𝑟𝑗 , 𝑠𝑗+1 , 𝑑𝑜𝑛𝑒) system is shown in Fig. 2. The air handling units bring in the outdoor
from 𝐷 air which is cooled by the electric main cooling coil or heated by
25: Set 𝑦𝑗 = 𝑟𝑗 + (1 − 𝑑𝑜𝑛𝑒)𝛾 max𝑎′ 𝑄(𝑠 ̂ 𝑗+1 , 𝑎′ ; 𝜃 − ) the electric main heating coil that contains an evaporative condenser
( )2
26: Perform a gradient descent step on 𝑦𝑗 − 𝑄(𝑠𝑗 , 𝑎𝑗 ; 𝜃) with pump and a basin heater. Then, the air is supplied to different zones
respect to the network parameters 𝜃 by the supply fan. Each zone heats the air supplied according to their
27: Set 𝑠𝑡 = 𝑠𝑡+1 thermostat setpoints. Thus, the total HVAC electric power demand
28: Every K episodes perform soft update on target network: 𝜃 − = is defined as the sum of the demand from all the HVAC facilities,
(1 − 𝛽)𝜃 − + 𝛽𝜃 including the main electric heating coil, the electric heating coils in
29: end for Zone 1–5, the supply fan, the main electric cooling coil, the cooling
30: if episode = M0 then coil evaporative condenser pump, and the cooling coil basin heater.
31: Perform ODCP algorithm on 𝐶, get change point list 𝐶𝑃 , and Supposing that the building locates at Allegheny County, Pittsburgh,
context number, i.e., change point number plus 1 Pennsylvania, USA, the yearly TMY3 weather profile for the local area
32: Replace 𝐿 with a ⟨𝐷, 𝑄, 𝑄⟩-tuple ̂ list of length equal to context is used for the training of the RL agent. TMY3, short for Typical
number, each tuple of which is initialized as line 3-5. Meteorological Year 3, is the weather data set of hourly values of solar
33: end if radiation and meteorological elements for one year, which represent
34: Set 𝜖 = 𝜖0 (1 - episode / M) typical weather conditions [42].
35: end for
5.2. Implementation details
5. Case study
We refer to the simulation framework used in [43]. This simulation
The effectiveness of our proposed method will be demonstrated framework (see Fig. 8) consists of the control methods implemented
through simulations. Our proposed method will be compared with the by Python scripts, and the simulation environment which wrapped

8
X. Deng et al. Building and Environment 211 (2022) 108680

Table 3
Cumulative rewards of two methods after convergence.
Non-stationary DQN Baseline DQN
Mean −407.40 −430.65
Single-zone problem
Std. 4.68 7.63
Mean −2476.12 −2882.82
Multi-zone problem
Std. 30.05 69.68

Fig. 8. Simulation framework. Considering episodes at the end, we can see that the average cu-
mulative reward of non-stationary DQN is higher than that of baseline
DQN. As shown in Table 3, in the single-zone problem the average
the EnergyPlus [30] model in OpenAI Gym interface [44]. The sim- cumulative reward of non-stationary DQN is 5% higher than that
ulation environment includes EnergyPlus and BCVTB (building control of baseline DQN meanwhile the standard deviation is 39% smaller.
virtual test bed). BCVTB serves as a connector for different simulation Meanwhile, in the multi-zone problem, the average cumulative reward
programs to run co-simulation [45]. of non-stationary DQN is 15% higher than that of baseline DQN while
In a simulation run, an EnergyPlus instance is created using the the standard deviation is 57% smaller.
input definition file and the data exchange file. Input definition file
defines the building physics, HVAC system and simulation settings. The 5.4. Performance in the single-zone problem
data exchange file defines the observation variables and control vari-
ables. In the data-exchange stage, variables can be read and written by We apply the well-trained models to the training environment. As
python scripts for control. The processor converts the raw observations shown in Fig. 10, both two controllers satisfy the occupant thermal
to state and reward information for controllers. Every simulation run comfort maintaining PPD lower than 10%. The energy consumption
curves are nearly periodic as two controllers learn to work during
starts at 0:00, January 1st and ends at 0:00, April 1st in a nonleap year,
the occupied hours every weekday. By checking the rest hours like
that is, 90 days. The granularity of the EnergyPlus simulation is 15 min.
the weekends when there is no occupant inside the zone, both two
For RL agent training, a simulation run is called a training episode.
controllers learn to maintain the heating and cooling setpoints at the
In addition to the whole simulation framework and simulation
limitation of the setpoint band to idle the coils of VAV boxes avoiding
settings, we also present the detailed design of the neural network and
unnecessary energy consumption.
related hyperparameters (see Table 2). Both the neural network design
In Fig. 11, the baseline DQN controller conducts heating actions
and hyperparameters are the same for non-stationary DQN and baseline
from 0:00 to 4:00 on March 20th (see Fig. 11(a)). The state resulting
DQN.
in the heating actions turns out to be similar to the state at 6:00 on
For the single-zone problem, the training episode number is 400.
February 28th (see Fig. 11(b)) when it is the very moment both the
When the non-stationary DQN agent finishes the 100th training episode,
baseline DQN controller and the non-stationary controller have learned
i.e., M0 = 100 (see Algorithm 1), we apply ODCP to the experience
to start heating for the working hours. It seems that heating actions are
tuples throughout the episode. The minimum length of the active
appropriate for the state under the environment context on February
observation window for a single changepoint detection is 1344, that
28th but not for the state under the environment context on March
is, two weeks, as we assume that there is at most one change point in
20th, since the baseline DQN controller has to conduct cooling actions
any window of two weeks for our building environment. For the multi-
from 0:00 to 4:00 on March 20th to correct the mistakes. We call this
zone problem, the training episode number is 250. When the agent
situation ‘mistake and correct’ in this study, which causes unnecessary
finishes the 50th training episode, i.e., M0 = 50, we apply ODCP to
energy use and can be seen frequently when being controlled by the
the experience tuples.
baseline DQN controller, but hardly occurs when being controlled by
the non-stationary controller.
5.3. Convergence of non-stationary DQN In Fig. 12, the baseline DQN controller fails to raise the temperature
before the working hours and causes a huge short-term discomfort
As shown in Fig. 9, both two curves become steady as the training (see Fig. 12(a)). This is because the baseline DQN keeps the heating
evolves. Notice that the convergence of baseline DQN is faster com- setpoint at the limit until 7:15 as it believes it is facing a state similar to
pared with non-stationary DQN. The reasons lie in two aspects. First, that at 3:30 on February 26th (see Fig. 12(b)). The baseline DQN fails
non-stationary DQN has to train multiple context DQNs. There will be to recognize the different environment context in these two periods,
more DQNs in the multi-zone problem. In each training episode, context which thus delay the heating behaviors. In contrast, the non-stationary
DQNs are updated respectively by the experiences in their activated DQN learns the right timings to raise the temperature for different
time intervals. Therefore, they need more episodes to collect enough environment contexts.
experiences for learning. Second, except for the first context DQN, the Considering the two examples above, it seems that the control policy
start state of a context DQN is exactly the end state of the last one. of the baseline DQN controller is not stable when faced with some states
Context DQNs have to spend more time learning control strategies for though the controller has been trained for a long time. In contrast, the
variant start states. The details of time cost are as follows: non-stationary DQN controller can react properly for similar states of
A computer with Intel 32-core Xeon(R) Gold 6130 CPU and NVIDIA different contexts. The reason is that the baseline DQN controller gets
GTX 3090 GPU is used for training. For non-stationary DQN, it takes confused when it learns experiences from different contexts. Although
200 episodes to converge in the single-zone problem while it takes 175 some states seem similar, the rewards and next states can differ greatly
episodes in the multi-zone problem. In the single-zone problem, one since the experiences come from contexts with distinct dynamics. Then
iterative training process takes 42 s to finish; in the multi-zone problem, errors occur when we use the Bellman equation (see Eq. (11)) to update
one iterative training process takes 77 s to finish. For baseline DQN, Q-values. As a result, the action selection of the baseline DQN controller
it takes 100 episodes to converge in the single-zone problem while is ambiguous as it tends to be neutral towards optimal control policies
it also takes 100 episodes in the multi-zone problem. In the single- for different contexts.
zone problem, one iterative training process takes 35 s to finish; in the Besides, the non-stationary DQN controller is compared with a rule-
multi-zone problem, one iterative training process takes 79 s to finish. based controller, which maintains the heating setpoint at 24 ◦ C and

9
X. Deng et al. Building and Environment 211 (2022) 108680

Fig. 9. Convergence of non-stationary DQN and baseline DQN. Both the average values and two standard deviation bounds are presented.

Fig. 10. Total HVAC electric demand power, PPD, heating and cooling setpoints and indoor air temperature of baseline DQN and non-stationary DQN for single-zone control.

Fig. 11. The effects of non-stationarity on energy consumption.

10
X. Deng et al. Building and Environment 211 (2022) 108680

Fig. 12. The effects of non-stationarity on PPD.

Fig. 13. Cumulative reward, average total HVAC electric demand power (W) and average PPD (%) of the rule-based controller, the baseline DQN controller and the non-stationary
DQN controller in the training building environment.

cooling setpoint at 26 ◦ C for working hours when the occupant number improvement of a single zone is diluted in the total HVAC demand of
is larger than 0. When there is no occupant, it sets the heating and all zones. Besides, non-stationary DQN reduces variation in occupant
cooling setpoints at 15 ◦ C and 30 ◦ C. The results are shown in Fig. 13 comfort with a smaller standard deviation in PPD. As for the total
and related discussions will be presented in the next subsection. demand, the average total HVAC electric demand power in the multi-
zone problem is larger than that in the single-zone problem because
5.5. Performance in the multi-zone problem more energy is required to satisfy the thermal comfort in 5 zones.
As can be seen in Fig. 14, 5 zones detect similar change points and
Fig. 13 shows the cumulative reward, average total HVAC elec- take similar action series throughout 90 days. In addition, the change
tric demand power and average PPD of the three controllers in the points of Zone 1 in the multi-zone problem are different from those in
single-zone problem and the multi-zone problem. In both the single- the single-zone problem. The reason is that besides weather conditions,
zone problem and the multi-zone problem, the cumulative rewards of there is another non-stationary factor in the multi-zone problem, that
non-stationary DQN and baseline DQN are higher than that of the rule- is, the actions of other controllers, since other controllers are controlled
based controller. Although the average total HVAC electric demand by RL agents rather than by a fixed schedule. However, the non-
power of the rule-based controller is lower, the average PPD and the stationarity of controller actions seems to be also highly influenced by
standard deviation of PPD are much higher than those of the two DQN weather conditions (see Fig. 10 and Fig. 14). Therefore, although a new
based methods and keep away from the desired comfort range for the non-stationary factor is introduced, the multi-zone problem still con-
indoor thermal environment. Occupants’ thermal comfort cannot be sists with our assumption about periodic factors in the non-stationary
guaranteed by the rule-based controller. Compared to baseline DQN, DQN algorithm (see Section 4.3).
in the single-zone problem, non-stationary DQN reduces the average Comparing the results of the single-zone and multi-zone problems,
total HVAC electric demand power by 2% and the average PPD by 9%; we find that when controlled by the non-stationary DQN controller,
in the multi-zone problem, non-stationary DQN reduces the average the energy consumption of Zone 1 in the multi-zone problem is higher
total HVAC electric demand power by 13% and the average PPD by than the energy consumption of Zone 1 in the single-zone problem.
around 7% for each zone. The improvement in energy saving might Checking the actions in January when the outdoor air temperature is
seem marginal for the single-zone problem, but this is because the low, we find that the extra energy consumption is spent on the forward

11
X. Deng et al. Building and Environment 211 (2022) 108680

Fig. 14. The heating and cooling setpoints and indoor air temperatures of the baseline DQN controller and the non-stationary DQN controller for 5 zones in the multi-zone
problem.

Fig. 15. Comparison of the non-stationary DQN controller in Zone 1 between the single-zone problem and the multi-zone problem.

heating actions during the rest hours between two weekdays (as shown learn optimal actions for different environment contexts, which leads
in Fig. 15). The forward heating action means the controller brings to unwanted energy waste.
the heating timing forward from around 6:00, when working hours are
about to begin, to around 0:00. The explanation might be that under 5.6. Performance in areas with non-stationarity of different levels
low outdoor air temperature, more heating time is required for all 5
zones to raise and stabilize the indoor air temperature. The forward The weather conditions in Pittsburgh are of the temperate con-
heating actions of the non-stationary DQN controller can hardly be seen tinental climate, which often has a significant annual variation in
after January, as shown in Fig. 16(a). However, the forward heating temperature (hot summers and cold winters). In addition, every sim-
actions of the baseline DQN controller can still be found in February ulation starts at 0:00, January 1st and ends at 0:00, April 1st, which
(see Fig. 16(b)). It is another evidence that baseline DQN fails to covers the period when the temperature rises significantly. Therefore,

12
X. Deng et al. Building and Environment 211 (2022) 108680

Fig. 16. Comparison of the performance between non-stationary DQN and baseline DQN in the multi-zone problem.

Fig. 17. Performance of non-stationary DQN in non-stationarity of different levels.

we believe that the weather conditions in Pittsburgh show typical 5.7. Stability of non-stationary DQN
non-stationarity in this HVAC control problem.
Nevertheless, an experiment is conducted to compare the non- To prove the stability of non-stationary DQN, we conduct some
stationarity of different levels caused by two kinds of weather con- experiments where disturbance happens at random timesteps. Distur-
ditions, that is, middle latitude continental climate and low latitude bance means that the setpoints are set to be the boundaries before
marine climate. We use the typical weather profile in 6 cities for simula- some timestep during the simulation. To be more concrete, we adopt 3
tion in the single-zone control problem. Among the 6 cities, Pittsburgh, boundary settings:
Beijing, and Dunhuang are of middle latitude continental climate; while
Shenzhen, Miami, and Bangkok are of low latitude marine climate. To 𝑖 heating setpoint = 15 ◦ C, cooling setpoint = 15 ◦ C.
show the difference between the two climates, we pick up outdoor air 𝑖𝑖 heating setpoint = 15 ◦ C, cooling setpoint = 30 ◦ C.
temperature as an example, which has a major influence on the build- 𝑖𝑖𝑖 heating setpoint = 30 ◦ C, cooling setpoint = 30 ◦ C.
ing thermal model. Fig. 17(a) shows the temperature evolution from
January 1st to March 31st in 6 cities. Temperature rise is significant in In addition, we randomly choose 4 timesteps as recovery points
cities of middle latitude continental climate, while temperature trend from the 8640 timesteps (every 15 mins in 90 days): 264, 1800, 5796,
seems steady in cities of low latitude marine climate. It is suggested that 6766. For example, if we assign the recovery point to be 1800 and
the formers have more non-stationarity caused by weather conditions. adopt boundary settings 𝑖, before recovery point 1800 the setpoints
Fig. 17(b) shows the convergence curve of non-stationary DQN and are set to be 15 ◦ C, 15 ◦ C, while after recovery point the setpoints are
baseline DQN in the single-zone problem for 6 cities. Non-stationary controlled by the trained RL controller.
DQN increases the average episode reward after it is converged for As shown in Fig. 18, non-stationary DQN can adjust the indoor tem-
all 6 cities. For those cities with more obvious non-stationarity, the perature to get everything back on track. Shortly after the disturbance,
improvement is more significant accordingly. Therefore, non-stationary the state trajectories and reward trajectories can return to the no-
DQN works for HVAC control in areas with non-stationarity of different disturbance trajectories respectively. This indicates that non-stationary
levels. DQN obtained deployment stability: first, the size of the state space and

13
X. Deng et al. Building and Environment 211 (2022) 108680

Fig. 18. State (indoor air temperature) trajectories and reward trajectories in 3 boundary settings: (𝑖) heating setpoint = 15 ◦ C, cooling setpoint = 15 ◦ C; (𝑖𝑖) heating setpoint =
15 ◦ C, cooling setpoint = 30 ◦ C; (𝑖𝑖𝑖) heating setpoint = 30 ◦ C, cooling setpoint = 30 ◦ C.

Fig. 19. Cumulative reward, average total HVAC electric demand power (W) and average PPD (%) of the rule-based controller, the baseline DQN controller and the non-stationary
DQN controller in the test building environment.

the action space in the HVAC control problem is moderate for the RL The major difference lies in the schedules of occupancy and electric
controller to explore and learn the Q values for nearly all the states; equipment. As shown in Fig. 20, the occupants and equipment follow a
second, deep neural network can approximate the Q values for similar periodic schedule. The working hours start at 8:15 and end at 21:15 on
states even not explored. weekdays. However, it is more realistic for occupants and equipment
to follow a schedule with randomness. In the test environment, we test
5.8. Generalization of non-stationary DQN our well-trained DQN controllers with a stochastic occupancy schedule
and a stochastic equipment schedule provided by [46]. The stochastic
To validate the generalization of our non-stationary DQN algo-
occupancy schedule is generated by the LBNL simulator [47] and the
rithm to an unseen environment, both the well-trained non-stationary
stochastic equipment schedule is based on the occupancy and added
DQN controller and baseline DQN controller are further tested in
with Gaussian noise [46] (see Fig. 20). All five zones in the building
a new building environment. Here we specify some similarities and
differences of the training and test environment settings. adopt the new schedules of occupancy and equipment.
Considering the model assumptions in Section 2.2, it requires a The same 90-day simulation is applied in the test case. The test
similar non-stationary MDP in the test environment for the application results are shown in Fig. 19. Similar to the results in Fig. 13, the
of the trained DRL controller. In Section 4.3, we assume the major cause average total HVAC electric demand power and average PPD of the
of the non-stationary environment is periodic weather conditions. Thus, non-stationary DQN controller is lower than that of the baseline DQN
in the test environment we adopt the same weather profile, building controller in both the single-zone and multi-zone problems. There-
model and VAV system used in training. fore, when faced with an unseen stochastic schedule, non-stationary

14
X. Deng et al. Building and Environment 211 (2022) 108680

Acknowledgments

This study was supported by the National Natural Science Founda-


tion of China (Grant No. 51838007), Shenzhen Science and Technology
Program, China (Grant No. KQTD20170810150821146) and Scientific
Research Funds of Tsinghua Shenzhen International Graduate School
(Grant No. QD2021007N).

References

[1] A. Costa, M.M. Keane, J.I. Torrens, E. Corry, Building operation and en-
ergy performance: Monitoring, analysis and optimisation toolkit, Appl. Energy
101 (2013) 310–316, http://dx.doi.org/10.1016/j.apenergy.2011.10.037, http:
//www.sciencedirect.com/science/article/pii/S030626191100691X, sustainable
Fig. 20. The occupancy schedule and equipment schedule for the training environment Development of Energy, Water and Environment Systems.
and the test environment. [2] L. Pérez-Lombard, J. Ortiz, C. Pout, A review on buildings energy con-
sumption information, Energy Build. 40 (2008) 394–398, http://dx.doi.org/10.
1016/j.enbuild.2007.03.007, http://www.sciencedirect.com/science/article/pii/
S0378778807001016.
DQN still exceeds baseline DQN after the offline training with a fixed [3] Y. Chen, H. Tan, Short-term prediction of electric demand in building sector via
schedule. hybrid support vector regression, Appl. Energy 204 (2017) 1363–1374, http:
//dx.doi.org/10.1016/j.apenergy.2017.03.070, https://www.sciencedirect.com/
science/article/pii/S0306261917303082.
6. Conclusion [4] F. Jalaei, G. Guest, A. Gaur, J. Zhang, Exploring the effects that a non-stationary
climate and dynamic electricity grid mix has on whole building life cycle assess-
In this study, a novel HVAC control method combining active ment: A multi-city comparison, Sustainable Cities Soc. 61 (2020) 102294, http://
building environment change detection and deep Q network is pro- dx.doi.org/10.1016/j.scs.2020.102294, https://www.sciencedirect.com/science/
article/pii/S2210670720305151.
posed to minimize the HVAC energy consumption with maintaining the
[5] Y. Zhou, Z. Kang, L. Zhang, C. Spanos, Causal analysis for non-stationary time
occupant thermal comfort. This non-stationary DQN method identifies series in sensor-rich smart buildings, in: 2013 IEEE International Conference on
the environment contexts by detecting the change points of the building Automation Science and Engineering (CASE), 2013, pp. 593–598, http://dx.doi.
environment and learns more effective control strategies respectively. org/10.1109/CoASE.2013.6654000.
Case study results prove the effectiveness of our proposed method: (i) [6] Y. Chen, F. Zhang, U. Berardi, Day-ahead prediction of hourly subentry energy
consumption in the building sector using pattern recognition algorithms, Energy
compared to the state-of-art DQN method, our proposed method ends
211 (2020) 118530, http://dx.doi.org/10.1016/j.energy.2020.118530, https://
up training with higher episode cumulative reward; (ii) our proposed www.sciencedirect.com/science/article/pii/S0360544220316388.
method can save extra HVAC energy consumption by 13% at most and [7] Z. Wang, T. Hong, Reinforcement learning for building controls:
reduce comfort violation by 9% at most; (iii) our proposed method The opportunities and challenges, Appl. Energy 269 (2020) 115036,
http://dx.doi.org/10.1016/j.apenergy.2020.115036, http://www.sciencedirect.
shows exceedingly good performance in both single-zone and multi-
com/science/article/pii/S0306261920305481.
zone HVAC control problems; (iv) it also obtains stability against [8] S. Wang, Z. Ma, Supervisory and optimal control of building hvac
disturbance and generalization to unseen building environments. systems: A review, HVAC & R Res. 14 (2008) 3–32, http://dx.doi.
In future work, we aim to improve the method with consideration org/10.1080/10789669.2008.10390991, https://www.tandfonline.com/doi/abs/
of more complicated situations and make it more applicable in the 10.1080/10789669.2008.10390991.
[9] Jian Liu, Wen-Jian Cai, Gui-Qing Zhang, Design and application of handheld
real world. First, the trade-off between energy savings and comfort
auto-tuning pid instrument used in hvac, in: 2009 4th IEEE Conference on
satisfaction should be further studied by varying the weights of the Industrial Electronics and Applications, 2009, pp. 1695–1698, http://dx.doi.org/
energy consumption and the thermal comfort. Second, different from 10.1109/ICIEA.2009.5138484.
the simulation environment where the states can be acquired easily by [10] Jiangjiang Wang, Chunfa Zhang, Youyin Jing, Application of an intelligent
input or calculation, the real building deployment requires an efficient pid control in heating ventilating and air-conditioning system, in: 2008 7th
World Congress on Intelligent Control and Automation, 2008, pp. 4371–4376,
design of the sensor system and the ability to reduce the potential errors
http://dx.doi.org/10.1109/WCICA.2008.4593624.
of the sensor data. Third, to further enhance the learning efficiency, it [11] Guang Geng, G.M. Geary, On performance and tuning of pid controllers in
can be investigated to include distributed learning in the framework, es- hvac systems, in: Proceedings of IEEE International Conference on Control and
pecially in the scenarios of large and complex HVAC systems. Moreover, Applications, vol. 2, 1993, pp. 819–824, http://dx.doi.org/10.1109/CCA.1993.
in the district operation management involving multiple buildings, the 348229.
[12] C. Bae, C. Chun, Research on seasonal indoor thermal environment and residents’
RL agents should learn to do group decisions to achieve complicated
control behavior of cooling and heating systems in korea, Build. Environ.
objectives, for example, electric power peak load shifting. By exploring 44 (2009) 2300–2307, http://dx.doi.org/10.1016/j.buildenv.2009.04.003, https:
these directions, the non-stationary control method will be more robust //www.sciencedirect.com/science/article/pii/S0360132309000973, special Issue
and efficient for real-world deployment. for 2008 International Conference on Building Energy and Environment (COBEE).
[13] W.-T. Li, S.R. Gubba, W. Tushar, C. Yuen, N.U. Hassan, H.V. Poor, K.L. Wood,
C.-K. Wen, Data driven electricity management for residential air conditioning
CRediT authorship contribution statement systems: An experimental approach, IEEE Trans. Emerg. Top. Comput. 7 (2019)
380–391, http://dx.doi.org/10.1109/TETC.2017.2655362.
Xiangtian Deng: Investigation, Methodology, Software, Writing – [14] Y. Zhou, C. Lork, W.-T. Li, C. Yuen, Y.M. Keow, Benchmarking air-
original draft. Yi Zhang: Resources, Supervision. Yi Zhang: Concep- conditioning energy performance of residential rooms based on regres-
sion and clustering techniques, Appl. Energy 253 (2019) 113548, http:
tualization, Methodology, Writing – review & editing. He Qi: Project
//dx.doi.org/10.1016/j.apenergy.2019.113548, https://www.sciencedirect.com/
administration, Funding acquisition, Writing – review & editing. science/article/pii/S030626191931222X.
[15] Y. Ma, F. Borrelli, B. Hencey, B. Coffey, S. Bengea, P. Haves, Model predictive
Declaration of competing interest control for the operation of building cooling systems, IEEE Trans. Control Syst.
Technol. 20 (2012) 796–803, http://dx.doi.org/10.1109/TCST.2011.2124461.
[16] M. Maasoumy, M. Razmara, M. Shahbakhti, A.S. Vincentelli, Handling model un-
The authors declare that they have no known competing finan- certainty in model predictive control for energy efficient buildings, Energy Build.
cial interests or personal relationships that could have appeared to 77 (2014) 377–392, http://dx.doi.org/10.1016/j.enbuild.2014.03.057, http://
influence the work reported in this paper. www.sciencedirect.com/science/article/pii/S0378778814002771.

15
X. Deng et al. Building and Environment 211 (2022) 108680

[17] B. Rajasekhar, W. Tushar, C. Lork, Y. Zhou, C. Yuen, N.M. Pindoriya, K.L. [29] T. Wei, Yanzhi. Wang, Q. Zhu, Deep reinforcement learning for building hvac
Wood, A survey of computational intelligence techniques for air-conditioners control, in: 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC),
energy management, IEEE Trans. Emerg. Top. Comput. Intell. 4 (2020) 555–570, 2017, pp. 1–6, http://dx.doi.org/10.1145/3061639.3062224.
http://dx.doi.org/10.1109/TETCI.2020.2991728. [30] EnergyPlus, 2020, URL https://energyplus.net/.
[18] A. Afram, F. Janabi-Sharifi, Theory and applications of hvac control sys- [31] Z. Zhang, A. Chong, Y. Pan, C. Zhang, K.P. Lam, Whole building energy model for
tems – a review of model predictive control (mpc), Build. Environ. 72 hvac optimal control: A practical framework based on deep reinforcement learn-
(2014) 343–355, http://dx.doi.org/10.1016/j.buildenv.2013.11.016, https:// ing, Energy Build. 199 (2021) 472–490, http://dx.doi.org/10.1016/j.enbuild.
www.sciencedirect.com/science/article/pii/S0360132313003363. 2019.07.029.
[19] Y. Yao, D.K. Shekhar, State of the art review on model predictive control [32] L. Yu, Y. Sun, Z. Xu, C. Shen, D. Yue, T. Jiang, X. Guan, Multi-agent deep
(mpc) in heating ventilation and air-conditioning (hvac) field, Build. Environ. reinforcement learning for hvac control in commercial buildings, IEEE Trans.
200 (2021) 107952, http://dx.doi.org/10.1016/j.buildenv.2021.107952, https: Smart Grid 12 (2021) 407–419, http://dx.doi.org/10.1109/TSG.2020.3011739.
//www.sciencedirect.com/science/article/pii/S0360132321003565. [33] R. Sutton, A. Barto, Reinforcement Learning:An Introduction, MIT Press, 1998.
[20] Y. Ma, J. Matuško, F. Borrelli, Stochastic model predictive control for building [34] S. Padakandla, K.J. Prabuchandran, S. Bhatnagar, Reinforcement learning
hvac systems: Complexity and conservatism, IEEE Trans. Control Syst. Technol. algorithm for non-stationary environments, Appl. Intell. 50 (2020).
23 (2015) 101–116, http://dx.doi.org/10.1109/TCST.2014.2313736. [35] T. Jaksch, R. Ortner, P. Auer, Near-optimal regret bounds for reinforcement
[21] A. Parisio, D. Varagnolo, M. Molinari, G. Pattarello, L. Fabietti, K.H. Jo- learning, J. Mach. Learn. Res. 11 (2010) 1563–1600, http://jmlr.org/papers/
hansson, Implementation of a scenario-based mpc for hvac systems: an v11/jaksch10a.html.
experimental case study, IFAC Proc. Vol. 47 (2014) 599–605, http://dx.doi.org/ [36] P. Auer, R. Ortner, Logarithmic online regret bounds for undiscounted reinforce-
10.3182/20140824-6-ZA-1003.02629, https://www.sciencedirect.com/science/ ment learning, in: B. Schölkopf, J. Platt, T. Hoffman (Eds.), Advances in Neural
article/pii/S1474667016416800. 19th IFAC World Congress. Information Processing Systems, vol. 19, MIT Press, 2007, https://proceedings.
[22] X. Ding, W. Du, A.E. Cerpa, Mb2c: Model-based deep reinforcement learning neurips.cc/paper/2006/file/c1b70d965ca504aa751ddb62ad69c63f-Paper.pdf.
for multi-zone building control, in: Proceedings of the 7th ACM International [37] B.C. da Silva, E.W. Basso, A.L.C. Bazzan, P.M. Engel, Dealing with non-stationary
Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, environments using context detection, in: Proceedings of the 23rd International
BuildSys ’20, Association for Computing Machinery, New York, NY, USA, 2020, Conference on Machine Learning, ICML ’06, Association for Computing Ma-
pp. 50–59, http://dx.doi.org/10.1145/3408308.3427986. chinery, New York, NY, USA, 2006, pp. 217–224, http://dx.doi.org/10.1145/
[23] E. Gibney, Google AI algorithm masters ancient game of Go, Nature (2016). 1143844.1143872.
[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, [38] C.M. Bishop, Neural Networks for Pattern Recognition, 1995.
M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. [39] P.O. Fanger, Thermal Comfort: Analysis and Applications in Environmental
Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human- Engineering, Danish Technical Press, Copenhagen, Denmark, 1970.
level control through deep reinforcement learning, Nature 518 (2015) 529–533, [40] P.K.J.N. Singh, P. Dayama, V. Pandit, Change point detection for compositional
http://dx.doi.org/10.1038/nature14236. multivariate data, 2019, arXiv:1901.04935.
[25] Z. Zou, X. Yu, S. Ergan, Towards optimal control of air handling units using [41] Department of Energy, Commercial reference buildings, 2020, https://www.
deep reinforcement learning and recurrent neural network, Build. Environ. energy.gov/eere/buildings/commercial-reference-buildings.
168 (2020) 106535, http://dx.doi.org/10.1016/j.buildenv.2019.106535, https: [42] S. Wilcox, W. Marion, Users manual for tmy3 data sets, 2008, http://dx.doi.org/
//www.sciencedirect.com/science/article/pii/S0360132319307474. 10.2172/928611.
[26] C. Lork, W.-T. Li, Y. Qin, Y. Zhou, C. Yuen, W. Tushar, T.K. Saha, An [43] Z. Zhang, K.P. Lam, Practical implementation and evaluation of deep reinforce-
uncertainty-aware deep reinforcement learning framework for residential air ment learning control for a radiant heating system, in: In: Proceedings of the
conditioning energy management, Appl. Energy 276 (2020) 115426, http: 5th Conference on Systems for Built Environments, BuildSys ’18, Association
//dx.doi.org/10.1016/j.apenergy.2020.115426, https://www.sciencedirect.com/ for Computing Machinery, New York, NY, USA, P, 2018, pp. 148–157, http:
science/article/pii/S0306261920309387. //dx.doi.org/10.1145/3276774.3276775.
[27] A. Naug, M. Quiñones Grueiro, G. Biswas, A relearning approach to [44] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W.
reinforcement learning for control of smart buildings, 2020, arXiv:2008.01879. Zaremba, Openai gym, 2016.
[28] M. Biemann, F. Scheller, X. Liu, L. Huang, Experimental evaluation of model- [45] Michael Wetter, Coffey Philip, Brian, Building controls virtual test bed, 2008.
free reinforcement learning algorithms for continuous hvac control, Appl. Energy [46] URL https://github.com/zhangzhizza/HVAC-RL-Control/tree/a3c/src/eplus-env,
298 (2021) 117164, http://dx.doi.org/10.1016/j.apenergy.2021.117164, https: 2019.
//www.sciencedirect.com/science/article/pii/S0306261921005961. [47] Y. Chen, T. Hong, X. Luo, An Agent-Based Stochastic Occupancy Simulator,
Building Simulation, 2018.

16

You might also like