Safe HVAC Control Via Batch Reinforcement Learning
Safe HVAC Control Via Batch Reinforcement Learning
Hsin-Yu Liu∗ , Bharathan Balaji† , Sicun Gao, Rajesh Gupta, Dezhi Hong
University of California, San Diego, La Jolla, CA, USA, † Amazon
ing (RL) has improved upon traditional control methods in increas- We can predict thermal characteristics based on weather condi-
ing the energy efficiency of HVAC systems. However, prior works tions, expected usage, and thermal insulation properties. In model-
use online RL methods that require configuring complex thermal based approaches, thermal states of the building are simulated with
simulators to train or use historical data-driven thermal models that simplified linear models, and methods such as model predictive
can take at least 104 time steps to reach rule-based performance control (MPC) [1, 4, 38, 39, 43, 65] and fuzzy control [7, 53] are used
Also, due to the distribution drift from simulator to real buildings, to improve upon RBC policies. However, based on heating/cooling
RL solutions are therefore seldom deployed in the real world. On physics, we know that thermal evolution is non-linear with respect
the other hand, batch RL methods can learn from the historical to indoor/outdoor conditions and depends on conditions such as
data and improve upon the existing policy without any interactions orientation with respect to the sun, use of blinds, and wall insu-
with the real buildings or simulators during the training. With the lation properties. Therefore, we can devise more accurate models
existing rule-based policy as the priors, the policies learned with to improve control performance further. Simulators such as Ener-
batch RL are better than the existing control from the first day of gyPlus [10] and TRNSYS [29] have been designed to capture the
deployment with very few training steps compared with online thermal properties of a building. But designing and calibrating such
methods. models for a large building requires significant time and exper-
Our algorithm incorporates a Kullback-Leibler (KL) regulariza- tise. With advances in sensing technologies and machine learning,
tion term to penalize policies that deviate far from the previous data-driven models have become popular in recent research.
ones. We evaluate our framework on a real multi-zone, multi-floor Reinforcement learning (RL) methods learn via direct interaction
building—it achieves 7.2% in energy reduction cf. the state-of-the- with the environment and thus has been studied extensively [25, 62,
art batch RL method, and outperforms other BRL methods in occu- 67]. They are categorized into model-based RL (MBRL) [15, 42] and
pants’ thermal comfort, and 16.7% energy reduction compared to model-free RL (MFRL) [9, 24, 70] algorithms. MBRL requires the
the default rule-based control. use of a simulator such as EnergyPlus [10], TRNSYS [29]. Without
the pre-training offline, their nature to take exploratory control
KEYWORDS actions can cause occupant discomfort. MBRL relies on a thermal
model learned from historical data, converges faster than MFRL
HVAC control, Batch Reinforcement Learning, Deep Reinforcement
methods. However, even with MBRL, the initial policy is worse
Learning
than the existing control policy, and it can take weeks/months
1 INTRODUCTION to improve and converge [16]. By contrast, batch RL can learn
Buildings account for 28% of the global carbon emissions [56], and directly from historical data. Previous studies have shown that
HVAC (heating, ventilation, and air conditioning) systems account BRL methods can improve on existing policies [20] by exploiting
for the majority of building energy consumption1 . Modern data- the behavioral policy to identify actions that maximize the reward
driven algorithms have the potential to improve the energy effi- over an episode with TD-error update (Q-learning) while ensuring
ciency of traditional HVAC control algorithms. Here we focus on that the chosen actions do not deviate too far from the existing
HVAC control in office buildings. policy so the value estimation is more accurate. Typically, there
An office building is typically divided into multiple thermal is a hyperparameter to decide the learning trade-off between Q-
zones, each of which can be controlled locally with a variable air learning or behavior cloning. Therefore, batch RL is a more efficient
volume unit. The control policy is based on sensor measurements method for deployment when historical data is available. To the
(temperature, humidity, CO2 , airflow, etc) in the thermal zone. best of our knowledge, BRL methods have not been explored for HVAC
Rule-based control (RBC) method is widely used to control the control.
actuators [51], typically in conjunction with proportional-integral- We design a BRL-based solution that effectively learns from
derivative (PID) controllers [17, 33]. Such controls are interpretable available historical data without requiring the use of a simulator or
but rely on experience and rules of thumb. It becomes challenging to explicit modeling of the HVAC system. Our framework guarantees
develop and maintain a fine-grained RBC policy for dynamic envi- safe system operations by avoiding random setpoint exploration
ronments. RBC is also a reactive algorithm as it changes the control that could damage the equipment and/or make occupants uncom-
fortable.
1 https://www.eia.gov/energyexplained/use-of-energy/commercial-buildings.php Our main contributions are summarized as follows:
∗ Corresponding authors.
† Work unrelated to Amazon.
• We propose and develop our framework, a simulation-free con- thermal comfort and indoor air quality requiring ∼3𝐾 days of train-
trol algorithm for energy reduction and thermal comfort co- ing. Nagarathinam et al. [41] train a multi-agent policy by taking
optimization. Our framework learns from existing historical data into account water-side chiller control, and reducing convergence
only, without requiring a simulator or complex modeling of the time to 2 years (∼700 days) using domain knowledge-based pruning.
space. DeepComfort [24] uses DDPG [35] to co-optimize thermal comfort
• Our method Batch Constrained Munchausen deep Q-learning and energy consumption with 104 hour (∼417 days) for training.
outperforms state-of-the-art BRL methods by penalizing policies MBBC [16] compares MBRL and MFRL methods with multi-zone
that deviate too far away from the previous policy. It outperforms control and shows that at least 104 of 15-minute time steps (∼100
existing controls from the first day of deployment. days) are needed to converge. Zhang et al. [68] train in an online
• We compare our framework with several state-of-the-art BRL fashion to control airflow and temperature. They also take ∼100
methods. Our framework reduces the energy consumption by days to converge.
16.7% compared to the default control, which is 7.2% improve- All prior works need a simulator or a data-driven model to
ment over the state-of-the-art, while keeping thermal comfort predict the thermal dynamics. Zhang et al. [70] use A3C [40] on
during the entire evaluation period. real building deployment with model pre-trained on a simulator.
HVACLearn [42] proposes an RL-based occupant-centric controller
2 BACKGROUND AND RELATED WORK (OCC) for thermostats using tabular Q-Learning with EnergyPlus
To the best of our knowledge, there is no previous work studying simulator. Raman et al. [44] implement Zap-Q [14] with 𝜖-greedy
how to co-optimize HVAC energy consumption and occupants’ exploration and compare the model with MPC methods using Ener-
thermal comfort with a completely simulation-free framework de- gyPlus. Lu et al. [37] compare on-policy and off-policy RL models
ployed on real multi-zone, multi-floor building. with simulated air-conditioned buildings with data-driven models.
Online RL methods, either model-free or model-based, rely on
exploration of the state-action space to improve the control policy.
2.1 Model Predictive Control
Model-free approaches are particularly data inefficient (months
MPC methods use a model to forecast the outdoor and indoor condi- to years of convergence time), and therefore, require the use of
tions and optimize for a sequence of control actions that maximizes a simulation model to learn a policy that can be practically de-
the given objective. MPC has been studied by several prior works ployed. But deploying such policies to a real building requires
for HVAC control. Aswani et al. [1] use learning-based MPC to careful calibration of the simulation model, which is prohibitively
control the room temperature to optimize energy consumption. time-consuming and expensive. Model-based methods are compa-
Beltran et al. [4] use occupancy prediction models derived from rably data-efficient and can use a thermal dynamics model trained
occupancy data traces and minimize energy consumption while with historical data. However, even these methods require weeks
staying within the comfort bounds of the occupants. Maasoumy et to months of real-world interaction for convergence. The initial
al. [39] propose a model-based hierarchical control strategy that control policy performance is considerably worse than the exist-
balances comfort and energy consumption. They linearize their ing rule-based policy [16, 41], and becomes a large impediment to
thermal dynamics model around its operating point and use an adoption. To setup an EnergyPlus model, we need building-specific
LQR supervisory controller that selects the optimal setpoints for information, such as materials used to construct the building, that
the lower level PID controllers. Privara et al. [43] interconnect require consulting blueprints. Even after modeling with such de-
building simulation software and traditional identification methods tails, a separate calibration step is required to ensure the accuracy
to avoid the statistical problems with data gathered from the real of the model. Whereas for our reward function model, we used
building. Winkler et al. [65] develop a data-driven gray-box model standard heat transfer equations and already available sensor data
whose parameters are learned from building operational data. To- from the building management system. The reward function can
gether with weather forecast information, this data is fed into the be reused in a new building, whereas EnergyPlus will require redo-
framework to minimize energy costs while satisfying user comfort ing the work again. Without a model to simulate airflow, we use
constraints. the readings and set points from the building management system.
Overall, the known issues of MPC are that it requires an accurate These are standard data points available in modern buildings, and
dynamic model, makes convexity assumptions, and the computation our method can be reused as is in other buildings.
cost of computing each control decision is high [3]. RL solutions 2.2.2 Offline RL Methods. Offline methods have not been explored
have been shown to overcome these limitations and outperform much yet in the building controls domain. GNU-RL [9] imple-
MPC methods [41], and the computation cost of a control decision ments behavior cloning for HVAC control. In contrast to behavior
is low as it only requires a neural network inference. cloning [52], where the agent simply learns to copy the behavioral
policy with an ML model, the BRL method is able to learn from
2.2 Reinforcement Learning the existing data with Q-update and compensate for the lack of
2.2.1 Online RL Methods. Researchers have been studied exten- diversity in the buffer by perturbing the selected action with a per-
sively for HVAC control with online RL methods [25, 62, 67]. Zhang turbation network. BRL maximizes the values returned by selecting
et al. [69] jointly optimize visual comfort, thermal comfort, and en- policy that improves upon the existing policy, rather than imitating
ergy consumption by training for ∼12𝐾 days in a simulator. OCTO- it.
PUS [15] co-optimizes HVAC, lighting, blinds, and window systems Previously, Ruelens et al.’s works focus on electricity cost opti-
and needs ∼3.5𝐾 days of training. Valladares et al. [57] co-optimize mization [49], demand response [48], and energy efficiency of heat
182
Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
Safe HVAC Control via Batch Reinforcement Learning ICCPS ’22, May 4-6th, 2022, Milan, Italy
pump [50] using fitted Q-iteration (FQI [46]). Vazques et al. [58, 59] to minimize system response time and reduce any discomfort to
balance comfort and energy consumption of a heat pump using FQI. occupants.
Yang et al. [66] implement Batch Q-Learning for low exergy build- We represent the agent and its environment as a Markov Decision
ings. The closest work to ours is Wei et al.’s work [63], where they Process (MDP) defined by a tuple, 𝑀 B = (S, S ′, A, 𝑃, 𝑅, 𝛾), where
′
control airflow using offline training using a modified Q-learning A is the action space in the batch B, S is the state space, S is
algorithm, where they clip and shrink the reward value. Unlike our ′ ′
the arriving state space where ∀𝑠 ∈ S corresponds to 𝑠 ∈ S
method, the experiments are done in simulators, do not control at a certain time step 𝑡 such that 𝑠 = 𝑠𝑡 , 𝑠 ′ = 𝑠𝑡 +1 . 𝑃(𝑠 ′ |𝑠, 𝑎) is the
zone temperature setpoint, and only consider temperature as a transition distribution, 𝑅(𝑠, 𝑎) is the reward function, and 𝛾 ∈ [0, 1) is
proxy for thermal comfort. the discount factor. The goal of our BRL model is to find an optimal
Algorithms such as FQI, Batch Q-learning, and Wei et al.’s DQN policy 𝜋 ∗ (𝑠) = argmax𝑎 𝑠.𝑡 .(𝑠,𝑎)∈B 𝑄 𝜋B (𝑠, 𝑎), which maximizes the
heuristic are all based on pure off-policy algorithms. Fujimoto et expected accumulative discounted rewards.
al. [23] show that off-policy methods exacerbate the extrapolation More specifically, we have the following:
error in a completely offline setting. The errors occur because the Q- • State: We use the following attributes for the RL process to eval-
network is trained on historical data but exploratory actions yield uate the policy: indoor air temperature, actual supply airflow,
policies which are different from the behavioral ones. They propose outside air temperature, and humidity. These states include the
Batch Constrained Q-learning (BCQ) [23] which restricts selected features needed for thermal comfort estimation 𝑠𝑇𝐶 𝑡 and those
actions to be close to those in the historical data and outperforms that represent the responses of actions as RL states 𝑠𝑡𝑅𝐿 .
prior approaches. BCQ uses a Variational AutoEncoder (VAE) [28] • Action: We control two important parameters, namely, zone air
to reconstruct the predicted actions given current states according temperature setpoint (𝑎𝑡𝑍 𝑁𝑇 ) and actual supply airflow setpoint
to existing data. 𝑆𝑢𝑝
(𝑎𝑡 ). Both are in continuous space and the action spaces are
BCQ is designed for complete offline, off-policy learning to pe-
normalized in the range of [−1, 1].
nalize policies that are far from the behavioral policies in the re-
• Environment: Real building HVAC zones across three different
play buffer. We build on top of the BCQ algorithm to further con-
floors. Every room is a single HVAC zone, and all these rooms
strain new policy to be close to the previous one. We enforce this
are used as lab space and work office.
constraint through Kullback-Leibler (KL) divergence between the
• Reward: We monitor the thermal states of the space as well as
learned policy and historical policy [61]. We show that our algo-
the thermal comfort index predicted by a regression model, and
rithm performance is more stable than BCQ in our real building
then make control decisions with the actions selected by the BRL
evaluation.
model. Our reward function penalizes high HVAC energy use and
We use the existing dataset as the prior experience, since the
discourages a large absolute value of the thermal comfort index,
rules are made by domain experts, its behavioral policy is safe cf.
which indicates discomfort to occupants. Our reward function at
random initialized online policy. In this paper, we focus on the
time step 𝑡 is:
performance of the algorithm in the initial days (one week) of 𝑅𝑡 = −𝛼 |𝑇𝐶𝑡 |−𝛽𝑃𝑡 , (1)
deployment and leave the long-term performance evaluation as
future work. where 𝛼, 𝛽 are the weights balancing between different objectives
and could be tuned to meet specific goals, 𝑇𝐶𝑡 is the thermal
comfort index at time 𝑡, 𝑃𝑡 is the HVAC power consumption at
time 𝑡. We compute 𝑃𝑡 attributed to a thermal zone using heat
3 DESIGN OF OUR FRAMEWORK transfer equations [2]. The DRL agent co-optimizes HVAC energy
3.1 BRL-based Control Framework Setup reduction and occupants’ thermal comfort.
As shown in Fig. 1, we first obtain historical data and process them
into replay buffer containing the transitions tuples. At each time
3.2 Thermal Comfort Prediction
step, the BRL model will randomly sample a mini-batch from the As we need to calculate the thermal comfort level as required by
replay buffer and train the target networks with the transitions our reward function, we adopt the widely used predicted mean
sampled. Periodically (according to the eval_freq in Alg. 1), we vote (PMV) [19] measure as our thermal comfort index. In this
evaluate the trained agent’s policy (the select_action function in metric, there are degrees of satisfaction, ranging from −3 (cold) to
Alg. 2) on real building zones to observe the states from our system’s 3 (hot), where PMV within the range from −0.5 to 0.5 is considered
readings and calculate the reward. The average rewards over time thermal-comfortable.
are shown in Fig. 5. We adopt the ASHRAE RP-884 thermal comfort data set [13] and
We use the episodic formulation as this is the standard procedure train a simple gradient boosting tree (GBT) model [27] to predict
in BRL literature [23, 32, 36]. In our formulation, the episode ends the thermal comfort by taking the current thermal states given
if the predicted thermal comfort is out of the thermal comfort by our building system in real-time. We show the evaluation of
range, i.e. absolute value larger than 0.5. Therefore, the agent is the effectiveness in Fig. 2 with such a simple GBT-based thermal
trained for an arbitrarily long episode length as long as it does not comfort index.
impact comfort. If we use a fixed episode length such as 24 hours,
the agent will optimize for that period. We use a time step of 9 3.3 Batch Reinforcement Learning for Control
minutes because that is the data-writing period for our building We take a BRL-based method, namely, batch-constrained deep Q-
management system. We choose the minimum possible time step learning (BCQ) [23] as our foundation and make improvements
183
Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
ICCPS ’22, May 4-6th, 2022, Milan, Italy Liu et al.
Replay buffer
Selected action decided by a VAE model, Real multi-zone
BRL Model: a perturbation model 𝜉! , two Q-networks, building environment
Batch data used Finding the optimized policy that maximizes returns and a set of hyperparameters
as prior knowledge
𝑠"
RL states feedback to calculate reward
Figure 1: Overview: Our batch reinforcement learning model selects actions that co-optimize thermal comfort for occupants
and energy consumption of HVAC system.
on it. BCQ is a pure offline, off-policy RL method that avoids the where 𝑟 is the reward, 𝛾 is the discount factor, 𝜆 is the minimum
extrapolation errors induced by the incorrect value estimation of weighting in double-Q learning, 𝜃 𝑗=1,2 are weights of the two critic
out-of-distribution actions selected out of existing dataset. Q-networks.
As illustrated in Fig. 1, for each time step 𝑡, we obtain state in- We propose an improvement on the BCQ algorithm, called Batch
formation from the sensors in the building. To only calculate the Constrained Munchausen RL (BCM), that encourages the agent to
reward but not updating the models. BCQ first samples a mini- update policy similarly to the previous one using a regularization
batch of data (the size of the mini-batch is set as a hyperparameter) term in the Q-update. With respect to other aspects, BCM inherits
from the entire set of historical data. Then, it trains a parametric BCQ’s characteristics and acts as an intermediate state of behavior
generative model 𝐺𝜔 , a conditional VAE on the batch to model the cloning and Q-learning.
distribution by transforming an underlying latent space. The en- The idea of the BCM algorithm is the following: we adopt the reg-
coder 𝐸𝜔 1 (𝑠, 𝑎) takes a distribution of state-action pairs and outputs ularization term in Munchausen RL (M-RL) [61] which penalizes the
the mean 𝜇 and standard deviation 𝜎 of a Gaussian distribution policies which deviate far from the previous policy with Kullback-
N (𝜇, 𝜎). A latent vector 𝑧 sampled from the Gaussian is passed to Leibler (KL) divergence [31, 60]. M-RL utilizes the current policy as
the decoder 𝐷𝜔 2 (𝑠, 𝑧) which outputs an action. The loss function of one of Q-update’s learning signals. 𝐾𝐿(𝜋1 ||𝜋2 ) = ⟨𝜋1, ln 𝜋 1 − ln 𝜋2 ⟩.
VAE consists of two parts: reconstruction loss and the KL regular- The other term added in M-RL is the entropy term which penalizes
ization term. the policies that are too far away from the uniform distribution,
where H (𝜋) = −⟨𝜋, ln 𝜋⟩. In offline settings, this term does not
(𝐷𝜔 2 (𝑠, 𝑧) − 𝑎)2, 𝑧 = 𝜇 + 𝜎 · 𝜖, 𝜖 ∼ N (0, 1)
X
L𝑟𝑒𝑐𝑜𝑛 = help improve the Q-update since we cannot accurately estimate
(𝑠,𝑎)∈B
uniform policy if we have only static data. We do not encourage
L𝐾𝐿 = 𝐷 𝐾𝐿 (N (𝜇, 𝜎)||N (0, 1)),
exploration as the online mode in the original M-RL settings. Our
L𝑉 𝐴𝐸 = L𝑟𝑒𝑐𝑜𝑛 + 𝜆L𝐾𝐿 .
problem is focused on conservative and safe policies exclusively
VAE here aims to produce only actions which are similar to existing selected from the batch with a small amount of perturbation. It
actions in the batch given the current state. The purpose of the helps to avoid the lack of diversity within state-action visitation in
perturbation model 𝜉𝜙 (𝑠, 𝑎, Φ) is to increase the diversity of seen the batch distribution.
actions, it adjusts the value of the selected action 𝑎 in the range of
[−Φ, Φ]. It could compensate for the lack of diversity in the batch 0
𝑦 = 𝑟 + 𝛼𝑚 𝜏𝑚 ln 𝜋𝜃ˆ (𝑎𝑡 |𝑠𝑡 ) 𝑙 + 𝛾 max 𝜆 min 𝑄𝜃 ′ (𝑠 ′, 𝑎˜𝑖 ) + (1 − 𝜆) max 𝑄𝜃 ′ (𝑠 ′, 𝑎˜𝑖 ) ,
0 𝑎𝑖 𝑗 =1,2 𝑗 𝑗 =1,2 𝑗
data, as a trade-off of inaccurate value estimation. By adjusting the (3)
hyperparameters 𝑛 and Φ, it could behave similarly to behavior 𝑄𝜃ˆ
where 𝜋𝜃ˆ = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥( 𝜏 ), the target Q after soft clipping in double
cloning with 𝑛 = 1 and Φ = 0, or similarly to traditional Q-learning
Q-learning, 𝛼𝑚 is the M-RL scaling parameter, 𝜏𝑚 is the entropy
when 𝑛 → ∞ and Φ → 𝑎𝑚𝑎𝑥 − 𝑎𝑚𝑖𝑛 .
temperature parameter, and 𝑙 0 is the clipping value minimum, since
the log-policy term is not bounded and can cause numerical is-
X
𝜙 ← argmax 𝑄𝜃 1 (𝑠, 𝑎 + 𝜉𝜙 (𝑠, 𝑎, Φ)), 𝑎 ∼ 𝐺𝜔 (𝑠).
𝜙 (𝑠,𝑎)∈B sues if the policy becomes too close to deterministic. We replace
𝜏 𝑙𝑛 𝜋(𝑎|𝑠) by [𝜏 𝑙𝑛 𝜋(𝑎|𝑠)]𝑙0 , where [·]𝑥 is the clipping function.
𝑦
At the core of BCQ is the value estimation networks, a pair of Q- 0
networks 𝑄𝜃 1 (𝑠, 𝑎) and 𝑄𝜃 2 (𝑠, 𝑎). By taking a weighted minimum The other added term in original M-RL algorithm is the entropy
between the two values as a learning target 𝑦 for both networks. term which encourages policies to be close to uniform distribu-
On the other hand, for the actor network, at first, 𝑛 actions are tion. We do not use it as it is not applicable for offline settings [61].
sampled with respect to the generative model, and then adjusted by Once we choose the action using BCM, we adjust the correspond-
the target perturbation model, before being passed to each target ing setpoints through a building operating system (BOS) [30, 64].
Q-networks for updates: The environment reflects the real response of action applied with
a time delay 𝑑, so our framework waits for 𝑑 to get data 𝑠𝑡 from
the sensors. Also, a PMV feature vector 𝑃𝑀𝑉𝑡 is fed into the re-
𝑦 = 𝑟 + 𝛾 max 𝜆 min 𝑄𝜃 ′ (𝑠 ′, 𝑎˜𝑖 ) + (1 − 𝜆) max 𝑄𝜃 ′ (𝑠 ′, 𝑎˜𝑖 ) , (2)
𝑎𝑖 𝑗=1,2 𝑗 𝑗=1,2 𝑗 gression model for thermal comfort prediction. According to the
184
Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
Safe HVAC Control via Batch Reinforcement Learning ICCPS ’22, May 4-6th, 2022, Milan, Italy
Algorithm 1: HVAC control via our framework Algorithm 2: BCM training algorithm
Input : Batch data B 𝑓 for a certain floor 𝑓 , time horizon 𝑇 , Input : Batch data B 𝑓 for a certain floor 𝑓 , target network
floor set F , zone/room set Z, and delayed update rate 𝜏, mini-batch size 𝑁 , max perturbation
response time 𝑑, target network update rate 𝜏, to selected actions Φ, number of sampled actions 𝑛,
mini-batch size 𝑏, max perturbation to selected minimum weighting 𝜆, evaluation frequency
actions Φ, number of sampled actions 𝑛, minimum 𝑒𝑣𝑎𝑙_𝑓 𝑟𝑒𝑞, M-RL scaling parameter 𝛼𝑚 , and
weighting 𝜆, evaluation frequency 𝑒𝑣𝑎𝑙_𝑓 𝑟𝑒𝑞, entropy temperature parameter 𝜏𝑚
M-RL scaling factor 𝛼𝑚 ∈ [0, 1], and entropy Output : Updated target networks
temperature parameter 𝜏𝑚 Initialize: RL agent BCM, Q networks 𝑄𝜃 1 , 𝑄𝜃 2 , VAE
Output : Reward, next state, and action selected by BCM generative network 𝐺𝜔 = {𝐸𝜔 1 , 𝐷𝜔 2 }, perturbation
Initialize: HVAC Environment 𝐸𝑛𝑣, RL agent BCM network 𝜉𝜙 , random parameter 𝜔, 𝜙, 𝜃 1, 𝜃 2 , and target
′ ′ ′
𝑑𝑎 = 𝑑𝑖𝑚(𝑎), 𝑑𝑠 = 𝑑𝑖𝑚(𝑠); networks 𝑄𝜃 ′ , 𝑄𝜃 ′ , 𝜉𝜙 ′ with 𝜃 1 ← 𝜃 1, 𝜃 2 ← 𝜃 2, 𝜙 ← 𝜙
1 2
for 𝑓 ∈ F do
𝐵𝐶𝑀 𝑓 = 𝐵𝐶𝑀(𝑑𝑠 , 𝑑𝑎 , 𝛾, 𝜏, 𝜆, 𝜙, 𝛼𝑚 , 𝜏𝑚 ); for 𝑡 ← 0 to 𝑇 do
Sample mini-batch 𝑁 transitions (𝑠, 𝑎, 𝑟, 𝑠 ′ ) from B 𝑓 ;
for 𝑧 ∈ Z do
0 ← t; 𝜇, 𝜎 = 𝐸𝜔 1 (𝑠, 𝑎), 𝑎˜ = 𝐷𝜔 2 (𝑠, 𝑧), 𝑧 ∼ N (𝜇, 𝜎)
˜ 2 + 𝐷𝐾𝐿 (N (𝜇, 𝜎)||N (0, 1))
P
while 𝑡𝑟𝑎𝑖𝑛_𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 < 𝑇 do 𝜔 ← argmin𝜔 (𝑎 − 𝑎)
𝐵𝐶𝑀 𝑓 .𝑡𝑟𝑎𝑖𝑛(B 𝑓 , 𝑏, 𝑛); Sample 𝑛 actions: {𝑎𝑖 ∼ 𝐺𝜔 (𝑠 ′ )}𝑛𝑖=1 ;
if 𝑡%𝑒𝑣𝑎𝑙_𝑓 𝑟𝑒𝑞 == 0 then Perturb each action: {𝑎𝑖 = 𝑎𝑖 + 𝜉𝜙 (𝑠 ′, 𝑎𝑖 , Φ)}𝑛𝑖=1 ;
𝑠𝑡𝑧 = 𝐸𝑛𝑣 𝑧 .𝑔𝑒𝑡𝑇ℎ𝑒𝑟𝑚𝑎𝑙𝑆𝑡𝑎𝑡𝑒(𝑡); Set value target P𝑦 (Eqn.3);
𝑇𝐶𝑡𝑧 = 𝐸𝑛𝑣 𝑧 .𝑔𝑒𝑡𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑇𝐶(𝑠𝑡𝑧 ); 𝜃 ← argmin𝜃 (𝑦 − 𝑄𝜃 (𝑠, 𝑎))2 ;
𝑎𝑧𝑡 = 𝐵𝐶𝑀 𝑓 .𝑠𝑒𝑙𝑒𝑐𝑡_𝑎𝑐𝑡𝑖𝑜𝑛(𝑠𝑡𝑧 ); 𝜙 ← argmax𝜙 𝑄𝜃 1 (𝑠, 𝑎 + 𝜉𝜙 (𝑠, 𝑎, Φ)), 𝑎 ∼ 𝐺𝜔 (𝑠);
′ ′
𝑠𝑡𝑧+1, 𝑟𝑡𝑧 = 𝐸𝑛𝑣 𝑧 .𝑠𝑡𝑒𝑝(𝑎𝑧𝑡 , 𝑠𝑡𝑧 , 𝑑); Update target networks: 𝜃𝑖 ← 𝜏𝜃 + (1 − 𝜏)𝜃𝑖 ;
′ ′
𝑡 += 1 ; 𝜙 ← 𝜏𝜙 + (1 − 𝜏)𝜙 ;
end end
end
end
end
185
Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
ICCPS ’22, May 4-6th, 2022, Milan, Italy Liu et al.
186
Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
Safe HVAC Control via Batch Reinforcement Learning ICCPS ’22, May 4-6th, 2022, Milan, Italy
Figure 5: Reward comparison of various algorithms Figure 6: We find the top-5 most similar weeks regarding
OAT to our experiment week (last figure) for evaluating en-
• Pessimistic Q-Learning (PQL) [36]: While BRL yields a new policy ergy consumption and thermal comfort.
other than those in the batch, it might visit states and actions that
PQL constrains the Bellman update over state-action pairs that
are outside the distribution of the batch data. In addition, function
are sufficiently covered by the conditional probability of action
approximation with a limited number of samples leads to overly
given state when generating the data. It adds a state-VAE and a
optimistic estimates. PQL thus uses pessimistic value estimates in
statistical filtration over BCQ’s architecture with pessimistic value
the low-data regions in the Bellman optimality equation as well
approximation, which might overkill near-optimal policy that is
as the evaluation back-up. It can yield more adaptive and stronger
without enough visitation, however, as time evolves, PQL gradu-
guarantees when the concentrability assumption does not hold.
ally learns better. BEAR is only guaranteed to outperform BCQ on
PQL learns from policies that satisfy a bounded density ratio
medium-quality data sets collected from a partially trained policy –
assumption akin to the on-policy policy gradient methods. The
a middle ground between optimal policy and random policy. How-
approach of PQL to improve from BCQ’s architecture is that they
ever, in our case, the replay buffers are closer to the data generated
add a state-VAE to predict the arriving state given current state-
with expert policy. This explains the outcomes in a reasonable way.
˜ 𝑎) instead of
action pair, filtering state-action distribution 𝜇(𝑠,
BCQ, as an ablation version of our BCM algorithm, yields a compa-
˜ |𝑎). The filtration is implemented by setting a hyperparameter
𝜇(𝑠
rable performance as BCM but fails to keep a stable outcome due
𝑏 as the 2𝑛𝑑 percentile of the state-VAE Evidence Lower Bound to the lack of a strong learning signal.
(ELBO). If the ELBO is larger than 𝑏 then Q-update is executed, The comparison between algorithms in our experiments is dis-
otherwise, it is not executed. tinct from the results shown in the original papers, where PQL
4.5.2 Comparison Methodology. We run each algorithm in a single outperforms BCQ and BEAR in two out of the three simulated
room on each floor in the same week so that outside air temperature environments. By contrast, on our real building HVAC system,
(OAT) is the same. For instance, in one week we run our BCM in BCQ provides a more stable and continuously improving perfor-
rooms in the same stack on different floors, e.g. 2144, 3144, and mance than the other two BRL methods. This is because all those
4144, and at the same time a different BRL algorithm, e.g. BEAR, experiments were conducted in simulation environments where
is running in rooms in a different adjacent stack, say, 2146, 3146, data are effectively unlimited, consequences for poor actions are
and 4146. In each room, we run the algorithm for 1, 000 time steps, non-existent, and system dynamics are clean and often determinis-
which is about one week. To reduce performance variations, we tic [18]. However, in real-world problems, systems are stochastic
evaluate each algorithm in three different rooms (one room from and non-stationary. It is not guaranteed that these algorithms would
each floor: 2F, 3F, and 4F). These rooms have the same functionality behave the same or similar to simulated cases in these settings.
(lab or office spaces) and are of roughly the same size and occupancy 4.6.2 Energy Consumption and Thermal Comfort Comparison. Out-
capacity. The entire evaluation time of all the experiments is from side Air Temperature (OAT) is a key factor affecting zone tempera-
September 28th to October 19th, 2021. ture; therefore, it affects both thermal comfort and energy usage
Appendix A.1 lists the hyperparameters for each method. of the HVAC system. It is thus reasonable to compare energy con-
sumption during baseline time periods with the most similar OAT
4.6 Results and Analysis trend to the period during which these BRL methods are evaluated.
4.6.1 Reward Comparison. Fig. 5 shows the evaluation results of To do so, we adopt Dynamic Time Warping (DTW) [5] to find his-
each algorithm, where each solid line is the average reward of all torical weeks with similar OATs, as DTW is a widely used method
runs for the same method; semi-transparent bands represent the to measure the similarity of time-series data of different lengths.
range of all runs for a particular algorithm. And gray dotted vertical In addition to considering the “shape” of historical OAT, we also
lines indicate 00:00AM of each day. The horizontal black dotted line consider the mean OAT difference between our experiment time
is the average reward in the buffer. It shows that BCM outperforms period and historical weeks. In summary, we find historical time
other methods by providing a relatively stable learning curve. periods whose OAT trend is similar and with close average weekly
187
Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
ICCPS ’22, May 4-6th, 2022, Milan, Italy Liu et al.
188
Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
Safe HVAC Control via Batch Reinforcement Learning ICCPS ’22, May 4-6th, 2022, Milan, Italy
ACKNOWLEDGEMENT
This work was supported in part by the CONIX Research Center,
Figure 13: Room Batch vs Floor Batch one of six centers in JUMP, a Semiconductor Research Corporation
(SRC) program sponsored by DARPA.
our evaluation period might be more suitable because of the similar
seasonal weather condition. Thus, we use only the data from the
same season as an ablation. REFERENCES
Fig. 11 shows that a batch of the entire year’s data produces better [1] Anil Aswani, Neal Master, Jay Taneja, David Culler, and Claire Tomlin. 2011.
Reducing transient and steady state electricity consumption in HVAC using
performance than only using the same season. A narrower distri- learning-based model-predictive control. Proc. IEEE 100, 1 (2011), 240–253.
bution of state-action visitation in a single season cannot update [2] Bharathan Balaji, Hidetoshi Teraoka, Rajesh Gupta, and Yuvraj Agarwal. 2013.
the Q-value as accurately as an entire year’s data could. Incorrect Zonepac: Zonal power estimation and control via hvac metering and occupant
feedback. In BuildSys. 1–8.
Q-value estimation would lead to a lower return. In summary, it is [3] Farinaz Behrooz, Norman Mariun, Mohammad Hamiruce Marhaban, Mohd Am-
essential to ensure enough state-action visitation diversity in the ran Mohd Radzi, and Abdul Rahman Ramli. 2018. Review of control techniques
for HVAC systems—Nonlinearity approaches based on Fuzzy cognitive maps.
batch data, in order to estimate the value more accurately. Energies 11, 3 (2018), 495.
[4] Alex Beltran and Alberto E Cerpa. 2014. Optimal HVAC building control with
occupancy prediction. In BuildSys. 168–171.
4.8 Generalization Experiments [5] Donald J Berndt and James Clifford. 1994. Using dynamic time warping to find
4.8.1 In-batch/Out-of-batch Experiment. To examine the gener- patterns in time series.. In KDD workshop, Vol. 10. Seattle, WA, USA:, 359–370.
[6] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schul-
alization of the BRL model, we test the learned policy on rooms man, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. arXiv preprint
where no data exist in the batch. Fig. 12 shows that out-of-batch arXiv:1606.01540 (2016).
(OOB) rooms cannot select proper actions to compensate for the [7] Francesco Calvino, Maria La Gennusa, Gianfranco Rizzo, and Gianluca Scac-
cianoce. 2004. The control of indoor thermal comfort conditions: introducing a
OAT fluctuation during the week. The reward curves follow the fuzzy adaptive controller. Energy and buildings 36, 2 (2004), 97–102.
OAT trend periodically, with clear peaks and valleys. This is rea- [8] CDC. 2003. Guidelines for Environmental Infection Control in Health-Care
Facilities. https://www.cdc.gov/infectioncontrol/guidelines/environmental/
sonable since different zones might respond differently under the background/air.html.
same VAV control action, due to the thermal dynamics in the HVAC [9] Bingqing Chen, Zicheng Cai, and Mario Bergés. 2019. Gnu-rl: A precocial rein-
and distance from VAV to zones. forcement learning solution for building hvac control using a differentiable mpc
policy. In BuildSys. 316–325.
[10] Drury B Crawley, Linda K Lawrie, Frederick C Winkelmann, Walter F Buhl, Y Joe
4.8.2 Room-specific/Floor-specific Experiment. We validate if a Huang, Curtis O Pedersen, Richard K Strand, Richard J Liesen, Daniel E Fisher,
room-specific policy is needed. Thus, we use room-specific batch Michael J Witte, et al. 2001. EnergyPlus: creating a new-generation building
energy simulation program. Energy and buildings 33, 4 (2001), 319–331.
data as our expert policy and evaluate these same rooms. However, [11] Hui Dai and Bin Zhao. 2020. Association of the infection probability of COVID-19
in Fig. 13 we observe that although both floor and room models with ventilation rates in confined spaces. In Building simulation, Vol. 13.
yield consistent outcomes above the average. It is better to use [12] Megan Dawe, Paul Raftery, Jonathan Woolley, Stefano Schiavon, and Fred Bau-
man. 2020. Comparison of mean radiant and air temperatures in mechanically-
a specific room buffer for a better fit of the room/zone thermal conditioned commercial buildings from over 200,000 field and laboratory mea-
dynamics. surements. Energy and Buildings 206 (2020), 109582.
189
Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
ICCPS ’22, May 4-6th, 2022, Milan, Italy Liu et al.
[13] Richard J De Dear. 1998. A global database of thermal comfort field experiments. Applications (CCA). IEEE, 55–60.
ASHRAE transactions 104 (1998), 1141. [44] Naren Srivaths Raman, Adithya M Devraj, Prabir Barooah, and Sean P Meyn.
[14] Adithya M Devraj, Ana Bušić, and Sean Meyn. 2019. Zap Q-Learning-A User’s 2020. Reinforcement learning for control of building HVAC systems. In 2020
Guide. In 2019 Fifth Indian Control Conference (ICC). IEEE, 10–15. American Control Conference (ACC). IEEE, 2326–2332.
[15] Xianzhong Ding, Wan Du, and Alberto Cerpa. 2019. OCTOPUS: Deep reinforce- [45] REHVA. 2021. REHVA COVID19 Guidance v4.1. https://www.rehva.eu/fileadmin/
ment learning for holistic smart building control. In BuildSys. 326–335. user_upload/REHVA_COVID-19_guidance_document_V4.1_15042021.pdf.
[16] Xianzhong Ding, Wan Du, and Alberto E Cerpa. 2020. MB2C: Model-Based Deep [46] Martin Riedmiller. 2005. Neural fitted Q iteration–first experiences with a data
Reinforcement Learning for Multi-zone Building Control. In BuildSys. 50–59. efficient neural reinforcement learning method. In ECML. Springer, 317–328.
[17] Anastasios I Dounis, M Bruant, M Santamouris, G Guarracino, and P Michel. [47] Brian C Ross. 2014. Mutual information between discrete and continuous data
1996. Comparison of conventional and fuzzy control of indoor air quality in sets. PloS one 9, 2 (2014), e87357.
buildings. Journal of Intelligent & Fuzzy Systems 4, 2 (1996), 131–140. [48] Frederik Ruelens, Bert J Claessens, Stijn Vandael, Bart De Schutter, Robert
[18] Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. 2019. Challenges of Babuška, and Ronnie Belmans. 2016. Residential demand response of thermostat-
real-world reinforcement learning. arXiv preprint arXiv:1904.12901 (2019). ically controlled loads using batch reinforcement learning. IEEE Transactions on
[19] Povl O Fanger et al. 1970. Thermal comfort. Analysis and applications in environ- Smart Grid 8, 5 (2016), 2149–2159.
mental engineering. Thermal comfort. Analysis and applications in environmental [49] Frederik Ruelens, Bert J Claessens, Stijn Vandael, Sandro Iacovella, Pieter Vinger-
engineering. (1970). hoets, and Ronnie Belmans. 2014. Demand response of a heterogeneous cluster of
[20] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. 2020. electric water heaters using batch reinforcement learning. In 2014 Power Systems
D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint Computation Conference. IEEE, 1–7.
arXiv:2004.07219 (2020). [50] Frederik Ruelens, Sandro Iacovella, Bert J Claessens, and Ronnie Belmans. 2015.
[21] Xiaohan Fu, Jason Koh, Francesco Fraternali, Dezhi Hong, and Rajesh Gupta. Learning agent for a heat-pump thermostat with a set-back strategy using model-
2020. Zonal Air Handling in Commercial Buildings. In BuildSys. 302–303. free reinforcement learning. Energies 8, 8 (2015), 8300–8318.
[22] Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing function ap- [51] Jyri Salpakari and Peter Lund. 2016. Optimal and rule-based control strategies
proximation error in actor-critic methods. In ICML. PMLR, 1587–1596. for energy flexibility in buildings with PV. Applied Energy 161 (2016), 425–436.
[23] Scott Fujimoto, David Meger, and Doina Precup. 2019. Off-policy deep reinforce- [52] Caude Sammut. 2010. Behavioral Cloning. Springer US, 93–97.
ment learning without exploration. In ICML. PMLR, 2052–2062. [53] AB Shepherd and WJ Batty. 2003. Fuzzy control strategies to provide cost and en-
[24] Guanyu Gao, Jie Li, and Yonggang Wen. 2020. DeepComfort: Energy-Efficient ergy efficient high quality indoor environments in buildings with high occupant
Thermal Comfort Control in Buildings via Reinforcement Learning. IEEE Internet densities. Building Services Engineering Research and Technology 24, 1 (2003).
of Things Journal 7, 9 (2020), 8472–8484. [54] Ravid Shwartz-Ziv and Amitai Armon. 2021. Tabular Data: Deep Learning is Not
[25] Mengjie Han, Ross May, Xingxing Zhang, Xinru Wang, Song Pan, Da Yan, Yuan All You Need. arXiv preprint arXiv:2106.03253 (2021).
Jin, and Liguo Xu. 2019. A review of reinforcement learning methodologies [55] Muthusamy V Swami and Subrato Chandra. 1987. Procedures for calculating
for controlling occupant comfort in buildings. Sustainable Cities and Society 51 natural ventilation airflow rates in buildings. ASHRAE final report FSEC-CR-163-
(2019), 101748. 86, ASHRAE research project (1987), 130.
[26] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, [56] IEA UN. 2020. Global status report for buildings and construction (2019). Avail-
and David Meger. 2018. Deep reinforcement learning that matters. In AAAI. able at https://www. gbpn. org/china/newsroom/2019-global-status-report-buildings-
[27] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, and-constr uction. Access date 15 (2020).
Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting [57] William Valladares, Marco Galindo, Jorge Gutiérrez, Wu-Chieh Wu, Kuo-Kai
decision tree. NIPS 30 (2017), 3146–3154. Liao, Jen-Chung Liao, Kuang-Chin Lu, and Chi-Chuan Wang. 2019. Energy
[28] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. optimization associated with thermal comfort and indoor air control via a deep
arXiv preprint arXiv:1312.6114 (2013). reinforcement learning algorithm. Building and Environment 155 (2019), 105–117.
[29] SA Klein. 1976. University of Wisconsin-Madison Solar Energy Laboratory. [58] José Vázquez-Canteli, Jérôme Kämpf, and Zoltán Nagy. 2017. Balancing comfort
TRNSYS: A transient simulation program. Eng. Experiment Station (1976). and energy consumption of a heat pump using batch reinforcement learning
[30] Andrew Krioukov, Gabe Fierro, Nikita Kitaev, and David Culler. 2012. Building with fitted Q-iteration. Energy Procedia 122 (2017), 415–420.
application stack (BAS). In BuildSys. 72–79. [59] José Vázquez-Canteli, Stepan Ulyanin, Jérôme Kämpf, and Zoltán Nagy. 2018.
[31] Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. Adaptive multi-agent control of HVAC systems for residential demand response
The annals of mathematical statistics 22, 1 (1951), 79–86. using batch reinforcement learning. (2018).
[32] Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. 2019. Stabiliz- [60] Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos,
ing off-policy q-learning via bootstrapping error reduction. arXiv preprint and Matthieu Geist. 2020. Leverage the average: an analysis of KL regularization
arXiv:1906.00949 (2019). in reinforcement learning. In NeurIPS.
[33] Geoff J Levermore. 1992. Building energy management systems. (1992). [61] Nino Vieillard, Olivier Pietquin, and Matthieu Geist. 2020. Munchausen rein-
[34] Yuguo Li, Hua Qian, Jian Hang, Xuguang Chen, Ling Hong, Peng Liang, Jiansen forcement learning. arXiv preprint arXiv:2007.14430 (2020).
Li, Shenglan Xiao, Jianjian Wei, Li Liu, et al. 2020. Evidence for probable aerosol [62] Zhe Wang and Tianzhen Hong. 2020. Reinforcement learning for building con-
transmission of SARS-CoV-2 in a poorly ventilated restaurant. MedRxiv (2020). trols: The opportunities and challenges. Applied Energy 269 (2020), 115036.
[35] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, [63] Tianshu Wei, Yanzhi Wang, and Qi Zhu. 2017. Deep reinforcement learning
Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with for building HVAC control. In Proceedings of the 54th annual design automation
deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015). conference 2017. 1–6.
[36] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. 2020. Provably [64] Thomas Weng, Anthony Nwokafor, and Yuvraj Agarwal. 2013. Buildingdepot
good batch reinforcement learning without great exploration. arXiv preprint 2.0: An integrated management system for building analysis and control. In
arXiv:2007.08202 (2020). Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Efficient
[37] Siliang Lu, Weilong Wang, Chaochao Lin, and Erica Cochran Hameen. 2019. Buildings. 1–8.
Data-driven simulation of a thermal comfort-based temperature set-point control [65] Daniel A Winkler, Ashish Yadav, Claudia Chitu, and Alberto E Cerpa. 2020.
with ASHRAE RP884. Building and Environment 156 (2019), 137–146. Office: Optimization framework for improved comfort & efficiency. In IPSN. IEEE,
[38] Mehdi Maasoumy, Alessandro Pinto, and Alberto Sangiovanni-Vincentelli. 2011. 265–276.
Model-based hierarchical optimal control design for HVAC systems. In Dynamic [66] Lei Yang, Zoltan Nagy, Philippe Goffin, and Arno Schlueter. 2015. Reinforcement
Systems and Control Conference, Vol. 54754. 271–278. learning for optimal control of low exergy buildings. Applied Energy 156 (2015),
[39] Mehdi Maasoumy, M Razmara, M Shahbakhti, and A Sangiovanni Vincentelli. 577–586.
2014. Handling model uncertainty in model predictive control for energy efficient [67] Liang Yu, Shuqi Qin, Meng Zhang, Chao Shen, Tao Jiang, and Xiaohong Guan.
buildings. Energy and Buildings 77 (2014), 377–392. 2020. Deep Reinforcement Learning for Smart Building Energy Management: A
[40] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timo- Survey. arXiv preprint arXiv:2008.05074 (2020).
thy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchro- [68] Chi Zhang, Sanmukh R Kuppannagari, Rajgopal Kannan, and Viktor K Prasanna.
nous methods for deep reinforcement learning. In ICML. PMLR, 1928–1937. 2019. Building HVAC scheduling using reinforcement learning via neural network
[41] Srinarayana Nagarathinam, Vishnu Menon, Arunchandar Vasan, and Anand based model approximation. In BuildSys. 287–296.
Sivasubramaniam. 2020. MARCO-Multi-Agent Reinforcement learning based [69] Tianyu Zhang, Gaby Baasch, Omid Ardakanian, and Ralph Evins. 2021. On the
COntrol of building HVAC systems. In e-Energy. 57–67. Joint Control of Multiple Building Systems with Reinforcement Learning. (2021).
[42] June Young Park and Zoltan Nagy. 2020. HVACLearn: A reinforcement learning [70] Zhiang Zhang and Khee Poh Lam. 2018. Practical implementation and evaluation
based occupant-centric control for thermostat set-points. In e-Energy. 434–437. of deep reinforcement learning control for a radiant heating system. In BuildSys.
[43] Samuel Prívara, Zdeněk Váňa, Dimitrios Gyalistras, Jiří Cigler, Carina Sager-
schnig, Manfred Morari, and Lukáš Ferkl. 2011. Modeling and identification of a
large multi-zone office building. In 2011 IEEE International Conference on Control
190
Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
Safe HVAC Control via Batch Reinforcement Learning ICCPS ’22, May 4-6th, 2022, Milan, Italy
A APPENDIX
A.1 Experiments Details
A.1.1 Parameters. For researchers to better reproduce our results,
we provide the hyperparameters used in our experiments. For most
of the models, we follow their default settings unless otherwise
recommended. We do not fine-tune the hyperparameters of the BRL
algorithms and use the reported values in the literature [23, 32, 36],
and we keep the architecture of actor-critic networks for a fair com-
parison. Modifying the architecture or any detail of implementation
might lead to a large difference in performances. [26] In PQL, we
scale the maximum state VAE training steps according to the ratio
of PQL’s MuJoCo buffer size to our building buffer size. For all the
network architectures we follow the original setups. The details of Figure 14: An example of historical thermal comfort trends
the hyperparameters are listed in Table 1. in top-5 similar OAT weeks
191
Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
ICCPS ’22, May 4-6th, 2022, Milan, Italy Liu et al.
The state, action, environment setups are all the same as our
main experiments. Except for the reward function at time step 𝑡 is
calculated with the following equation:
2 https://tinyurl.com/yy8f5faq 3 https://tinyurl.com/y9lczbwp
4 https://tinyurl.com/yy8nzlmj
192
Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.