0% found this document useful (0 votes)
28 views

Safe HVAC Control Via Batch Reinforcement Learning

Uploaded by

yechen xu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Safe HVAC Control Via Batch Reinforcement Learning

Uploaded by

yechen xu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

2022 ACM/IEEE 13th International Conference on Cyber-Physical Systems (ICCPS)

Safe HVAC Control via Batch Reinforcement Learning

Hsin-Yu Liu∗ , Bharathan Balaji† , Sicun Gao, Rajesh Gupta, Dezhi Hong
University of California, San Diego, La Jolla, CA, USA, † Amazon

ABSTRACT settings based on the current measurements. We can improve the


Buildings account for 30% of energy use worldwide, and approxi- control performance if we can forecast the thermal environment
mately half of it is ascribed to HVAC systems. Reinforcement Learn- characteristics.
2022 ACM/IEEE 13th International Conference on Cyber-Physical Systems (ICCPS) | 978-1-6654-0967-4/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICCPS54341.2022.00023

ing (RL) has improved upon traditional control methods in increas- We can predict thermal characteristics based on weather condi-
ing the energy efficiency of HVAC systems. However, prior works tions, expected usage, and thermal insulation properties. In model-
use online RL methods that require configuring complex thermal based approaches, thermal states of the building are simulated with
simulators to train or use historical data-driven thermal models that simplified linear models, and methods such as model predictive
can take at least 104 time steps to reach rule-based performance control (MPC) [1, 4, 38, 39, 43, 65] and fuzzy control [7, 53] are used
Also, due to the distribution drift from simulator to real buildings, to improve upon RBC policies. However, based on heating/cooling
RL solutions are therefore seldom deployed in the real world. On physics, we know that thermal evolution is non-linear with respect
the other hand, batch RL methods can learn from the historical to indoor/outdoor conditions and depends on conditions such as
data and improve upon the existing policy without any interactions orientation with respect to the sun, use of blinds, and wall insu-
with the real buildings or simulators during the training. With the lation properties. Therefore, we can devise more accurate models
existing rule-based policy as the priors, the policies learned with to improve control performance further. Simulators such as Ener-
batch RL are better than the existing control from the first day of gyPlus [10] and TRNSYS [29] have been designed to capture the
deployment with very few training steps compared with online thermal properties of a building. But designing and calibrating such
methods. models for a large building requires significant time and exper-
Our algorithm incorporates a Kullback-Leibler (KL) regulariza- tise. With advances in sensing technologies and machine learning,
tion term to penalize policies that deviate far from the previous data-driven models have become popular in recent research.
ones. We evaluate our framework on a real multi-zone, multi-floor Reinforcement learning (RL) methods learn via direct interaction
building—it achieves 7.2% in energy reduction cf. the state-of-the- with the environment and thus has been studied extensively [25, 62,
art batch RL method, and outperforms other BRL methods in occu- 67]. They are categorized into model-based RL (MBRL) [15, 42] and
pants’ thermal comfort, and 16.7% energy reduction compared to model-free RL (MFRL) [9, 24, 70] algorithms. MBRL requires the
the default rule-based control. use of a simulator such as EnergyPlus [10], TRNSYS [29]. Without
the pre-training offline, their nature to take exploratory control
KEYWORDS actions can cause occupant discomfort. MBRL relies on a thermal
model learned from historical data, converges faster than MFRL
HVAC control, Batch Reinforcement Learning, Deep Reinforcement
methods. However, even with MBRL, the initial policy is worse
Learning
than the existing control policy, and it can take weeks/months
1 INTRODUCTION to improve and converge [16]. By contrast, batch RL can learn
Buildings account for 28% of the global carbon emissions [56], and directly from historical data. Previous studies have shown that
HVAC (heating, ventilation, and air conditioning) systems account BRL methods can improve on existing policies [20] by exploiting
for the majority of building energy consumption1 . Modern data- the behavioral policy to identify actions that maximize the reward
driven algorithms have the potential to improve the energy effi- over an episode with TD-error update (Q-learning) while ensuring
ciency of traditional HVAC control algorithms. Here we focus on that the chosen actions do not deviate too far from the existing
HVAC control in office buildings. policy so the value estimation is more accurate. Typically, there
An office building is typically divided into multiple thermal is a hyperparameter to decide the learning trade-off between Q-
zones, each of which can be controlled locally with a variable air learning or behavior cloning. Therefore, batch RL is a more efficient
volume unit. The control policy is based on sensor measurements method for deployment when historical data is available. To the
(temperature, humidity, CO2 , airflow, etc) in the thermal zone. best of our knowledge, BRL methods have not been explored for HVAC
Rule-based control (RBC) method is widely used to control the control.
actuators [51], typically in conjunction with proportional-integral- We design a BRL-based solution that effectively learns from
derivative (PID) controllers [17, 33]. Such controls are interpretable available historical data without requiring the use of a simulator or
but rely on experience and rules of thumb. It becomes challenging to explicit modeling of the HVAC system. Our framework guarantees
develop and maintain a fine-grained RBC policy for dynamic envi- safe system operations by avoiding random setpoint exploration
ronments. RBC is also a reactive algorithm as it changes the control that could damage the equipment and/or make occupants uncom-
fortable.
1 https://www.eia.gov/energyexplained/use-of-energy/commercial-buildings.php Our main contributions are summarized as follows:
∗ Corresponding authors.
† Work unrelated to Amazon.

978-1-6654-0967-4/22/$31.00 ©2022 IEEE 181


DOI 10.1109/ICCPS54341.2022.00023
Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
ICCPS ’22, May 4-6th, 2022, Milan, Italy Liu et al.

• We propose and develop our framework, a simulation-free con- thermal comfort and indoor air quality requiring ∼3𝐾 days of train-
trol algorithm for energy reduction and thermal comfort co- ing. Nagarathinam et al. [41] train a multi-agent policy by taking
optimization. Our framework learns from existing historical data into account water-side chiller control, and reducing convergence
only, without requiring a simulator or complex modeling of the time to 2 years (∼700 days) using domain knowledge-based pruning.
space. DeepComfort [24] uses DDPG [35] to co-optimize thermal comfort
• Our method Batch Constrained Munchausen deep Q-learning and energy consumption with 104 hour (∼417 days) for training.
outperforms state-of-the-art BRL methods by penalizing policies MBBC [16] compares MBRL and MFRL methods with multi-zone
that deviate too far away from the previous policy. It outperforms control and shows that at least 104 of 15-minute time steps (∼100
existing controls from the first day of deployment. days) are needed to converge. Zhang et al. [68] train in an online
• We compare our framework with several state-of-the-art BRL fashion to control airflow and temperature. They also take ∼100
methods. Our framework reduces the energy consumption by days to converge.
16.7% compared to the default control, which is 7.2% improve- All prior works need a simulator or a data-driven model to
ment over the state-of-the-art, while keeping thermal comfort predict the thermal dynamics. Zhang et al. [70] use A3C [40] on
during the entire evaluation period. real building deployment with model pre-trained on a simulator.
HVACLearn [42] proposes an RL-based occupant-centric controller
2 BACKGROUND AND RELATED WORK (OCC) for thermostats using tabular Q-Learning with EnergyPlus
To the best of our knowledge, there is no previous work studying simulator. Raman et al. [44] implement Zap-Q [14] with 𝜖-greedy
how to co-optimize HVAC energy consumption and occupants’ exploration and compare the model with MPC methods using Ener-
thermal comfort with a completely simulation-free framework de- gyPlus. Lu et al. [37] compare on-policy and off-policy RL models
ployed on real multi-zone, multi-floor building. with simulated air-conditioned buildings with data-driven models.
Online RL methods, either model-free or model-based, rely on
exploration of the state-action space to improve the control policy.
2.1 Model Predictive Control
Model-free approaches are particularly data inefficient (months
MPC methods use a model to forecast the outdoor and indoor condi- to years of convergence time), and therefore, require the use of
tions and optimize for a sequence of control actions that maximizes a simulation model to learn a policy that can be practically de-
the given objective. MPC has been studied by several prior works ployed. But deploying such policies to a real building requires
for HVAC control. Aswani et al. [1] use learning-based MPC to careful calibration of the simulation model, which is prohibitively
control the room temperature to optimize energy consumption. time-consuming and expensive. Model-based methods are compa-
Beltran et al. [4] use occupancy prediction models derived from rably data-efficient and can use a thermal dynamics model trained
occupancy data traces and minimize energy consumption while with historical data. However, even these methods require weeks
staying within the comfort bounds of the occupants. Maasoumy et to months of real-world interaction for convergence. The initial
al. [39] propose a model-based hierarchical control strategy that control policy performance is considerably worse than the exist-
balances comfort and energy consumption. They linearize their ing rule-based policy [16, 41], and becomes a large impediment to
thermal dynamics model around its operating point and use an adoption. To setup an EnergyPlus model, we need building-specific
LQR supervisory controller that selects the optimal setpoints for information, such as materials used to construct the building, that
the lower level PID controllers. Privara et al. [43] interconnect require consulting blueprints. Even after modeling with such de-
building simulation software and traditional identification methods tails, a separate calibration step is required to ensure the accuracy
to avoid the statistical problems with data gathered from the real of the model. Whereas for our reward function model, we used
building. Winkler et al. [65] develop a data-driven gray-box model standard heat transfer equations and already available sensor data
whose parameters are learned from building operational data. To- from the building management system. The reward function can
gether with weather forecast information, this data is fed into the be reused in a new building, whereas EnergyPlus will require redo-
framework to minimize energy costs while satisfying user comfort ing the work again. Without a model to simulate airflow, we use
constraints. the readings and set points from the building management system.
Overall, the known issues of MPC are that it requires an accurate These are standard data points available in modern buildings, and
dynamic model, makes convexity assumptions, and the computation our method can be reused as is in other buildings.
cost of computing each control decision is high [3]. RL solutions 2.2.2 Offline RL Methods. Offline methods have not been explored
have been shown to overcome these limitations and outperform much yet in the building controls domain. GNU-RL [9] imple-
MPC methods [41], and the computation cost of a control decision ments behavior cloning for HVAC control. In contrast to behavior
is low as it only requires a neural network inference. cloning [52], where the agent simply learns to copy the behavioral
policy with an ML model, the BRL method is able to learn from
2.2 Reinforcement Learning the existing data with Q-update and compensate for the lack of
2.2.1 Online RL Methods. Researchers have been studied exten- diversity in the buffer by perturbing the selected action with a per-
sively for HVAC control with online RL methods [25, 62, 67]. Zhang turbation network. BRL maximizes the values returned by selecting
et al. [69] jointly optimize visual comfort, thermal comfort, and en- policy that improves upon the existing policy, rather than imitating
ergy consumption by training for ∼12𝐾 days in a simulator. OCTO- it.
PUS [15] co-optimizes HVAC, lighting, blinds, and window systems Previously, Ruelens et al.’s works focus on electricity cost opti-
and needs ∼3.5𝐾 days of training. Valladares et al. [57] co-optimize mization [49], demand response [48], and energy efficiency of heat

182

Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
Safe HVAC Control via Batch Reinforcement Learning ICCPS ’22, May 4-6th, 2022, Milan, Italy

pump [50] using fitted Q-iteration (FQI [46]). Vazques et al. [58, 59] to minimize system response time and reduce any discomfort to
balance comfort and energy consumption of a heat pump using FQI. occupants.
Yang et al. [66] implement Batch Q-Learning for low exergy build- We represent the agent and its environment as a Markov Decision
ings. The closest work to ours is Wei et al.’s work [63], where they Process (MDP) defined by a tuple, 𝑀 B = (S, S ′, A, 𝑃, 𝑅, 𝛾), where

control airflow using offline training using a modified Q-learning A is the action space in the batch B, S is the state space, S is
algorithm, where they clip and shrink the reward value. Unlike our ′ ′
the arriving state space where ∀𝑠 ∈ S corresponds to 𝑠 ∈ S
method, the experiments are done in simulators, do not control at a certain time step 𝑡 such that 𝑠 = 𝑠𝑡 , 𝑠 ′ = 𝑠𝑡 +1 . 𝑃(𝑠 ′ |𝑠, 𝑎) is the
zone temperature setpoint, and only consider temperature as a transition distribution, 𝑅(𝑠, 𝑎) is the reward function, and 𝛾 ∈ [0, 1) is
proxy for thermal comfort. the discount factor. The goal of our BRL model is to find an optimal
Algorithms such as FQI, Batch Q-learning, and Wei et al.’s DQN policy 𝜋 ∗ (𝑠) = argmax𝑎 𝑠.𝑡 .(𝑠,𝑎)∈B 𝑄 𝜋B (𝑠, 𝑎), which maximizes the
heuristic are all based on pure off-policy algorithms. Fujimoto et expected accumulative discounted rewards.
al. [23] show that off-policy methods exacerbate the extrapolation More specifically, we have the following:
error in a completely offline setting. The errors occur because the Q- • State: We use the following attributes for the RL process to eval-
network is trained on historical data but exploratory actions yield uate the policy: indoor air temperature, actual supply airflow,
policies which are different from the behavioral ones. They propose outside air temperature, and humidity. These states include the
Batch Constrained Q-learning (BCQ) [23] which restricts selected features needed for thermal comfort estimation 𝑠𝑇𝐶 𝑡 and those
actions to be close to those in the historical data and outperforms that represent the responses of actions as RL states 𝑠𝑡𝑅𝐿 .
prior approaches. BCQ uses a Variational AutoEncoder (VAE) [28] • Action: We control two important parameters, namely, zone air
to reconstruct the predicted actions given current states according temperature setpoint (𝑎𝑡𝑍 𝑁𝑇 ) and actual supply airflow setpoint
to existing data. 𝑆𝑢𝑝
(𝑎𝑡 ). Both are in continuous space and the action spaces are
BCQ is designed for complete offline, off-policy learning to pe-
normalized in the range of [−1, 1].
nalize policies that are far from the behavioral policies in the re-
• Environment: Real building HVAC zones across three different
play buffer. We build on top of the BCQ algorithm to further con-
floors. Every room is a single HVAC zone, and all these rooms
strain new policy to be close to the previous one. We enforce this
are used as lab space and work office.
constraint through Kullback-Leibler (KL) divergence between the
• Reward: We monitor the thermal states of the space as well as
learned policy and historical policy [61]. We show that our algo-
the thermal comfort index predicted by a regression model, and
rithm performance is more stable than BCQ in our real building
then make control decisions with the actions selected by the BRL
evaluation.
model. Our reward function penalizes high HVAC energy use and
We use the existing dataset as the prior experience, since the
discourages a large absolute value of the thermal comfort index,
rules are made by domain experts, its behavioral policy is safe cf.
which indicates discomfort to occupants. Our reward function at
random initialized online policy. In this paper, we focus on the
time step 𝑡 is:
performance of the algorithm in the initial days (one week) of 𝑅𝑡 = −𝛼 |𝑇𝐶𝑡 |−𝛽𝑃𝑡 , (1)
deployment and leave the long-term performance evaluation as
future work. where 𝛼, 𝛽 are the weights balancing between different objectives
and could be tuned to meet specific goals, 𝑇𝐶𝑡 is the thermal
comfort index at time 𝑡, 𝑃𝑡 is the HVAC power consumption at
time 𝑡. We compute 𝑃𝑡 attributed to a thermal zone using heat
3 DESIGN OF OUR FRAMEWORK transfer equations [2]. The DRL agent co-optimizes HVAC energy
3.1 BRL-based Control Framework Setup reduction and occupants’ thermal comfort.
As shown in Fig. 1, we first obtain historical data and process them
into replay buffer containing the transitions tuples. At each time
3.2 Thermal Comfort Prediction
step, the BRL model will randomly sample a mini-batch from the As we need to calculate the thermal comfort level as required by
replay buffer and train the target networks with the transitions our reward function, we adopt the widely used predicted mean
sampled. Periodically (according to the eval_freq in Alg. 1), we vote (PMV) [19] measure as our thermal comfort index. In this
evaluate the trained agent’s policy (the select_action function in metric, there are degrees of satisfaction, ranging from −3 (cold) to
Alg. 2) on real building zones to observe the states from our system’s 3 (hot), where PMV within the range from −0.5 to 0.5 is considered
readings and calculate the reward. The average rewards over time thermal-comfortable.
are shown in Fig. 5. We adopt the ASHRAE RP-884 thermal comfort data set [13] and
We use the episodic formulation as this is the standard procedure train a simple gradient boosting tree (GBT) model [27] to predict
in BRL literature [23, 32, 36]. In our formulation, the episode ends the thermal comfort by taking the current thermal states given
if the predicted thermal comfort is out of the thermal comfort by our building system in real-time. We show the evaluation of
range, i.e. absolute value larger than 0.5. Therefore, the agent is the effectiveness in Fig. 2 with such a simple GBT-based thermal
trained for an arbitrarily long episode length as long as it does not comfort index.
impact comfort. If we use a fixed episode length such as 24 hours,
the agent will optimize for that period. We use a time step of 9 3.3 Batch Reinforcement Learning for Control
minutes because that is the data-writing period for our building We take a BRL-based method, namely, batch-constrained deep Q-
management system. We choose the minimum possible time step learning (BCQ) [23] as our foundation and make improvements

183

Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
ICCPS ’22, May 4-6th, 2022, Milan, Italy Liu et al.

𝑇𝐶" Regression Model : 𝑃𝑀𝑉"


Thermal comfort index Predict thermal comfort PMV features
with current thermal states

Replay buffer
Selected action decided by a VAE model, Real multi-zone
BRL Model: a perturbation model 𝜉! , two Q-networks, building environment
Batch data used Finding the optimized policy that maximizes returns and a set of hyperparameters
as prior knowledge
𝑠"
RL states feedback to calculate reward
Figure 1: Overview: Our batch reinforcement learning model selects actions that co-optimize thermal comfort for occupants
and energy consumption of HVAC system.

on it. BCQ is a pure offline, off-policy RL method that avoids the where 𝑟 is the reward, 𝛾 is the discount factor, 𝜆 is the minimum
extrapolation errors induced by the incorrect value estimation of weighting in double-Q learning, 𝜃 𝑗=1,2 are weights of the two critic
out-of-distribution actions selected out of existing dataset. Q-networks.
As illustrated in Fig. 1, for each time step 𝑡, we obtain state in- We propose an improvement on the BCQ algorithm, called Batch
formation from the sensors in the building. To only calculate the Constrained Munchausen RL (BCM), that encourages the agent to
reward but not updating the models. BCQ first samples a mini- update policy similarly to the previous one using a regularization
batch of data (the size of the mini-batch is set as a hyperparameter) term in the Q-update. With respect to other aspects, BCM inherits
from the entire set of historical data. Then, it trains a parametric BCQ’s characteristics and acts as an intermediate state of behavior
generative model 𝐺𝜔 , a conditional VAE on the batch to model the cloning and Q-learning.
distribution by transforming an underlying latent space. The en- The idea of the BCM algorithm is the following: we adopt the reg-
coder 𝐸𝜔 1 (𝑠, 𝑎) takes a distribution of state-action pairs and outputs ularization term in Munchausen RL (M-RL) [61] which penalizes the
the mean 𝜇 and standard deviation 𝜎 of a Gaussian distribution policies which deviate far from the previous policy with Kullback-
N (𝜇, 𝜎). A latent vector 𝑧 sampled from the Gaussian is passed to Leibler (KL) divergence [31, 60]. M-RL utilizes the current policy as
the decoder 𝐷𝜔 2 (𝑠, 𝑧) which outputs an action. The loss function of one of Q-update’s learning signals. 𝐾𝐿(𝜋1 ||𝜋2 ) = ⟨𝜋1, ln 𝜋 1 − ln 𝜋2 ⟩.
VAE consists of two parts: reconstruction loss and the KL regular- The other term added in M-RL is the entropy term which penalizes
ization term. the policies that are too far away from the uniform distribution,
where H (𝜋) = −⟨𝜋, ln 𝜋⟩. In offline settings, this term does not
(𝐷𝜔 2 (𝑠, 𝑧) − 𝑎)2, 𝑧 = 𝜇 + 𝜎 · 𝜖, 𝜖 ∼ N (0, 1)
X
L𝑟𝑒𝑐𝑜𝑛 = help improve the Q-update since we cannot accurately estimate
(𝑠,𝑎)∈B
uniform policy if we have only static data. We do not encourage
L𝐾𝐿 = 𝐷 𝐾𝐿 (N (𝜇, 𝜎)||N (0, 1)),
exploration as the online mode in the original M-RL settings. Our
L𝑉 𝐴𝐸 = L𝑟𝑒𝑐𝑜𝑛 + 𝜆L𝐾𝐿 .
problem is focused on conservative and safe policies exclusively
VAE here aims to produce only actions which are similar to existing selected from the batch with a small amount of perturbation. It
actions in the batch given the current state. The purpose of the helps to avoid the lack of diversity within state-action visitation in
perturbation model 𝜉𝜙 (𝑠, 𝑎, Φ) is to increase the diversity of seen the batch distribution.
actions, it adjusts the value of the selected action 𝑎 in the range of  
[−Φ, Φ]. It could compensate for the lack of diversity in the batch 0
𝑦 = 𝑟 + 𝛼𝑚 𝜏𝑚 ln 𝜋𝜃ˆ (𝑎𝑡 |𝑠𝑡 ) 𝑙 + 𝛾 max 𝜆 min 𝑄𝜃 ′ (𝑠 ′, 𝑎˜𝑖 ) + (1 − 𝜆) max 𝑄𝜃 ′ (𝑠 ′, 𝑎˜𝑖 ) ,

0 𝑎𝑖 𝑗 =1,2 𝑗 𝑗 =1,2 𝑗
data, as a trade-off of inaccurate value estimation. By adjusting the (3)
hyperparameters 𝑛 and Φ, it could behave similarly to behavior 𝑄𝜃ˆ
where 𝜋𝜃ˆ = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥( 𝜏 ), the target Q after soft clipping in double
cloning with 𝑛 = 1 and Φ = 0, or similarly to traditional Q-learning
Q-learning, 𝛼𝑚 is the M-RL scaling parameter, 𝜏𝑚 is the entropy
when 𝑛 → ∞ and Φ → 𝑎𝑚𝑎𝑥 − 𝑎𝑚𝑖𝑛 .
temperature parameter, and 𝑙 0 is the clipping value minimum, since
the log-policy term is not bounded and can cause numerical is-
X
𝜙 ← argmax 𝑄𝜃 1 (𝑠, 𝑎 + 𝜉𝜙 (𝑠, 𝑎, Φ)), 𝑎 ∼ 𝐺𝜔 (𝑠).
𝜙 (𝑠,𝑎)∈B sues if the policy becomes too close to deterministic. We replace
𝜏 𝑙𝑛 𝜋(𝑎|𝑠) by [𝜏 𝑙𝑛 𝜋(𝑎|𝑠)]𝑙0 , where [·]𝑥 is the clipping function.
𝑦
At the core of BCQ is the value estimation networks, a pair of Q- 0

networks 𝑄𝜃 1 (𝑠, 𝑎) and 𝑄𝜃 2 (𝑠, 𝑎). By taking a weighted minimum The other added term in original M-RL algorithm is the entropy
between the two values as a learning target 𝑦 for both networks. term which encourages policies to be close to uniform distribu-
On the other hand, for the actor network, at first, 𝑛 actions are tion. We do not use it as it is not applicable for offline settings [61].
sampled with respect to the generative model, and then adjusted by Once we choose the action using BCM, we adjust the correspond-
the target perturbation model, before being passed to each target ing setpoints through a building operating system (BOS) [30, 64].
Q-networks for updates: The environment reflects the real response of action applied with
a time delay 𝑑, so our framework waits for 𝑑 to get data 𝑠𝑡 from
 
the sensors. Also, a PMV feature vector 𝑃𝑀𝑉𝑡 is fed into the re-
𝑦 = 𝑟 + 𝛾 max 𝜆 min 𝑄𝜃 ′ (𝑠 ′, 𝑎˜𝑖 ) + (1 − 𝜆) max 𝑄𝜃 ′ (𝑠 ′, 𝑎˜𝑖 ) , (2)
𝑎𝑖 𝑗=1,2 𝑗 𝑗=1,2 𝑗 gression model for thermal comfort prediction. According to the

184

Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
Safe HVAC Control via Batch Reinforcement Learning ICCPS ’22, May 4-6th, 2022, Milan, Italy

Algorithm 1: HVAC control via our framework Algorithm 2: BCM training algorithm
Input : Batch data B 𝑓 for a certain floor 𝑓 , time horizon 𝑇 , Input : Batch data B 𝑓 for a certain floor 𝑓 , target network
floor set F , zone/room set Z, and delayed update rate 𝜏, mini-batch size 𝑁 , max perturbation
response time 𝑑, target network update rate 𝜏, to selected actions Φ, number of sampled actions 𝑛,
mini-batch size 𝑏, max perturbation to selected minimum weighting 𝜆, evaluation frequency
actions Φ, number of sampled actions 𝑛, minimum 𝑒𝑣𝑎𝑙_𝑓 𝑟𝑒𝑞, M-RL scaling parameter 𝛼𝑚 , and
weighting 𝜆, evaluation frequency 𝑒𝑣𝑎𝑙_𝑓 𝑟𝑒𝑞, entropy temperature parameter 𝜏𝑚
M-RL scaling factor 𝛼𝑚 ∈ [0, 1], and entropy Output : Updated target networks
temperature parameter 𝜏𝑚 Initialize: RL agent BCM, Q networks 𝑄𝜃 1 , 𝑄𝜃 2 , VAE
Output : Reward, next state, and action selected by BCM generative network 𝐺𝜔 = {𝐸𝜔 1 , 𝐷𝜔 2 }, perturbation
Initialize: HVAC Environment 𝐸𝑛𝑣, RL agent BCM network 𝜉𝜙 , random parameter 𝜔, 𝜙, 𝜃 1, 𝜃 2 , and target
′ ′ ′
𝑑𝑎 = 𝑑𝑖𝑚(𝑎), 𝑑𝑠 = 𝑑𝑖𝑚(𝑠); networks 𝑄𝜃 ′ , 𝑄𝜃 ′ , 𝜉𝜙 ′ with 𝜃 1 ← 𝜃 1, 𝜃 2 ← 𝜃 2, 𝜙 ← 𝜙
1 2
for 𝑓 ∈ F do
𝐵𝐶𝑀 𝑓 = 𝐵𝐶𝑀(𝑑𝑠 , 𝑑𝑎 , 𝛾, 𝜏, 𝜆, 𝜙, 𝛼𝑚 , 𝜏𝑚 ); for 𝑡 ← 0 to 𝑇 do
Sample mini-batch 𝑁 transitions (𝑠, 𝑎, 𝑟, 𝑠 ′ ) from B 𝑓 ;
for 𝑧 ∈ Z do
0 ← t; 𝜇, 𝜎 = 𝐸𝜔 1 (𝑠, 𝑎), 𝑎˜ = 𝐷𝜔 2 (𝑠, 𝑧), 𝑧 ∼ N (𝜇, 𝜎)
˜ 2 + 𝐷𝐾𝐿 (N (𝜇, 𝜎)||N (0, 1))
P
while 𝑡𝑟𝑎𝑖𝑛_𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 < 𝑇 do 𝜔 ← argmin𝜔 (𝑎 − 𝑎)
𝐵𝐶𝑀 𝑓 .𝑡𝑟𝑎𝑖𝑛(B 𝑓 , 𝑏, 𝑛); Sample 𝑛 actions: {𝑎𝑖 ∼ 𝐺𝜔 (𝑠 ′ )}𝑛𝑖=1 ;
if 𝑡%𝑒𝑣𝑎𝑙_𝑓 𝑟𝑒𝑞 == 0 then Perturb each action: {𝑎𝑖 = 𝑎𝑖 + 𝜉𝜙 (𝑠 ′, 𝑎𝑖 , Φ)}𝑛𝑖=1 ;
𝑠𝑡𝑧 = 𝐸𝑛𝑣 𝑧 .𝑔𝑒𝑡𝑇ℎ𝑒𝑟𝑚𝑎𝑙𝑆𝑡𝑎𝑡𝑒(𝑡); Set value target P𝑦 (Eqn.3);
𝑇𝐶𝑡𝑧 = 𝐸𝑛𝑣 𝑧 .𝑔𝑒𝑡𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑇𝐶(𝑠𝑡𝑧 ); 𝜃 ← argmin𝜃 (𝑦 − 𝑄𝜃 (𝑠, 𝑎))2 ;
𝑎𝑧𝑡 = 𝐵𝐶𝑀 𝑓 .𝑠𝑒𝑙𝑒𝑐𝑡_𝑎𝑐𝑡𝑖𝑜𝑛(𝑠𝑡𝑧 ); 𝜙 ← argmax𝜙 𝑄𝜃 1 (𝑠, 𝑎 + 𝜉𝜙 (𝑠, 𝑎, Φ)), 𝑎 ∼ 𝐺𝜔 (𝑠);
′ ′
𝑠𝑡𝑧+1, 𝑟𝑡𝑧 = 𝐸𝑛𝑣 𝑧 .𝑠𝑡𝑒𝑝(𝑎𝑧𝑡 , 𝑠𝑡𝑧 , 𝑑); Update target networks: 𝜃𝑖 ← 𝜏𝜃 + (1 − 𝜏)𝜃𝑖 ;
′ ′
𝑡 += 1 ; 𝜙 ← 𝜏𝜙 + (1 − 𝜏)𝜙 ;
end end
end
end
end

prediction of regression model 𝑇𝐶𝑡 and the RL states 𝑠𝑡 , we calcu-


late the reward using Eq.(1). We repeat this process until reaching
the maximum number of time steps 𝑇 . Details of the HVAC control
via BCM algorithm are described in Algorithm 1.
4 EVALUATION
4.1 Data Collection and Pre-processing Figure 2: Performance comparison of regression models for
predicting thermal comfort based on PMV
The data we use from all the sensors and control points are recorded
every 9 minutes via a BOS. We obtain data of an entire year, from sample 𝑎𝑖 , it is converted to 𝑧𝑖 such that 𝑧𝑖 = (𝑎𝑖 − 𝜇)/𝑠, where
the beginning of July 2017 to the end of June 2018 of fifteen rooms 𝜇 is mean, 𝑠 is the standard deviation of the batch. In the replay
across three different floors in a building. The batch for each floor, or buffer, there are several main matrices required: action A, state
the buffer, contains around 200𝐾 transitions (2F:∼260𝐾, 3F:∼193𝐾, S, next state S ′ , reward R (calculated with our thermal comfort
4F:∼172𝐾), and it might differ from one to another due to varied prediction model, power consumption, and RL states), index I
system maintenance duration throughout the year. Since the rooms (which records the indices as time stamps), and episode terminal
on the same side of a floor often share similar thermal dynamics, status N (it labels if an episode is terminated or not—in our setting
we thus create batch data for each floor to ensure that the replay when the predicted thermal comfort metric does not satisfy the
buffer reflects each variable air volume (VAV)’s thermal dynamics criteria, i.e. |𝑃𝑀𝑉 |> 0.5, the episode is considered as terminated). To
precisely. We set each room to its maximum occupancy, which summarize, the batch data is a set consisting of the above-mentioned
is obtained from our campus facility information management matrices, i.e. B = {A, S, S ′, R, I, N }.
system, and in evaluation, we assume full occupancy the entire We use Intel Xeon Gold 6230 CPUs (2.10GHz) and NVidia Quadro
time. We could easily modify the problem formulation by taking RTX 8000 GPUs with Ubuntu 18.04 OS for our experiments.
occupancy into account in both our policy and reward function. The
airflow CFM (cubic feet per minute) needed is just multiplied by the 4.2 Thermal Comfort Prediction
number of people in the room. However, at this moment we have We compare five different regression models for predicting thermal
no occupancy sensor data, so we assume the most strict condition comfort, namely Linear Regression (LR), Support Vector Regression
of full capacity. We standardize our actions in a batch to the range (SVR), Bayesian Regression (BR), Deep Neural Network (DNN), and
of [−1, 1] as a standard procedure in the RL setup. For each action Gradient Boosting (GB) (Fig. 2). The input features of the models are

185

Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
ICCPS ’22, May 4-6th, 2022, Milan, Italy Liu et al.

Figure 3: Importance of feature to thermal comfort via mu-


tual information regression analysis. The features are cloth- Figure 4: Performance comparison with VAE simulators
ing level (Clo), metabolic rate (Met) indoor air temperature
(Air temp.), mean radiant temperature (MRT), relative hu- 4.4 Preliminary Experiments
midity (RH), and air velocity (Air velo.).
We first investigate how BRL methods are compared with online RL
methods, we compare these BRL methods with the state-of-the-art
zone air temperature, humidity, mean radiant temperature (MRT), online RL methods: TD3 [22] and DDPG [35]. Our approach is to
air velocity, metabolic rate (Met), and clothing insulation (Clo). We build a data-driven simulator environment with two VAEs. The first
set clothing level as "typical summer indoor clothing". Metabolic one is for predicting the RL and thermal comfort states. The second
rate is set as "typing" where the zones evaluated are all student lab one is to predict the power/energy consumption. These two VAEs
and office spaces. There are in total 30650 data points with complete function as the thermal states simulator.
feature information in the ASHRAE RP-884 thermal comfort data We evaluate with 200 episodes and the evaluation frequency is
set [13] we adopted for evaluation. All models are trained and tested every five time steps. We run each algorithm with three randomly
with 10-fold cross-validation. Hyperparameters optimization is con- initialized initial conditions in the range of our dataset. As we see in
ducted via either grid search or Bayesian optimization. According Fig. 4, the solid line is the average of these three runs, and the half-
to Fig. 2, the best model is the gradient boosting tree [27] with transparent regions indicate the range of these runs. The results
an MSE of 1.147, which supports our choice of GB-based model show that the performance ranking among these BRL methods:
to predict thermal comfort index for the RL reward function. It is PQL>BEAR>BCQ/BCM>DDPG>TD3. While BRL methods reach a
reasonable that the gradient boosting method outperforms the deep stable state, online RL methods TD3 and DDPG are still exploring
learning counterpart on tabular data because of selection bias and new policies, and thus yield a continuously declining performance
hyperparameter optimization [54] The MSE metrics reported are in a short period of time. These BRL methods (details of PQL and
averaged with 3 runs for each model. BEAR are elaborated in 4.5) learn exclusively from the batch provide
stable, and safe policies. The reason why performance is constant is
4.3 Importance of Airflow Control that in the simulation environment the responses of the system are
Few prior works quantitatively study the importance of airflow deterministic which is different from the real building environments.
control in maintaining occupants’ thermal comfort. Almost all re- (Fig. 5) In real building systems, the responses are stochastic.
search focuses on temperature and humidity control for occupants’
thermal comfort. Here, we empirically analyze how airflow impacts 4.5 Baseline Methods
thermal comfort based on the PMV features. 4.5.1 State-of-the-art BRL Methods. After BCQ was proposed, there
We conduct analysis via mutual information-based regression. are several studies outperforming it in the OpenAI Gym [6] simula-
Between two random variables (𝑋, 𝑌 ), the dependency of these two tion environments. We implement these methods as baselines to be
variables, which is a non-negative value, is calculated as: compared with BCM.
Z Z 
𝑝 (𝑋 ,𝑌 ) (𝑥, 𝑦)
 • Bootstrapping Error Accumulation Reduction (BEAR) [32]: BEAR
𝐼 (𝑋 ; 𝑌 ) = 𝑝 (𝑋 ,𝑌 ) (𝑥, 𝑦)𝑙𝑜𝑔 𝑑𝑥 𝑑𝑦, identifies bootstrapping error as a key source of BRL instability. It
𝑦 𝑥 𝑝𝑋 (𝑥)𝑝𝑌 (𝑦)
is due to bootstrapping of actions that lie outside of the training
where 𝑝(𝑋, 𝑌 ) is the joint probability density function of 𝑋 and data distribution. The algorithm mitigates the out-of-distribution
𝑌 , and 𝑝𝑋 , 𝑝𝑌 are the corresponding marginal density functions. action selection by searching over the set of policies that is akin
It is equal to zero if and only if two random variables are inde- to the behavior policy. BEAR’s ultimate goal is to search over
pendent, and higher values mean higher dependency [47]. Fig. 3 the set of policies Π, which shares the same set of values that
indicates that air velocity is the second most important factor after the random variable can take on as the behavior policy. And
air temperature and mean radiant temperature (MRT) (here we its performance is outstanding with the medium-quality static
approximate MRT with air temperature [12]). Thus, by controlling dataset (medium-quality means by training an agent with half
zone air temperature and airflow (air velocity can be converted to amount of time steps cf. expert RL agent/human expert or when
airflow rate with room area), we control the two most important the agent is trained to yield half the average return cf. the expert
features affecting occupants’ thermal comfort. agent).

186

Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
Safe HVAC Control via Batch Reinforcement Learning ICCPS ’22, May 4-6th, 2022, Milan, Italy

Figure 5: Reward comparison of various algorithms Figure 6: We find the top-5 most similar weeks regarding
OAT to our experiment week (last figure) for evaluating en-
• Pessimistic Q-Learning (PQL) [36]: While BRL yields a new policy ergy consumption and thermal comfort.
other than those in the batch, it might visit states and actions that
PQL constrains the Bellman update over state-action pairs that
are outside the distribution of the batch data. In addition, function
are sufficiently covered by the conditional probability of action
approximation with a limited number of samples leads to overly
given state when generating the data. It adds a state-VAE and a
optimistic estimates. PQL thus uses pessimistic value estimates in
statistical filtration over BCQ’s architecture with pessimistic value
the low-data regions in the Bellman optimality equation as well
approximation, which might overkill near-optimal policy that is
as the evaluation back-up. It can yield more adaptive and stronger
without enough visitation, however, as time evolves, PQL gradu-
guarantees when the concentrability assumption does not hold.
ally learns better. BEAR is only guaranteed to outperform BCQ on
PQL learns from policies that satisfy a bounded density ratio
medium-quality data sets collected from a partially trained policy –
assumption akin to the on-policy policy gradient methods. The
a middle ground between optimal policy and random policy. How-
approach of PQL to improve from BCQ’s architecture is that they
ever, in our case, the replay buffers are closer to the data generated
add a state-VAE to predict the arriving state given current state-
with expert policy. This explains the outcomes in a reasonable way.
˜ 𝑎) instead of
action pair, filtering state-action distribution 𝜇(𝑠,
BCQ, as an ablation version of our BCM algorithm, yields a compa-
˜ |𝑎). The filtration is implemented by setting a hyperparameter
𝜇(𝑠
rable performance as BCM but fails to keep a stable outcome due
𝑏 as the 2𝑛𝑑 percentile of the state-VAE Evidence Lower Bound to the lack of a strong learning signal.
(ELBO). If the ELBO is larger than 𝑏 then Q-update is executed, The comparison between algorithms in our experiments is dis-
otherwise, it is not executed. tinct from the results shown in the original papers, where PQL
4.5.2 Comparison Methodology. We run each algorithm in a single outperforms BCQ and BEAR in two out of the three simulated
room on each floor in the same week so that outside air temperature environments. By contrast, on our real building HVAC system,
(OAT) is the same. For instance, in one week we run our BCM in BCQ provides a more stable and continuously improving perfor-
rooms in the same stack on different floors, e.g. 2144, 3144, and mance than the other two BRL methods. This is because all those
4144, and at the same time a different BRL algorithm, e.g. BEAR, experiments were conducted in simulation environments where
is running in rooms in a different adjacent stack, say, 2146, 3146, data are effectively unlimited, consequences for poor actions are
and 4146. In each room, we run the algorithm for 1, 000 time steps, non-existent, and system dynamics are clean and often determinis-
which is about one week. To reduce performance variations, we tic [18]. However, in real-world problems, systems are stochastic
evaluate each algorithm in three different rooms (one room from and non-stationary. It is not guaranteed that these algorithms would
each floor: 2F, 3F, and 4F). These rooms have the same functionality behave the same or similar to simulated cases in these settings.
(lab or office spaces) and are of roughly the same size and occupancy 4.6.2 Energy Consumption and Thermal Comfort Comparison. Out-
capacity. The entire evaluation time of all the experiments is from side Air Temperature (OAT) is a key factor affecting zone tempera-
September 28th to October 19th, 2021. ture; therefore, it affects both thermal comfort and energy usage
Appendix A.1 lists the hyperparameters for each method. of the HVAC system. It is thus reasonable to compare energy con-
sumption during baseline time periods with the most similar OAT
4.6 Results and Analysis trend to the period during which these BRL methods are evaluated.
4.6.1 Reward Comparison. Fig. 5 shows the evaluation results of To do so, we adopt Dynamic Time Warping (DTW) [5] to find his-
each algorithm, where each solid line is the average reward of all torical weeks with similar OATs, as DTW is a widely used method
runs for the same method; semi-transparent bands represent the to measure the similarity of time-series data of different lengths.
range of all runs for a particular algorithm. And gray dotted vertical In addition to considering the “shape” of historical OAT, we also
lines indicate 00:00AM of each day. The horizontal black dotted line consider the mean OAT difference between our experiment time
is the average reward in the buffer. It shows that BCM outperforms period and historical weeks. In summary, we find historical time
other methods by providing a relatively stable learning curve. periods whose OAT trend is similar and with close average weekly

187

Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
ICCPS ’22, May 4-6th, 2022, Milan, Italy Liu et al.

Figure 9: Effect of perturbation to selected actions


Figure 7: Energy consumption and thermal comfort compar-
isons among different control methods

Figure 10: Effect of buffer data size


Figure 8: Thermal comfort achieved by our BCM model dur-
ing evaluation

OAT to our experiment week. Fig. 6 shows an example of historical


weeks found using the above metrics. In this figure, a tuple of (min,
max) OAT is labeled on top of each week’s OAT data.
Once we have the top-5 weeks with the most similar OAT trend
to our experiment period, we compare all methods and estimate
energy consumption and thermal comfort.
In Fig. 7, we normalize the historical energy use to one as refer-
ence. BCM consumes the least energy compared with other methods.
A 16.7% of energy consumption reduction is achieved, and BCQ
also outperforms RBC by 9.5%. On the other hand, the occupants’ Figure 11: Same Season vs. Entire Year
thermal comforts are shown in real average absolute values. The
standard deviations (marked as error bars) of all BRL methods are
smaller than their historical counterparts. it cannot learn efficiently until around 700 time steps due to too
We also examine the thermal comfort during the entire time pe- large the range of action spaces to select from. In our buffer, there
riod for every experiment and keep track of changes and violations is enough diversity since it is extracted with an entire year of data.
as time evolves. Fig. 8 is an example showing that BCM maintains Thus, we choose Φ = 0.05 in our main experiment.
thermal comfort level persistently during the entire evaluation time
4.7.2 Amount of Data. We randomly sample data points by a frac-
period. 1 , 1 , 1 } and evaluate rooms on the same floor in the
tion of { 10 100 1000
same week to observe the impact. Fig. 10 shows the information loss
4.7 Sensitivity Analysis 1 one, it hardly reaches the
from smaller buffer data. E.g. for the 1000
4.7.1 Perturbation to Action. In our main evaluation, we used Φ = 1 and 1 cases, they show
average of the original buffer. For the 10 100
0.05, which is the parameter controlling the degree of perturbation
comparable performances but have difficulties being consistent.
applied to selected actions. To inspect how perturbation impacts
the performance of BCM, we evaluate two different values of 0.1 4.7.3 Diversity of Batch Data. Originally, we use the thermal states
and 0.2 for Φ. The result in Fig. 9 indicates that for Φ = 0.1, on of a set of rooms/zones from an entire year as our batch data.
average, does not yield a higher reward than Φ = 0.05. For Φ = 0.2, Intuitively, a replay buffer containing data from the same season as

188

Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
Safe HVAC Control via Batch Reinforcement Learning ICCPS ’22, May 4-6th, 2022, Milan, Italy

5 CONCLUSION AND FUTURE WORKS


Our simulator-free, multi-zone, BRL-based framework uses existing
data as prior knowledge to learn the optimal policy without setting
up complex, parameterized simulators. It saves energy compared
with the default rule-based control method and maintains thermal
comfort. To the best of our knowledge, our work is the first to im-
prove and implement state-of-the-art BRL methods on real building
HVAC control. We hope our research encourages domain experts
to adopt BRL for real-world problems.
To further improve our control framework, we will update our
building operation system to achieve a more frequent data writing
Figure 12: Out-of-batch (OOB) vs In-batch rate. This way, we could train the model for the same number of
time steps in a shorter time, hereby faster convergence of model.
In addition, we will include rooms of different functionality, e.g.
conference room, individual office, study area, in our evaluation to
create a more generalized model for HVAC control. Also, we could
expand the action spaces by including chilling system control and
economizers for more comprehensive optimization.
For methodology improvement, we plan to further investigate
model-based method in offline mode. Which uses dynamic models
to generate a model buffer, then the model buffer is also used to
update the BRL model.

ACKNOWLEDGEMENT
This work was supported in part by the CONIX Research Center,
Figure 13: Room Batch vs Floor Batch one of six centers in JUMP, a Semiconductor Research Corporation
(SRC) program sponsored by DARPA.
our evaluation period might be more suitable because of the similar
seasonal weather condition. Thus, we use only the data from the
same season as an ablation. REFERENCES
Fig. 11 shows that a batch of the entire year’s data produces better [1] Anil Aswani, Neal Master, Jay Taneja, David Culler, and Claire Tomlin. 2011.
Reducing transient and steady state electricity consumption in HVAC using
performance than only using the same season. A narrower distri- learning-based model-predictive control. Proc. IEEE 100, 1 (2011), 240–253.
bution of state-action visitation in a single season cannot update [2] Bharathan Balaji, Hidetoshi Teraoka, Rajesh Gupta, and Yuvraj Agarwal. 2013.
the Q-value as accurately as an entire year’s data could. Incorrect Zonepac: Zonal power estimation and control via hvac metering and occupant
feedback. In BuildSys. 1–8.
Q-value estimation would lead to a lower return. In summary, it is [3] Farinaz Behrooz, Norman Mariun, Mohammad Hamiruce Marhaban, Mohd Am-
essential to ensure enough state-action visitation diversity in the ran Mohd Radzi, and Abdul Rahman Ramli. 2018. Review of control techniques
for HVAC systems—Nonlinearity approaches based on Fuzzy cognitive maps.
batch data, in order to estimate the value more accurately. Energies 11, 3 (2018), 495.
[4] Alex Beltran and Alberto E Cerpa. 2014. Optimal HVAC building control with
occupancy prediction. In BuildSys. 168–171.
4.8 Generalization Experiments [5] Donald J Berndt and James Clifford. 1994. Using dynamic time warping to find
4.8.1 In-batch/Out-of-batch Experiment. To examine the gener- patterns in time series.. In KDD workshop, Vol. 10. Seattle, WA, USA:, 359–370.
[6] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schul-
alization of the BRL model, we test the learned policy on rooms man, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. arXiv preprint
where no data exist in the batch. Fig. 12 shows that out-of-batch arXiv:1606.01540 (2016).
(OOB) rooms cannot select proper actions to compensate for the [7] Francesco Calvino, Maria La Gennusa, Gianfranco Rizzo, and Gianluca Scac-
cianoce. 2004. The control of indoor thermal comfort conditions: introducing a
OAT fluctuation during the week. The reward curves follow the fuzzy adaptive controller. Energy and buildings 36, 2 (2004), 97–102.
OAT trend periodically, with clear peaks and valleys. This is rea- [8] CDC. 2003. Guidelines for Environmental Infection Control in Health-Care
Facilities. https://www.cdc.gov/infectioncontrol/guidelines/environmental/
sonable since different zones might respond differently under the background/air.html.
same VAV control action, due to the thermal dynamics in the HVAC [9] Bingqing Chen, Zicheng Cai, and Mario Bergés. 2019. Gnu-rl: A precocial rein-
and distance from VAV to zones. forcement learning solution for building hvac control using a differentiable mpc
policy. In BuildSys. 316–325.
[10] Drury B Crawley, Linda K Lawrie, Frederick C Winkelmann, Walter F Buhl, Y Joe
4.8.2 Room-specific/Floor-specific Experiment. We validate if a Huang, Curtis O Pedersen, Richard K Strand, Richard J Liesen, Daniel E Fisher,
room-specific policy is needed. Thus, we use room-specific batch Michael J Witte, et al. 2001. EnergyPlus: creating a new-generation building
energy simulation program. Energy and buildings 33, 4 (2001), 319–331.
data as our expert policy and evaluate these same rooms. However, [11] Hui Dai and Bin Zhao. 2020. Association of the infection probability of COVID-19
in Fig. 13 we observe that although both floor and room models with ventilation rates in confined spaces. In Building simulation, Vol. 13.
yield consistent outcomes above the average. It is better to use [12] Megan Dawe, Paul Raftery, Jonathan Woolley, Stefano Schiavon, and Fred Bau-
man. 2020. Comparison of mean radiant and air temperatures in mechanically-
a specific room buffer for a better fit of the room/zone thermal conditioned commercial buildings from over 200,000 field and laboratory mea-
dynamics. surements. Energy and Buildings 206 (2020), 109582.

189

Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
ICCPS ’22, May 4-6th, 2022, Milan, Italy Liu et al.

[13] Richard J De Dear. 1998. A global database of thermal comfort field experiments. Applications (CCA). IEEE, 55–60.
ASHRAE transactions 104 (1998), 1141. [44] Naren Srivaths Raman, Adithya M Devraj, Prabir Barooah, and Sean P Meyn.
[14] Adithya M Devraj, Ana Bušić, and Sean Meyn. 2019. Zap Q-Learning-A User’s 2020. Reinforcement learning for control of building HVAC systems. In 2020
Guide. In 2019 Fifth Indian Control Conference (ICC). IEEE, 10–15. American Control Conference (ACC). IEEE, 2326–2332.
[15] Xianzhong Ding, Wan Du, and Alberto Cerpa. 2019. OCTOPUS: Deep reinforce- [45] REHVA. 2021. REHVA COVID19 Guidance v4.1. https://www.rehva.eu/fileadmin/
ment learning for holistic smart building control. In BuildSys. 326–335. user_upload/REHVA_COVID-19_guidance_document_V4.1_15042021.pdf.
[16] Xianzhong Ding, Wan Du, and Alberto E Cerpa. 2020. MB2C: Model-Based Deep [46] Martin Riedmiller. 2005. Neural fitted Q iteration–first experiences with a data
Reinforcement Learning for Multi-zone Building Control. In BuildSys. 50–59. efficient neural reinforcement learning method. In ECML. Springer, 317–328.
[17] Anastasios I Dounis, M Bruant, M Santamouris, G Guarracino, and P Michel. [47] Brian C Ross. 2014. Mutual information between discrete and continuous data
1996. Comparison of conventional and fuzzy control of indoor air quality in sets. PloS one 9, 2 (2014), e87357.
buildings. Journal of Intelligent & Fuzzy Systems 4, 2 (1996), 131–140. [48] Frederik Ruelens, Bert J Claessens, Stijn Vandael, Bart De Schutter, Robert
[18] Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. 2019. Challenges of Babuška, and Ronnie Belmans. 2016. Residential demand response of thermostat-
real-world reinforcement learning. arXiv preprint arXiv:1904.12901 (2019). ically controlled loads using batch reinforcement learning. IEEE Transactions on
[19] Povl O Fanger et al. 1970. Thermal comfort. Analysis and applications in environ- Smart Grid 8, 5 (2016), 2149–2159.
mental engineering. Thermal comfort. Analysis and applications in environmental [49] Frederik Ruelens, Bert J Claessens, Stijn Vandael, Sandro Iacovella, Pieter Vinger-
engineering. (1970). hoets, and Ronnie Belmans. 2014. Demand response of a heterogeneous cluster of
[20] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. 2020. electric water heaters using batch reinforcement learning. In 2014 Power Systems
D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint Computation Conference. IEEE, 1–7.
arXiv:2004.07219 (2020). [50] Frederik Ruelens, Sandro Iacovella, Bert J Claessens, and Ronnie Belmans. 2015.
[21] Xiaohan Fu, Jason Koh, Francesco Fraternali, Dezhi Hong, and Rajesh Gupta. Learning agent for a heat-pump thermostat with a set-back strategy using model-
2020. Zonal Air Handling in Commercial Buildings. In BuildSys. 302–303. free reinforcement learning. Energies 8, 8 (2015), 8300–8318.
[22] Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing function ap- [51] Jyri Salpakari and Peter Lund. 2016. Optimal and rule-based control strategies
proximation error in actor-critic methods. In ICML. PMLR, 1587–1596. for energy flexibility in buildings with PV. Applied Energy 161 (2016), 425–436.
[23] Scott Fujimoto, David Meger, and Doina Precup. 2019. Off-policy deep reinforce- [52] Caude Sammut. 2010. Behavioral Cloning. Springer US, 93–97.
ment learning without exploration. In ICML. PMLR, 2052–2062. [53] AB Shepherd and WJ Batty. 2003. Fuzzy control strategies to provide cost and en-
[24] Guanyu Gao, Jie Li, and Yonggang Wen. 2020. DeepComfort: Energy-Efficient ergy efficient high quality indoor environments in buildings with high occupant
Thermal Comfort Control in Buildings via Reinforcement Learning. IEEE Internet densities. Building Services Engineering Research and Technology 24, 1 (2003).
of Things Journal 7, 9 (2020), 8472–8484. [54] Ravid Shwartz-Ziv and Amitai Armon. 2021. Tabular Data: Deep Learning is Not
[25] Mengjie Han, Ross May, Xingxing Zhang, Xinru Wang, Song Pan, Da Yan, Yuan All You Need. arXiv preprint arXiv:2106.03253 (2021).
Jin, and Liguo Xu. 2019. A review of reinforcement learning methodologies [55] Muthusamy V Swami and Subrato Chandra. 1987. Procedures for calculating
for controlling occupant comfort in buildings. Sustainable Cities and Society 51 natural ventilation airflow rates in buildings. ASHRAE final report FSEC-CR-163-
(2019), 101748. 86, ASHRAE research project (1987), 130.
[26] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, [56] IEA UN. 2020. Global status report for buildings and construction (2019). Avail-
and David Meger. 2018. Deep reinforcement learning that matters. In AAAI. able at https://www. gbpn. org/china/newsroom/2019-global-status-report-buildings-
[27] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, and-constr uction. Access date 15 (2020).
Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting [57] William Valladares, Marco Galindo, Jorge Gutiérrez, Wu-Chieh Wu, Kuo-Kai
decision tree. NIPS 30 (2017), 3146–3154. Liao, Jen-Chung Liao, Kuang-Chin Lu, and Chi-Chuan Wang. 2019. Energy
[28] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. optimization associated with thermal comfort and indoor air control via a deep
arXiv preprint arXiv:1312.6114 (2013). reinforcement learning algorithm. Building and Environment 155 (2019), 105–117.
[29] SA Klein. 1976. University of Wisconsin-Madison Solar Energy Laboratory. [58] José Vázquez-Canteli, Jérôme Kämpf, and Zoltán Nagy. 2017. Balancing comfort
TRNSYS: A transient simulation program. Eng. Experiment Station (1976). and energy consumption of a heat pump using batch reinforcement learning
[30] Andrew Krioukov, Gabe Fierro, Nikita Kitaev, and David Culler. 2012. Building with fitted Q-iteration. Energy Procedia 122 (2017), 415–420.
application stack (BAS). In BuildSys. 72–79. [59] José Vázquez-Canteli, Stepan Ulyanin, Jérôme Kämpf, and Zoltán Nagy. 2018.
[31] Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. Adaptive multi-agent control of HVAC systems for residential demand response
The annals of mathematical statistics 22, 1 (1951), 79–86. using batch reinforcement learning. (2018).
[32] Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. 2019. Stabiliz- [60] Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos,
ing off-policy q-learning via bootstrapping error reduction. arXiv preprint and Matthieu Geist. 2020. Leverage the average: an analysis of KL regularization
arXiv:1906.00949 (2019). in reinforcement learning. In NeurIPS.
[33] Geoff J Levermore. 1992. Building energy management systems. (1992). [61] Nino Vieillard, Olivier Pietquin, and Matthieu Geist. 2020. Munchausen rein-
[34] Yuguo Li, Hua Qian, Jian Hang, Xuguang Chen, Ling Hong, Peng Liang, Jiansen forcement learning. arXiv preprint arXiv:2007.14430 (2020).
Li, Shenglan Xiao, Jianjian Wei, Li Liu, et al. 2020. Evidence for probable aerosol [62] Zhe Wang and Tianzhen Hong. 2020. Reinforcement learning for building con-
transmission of SARS-CoV-2 in a poorly ventilated restaurant. MedRxiv (2020). trols: The opportunities and challenges. Applied Energy 269 (2020), 115036.
[35] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, [63] Tianshu Wei, Yanzhi Wang, and Qi Zhu. 2017. Deep reinforcement learning
Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with for building HVAC control. In Proceedings of the 54th annual design automation
deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015). conference 2017. 1–6.
[36] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. 2020. Provably [64] Thomas Weng, Anthony Nwokafor, and Yuvraj Agarwal. 2013. Buildingdepot
good batch reinforcement learning without great exploration. arXiv preprint 2.0: An integrated management system for building analysis and control. In
arXiv:2007.08202 (2020). Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Efficient
[37] Siliang Lu, Weilong Wang, Chaochao Lin, and Erica Cochran Hameen. 2019. Buildings. 1–8.
Data-driven simulation of a thermal comfort-based temperature set-point control [65] Daniel A Winkler, Ashish Yadav, Claudia Chitu, and Alberto E Cerpa. 2020.
with ASHRAE RP884. Building and Environment 156 (2019), 137–146. Office: Optimization framework for improved comfort & efficiency. In IPSN. IEEE,
[38] Mehdi Maasoumy, Alessandro Pinto, and Alberto Sangiovanni-Vincentelli. 2011. 265–276.
Model-based hierarchical optimal control design for HVAC systems. In Dynamic [66] Lei Yang, Zoltan Nagy, Philippe Goffin, and Arno Schlueter. 2015. Reinforcement
Systems and Control Conference, Vol. 54754. 271–278. learning for optimal control of low exergy buildings. Applied Energy 156 (2015),
[39] Mehdi Maasoumy, M Razmara, M Shahbakhti, and A Sangiovanni Vincentelli. 577–586.
2014. Handling model uncertainty in model predictive control for energy efficient [67] Liang Yu, Shuqi Qin, Meng Zhang, Chao Shen, Tao Jiang, and Xiaohong Guan.
buildings. Energy and Buildings 77 (2014), 377–392. 2020. Deep Reinforcement Learning for Smart Building Energy Management: A
[40] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timo- Survey. arXiv preprint arXiv:2008.05074 (2020).
thy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchro- [68] Chi Zhang, Sanmukh R Kuppannagari, Rajgopal Kannan, and Viktor K Prasanna.
nous methods for deep reinforcement learning. In ICML. PMLR, 1928–1937. 2019. Building HVAC scheduling using reinforcement learning via neural network
[41] Srinarayana Nagarathinam, Vishnu Menon, Arunchandar Vasan, and Anand based model approximation. In BuildSys. 287–296.
Sivasubramaniam. 2020. MARCO-Multi-Agent Reinforcement learning based [69] Tianyu Zhang, Gaby Baasch, Omid Ardakanian, and Ralph Evins. 2021. On the
COntrol of building HVAC systems. In e-Energy. 57–67. Joint Control of Multiple Building Systems with Reinforcement Learning. (2021).
[42] June Young Park and Zoltan Nagy. 2020. HVACLearn: A reinforcement learning [70] Zhiang Zhang and Khee Poh Lam. 2018. Practical implementation and evaluation
based occupant-centric control for thermostat set-points. In e-Energy. 434–437. of deep reinforcement learning control for a radiant heating system. In BuildSys.
[43] Samuel Prívara, Zdeněk Váňa, Dimitrios Gyalistras, Jiří Cigler, Carina Sager-
schnig, Manfred Morari, and Lukáš Ferkl. 2011. Modeling and identification of a
large multi-zone office building. In 2011 IEEE International Conference on Control

190

Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
Safe HVAC Control via Batch Reinforcement Learning ICCPS ’22, May 4-6th, 2022, Milan, Italy

A APPENDIX
A.1 Experiments Details
A.1.1 Parameters. For researchers to better reproduce our results,
we provide the hyperparameters used in our experiments. For most
of the models, we follow their default settings unless otherwise
recommended. We do not fine-tune the hyperparameters of the BRL
algorithms and use the reported values in the literature [23, 32, 36],
and we keep the architecture of actor-critic networks for a fair com-
parison. Modifying the architecture or any detail of implementation
might lead to a large difference in performances. [26] In PQL, we
scale the maximum state VAE training steps according to the ratio
of PQL’s MuJoCo buffer size to our building buffer size. For all the
network architectures we follow the original setups. The details of Figure 14: An example of historical thermal comfort trends
the hyperparameters are listed in Table 1. in top-5 similar OAT weeks

Table 1: Hyperparameter Settings of evaluated methods


BCM BCQ BEAR PQL
𝛾 0.99 0.99 0.99 0.99
𝑁 100 100 100 100
𝜏 0.005 0.005 0.005 0.005
𝜆 0.75 0.75 0.75 0.75
Φ 0.05 0.05 – 0.1
𝛼𝑚 0.9 – – –
𝜏𝑚 0.03 – – –
clip value min. -1 – – –
backup – – – Q-max
QL noise – – – 0.15
𝑏 percentile – – – 2 Figure 15: States in BCM evaluation week
max state VAE trainstep – – – 2𝑒4
Policy update version – – 0 –
MMD matching # samples – – 5 – A.2 Experiment with safe minimum airflow
MMD sigma – – 20 –
Kernel type – – Laplacian –
A.2.1 Motivation. Indoor environment and indoor gatherings present
Lagrange threshold – – 10 – a disease spreading risk as virus-laden aerosol lingers in indoor air
Distance type – – MMD – for hours at high concentrations [34] rather than being quickly dis-
𝛾 : discount factor, 𝑁 : mini-batch size, 𝜏: target network update rate, 𝜆:
persed and destroyed through UV (sun)light outdoors. Accumulated
minimum weighting between two Q-networks, Φ: max perturbation on exposure to viral load over time is an important risk determinant
action, 𝛼𝑚 : Munchausen scaling term, 𝜏𝑚 : entropy temperature, clip for an individual to be infected [11]. In the context of the current
value min.: minimum clipping value on Munchausen term pandemic caused by the spreading of the SARS-CoV-2 virus that
causes COVID-19 disease, many efforts are underway to control
its spread for the public healthcare system to maintain its capacity
and reduce fatalities. We believe that a well-designed operation of
A.1.2 Data Monitored. In the evaluation processes, we monitored the HVAC system can be a critical means to reduce the likelihood
all the states as time series to observe if there is any abnormality. of spreading events by appropriately directing airflows. HVAC soci-
Also, to inspect how BRL methods optimize the target objectives. eties such as ASHRAE and REHVA have recommended high rates of
As shown in Fig. 14, it is an example of how the thermal comforts air circulation and an increased fraction of fresh air. This is typically
of historical weeks vary under rule-based control. Apparent peri- measured by air changes per hour, or ACH, in a given enclosed
odic patterns are observed which follow the OAT trends during the space or the entire building. ACH is computed by the air volume
week. It indicates that RBC cannot compensate the OAT variations added to or removed from space in an hour divided by the total
as BRL method (Fig. 8). volume of the space [55]. For air impurities removed by fresh air,
In Fig. 15, it shows the time series of the states observed during unit ACH is then a time constant that represents the rate of dilution
BRL evaluation. Our BRL method BCM keep zone air temperature in infectious particles caused by the introduction of fresh-air [11].
setpoint (ZNT StPt) in a narrow range stably, thus, keep the zone ACH is increased primarily by increasing the ratio of fresh air and
air temperature readings (ZNT) in a reasonable range to maintain the speed of airflow supplied to a given space. Typically, commercial
thermal comforts while no constraints are applied, which is different buildings are designed to achieve ACH levels of 3-5 whereas more
from online RL methods where the range of actions are constrained sensitive areas in hospital settings could be as high as 12 ACH [8].
by human experts as hard rules. Achieving a substantially high ACH level in a typical office building

191

Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.
ICCPS ’22, May 4-6th, 2022, Milan, Italy Liu et al.

The state, action, environment setups are all the same as our
main experiments. Except for the reward function at time step 𝑡 is
calculated with the following equation:

𝑆𝑢𝑝 𝑠𝑎𝑓 𝑒 𝑆𝑢𝑝


𝑅𝑡 = −𝛼𝑅𝑒𝐿𝑈 ( |𝑇𝐶𝑡 |−𝑇𝐶𝑐 ) − 𝛽𝑠𝑡 − 𝛿𝑅𝑒𝐿𝑈 (𝐴𝑚𝑖𝑛 − 𝑠𝑡 ), (4)
In Eq.(4), 𝛼, 𝛽, 𝛿 are the weights balancing between different
objectives and could be tuned to meet specific goals, 𝑇𝐶𝑡 is the
thermal comfort index at time 𝑡, 𝑇𝐶𝑐 is the requirement on thermal
𝑆𝑢𝑝
comfort, 0.5, and 𝑠𝑡 is the supply airflow at the time 𝑡, and we
𝑠𝑎𝑓 𝑒
assume each room is fully occupied, leading to a constant 𝐴𝑚𝑖𝑛 for
each room based on the ACH requirement and number of people at
Figure 16: Reward comparison (considering safe airflow) full occupancy. The ReLU (Rectified Linear Unit) activation function
is used here to penalize any thermal comfort index that is out of
the comfortable range and any airflow value that is lower than
minimum safe airflow.
The results are run with two stacks of rooms per algorithm. And
each stack of runs lasts approximately a week. The experiment
result motivates us to improve from BCQ, since it outperforms
the others in the real HVAC environments. The buffer is the same
as our main experiments with an entire year of records. And the
evaluation time period is from June 1st to June 14th, 2021.
To further analyze the improvements of the target objectives,
respectively. Fig. 17 shows the comparison of energy consumption,
Figure 17: Energy, thermal comfort, and airflow comparison thermal comfort, and airflow readings. In this figure, RBC value of
each category is normalized as one. We could observe that in sum-
is challenging due to the cooling capacity of the equipment [21], mer OAT weeks, BRL methods could save more energy compared
and thus in our study, we seek to fulfill a minimum safe airflow with the results of our main experiments where evaluation is done
requirement. in the Fall. BCQ is with a 24 percent of energy reduction cf. RBC
due to a more efficient policy control with a more stable airflow
A.2.2 Safe Airflow Level Guidelines. Various guidelines have been
and thermal comfort, as the error bars shown in the figure.
issued by ASHRAE2 , CDC3 , and the European union REHVA4 on
building operation to lower the risk of getting infected by the respi-
ratory disease of the occupants through the air during the COVID-
19 pandemic. These guidelines provide detailed recommendations
regarding multiple aspects of building operation and share much in
common, including, but not limited to, use of high-rating minimum
efficiency reporting value (MERV) filters and/or UV-C lighting to
treat the return air, 24/7 HVAC operation, no use of recirculated air
(i.e. use 100% outside air), increased air change (ACH) rate during
occupancy.
While comprehensive, these recommendations are difficult to
implement altogether, if not completely impossible. The effects of
these measures and their implications on the building systems with
respect to energy consumption and occupants’ thermal comfort
still largely remain unclear to practitioners and residents. In our
work, we maintain a safe airflow level in the zones we evaluate
by requiring a minimum of 21.19 CFM per person (10L/s per per-
son) [45] airflow in a space, which satisfies ASHRAE’s, REHVA’s,
and CDC’s requirements.
A.2.3 Experiment Results. In Fig. 16, we compare several state-
of-the-art BRL methods as we did in our main experiments. The
minimum safe airflow is calculated with the people occupied in the
room, where we assume full occupancy.

2 https://tinyurl.com/yy8f5faq 3 https://tinyurl.com/y9lczbwp
4 https://tinyurl.com/yy8nzlmj

192

Authorized licensed use limited to: Zhejiang University. Downloaded on August 28,2023 at 03:14:22 UTC from IEEE Xplore. Restrictions apply.

You might also like