Exploring Deep Reinforcement Learning For Holistic Smart Building Control
Exploring Deep Reinforcement Learning For Holistic Smart Building Control
Abstract—Recently, significant efforts have been done to control rules may not be optimal and have to be adapted to
improve quality of comfort for commercial buildings’ users new buildings at commissioning time. Many times these rules
while also trying to reduce energy use and costs. Most of are updated in an ad-hoc manner, based on experience and
these efforts have concentrated in energy efficient control of the
feedback from occupants and/or trial and error performed by
arXiv:2301.11510v1 [cs.LG] 27 Jan 2023
control. It is a planning-based method that solves an optimal the occupant requirements of human comfort. However, the
control problem iteratively over a receding time horizon. Some genetic algorithm requires a few minutes to hours to generate
of the advantages of MPC are that it takes into consideration one control action and thus is not practical to be used in real
future disturbances and that it can handle multiple constraints building control.
and objectives, e.g., energy and comfort [5]. RL-based control of the HVAC system. With the devel-
However, it can be argued that the main roadblock prevent- opment of deep learning [24], [25] and deep reinforcement
ing widespread adoption of MPC is its reliance on a model learning [26], [27], many works apply RL for HVAC control.
[14], [15]. By some estimates, modeling can account for up to RL control can be a “model-free” control method, i.e., an RL
75% of the time and resources required for implementing MPC agent has no prior knowledge about the controlled process.
in practice [16]. Because buildings are highly heterogeneous, RL learns an optimal control strategy by “trial-and-error”.
a custom model is required for each thermal zone or building Therefore, it can be an online learning method that learns an
under control. optimal control strategy during actual building operations. Pe-
There are two paradigms for modeling building dynamics: dro et al. [28] investigated the application of a reinforcement-
physics-based and statistics-based [14]. Physics-based models, learning-based supervisory control approach, which actively
e.g., EnergyPlus, utilize physical knowledge and material learns how to appropriately schedule thermostat temperature
properties of a building to create detailed representation of the setpoints. However, in HVAC control, online learning may
building dynamics. A major shortcoming is that such models introduce unstable and poor control actions at the initial stage
are not control-oriented. Nonetheless, it is not impossible to of the learning. In addition, it may take a long time (e.g. over
use such models for control [17]. For instance, exhaustive 50 days reported in [28]) for an RL agent to converge to a
search optimization is used to derive control policy for an stable control policy for some cases. Therefore, some studies
EnergyPlus model [18]. Furthermore, physics-based model choose to use an HVAC simulator to the train the RL agent
requires significant modeling effort, because they have a large offline [29].
number of free parameters to be specified by engineers (e.g., Unlike MPC, simulators with arbitrary high complexity can
2,500 parameters for a medium-sized building [19]); and be directly used to train RL agents because of its “model-free”
information required for determining these parameters are nature. Li et al. [6] adopt Q learning for HVAC control. Dala-
scattered in different design documents [20]. magkidis et al. [30] design a Linear Reinforcement Learning
Statistical models assume a parametric model form, which Controller (LRLC) using linear function approximation of the
may or may not have physical underpinnings, and identifies state-action value function to meet the thermal comfort with
model parameters directly from data. Dinh et al. [21] propose minimal energy consumption. However, the tabular Q learning
a hybrid control that combines MPC and direct imitation approaches are not suitable for problems with a large state
learning to reduce energy cost while maintaining a comfortable space, like the state of four subsystems. Le et al. [31], [32]
indoor temperature. While this approach is potentially scal- propose a control method of air free-cooled data centers in
able, a practical problem is that the experimental conditions tropics via DRL. Vazquez-Canteli et al. [33] develop a multi-
required for accurate identification of building systems fall agent RL implementation for load shaping of grid-interactive
outside of normal building operations [22]. connected buildings. Ding et al. [34] design a model-based
Conventional control of multiple subsystems. Blind sys- RL method for multi-zone building control. Zhang et al. [7],
tem should be considered as an integral part of fenestration [35] implement and deploy a DRL-based control method for
system design for commercial and office buildings, in order to radiant heating systems in a real-life office building. Gao et al.
balance daylighting requirements versus the need to reduce [36] propose a deep deterministic policy gradients (DDPGs)-
solar gains. The impact of glazing area, shading device based approach for learning the thermal comfort control policy.
properties and shading control on building cooling and lighting Although the above works can improve the performance of
demand was calculated using a coupled lighting and thermal HVAC control, they only focused on HVAC subsystem.
simulation module [11]. The interactions between cooling and
lighting energy use in perimeter spaces were evaluated as a III. M OTIVATION
function of window-to-wall ratio and shading parameters. In this section, we perform a set of preliminary simulations
The impacts of window operation on building performance in EnergyPlus [37] in order to understand the relationships
was investigated [13] for different types of ventilation systems between the different subsystems and their impact on human
including natural ventilation, mixed-mode ventilation, and comfort in a building as described in Figure 2. This is also
conventional VAV systems in a medium-size reference office used to gain trust that the simulator is being run correctly, with
building. While the results highlighted the impacts of window intuitive results that can be understood.
operation on energy use and comfort and identified HVAC Our goal is to study the effect of different subsystems to
energy savings with mixed-mode ventilation during summer three human comfort metrics. A single-floor office building
for various climates, the control for window opening fraction of 100 m2 at Merced, California is modeled. The building is
was estimated by experience and is not salable for different equipped with a north-facing single-panel window of 2 m2 and
kinds of buildings. an interior blind. The simulations are conducted with weather
Kolokotsa et al. [23] develop an energy efficient fuzzy data for the month of October. This is a shoulder season, with
controller based on a genetic algorithm to control four outdoor temperatures being a bit cold, but mostly sunny days,
subsystems (HVAC, lighting, window, and blind) and meet i.e. high solar gain.
4
3 40
1.5 Baseline Blind Open Outdoor Temperature
Predictive Mean Vote
Temperature ( C)
1.0 Window Open 2 Blind Open
HVAC Open 30
0.5
1 25
0.0
20
0.5 0
6 8 15
0 5 10 15 20 10 12 14 16 18 0 5 10 5 20
Time (h) Time (h) Time (h)1
Fig. 3: Thermal Comfort, PMV Fig. 4: Visual Comfort, Illuminance Fig. 5: Temperature Effect
Figure 3 shows the effect of three subsystems on thermal IV. D ESIGN OF O CTOPUS
comfort. Predictive Mean Vote (PMV) is used to evaluate In this section, we describe in detail the design of OC-
thermal comfort. A PMV value that is close to zero represents TOPUS, including a system overview, DRL-based building
the best thermal comfort, with higher positive values meaning control, branching dueling Q-Network, and reward function
people are hot, and lower negative values meaning people are calculation.
cold. A detailed description of PMV values and ranges will be
provided in Section IV-D2. The baseline case (green-solid) in A. OCTOPUS Overview
Figure 3 shows the case when all three subsystems are closed.
This case acts like a “fishtank” model, where the only effect The design goal of OCTOPUS is to meet the requirement of
in the room is due to the solar gain during the day, with no human comfort by energy efficient control of four subsystems
other interactions through any system but the window. in a building.
Our goal is to minimize the energy E consumed by all
When only the blind is open (blue-dashed), the PMV value subsystems in the building, including the energy used in
can be affected from 1.45 to 1.75, showing an increase in heating/cooling coils to heat and cool the air, the electricity
the temperature due to the increase of solar gain. This is more used in the water pumps and flow fans in the HVAC system,
prominent in the middle of the day, when the sun is at its apex. electricity used by the lights, and the electricity used by the
When the window is open (red-dashed-dot), the PMV value is motors to adjust the blinds and windows.
lowered due to the temperature effect, colder outside air enters The value of E is constantly being affected by the vector
the room, producing a colder, more comfortable temperature. As , which is an action combination for four subsystems, which
The HVAC system (black-dot) can maintain the PMV value belongs to the vector Aall that is all the possible action
to an acceptable range (between -0.5 and +0.5) by forcing combinations.
air to be at the correct temperature through the room vents. In addition to the minimization of energy, we would like to
From the results of Figure 3, we can conclude that all these maintain the human comfort metrics within a particular range.
three subsystems have an obvious impact on thermal comfort. This can be expressed as Pmin ≤ P M V ≤ Pmax , Vmin ≤
Figure 4 shows the illuminance measured at a place close to V ≤ Vmax , and Imin ≤ I ≤ Imax .
the window from 5 am to 7 pm when the blind is open (green- P M V is a parameter that measures thermal comfort; V
solid) and the room has natural light. Illuminance values from measures visual comfort; and I measures indoor air quality.
500-1000 lux or higher are acceptable in most environments. The consumed energy E and the human comfort metrics
We clearly see that with the blind open, the values are within (P M V , V , and I) are determined by the current state of
this range for most of the day. all four subsystems, the outdoor weather and the action we
Figure 5 shows the indoor temperature when the blind is are about to take. They can be measured in real buildings or
open (red-dashed) or closed (blue-solid). The outdoor temper- calculated in a building simulator, like EnergyPlus, after the
ature (green-dash-dot) is lower than the indoor temperature, action is executed.
due to the ”fish tank” effect and the lack of window open The achieved human comfort results should fall into an
or an HVAC system on during the day. Combining the results acceptable range to meet the requirements of users. We use
from Figures 4 and 5 we see that the blind system can save the [Pmin , Pmax ], [Vmin , Vmax ], [Imin , Imax ] to present the
energy consumed by the lighting system by reducing the need accepted range for thermal comfort, visual comfort and indoor
of artificial light, but it may also increase the energy used by air quality. They can be set by individual users according to
the HVAC system in order to maintain the load. However, for their preference, or by facility managers based on building
lower outdoor temperatures in winter, the sunlight through the standards. The details on calculation of the above parameters
blind can increase the indoor temperature and save the energy (E, P M V , V and I), the definition of an action (As ) and
of the HVAC system. the settings of the human comfort ranges (e.g., [Pmin , Pmax ])
The simulations are conducted to show some examples of will be introduced in Section IV-D.
the non-trivial interactions between subsystems and human Our goal is to find the best As from Aall for each action
comfort. It is challenging to quantify the complex relationships interval (15 mins in our implementation). The best As should
among different subsystems and the three human comfort maintain the three human comfort metrics in their acceptable
metrics and serves as motivation for our work. ranges for the entire control interval with the lowest energy
5
Reward
On-demand
Actuation Environment
Customized DRL Optimization
Control Method
HVAC lighting
system system EnergyPlus
Meta Data
Model
Initialization
blind window
Calibrated Model system system
Fig. 6: OCTOPUS Architecture with Four Subsystems (including HVAC, lighting, blind and window systems)
Fig. 8: The Specific Action Branching Network Implemented for the Proposed BDQ Agent
Q-Network (DQN) in [8] and Asynchronous Advantage Actor- several network branches, one for each action dimension. BDQ
Critic (A3C) in [7], cannot work efficiently in our problem, can scale robustly to environments with high dimensional
Action Branching Architectures for Deep Reinforcement L earning (Arash Tavakoli, Fabio Pardo, Petar Kormushev) 1
because the large number of actions requires to be explicitly action spaces and even outperform the Deep Deterministic
represented in the agent DNN network and it will significantly Policy Gradient (DDPG) algorithm in the most challenging
increase the number of DNN parameters to be learned and task [40]. In our current implementation, we use a simulated
consequently the training time [38]. To solve this problem, building model developed in EnergyPlus as the environment
we leverage a novel neural architecture featuring a shared for training and validation. Our BDQ-based agent interacts
representation followed by four network branches, one for with the EnergyPlus model. At each control step, it processes
each action dimension. the state (building and weather parameters) and generates a
4) Reward Function in OCTOPUS: Reward illustrates the combined action set for four subsystems.
immediate evaluation of the control effects for each action Figure 8 demonstrates the action branching network of BDQ
under a certain state. Both human comfort and energy con- agent. When a state is inputted, the shared decision module
sumption should be incorporated. To define the reward func- computes a latent representation that is then used for the
tion, a common approach is to use the Lagrangian Multiplier calculation of the state value and the output of the network
function [39] to first convert the constrained formulation into (Advantages dimension in Figure 8) for each dimension
an unconstrained one: branch. The state value and the factorized advantages are then
R = −[ρ1 N orm(E) + ρ2 N orm(Tc ) combined, via a special aggregation layer, to output the Q-
(3) values for each action dimension. These Q-values are then
+ρ3 N orm(Vc ) + ρ4 N orm(Ic )], queried for the generation of a joint-action tuple. The weights
where ρ1 , ρ2 , ρ3 and ρ4 are the Lagrangian multipliers. E of the fully connected neural layers are denoted by the gray
is energy consumption, T c is thermal comfort, V c is visual trapezoids and the size of each layer (i.e. number of units) is
comfort and Ic is Indoor air quality. N orm(x) is a normal- depicted in the figure.
ization process, i.e., N orm(x) = (x - xmin )/(xmax - xmin ) Training Process: The training process of the BDQ-based
to transform energy and three human comfort to the same control agent is outlined in Algorithm 1. At the beginning,
scale. This reward function merges the objective (e.g. energy we first initialize a neural network Q with random weight
consumption) and constraint satisfaction (e.g. human comfort). θ. Another neural network Q− with the same architecture
The reward consists of four parts, namely, the penalty for is also created. The outer ”for” loop controls the number of
the energy consumption of the HVAC and lighting system, training episodes, and the inner ”for” loop performs control
the penalty for the occupants’ thermal discomfort, the penalty at each control time step within one training episode. During
for the occupants’ visual discomfort and the penalty for the the training process, the recent transition tuples (St , At , St+1 ,
occupants’ indoor air condition discomfort. Specifically, the Rt+1 ) are stored in the replay memory Λ from which a mini-
reward should be less, if more energy is consumed by the batch of samples will be generated for neural network training.
HVAC system or the occupants feel uncomfortable about The variable At stores the control action in the last step, and
the building thermal, visual and indoor air condition. The St and St+1 represent the building state in the previous and
details about how to define and formulate energy consumption current control time steps, respectively. At the beginning of
E, thermal comfort T c, visual comfort V c and indoor air each time slot t, we first update four actions and obtain the
condition Ic are explained in Section IV-D. current state St+1 . In line 7, the immediate reward Rt+1 is
calculated by Equation 3. A training mini-batch can be built
by randomly drawing some transition tuples from the memory.
C. Branching Dueling Q-Network We calculate the target vector and update the weights of
To solve the high-dimensional action problem described in the neural network Q by using an Adam optimizer for every
Section IV-B3, OCTOPUS adopts a Branching Dueling Q- control step t. Formally, for an action dimension d ∈ 1, ...N
Network (BDQ), which is a branching variant of the dueling with n discrete actions, a branch’s Q-value at state s ∈ S and
Double Deep Q-Network (DDQN). BDQ is a new neural with action ad ∈ Ad is expressed in terms of the common
architecture featuring a shared decision module followed by state value V (s) (the result of the shared representation layer
7
Light
Supply Air
AHU
Window
Blind
2 2
Dampers
Illuminance Window Area 1m ∼ 4m 2m2
Temperature
CO2 Sensor
Return Air Window Thickness 3mm ∼ 6mm 3mm
room
Exhaust
Fan Efficiency 0.5 ∼ 0.8 0.7
Return Fan
Air
Blind Type Interior/Exterior Blind Interior
Fig. 10: HVAC Single Duct VAV Terminal Reheat Layout. Blind Thickness 1mm ∼ 6mm 1mm
Initial Proposed
TABLE III: Modeling Error after Calibration
Model
MBE CVRMSE
Interim model #1 Weather Station February (hourly temperature) -1.48% 5.32%
(weather data) (https://darksky.net) March (hourly temperature) -0.26% 4.95%
April (hourly temperature) 1.20% 5.06%
Interim model#2
(occupancy Panasonic PIR and May (hourly temperature) 0.48% 4.38%
schedule) Grid eye sensors
February - May(monthly energy) -3.83% 12.33%
Interim model #3
(zone temperature WebCTRL and
and HVAC energy) influx database
system was installed in the target building on our campus and
allows the collection of fine grain occupancy data at the zone
Calibrate Error Actual level in the building, allowing the evaluation using accurate
Calibrated Model MBE, CVRMSE measured data
occupancy patterns. We used the hourly occupancy data from
3 months as the occupancy schedule in our simulated building
Fig. 11: Building Model Calibration Process. by EnergyPlus. The third step is to calibrate certain system and
control parameters to match those in the target building we
want to replicate. This involves multiple issues, including (a)
(VAV) unit that regulates the amount of air that flows into a
the selection of the parameters to be calibrated, (b) the range
zone. Terminal reheat occurs when the heating coil increases
of those parameters, and (c) the step used within the range.
the temperature before discharging air into a zone. A discharge
In our work, we use an N-factorial design with 5 parameters
setpoint temperature is selected for each zone and the VAV
and ranges to be tested based on operational experience. We
ensures that the air is heated to this temperature for each zone.
tested different combinations of HVAC system parameters
The air supplied to the zone is mixed with the current zone air,
(Infiltration rate) and control (mass flow rate, heating, and
and some of the air is exhausted out of the zone to maintain
cooling setpoints) and found the combination that minimized
a constant static pressure. The return air from each zone is
the calibrated error (see below). The selected calibration
mixed in the return duct, and then portions of it may enter the
parameters are listed in the Table II with their calibration
economizer.
ranges and value selected. The final step is to compare the
calibrated error between the calibrated model and the actual
D. HVAC Modeling and Calibration measured zone temperature and energy consumption stored
The purpose of the calibration is to ensure the building in the operational building database. The whole calibration
energy model can generate energy use results close to the process of modeling our building takes nearly one month.
measured values in the target building using actual inputs, ASHRAE Guideline 14-2002 [49] defines the evaluation
including weather, occupancy schedule, and the HVAC system criteria to calibrate BEM models. According to the Guide-
parameters and controls. line, monthly and hourly data can be used for calibration.
The building model calibration process is shown in Figure Mean Bias Error (MBE) and Coefficient of Variation of the
11. The first step of the calibration is to collect the real Root Mean Squared Error (CVRMSE) are used as evaluation
weather data from a public weather station for the period indices. The guideline states that the model should have an
to be tested. We use a Dark Sky’s API, a public weather MBE of 5% and a CVRMSE of 15% relative to monthly
website, to collect real weather data for three months. The calibration data. If hourly calibration data are used, these
second step is to replace the default occupancy schedules in requirements should be 10% and 30%, respectively. In our
the simulator with the actual occupancy schedules collected case, hourly data is used to calculate the error metrics for
from the real target building using ThermoSense [48]. This the average zone temperature. We choose monthly data to
10
TABLE IV: Human Comfort Statistical Results for Rule Based, DDQN-HVAC and OCTOPUS Schemes
CO2 Concentration Energy Consumption
PMV Illuminance (lux) (ppm) (kWh)
Location Method Metric
January July January July January July January July
Mean 0.03 -0.25 576.78 646.45 623.61 668.03
Rule Based
Method Std 0.11 0.13 152.54 157.11 120.64 181.22 1990.99 3583.03
Violation rate 0 2% 0.94% 0 0.3% 3.629%
Mean -0.19 0.28 576.78 646.45 625.62 648.01
DDQN-HVAC
Merced Std 0.21 0.11 152.54 157.11 122.62 120.57 1859.10 3335.58
[7]
Violation rate 2.99% 4.4% 0.94% 0 0 0.2%
Mean -0.31 0.27 587.12 569.88 594.77 612.33
OCTOPUS Std 0.2 0.10 382.27 75.83 111.59 110.35 1756.24 2941.46
Violation rate 5.7% 2.5% 0.26% 0.2% 1.31% 0.33%
Mean -0.28 -0.15 583.27 637.07 610.26 638.33
Rule Based
Method Std 0.11 0.02 163.96 151.37 63.94 151.37 3848.61 3309.56
Violation rate 3.09% 0 1.1% 0 0 0
Mean -0.32 0.24 583.27 637.07 612.74 649.32
DDQN-HVAC
Chicago Std 0.08 0.07 163.96 151.37 65.09 90.16 3605.21 3078.67
[7]
Violation rate 3.7% 2.9% 1.1 % 0 0 0
Mean -0.4 0.29 598.34 544.09 640.31 633.71
OCTOPUS Std 0.1 0.11 259.88 55.37 99.85 111.04 3496.54 2722.03
Violation rate 4.2% 1.47% 1.6 % 0 1% 1.31%
TABLE V: Parameter Settings in DRL Algorithms the agent with a minibatch size of 64 and a discount factor
γ = 0.99. The target network is updated every 103 time
4tc 15 m β1 0.9
steps. We use the rectified non-linearity (or ReLU) [51] for
Minibatch Size 64 β2 0.999 all hidden layers and linear activation on the output layers.
Learning Rate 10−4
Action Dimension 35040 The network has two hidden layers with 512 and 256 units in
the shared network module and one hidden layer per branch
γ 0.99 Action Space 2.37 ∗ 107
with 128 units. The weights are initialized using the Xavier
initialization [52] and the biases were initialized to zero.
We used the prioritized replay with a buffer size of 106
calculate energy error metrics because energy data can only be
and linear annealing of β from β0 = 0.4 to 1 over 2 x
obtained monthly. The calibration results for zone temperature
106 steps. While an −greedy policy is often used with Q-
and energy consumption are shown in Table III. It is shown
learning, random exploration (with an exploration probability)
that less than 2% NMBE and less than 6% CVRMSE for the
in physical, continuous-action domains can be inefficient. To
zone temperature can be achieved with the optimal parameter
explore actions well in our building environment, we decided
setting. We found that both the CVRMSE for the monthly
to sample actions from a Gaussian distribution with its mean
heating and cooling energy demand is relatively large, but the
at the greedy actions and with a small fixed standard deviation
NMBE and CVRMSE are still within the acceptable range.
throughout the training to encourage life-long exploration. We
This means the model can achieve accurate calculation for the
used a fixed standard deviation of 0.2 during training and zero
monthly energy.
during evaluation. This exploration strategy yielded a mildly
better performance as compared to using an −greedy policy
E. OCTOPUS Training with a fixed or linearly annealed exploration probability. The
duration of each time (action) slot is 15 minutes. We achieved
10-year weather data for training from the two locations
convergence of our reward function after 1000 episodes as
tested (Merced, CA and Chicago, IL) is randomly divided,
explained in Section VI-F.
with eight years used for training and the remaining two
years used for testing. The parameter settings in our DRL
Algorithms are shown in Table V. In our implementation of VI. E VALUATION
OCTOPUS, we use the Adam optimizer [50] for gradient- In this section, we compare the performance of OCTOPUS
based optimization with a learning rate of 10−4 . We train with the rule-based method and the latest DRL-based method.
11
1800
Daily Total Energy Consumption(kWh)
80
Rule Based Method
DDQN-HVAC 2000
OCTOPUS
70 2200
2400
Reward
Episode Reward
60 Average Reward
2600
50 2800
40 3000
30 3200
0 200 400 600 800 0
5 10 15 DAY 20 25 30 100
Episode
Fig. 12: Daily Energy Consumption of Control Methods. Fig. 14: The Convergence of OCTOPUS.
4000
B. Human Comfort
Total Energy Consumption (kWh)
3500
Rule Based Method From the results in Table IV, we see that all three methods
3000 OCTOPUS_HVAC
OCTOPUS_HVAC_L
OCTOPUS_HVAC_L_B can maintain the PMV value in the desired range for most
2500 OCTOPUS_HVAC_L_B_W
of the time since the violation rate is low. The average PMV
2000 violation rate of OCTOPUS and DDQN-HVAC is higher than
1500 the rule-based method by 2.19% and 2.22% respectively. The
1000 reason for this is that the DRL-based methods try to save more
500 energy by setting the PMV to a value close to the boundary of
0 January(M) January(C) July(M) July(C) Average
the acceptable range. It can be observed in Table IV that the
average PMV value of OCTOPUS and DDQN-HVAC (-0.36
Fig. 13: Performance Contribution of Each Subsystem. and -0.26) is closer to the range boundary (-0.5), compared
with the rule-based method (-0.13).
For both visual comfort and indoor air quality, the three
A. Experiment Setting control methods provide a very small violation rate. For illumi-
nance, the mean illuminance value of OCTOPUS and DDQN-
The implementation of the rule-based HVAC control has HAVC is 590.69 lux and 610.89 lux respectively. OCTOPUS
been introduced in Section V-B. The rule-based method only saves energy by utilizing natural light as much as possible.
controls the HVAC system. For the conventional DRL-based For indoor air quality, the average of CO2 concentration of
method, we implement the dueling DQN architecture used OCTOPUS, DDQN-HVAC, and rule-based method is 620.28
in [7], which controls the water-based heating system. We ppm, 633.92 ppm, and 635.06 ppm. OCTOPUS adjusts both
name that work as DDQN-HVAC in our comparison. Since window system and HVAC system to maintain the CO2
these two benchmarks do not control the light system, for a concentration level within the desired range. DDQN-HVAC
fair comparison, we initialize the lights on in all experiments. and the rule-based method only use the HVAC system.
OCTOPUS may dim the lights if the blind is open during the
day. In addition, the two benchmarks always leave the blind C. Energy Efficiency
and window system closed.
The results in Table IV reveal that OCTOPUS save 14.26%
The three human comfort metrics are measured by PMV, and 8.1% energy on average, compared with the rule-based
Illuminance, and carbon dioxide concentration. We set the control method and DDQN-HVAC. In both cities, OCTOPUS
acceptable range of three human comfort metrics according to achieves similar performance gain. OCTOPUS reduces the
building standards and previous experiences in related work. energy consumption of HVAC by using the other subsystems.
The comfort range of PMV is set to -0.5 to 0.5 [53]. The Figure 12 shows a daily energy consumption of three control
comfort range of illuminance is set to 500-1000 lux [43]. The methods in January at Merced. In most days, OCTOPUS
comfort range of carbon dioxide concentration is set to 400- consumes less energy than the other two methods; however,
1000 ppm [45]. OCTOPUS is not always the best although we see clear gains
We use three control methods to control the building we towards the second half of the month due to a change in
modeled in Section V for two months (January and July) and weather temperature. The average range of outdoor temper-
at two places with distinct weather patterns. Table IV shows ature changes from 2 ◦ C ∼ 13 ◦ C in the first half of the
the human comfort results of three control methods and their month to -1 ◦ C ∼ 18 ◦ C in the second half of the month.
energy consumption. The violation rate is calculated as the OCTOPUS could use external air with the window open for
time when the value of a human comfort metric falls beyond more natural ventilation.
its acceptable range divided by the total simulated time. Other In Table IV, compared to the rule-based method and DDQN-
quality of service metrics, including the amount by the which HVAC, OCTOPUS saves more energy in July (17.6% and
the violation occurred, or combination of amount and time 11.7%) than in January (10.05% and 3.9%). In July, the
will be explored in future work. outdoor air temperature range at Merced and Chicago is 15◦ C
12
∼ 42◦ C and 15◦ C ∼ 40◦ C respectively. The window can be results of the different hyperparameters are not intuitive.
opened when the temperature is within the acceptable range, For example, we would expect the bigger ρ1 and smaller
in order to save the energy consumed by the HVAC system. ρ2 , ρ3 , ρ4 to lead to lower energy consumption and just meet
However, in January, due to the cold weather at both places, the requirements of thermal comfort, visual comfort and indoor
the windows stay closed most of the time and cannot make air condition. However, the results in Table VI shows that
much contribution to energy savings. when increasing the weight of energy, energy consumption
does not necessarily decrease. Such counter-intuitive results
are possibly caused by the delayed reward problem that the
D. Performance Decomposition
DRL agents are stuck in some local optimal areas during the
We implement four versions of OCTOPUS to study the training. Out of the five experiments in Table VI, the fourth
energy saving contribution of each subsystem, i.e., OCTO- row saves 17.9% of the energy consumption with only slightly
PUS just with the HVAC system (OCTOPUS HVAC), OC- worse three human comfort quality in the testing model, which
TOPUS with HVAC and lighting (OCTOPUS HVAC L), comparably achieves the best balance between the human
OCTOPUS with HVAC, lighting and blind (OCTOPUS comfort and energy consumption. Therefore, the parameters
HVAC L B) and OCTOPUS with all four subsystems (OC- in the fourth row are used for the trained agent.
TOPUS HVAC L B W). Figure 13 depicts the energy con-
sumption of these four versions in two different months and at F. Convergence of OCTOPUS training
two different places (Merced and Chicago). Compared with the
Figure 14 shows that the accumulated reward of OCTOPUS
rule-based method, OCTOPUS HVAC can save 6.16% more
in each episode during a training process. We calculate the
energy by only considering HVAC. When the lighting system
reward function every control time step (15 minutes), and
is added in OCTOPUS HVAC L, 2.73% more energy can
thus one episode (one month) contains 2880 time steps. The
be saved. If the blind system is further added in OCTOPUS
accumulated reward of one episode (episode reward in Figure
HVAC L B 1.93% more energy can be saved. Finally, when
14) is the sum of the rewards of 2880 time steps. From the
the window system is added in OCTOPUS HVAC L B W,
results in Figure 14, we see that the episode reward increases
3.44% more energy can be saved. Four subsystems make
and tends to be stable as the number of training episodes
different contributions to energy saving in January and July.
increases. When the episode reward does not change much, it
In January, four subsystems (i.e., HVAC, lighting, blind and
means that we cannot do further to improve the learned control
window) make 6.16%, 2.73%, 1.93% and 0% contribution
policy and thus the training process converges. As indicated in
of energy savings respectively. In July, the contribution of
Figure 14, the training reward fluctuates between two adjacent
these subsystems changes to 5.9 %, 3.31 %, 1.99%, and 6.4%
episodes, because the number of time steps is large in one
respectively. The most obvious difference between these two
episode, i.e., 2880. The rewards calculated at some of these
months is made by the window system (6.4%). The reason
2880 time steps may vary dynamically because we randomly
for this has been explained above. In January, the windows
choose some time steps by an exploration rate (determined by
are closed almost all the time. In July, the cold outdoor air is
a Gaussian distribution with a standard deviation of 0.2). At
used to cool down the building instead of using HVAC system.
these time steps, we do not use the action generated by the
agent, but randomly choose an action to avoid local minimum
E. Hyperparameters Setting convergence. If we smooth the episode reward using a sliding
The hyperparameters in the reward function (Equation 3) window of 10 episodes, the average reward in Figure 14 is
are tuned to balance between the energy consumption and more stable during the training.
human comfort. Table VI shows the performance results of
the trained DRL agents in the selected experiments of the VII. D ISCUSSION
hyperparameters tuning. The total energy consumption and Deploying in a Real Building. Although we have de-
the mean and standard deviation of the PMV, Illuminance veloped a calibrated simulation model of a real building
and carbon dioxide concentration are used as the evaluation on our campus for training and evaluation, we have not
metrics. It is interesting to find that the control performance deployed OCTOPUS in the building, because we do not have
13
access to automatic blind and window system at the moment. for training using existing historical weather data together with
We are seeking financial support to work with our facility a calibrated simulator for the target building. We compare our
team for a possible upgrade. OCTOPUS is designed for real results with both the state-of-art rule-based control scheme
deployment in buildings. For a new building, we need to build obtained from a LEED Gold certified building, a DRL scheme
an EnergyPlus model for it and calibrate the model using real used for optimized heating in the literature, and show that we
building operation data. After training the OCTOPUS control can get 14.26% and 8.1% energy savings while maintaining
agent using the calibrated simulation model and real weather (and sometime even improving) human comfort values for
data, we can deploy the trained agent in the building for real- temperature, air quality and lighting.
time control. For a certain action interval (e.g., every 10 mins),
the OCTOPUS control agent takes the state of the building as
input and generates the control actions of four subsystems. R EFERENCES
OCTOPUS can provide real-time control, as one inference [1] X. Ding, W. Du, and A. Cerpa, “Octopus: Deep reinforcement learning
only takes 22 ms. We plan to deploy OCTOPUS in a real for holistic smart building control,” in ACM BuildSys, 2019.
building in our future work. [2] O. Lucon, D. Ürge-Vorsatz, A. Z. Ahmed, H. Akbari, P. Bertoldi, L. F.
Cabeza, N. Eyre, A. Gadgil, L. Harvey, Y. Jiang et al., “Buildings,”
Scalability of OCTOPUS. OCTOPUS can work in a one- 2014.
zone building with one HVAC system, lighting zone, blind and [3] H. Rajabi, Z. Hu, X. Ding, S. Pan, W. Du, and A. Cerpa, “Modes: M ulti-
window. However, a realistic building (or even a small home) sensor o ccupancy d ata-driven e stimation s ystem for smart buildings,”
in Proceedings of the Thirteenth ACM International Conference on
is usually equipped with many lighting zones, blinds and Future Energy Systems, 2022, pp. 228–239.
windows which may take different actions in one subsystem. [4] W. W. Shein, Y. Tan, and A. O. Lim, “Pid controller for temperature
OCTOPUS may solve this scalability problem by increasing control with multiple actuators in cyber-physical home system,” in IEEE
NBiS, 2012.
the number of BDQ branches, i.e., each branch corresponds [5] A. Beltran and A. E. Cerpa, “Optimal hvac building control with
to one subsystem in each zone of a building. We will tackle occupancy prediction,” in ACM BuildSys, 2014.
this scalability problem in our future work. [6] B. Li and L. Xia, “A multi-grid reinforcement learning method for
energy conservation and comfort of hvac in buildings,” in IEEE CASE,
Building Model Calibration. A critical component of 2015.
our architecture is the use of a calibrated building model [7] Z. Zhang and K. P. Lam, “Practical implementation and evaluation of
that is close to the target building, allowing us to generate deep reinforcement learning control for a radiant heating system,” in
ACM BuildSys, 2018.
sufficient data for our training needs. However, getting a
[8] T. Wei, Y. Wang, and Q. Zhu, “Deep reinforcement learning for building
calibrated model ”right” is a tedious process of trial-and-error hvac control,” in ACM DAC, 2017.
over a large number of parameters. Out of the thousands of [9] (2019). [Online]. Available: https://www.geze.com/en/discover/topics/
parameters available in EnergyPlus, we use our experience natural-ventilation/
[10] (2019). [Online]. Available: https://www.buildings.com/article-details/
and consulted experts to determine both the most important articleid/12969/title/operable-windows-for-operating-efficiency
parameters and a sensible range of values to explore (it took [11] A. Tzempelikos and A. K. Athienitis, “The impact of shading design
us four weeks to get it ”right”). However, there is no magic and control on building cooling and lighting demand,” Solar energy,
2007.
bullet, and this may become a problem, especially for unusual [12] Z. Cheng, Q. Zhao, F. Wang, Y. Jiang, L. Xia, and J. Ding, “Satisfaction
building architectures or specialized HVAC systems that may based q-learning for integrated lighting and blind control,” Energy and
not be trivial to replicate in a simulation environment. Buildings, 2016.
[13] L. Wang and S. Greenberg, “Window operation and impacts on building
Accepting Users’ Feedback. Some existing work [54] energy consumption,” Energy and Buildings, vol. 92, pp. 313–321, 2015.
allows users to send their feedback to the control server. The [14] S. Privara, J. Cigler, Z. Váňa, F. Oldewurtel, C. Sagerschnig, and
feedback can represent a user’s personalized preference on E. Žáčeková, “Building modeling as a crucial part for building predictive
control,” Energy and Buildings, vol. 56, pp. 8–22, 2013.
different human comfort metrics and will be considered in the [15] D. Kumar, X. Ding, W. Du, and A. Cerpa, “Building sensor fault
control decision process. OCTOPUS can easily accept users’ detection and diagnostic system,” in Proceedings of the 8th ACM
feedback to train a better agent model by making a small International Conference on Systems for Energy-Efficient Buildings,
Cities, and Transportation, 2021, pp. 357–360.
modification, i.e., changing the calculated comfort values in
[16] P. Rockett and E. A. Hathway, “Model-predictive control for non-
the reward function by the users’ feedback. This can be used domestic buildings: a critical review and prospects,” Building Research
for the initial training or for updated training (once deployed). & Information, 2017.
For example, the OCTOPUS control agent can be trained [17] E. Atam and L. Helsen, “Control-oriented thermal modeling of mul-
tizone buildings: methods and issues: intelligent control of a building
incrementally with a certain time interval (e.g., one month). system,” IEEE Control systems magazine, 2016.
The newly-trained agent will be used for real-time. [18] J. Zhao, K. P. Lam, and B. E. Ydstie, “Energyplus model-based
predictive control (epmpc) by using matlab/simulink and mle+,” 2013.
[19] O. T. Karaguzel and K. P. Lam, “Development of whole-building energy
VIII. C ONCLUSIONS performance models as benchmarks for retrofit projects,” in Proceedings
of the 2011 Winter Simulation Conference (WSC). IEEE, 2011.
This paper proposes OCTOPUS, a DRL-based control [20] B. Gu, S. Ergan, and B. Akinci, “Generating as-is building information
system for buildings that holistically controls many subsys- models for facility management by leveraging heterogeneous existing
information sources: A case study,” in Construction Research Congress
tems in modern buildings (e.g., HVAC, light, blind, window) 2014: Construction in a Global Network, 2014.
and manages the trade-offs between energy use and human [21] H. T. Dinh and D. Kim, “Milp-based imitation learning for hvac control,”
comfort. As part of our architecture, we develop a system IEEE Internet of Things Journal, 2021.
[22] C. Agbi, Z. Song, and B. Krogh, “Parameter identifiability for multi-
that addresses the issues of large action state, a novel reward zone building models,” in 2012 IEEE 51st IEEE conference on decision
function based on energy and comfort, and data requirements and control (CDC). IEEE, 2012.
14
[23] D. Kolokotsa, G. Stavrakakis, K. Kalaitzakis, and D. Agoris, “Genetic [49] A. Guideline, “Guideline 14-2002, measurement of energy and demand
algorithms optimized fuzzy controller for the indoor environmental savings,” American Society of Heating, Ventilating, and Air Conditioning
management in buildings implemented using plc and local operating Engineers, Atlanta, Georgia, 2002.
networks,” Engineering Applications of Artificial Intelligence, 2002. [50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[24] M. Liu, X. Ding, and W. Du, “Continuous, real-time object detection arXiv preprint arXiv:1412.6980, 2014.
on mobile devices without offloading,” in 2020 IEEE 40th International [51] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
Conference on Distributed Computing Systems (ICDCS). IEEE, 2020, networks,” in AISTATS, 2011.
pp. 976–986. [52] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
[25] H. Zhu, V. Gupta, S. S. Ahuja, Y. Tian, Y. Zhang, and X. Jin, “Network feedforward neural networks,” in AISTATS, 2010.
planning with deep reinforcement learning,” in Proceedings of the 2021 [53] R. American Society of Heating and A.-C. E. S. 55, “Thermal environ-
ACM SIGCOMM 2021 Conference, 2021, pp. 258–271. mental conditions for human occupancy,” 2017.
[26] X. Ding and W. Du, “Drlic: Deep reinforcement learning for irrigation [54] D. A. Winkler, A. Beltran, N. P. Esfahani, P. P. Maglio, and A. E. Cerpa,
control,” in 2022 21st ACM/IEEE International Conference on Informa- “Forces: feedback and control for occupants to refine comfort and energy
tion Processing in Sensor Networks (IPSN). IEEE, 2022, pp. 41–53. savings,” in ACM UbiComp, 2016.
[27] Z. Shen, K. Yang, W. Du, X. Zhao, and J. Zou, “Deepapp: A
deep reinforcement learning framework for mobile application usage
prediction,” in ACM SenSys, 2019.
[28] P. Fazenda, K. Veeramachaneni, P. Lima, and U.-M. O’Reilly, “Using
reinforcement learning to optimize occupant comfort and energy usage
in hvac systems,” Journal of Ambient Intelligence and Smart Environ-
ments, vol. 6, no. 6, pp. 675–690, 2014.
[29] Z. Zhang, A. Chong, Y. Pan, C. Zhang, and K. P. Lam, “Whole building
energy model for hvac optimal control: A practical framework based on
deep reinforcement learning,” Energy and Buildings, 2019.
[30] K. Dalamagkidis, D. Kolokotsa, K. Kalaitzakis, and G. S. Stavrakakis,
“Reinforcement learning for energy conservation and comfort in build-
ings,” Building and environment, 2007.
[31] D. Van Le, Y. Liu, R. Wang, R. Tan, Y.-W. Wong, and Y. Wen,
“Control of air free-cooled data centers in tropics via deep reinforcement
learning,” in Proceedings of the 6th ACM International Conference on
Systems for Energy-Efficient Buildings, Cities, and Transportation, 2019.
[32] D. V. Le, R. Wang, Y. Liu, R. Tan, Y.-W. Wong, and Y. Wen, “Deep
reinforcement learning for tropical air free-cooled data center control,”
ACM Transactions on Sensor Networks (TOSN), 2021.
[33] J. R. Vazquez-Canteli, G. Henze, and Z. Nagy, “Marlisa: Multi-agent
reinforcement learning with iterative sequential action selection for
load shaping of grid-interactive connected buildings,” in Proceedings of
the 7th ACM International Conference on Systems for Energy-Efficient
Buildings, Cities, and Transportation, 2020.
[34] X. Ding, W. Du, and A. E. Cerpa, “Mb2c: Model-based deep reinforce-
ment learning for multi-zone building control,” in Proceedings of the 7th
ACM international conference on systems for energy-efficient buildings,
cities, and transportation, 2020, pp. 50–59.
[35] Z. Zhang, A. Chong, Y. Pan, C. Zhang, S. Lu, and K. P. Lam, “A
deep reinforcement learning approach to using whole building energy
model for hvac optimal control,” in 2018 Building Performance Analysis
Conference and SimBuild, 2018.
[36] G. Gao, J. Li, and Y. Wen, “Deepcomfort: Energy-efficient thermal
comfort control in buildings via reinforcement learning,” IEEE Internet
of Things Journal, 2020.
[37] U. D. of Energy. (2016) Energyplus 8.6.0. [Online]. Available:
https://energyplus.net/
[38] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
with double q-learning,” in AAAI 2016, 2016.
[39] K. Ito and K. Kunisch, Lagrange multiplier approach to variational
problems and applications. Siam, 2008.
[40] A. Tavakoli, F. Pardo, and P. Kormushev, “Action branching architectures
for deep reinforcement learning,” in AAAI, 2018.
[41] P. O. Fanger et al., “Thermal comfort. analysis and applications in
environmental engineering.” Thermal comfort. Analysis and applications
in environmental engineering., 1970.
[42] P. Fanger, “Moderate thermal environments determination of the pmv
and ppd indices and specification of the conditions for thermal comfort,”
ISO 7730, 1984.
[43] D. C. Pritchard, Lighting. Routledge, 2014.
[44] S. J. Emmerich and A. K. Persily, State-of-the-art review of CO2 demand
controlled ventilation technology and application. Citeseer, 2001.
[45] A. S. 62.1, “Ventilation for acceptable indoor air quality,” 2016.
[46] (2018) sketchup. [Online]. Available: https://www.sketchup.com
[47] M. Wetter, “Co-simulation of building energy and control systems with
the building controls virtual test bed,” Journal of Building Performance
Simulation, 2011.
[48] A. Beltran, V. L. Erickson, and A. E. Cerpa, “Thermosense: Occupancy
thermal based sensing for hvac control,” in ACM BuildSys Workshop,
2013.