0% found this document useful (0 votes)
14 views

Exploring Deep Reinforcement Learning For Holistic Smart Building Control

1. The document discusses developing a deep reinforcement learning framework called OCTOPUS to optimize control of multiple building subsystems (HVAC, lighting, blinds, windows) to improve energy efficiency and occupant comfort. 2. Existing building control approaches optimize individual subsystems but not the interactions between all subsystems, missing opportunities for improved performance. 3. The authors argue a holistic control approach is needed and develop OCTOPUS using a novel reward function to explore tradeoffs between energy use and comfort across all subsystems through trial-and-error learning without requiring an explicit system model. 4. Simulations show OCTOPUS achieves 14.26% and 8.1% energy savings over rule-

Uploaded by

muhammad.arslan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Exploring Deep Reinforcement Learning For Holistic Smart Building Control

1. The document discusses developing a deep reinforcement learning framework called OCTOPUS to optimize control of multiple building subsystems (HVAC, lighting, blinds, windows) to improve energy efficiency and occupant comfort. 2. Existing building control approaches optimize individual subsystems but not the interactions between all subsystems, missing opportunities for improved performance. 3. The authors argue a holistic control approach is needed and develop OCTOPUS using a novel reward function to explore tradeoffs between energy use and comfort across all subsystems through trial-and-error learning without requiring an explicit system model. 4. Simulations show OCTOPUS achieves 14.26% and 8.1% energy savings over rule-

Uploaded by

muhammad.arslan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

Exploring Deep Reinforcement Learning for


Holistic Smart Building Control
Xianzhong Ding, Alberto Cerpa, Member, IEEE, and Wan Du, Member, IEEE

Abstract—Recently, significant efforts have been done to control rules may not be optimal and have to be adapted to
improve quality of comfort for commercial buildings’ users new buildings at commissioning time. Many times these rules
while also trying to reduce energy use and costs. Most of are updated in an ad-hoc manner, based on experience and
these efforts have concentrated in energy efficient control of the
feedback from occupants and/or trial and error performed by
arXiv:2301.11510v1 [cs.LG] 27 Jan 2023

HVAC (Heating, Ventilation, and Air conditioning) system, which


is usually the core system in charge of controlling buildings’ HVAC engineers during the operational use of the building. As
conditioning and ventilation. However, in practice, HVAC systems a result, many model-based approaches have been developed
alone cannot control every aspect of conditioning and comfort to model the thermal dynamics of a building and execute a
that affects buildings’ occupants. Modern lighting, blind and control algorithm on top of the model, such as Proportional
window systems, usually considered as independent systems,
when present, can significantly affect building energy use, and Integral Derivative (PID) [4] and Model Predictive Control
perhaps more importantly, user comfort in terms of thermal, (MPC) [5]. However, the complexity of the thermal dynamics
air quality and illumination conditions. For example, it has been and the various influencing factors are hard to be precisely
shown that a blind system can provide 12%∼35% reduction in modeled, which is why the models tend to be simplified
cooling load in summer while also improving visual comfort. In in order deal with the parameter-fitting data requirements
this paper, we take a holistic approach to deal with the trade-
offs between energy use and comfort in commercial buildings. and computational complexity when solving the optimization
We developed a system called OCTOPUS, which employs a problem [5].
novel deep reinforcement learning (DRL) framework that uses To tackle the limitations of the model-based methods,
a data-driven approach to find the optimal control sequences some model-free approaches have been proposed based on
of all building’s subsystems, including HVAC, lighting, blind reinforcement learning (RL) for HVAC control, including Q-
and window systems. The DRL architecture includes a novel
reward function that allows the framework to explore the trade- learning [6] and Deep Reinforcement Learning (DRL) [7].
offs between energy use and users’ comfort, while at the same With RL, an optimal control policy can be learned by the
time enable the solution of the high-dimensional control problem trial-and-error interaction between a control agent and a
due to the interactions of four different building subsystems. In building, without explicitly modeling the system dynamics. By
order to cope with OCTOPUS’s data training requirements, we adopting a deep neural network as the control agent, DRL-
argue that calibrated simulations that match the target building
operational points are the vehicle to generate enough data to be based schemes can handle large state and action space in
able to train our DRL framework to find the control solution for building control [7]. Some recent work [7], [8] has shown
the target building. In our work, we trained OCTOPUS with 10- that DRL can provide real-time control for building energy
year weather data and a building model that is implemented efficiency. However, all existing methods only consider a
in the EnergyPlus building simulator, which was calibrated single subsystem in buildings, e.g., the HVAC system [8] or the
using data from a real production building. Through extensive
simulations we demonstrate that OCTOPUS can achieve 14.26% heating system [7], ignoring some other subsystems that can
and 8.1% energy savings compared with the state-of-the art rule- affect performance from the energy use and/or user comfort
based method in a LEED Gold Certified building and the latest point of view.
DRL-based method available in the literature respectively, while At present, more and more buildings are been equipped
maintaining human comfort within a desired range. with automatically-adjustable windows and blinds. For exam-
Index Terms—HVAC, Energy efficiency, Optimal control, Deep ple, motor-operated windows and blinds, like the intelligent
reinforcement learning products from GEZE [9], have been installed using an effective
natural ventilation strategy [10]. In addition, researchers have
I. I NTRODUCTION studied the potential of energy saving by jointly controlling
the HVAC system and another subsystem, like blind [11],
Energy saving in buildings is important to society, as lighting [12], and window [13]. For example, the energy
buildings consume 32% energy and 51% electricity demand consumed by HVAC can be reduced by 17%∼47% if window-
worldwide [2], [3]. Rule-based control (RBC) is widely used to based natural ventilation is enabled [13].
set the actuators (e.g., heating or cooling temperature, and fan In this work, we argue that a holistic approach that considers
speed) in the HVAC (heating, ventilation, and air-conditioning) all available subsystems (HVAC, blinds, windows, lights) in
system. The ”rules” in RBC are usually set as some static buildings, which have complex and non-trivial interactions
thresholds or simple control loops based on the experience of should be used in coordination to achieve a specific en-
engineers and facility managers. The thresholds and simple ergy efficiency/comfort goal. Figure 1 shows a depiction of
A preliminary version of this work was published in the Proceedings of a modern building that includes multiple subsystems (e.g.,
ACM BuildSys 2019 [1]. HVAC, window, blind and lighting) that work together to
2

Indoor Air Quality Thermal Comfort Visual Comfort

Return Fan Supply Fan


HVAC system
VAV
HVAC System Window System Blind System Lighting System

Return Air Supply Air

Blind Lighting system


Fig. 2: Relationship between Four Subsystems and Three
system
Thermal comfort Human Comfort Metrics
Visual comfort
Indoor air quality
Window Thermostat
system Illuminance
Temperature
Room CO2 Sensor ture featuring a shared representation followed by four network
branches, one for each action dimension. In addition, from the
Fig. 1: Four Subsystems in a Typical Building. shared representation, a state value is obtained that links the
joint interrelations in the action space, and it is added to the
output of the four previous branches. This approach achieves a
guarantee human comfort goals, including thermal comfort,
linear increase in the number of network outputs by allowing
visual comfort, and indoor air quality goals. For example,
independence for each action dimension.
indoor temperature can be influenced by three subsystems,
Reward Function. To explore the potential energy saving
like setting the HVAC temperature (adjusting the discharge
energy across four subsystems while considering three human
temperature set points at the VAV level), and/or adjusting blind
comforts, we formulate this problem into an optimization
slats (allowing external sunlight to heat indoor air) and/or the
problem. We define a reward function in our DRL framework
window system (enabling exchange of indoor and outdoor air).
to solve the optimization problem. The novel reward function
To achieve more efficient energy management in buildings,
jointly combines energy consumption, thermal comfort, visual
we propose to study the joint control problem of four subsys-
comfort, and indoor air quality, offering better control and
tems of a building to meet three human comfort metrics as
more flexibility to meet the unique requirement of users.
depicted in Figure 2. The energy consumption of a building
Data Training Requirements. While model-free approaches
is determined by four subsystems and their interaction. It is
in general, and RL techniques in particular, are very powerful,
challenging to control four subsystems jointly, since they may
their main weakness is the amount of data required to train
have opposite outcomes on different human comfort metrics.
them properly. The amount of training data should be in
For example, opening the window can improve indoor air
proportion to the action space, which in our case it is
quality and save the energy consumed by the HVAC system
very large. This issue is very important since we cannot
for ventilation, but it may also reduce (in winter) or increase
expect building stakeholders to have years of building data
(in summer) indoor temperature. To handle the temperature
readily available so we can use OCTOPUS. Instead, we use
variation caused by the open window, the HVAC system may
a calibrated building simulator combined with weather data
need to spend more energy rather than the energy saved by
that is readily available, in order to generate as much training
natural ventilation.
data as we needed. We trained our OCTOPUS system with
This paper presents a customized DRL-based control sys-
10-year of weather data of two areas; one is Merced, CA, and
tem, named OCTOPUS, which controls four subsystems of
the other one in Chicago, IL, due to their distinct weather
a building to meet three human comfort requirements with
characteristics. The critical point is that this method allows to
the best energy efficiency. It leverages all the advantages of
train OCTOPUS for any building under any weather profile, as
DRL-based control, including fast adaptation to new buildings,
long as there is a repository of weather data for the location,
real-time actuation and being able to handle a large state
and a few months of building data to perform the calibration
space. However, to control four subsystems jointly in a unified
of the simulator.
framework, we need to tackle three main challenges:
We highlight the main contributions of the paper as follows:
High-Dimension Control Actions. With a uniform DRL
• To the best of our knowledge, this is the first work that
framework, OCTOPUS needs to decide a control action
leverages DRL to balance the tradeoff between energy use
for four subsystems jointly and periodically, including the
and human comfort in a holistic manner.
heating/cooling air temperature of the HVAC system, the
• OCTOPUS adopts a special reward function and a new DRL
brightness level of electric lights, the blind slat range and
architecture to tackle the challenges imposed by the combined
the open proportion of the window. Each subsystem adds
joint control of four subsystems with a very large action space.
one dimension in the action space. The goal of OCTOPUS
• We tackle the issue of data training requirement by adopting
is to select the best action combination As from the set of
a simulation strategy for data generation, and spending effort
all possible combinations Aall that meet the requirement of
in calibrating the simulations to make them as close as possible
human comfort with the lowest energy consumption. Since
to the target building. This allows our system to generate as
each subsystem can set its actuator to a large number of
much data as needed within a finite amount of time.
discrete values, e.g., we have 66 possible values to set the
zone temperature by the HVAC system, the set of all possible
action combinations Aall is extremely large, i.e., 2,371,842 II. R ELATED W ORK
actions in our case. Conventional control of the HVAC system. Model pre-
To solve this problem, we leverage a novel neural architec- dictive control (MPC) models have been developed for HVAC
3

control. It is a planning-based method that solves an optimal the occupant requirements of human comfort. However, the
control problem iteratively over a receding time horizon. Some genetic algorithm requires a few minutes to hours to generate
of the advantages of MPC are that it takes into consideration one control action and thus is not practical to be used in real
future disturbances and that it can handle multiple constraints building control.
and objectives, e.g., energy and comfort [5]. RL-based control of the HVAC system. With the devel-
However, it can be argued that the main roadblock prevent- opment of deep learning [24], [25] and deep reinforcement
ing widespread adoption of MPC is its reliance on a model learning [26], [27], many works apply RL for HVAC control.
[14], [15]. By some estimates, modeling can account for up to RL control can be a “model-free” control method, i.e., an RL
75% of the time and resources required for implementing MPC agent has no prior knowledge about the controlled process.
in practice [16]. Because buildings are highly heterogeneous, RL learns an optimal control strategy by “trial-and-error”.
a custom model is required for each thermal zone or building Therefore, it can be an online learning method that learns an
under control. optimal control strategy during actual building operations. Pe-
There are two paradigms for modeling building dynamics: dro et al. [28] investigated the application of a reinforcement-
physics-based and statistics-based [14]. Physics-based models, learning-based supervisory control approach, which actively
e.g., EnergyPlus, utilize physical knowledge and material learns how to appropriately schedule thermostat temperature
properties of a building to create detailed representation of the setpoints. However, in HVAC control, online learning may
building dynamics. A major shortcoming is that such models introduce unstable and poor control actions at the initial stage
are not control-oriented. Nonetheless, it is not impossible to of the learning. In addition, it may take a long time (e.g. over
use such models for control [17]. For instance, exhaustive 50 days reported in [28]) for an RL agent to converge to a
search optimization is used to derive control policy for an stable control policy for some cases. Therefore, some studies
EnergyPlus model [18]. Furthermore, physics-based model choose to use an HVAC simulator to the train the RL agent
requires significant modeling effort, because they have a large offline [29].
number of free parameters to be specified by engineers (e.g., Unlike MPC, simulators with arbitrary high complexity can
2,500 parameters for a medium-sized building [19]); and be directly used to train RL agents because of its “model-free”
information required for determining these parameters are nature. Li et al. [6] adopt Q learning for HVAC control. Dala-
scattered in different design documents [20]. magkidis et al. [30] design a Linear Reinforcement Learning
Statistical models assume a parametric model form, which Controller (LRLC) using linear function approximation of the
may or may not have physical underpinnings, and identifies state-action value function to meet the thermal comfort with
model parameters directly from data. Dinh et al. [21] propose minimal energy consumption. However, the tabular Q learning
a hybrid control that combines MPC and direct imitation approaches are not suitable for problems with a large state
learning to reduce energy cost while maintaining a comfortable space, like the state of four subsystems. Le et al. [31], [32]
indoor temperature. While this approach is potentially scal- propose a control method of air free-cooled data centers in
able, a practical problem is that the experimental conditions tropics via DRL. Vazquez-Canteli et al. [33] develop a multi-
required for accurate identification of building systems fall agent RL implementation for load shaping of grid-interactive
outside of normal building operations [22]. connected buildings. Ding et al. [34] design a model-based
Conventional control of multiple subsystems. Blind sys- RL method for multi-zone building control. Zhang et al. [7],
tem should be considered as an integral part of fenestration [35] implement and deploy a DRL-based control method for
system design for commercial and office buildings, in order to radiant heating systems in a real-life office building. Gao et al.
balance daylighting requirements versus the need to reduce [36] propose a deep deterministic policy gradients (DDPGs)-
solar gains. The impact of glazing area, shading device based approach for learning the thermal comfort control policy.
properties and shading control on building cooling and lighting Although the above works can improve the performance of
demand was calculated using a coupled lighting and thermal HVAC control, they only focused on HVAC subsystem.
simulation module [11]. The interactions between cooling and
lighting energy use in perimeter spaces were evaluated as a III. M OTIVATION
function of window-to-wall ratio and shading parameters. In this section, we perform a set of preliminary simulations
The impacts of window operation on building performance in EnergyPlus [37] in order to understand the relationships
was investigated [13] for different types of ventilation systems between the different subsystems and their impact on human
including natural ventilation, mixed-mode ventilation, and comfort in a building as described in Figure 2. This is also
conventional VAV systems in a medium-size reference office used to gain trust that the simulator is being run correctly, with
building. While the results highlighted the impacts of window intuitive results that can be understood.
operation on energy use and comfort and identified HVAC Our goal is to study the effect of different subsystems to
energy savings with mixed-mode ventilation during summer three human comfort metrics. A single-floor office building
for various climates, the control for window opening fraction of 100 m2 at Merced, California is modeled. The building is
was estimated by experience and is not salable for different equipped with a north-facing single-panel window of 2 m2 and
kinds of buildings. an interior blind. The simulations are conducted with weather
Kolokotsa et al. [23] develop an energy efficient fuzzy data for the month of October. This is a shoulder season, with
controller based on a genetic algorithm to control four outdoor temperatures being a bit cold, but mostly sunny days,
subsystems (HVAC, lighting, window, and blind) and meet i.e. high solar gain.
4

3 40
1.5 Baseline Blind Open Outdoor Temperature
Predictive Mean Vote

Illuminance (x1000 lux)


Blind Open Blind Close 35 Blind Close

Temperature ( C)
1.0 Window Open 2 Blind Open
HVAC Open 30
0.5
1 25
0.0
20
0.5 0
6 8 15
0 5 10 15 20 10 12 14 16 18 0 5 10 5 20
Time (h) Time (h) Time (h)1
Fig. 3: Thermal Comfort, PMV Fig. 4: Visual Comfort, Illuminance Fig. 5: Temperature Effect

Figure 3 shows the effect of three subsystems on thermal IV. D ESIGN OF O CTOPUS
comfort. Predictive Mean Vote (PMV) is used to evaluate In this section, we describe in detail the design of OC-
thermal comfort. A PMV value that is close to zero represents TOPUS, including a system overview, DRL-based building
the best thermal comfort, with higher positive values meaning control, branching dueling Q-Network, and reward function
people are hot, and lower negative values meaning people are calculation.
cold. A detailed description of PMV values and ranges will be
provided in Section IV-D2. The baseline case (green-solid) in A. OCTOPUS Overview
Figure 3 shows the case when all three subsystems are closed.
This case acts like a “fishtank” model, where the only effect The design goal of OCTOPUS is to meet the requirement of
in the room is due to the solar gain during the day, with no human comfort by energy efficient control of four subsystems
other interactions through any system but the window. in a building.
Our goal is to minimize the energy E consumed by all
When only the blind is open (blue-dashed), the PMV value subsystems in the building, including the energy used in
can be affected from 1.45 to 1.75, showing an increase in heating/cooling coils to heat and cool the air, the electricity
the temperature due to the increase of solar gain. This is more used in the water pumps and flow fans in the HVAC system,
prominent in the middle of the day, when the sun is at its apex. electricity used by the lights, and the electricity used by the
When the window is open (red-dashed-dot), the PMV value is motors to adjust the blinds and windows.
lowered due to the temperature effect, colder outside air enters The value of E is constantly being affected by the vector
the room, producing a colder, more comfortable temperature. As , which is an action combination for four subsystems, which
The HVAC system (black-dot) can maintain the PMV value belongs to the vector Aall that is all the possible action
to an acceptable range (between -0.5 and +0.5) by forcing combinations.
air to be at the correct temperature through the room vents. In addition to the minimization of energy, we would like to
From the results of Figure 3, we can conclude that all these maintain the human comfort metrics within a particular range.
three subsystems have an obvious impact on thermal comfort. This can be expressed as Pmin ≤ P M V ≤ Pmax , Vmin ≤
Figure 4 shows the illuminance measured at a place close to V ≤ Vmax , and Imin ≤ I ≤ Imax .
the window from 5 am to 7 pm when the blind is open (green- P M V is a parameter that measures thermal comfort; V
solid) and the room has natural light. Illuminance values from measures visual comfort; and I measures indoor air quality.
500-1000 lux or higher are acceptable in most environments. The consumed energy E and the human comfort metrics
We clearly see that with the blind open, the values are within (P M V , V , and I) are determined by the current state of
this range for most of the day. all four subsystems, the outdoor weather and the action we
Figure 5 shows the indoor temperature when the blind is are about to take. They can be measured in real buildings or
open (red-dashed) or closed (blue-solid). The outdoor temper- calculated in a building simulator, like EnergyPlus, after the
ature (green-dash-dot) is lower than the indoor temperature, action is executed.
due to the ”fish tank” effect and the lack of window open The achieved human comfort results should fall into an
or an HVAC system on during the day. Combining the results acceptable range to meet the requirements of users. We use
from Figures 4 and 5 we see that the blind system can save the [Pmin , Pmax ], [Vmin , Vmax ], [Imin , Imax ] to present the
energy consumed by the lighting system by reducing the need accepted range for thermal comfort, visual comfort and indoor
of artificial light, but it may also increase the energy used by air quality. They can be set by individual users according to
the HVAC system in order to maintain the load. However, for their preference, or by facility managers based on building
lower outdoor temperatures in winter, the sunlight through the standards. The details on calculation of the above parameters
blind can increase the indoor temperature and save the energy (E, P M V , V and I), the definition of an action (As ) and
of the HVAC system. the settings of the human comfort ranges (e.g., [Pmin , Pmax ])
The simulations are conducted to show some examples of will be introduced in Section IV-D.
the non-trivial interactions between subsystems and human Our goal is to find the best As from Aall for each action
comfort. It is challenging to quantify the complex relationships interval (15 mins in our implementation). The best As should
among different subsystems and the three human comfort maintain the three human comfort metrics in their acceptable
metrics and serves as motivation for our work. ranges for the entire control interval with the lowest energy
5

Energy Thermal Visual Indoor Air Demand


Consumption Comfort Comfort Condition Formulation

Reward

On-demand
Actuation Environment
Customized DRL Optimization
Control Method

HVAC lighting
system system EnergyPlus
Meta Data
Model
Initialization
blind window
Calibrated Model system system

Fig. 6: OCTOPUS Architecture with Four Subsystems (including HVAC, lighting, blind and window systems)

consumption (E). To achieve this goal, we implement a


Agent
DRL-based control system for buildings. Figure 6 shows
the overview of OCTOPUS as a building control system. It 𝒓𝒆𝒘𝒂𝒓𝒅: 𝒓𝒕 𝒔𝒕𝒂𝒕𝒆: 𝒔𝒕 𝒂𝒄𝒕𝒊𝒐𝒏: 𝒂𝒕
consists of three layers, i.e., building layer, control layer, and
𝒔𝒕)𝟏
user demand layer. The building layer is composed of the real Environment
building or a building simulation model, and the sensor data 𝒓𝒕)𝟏
management components. It provides sensor data to the control
Fig. 7: Reinforcement Learning Model.
layer and executes the control actions generated by the latter.
The user demand layer quantifies the user requirement of three
human comfort metrics. The range of each human comfort
air relative humidity (%), diffuse solar radiation (W/m2 ),
metric is then passed to the control layer, which searches for
direct solar radiation (W/m2 ), solar incident angle (◦ ), wind
the optimal control to meet the human comfort ranges with
speed (m/s), wind direction (degree from north), average PMV
minimal energy consumption.
(%), heating setpoint of the HVAC system (◦ C), cooling
setpoint of the HVAC system (◦ C), the dimming level of
B. DRL-based Building Control lights (%), the window open percentage (%), and the blind
1) Basics for DRL and DQN: In a standard RL framework, open angle (◦ ). All the values we can be calculated by the
as shown in Figure 7, an agent learns an optimal control policy EnergyPlus simulation model. Min-max normalization is used
by trying different control actions to the environment. In our to convert each item to a value within 0-1.
case, the environment is a building simulation model due to the 3) Action in OCTOPUS: The action is how the DRL
extensive data requirements to train the system. With DRL, the agent controls the environment. Given the state, we want the
agent is implemented as a deep neural network (DNN). The agent to find the most suitable action combinations among
agent-environment interactions of one step can be expressed HVAC, lighting, blind and window system to balance energy
as a tuple (St , At , St+1 , Rt+1 ), where St is the environment’s consumption and three human comfort metrics. There are four
state at time t, At is the control action performed by the agent action dimensions when considering these four subsystems,
at time t, St+1 is the resulting environment’ s state after the represented as
agent has taken the action, Rt+1 is the reward received by the
agent from the environment. The goal of DNN agent training is At = {Ht , Lt , Bt , Wt } , (2)
to learn an optimal control policy to maximize the accumulated
where At is the action combination of four subsystems at time
returned reward by taking different control actions.
t. Ht is the temperature set-point of the HVAC system, which
2) State in OCTOPUS: The state is what the DRL agent
can be set to 66 values. Lt is the dimming level of electric
takes as input for each control step. In this study, the state is
lights. Bt is the blind slat angle. The range of blind slat can
a stack of the current and historical observations, as shown
be adjusted from 0 ◦ ∼ 180 ◦ . Wt is the open percentage
below:
of the window. Each of the above three actuation parameters
S = {obt , obt−1 , ..., obt−n } , (1)
can be set to 33 values in our current implementation to
where t is the current time step, n is the number of the achieve a proper balance between control granularity and
historical time steps to be considered, and each ob consists of calculation complexity. According to Equation 2, the total
the following 15 items: outdoor air temperature (◦ C), outdoor number of possible actions in the action space is 2,371,842
air relative humidity (%), indoor air temperature(◦ C), indoor (66 × 33 × 33 × 33). Existing DRL architectures, like Deep
Branching Dueling Q-Network
6

Advantages dimension 1 argmax


(HVAC system) Q-values
128

Advantages dimension 2 argmax


(Lighting system) Q-values
128
State Shared representation Action Combination
256 Advantages dimension 3
512 (Blind system) Q-values argmax
128
128
Advantages dimension 4 nQ-values
(Window system) argmax
128
State value

Fig. 8: The Specific Action Branching Network Implemented for the Proposed BDQ Agent

Q-Network (DQN) in [8] and Asynchronous Advantage Actor- several network branches, one for each action dimension. BDQ
Critic (A3C) in [7], cannot work efficiently in our problem, can scale robustly to environments with high dimensional
Action Branching Architectures for Deep Reinforcement L earning (Arash Tavakoli, Fabio Pardo, Petar Kormushev) 1
because the large number of actions requires to be explicitly action spaces and even outperform the Deep Deterministic
represented in the agent DNN network and it will significantly Policy Gradient (DDPG) algorithm in the most challenging
increase the number of DNN parameters to be learned and task [40]. In our current implementation, we use a simulated
consequently the training time [38]. To solve this problem, building model developed in EnergyPlus as the environment
we leverage a novel neural architecture featuring a shared for training and validation. Our BDQ-based agent interacts
representation followed by four network branches, one for with the EnergyPlus model. At each control step, it processes
each action dimension. the state (building and weather parameters) and generates a
4) Reward Function in OCTOPUS: Reward illustrates the combined action set for four subsystems.
immediate evaluation of the control effects for each action Figure 8 demonstrates the action branching network of BDQ
under a certain state. Both human comfort and energy con- agent. When a state is inputted, the shared decision module
sumption should be incorporated. To define the reward func- computes a latent representation that is then used for the
tion, a common approach is to use the Lagrangian Multiplier calculation of the state value and the output of the network
function [39] to first convert the constrained formulation into (Advantages dimension in Figure 8) for each dimension
an unconstrained one: branch. The state value and the factorized advantages are then
R = −[ρ1 N orm(E) + ρ2 N orm(Tc ) combined, via a special aggregation layer, to output the Q-
(3) values for each action dimension. These Q-values are then
+ρ3 N orm(Vc ) + ρ4 N orm(Ic )], queried for the generation of a joint-action tuple. The weights
where ρ1 , ρ2 , ρ3 and ρ4 are the Lagrangian multipliers. E of the fully connected neural layers are denoted by the gray
is energy consumption, T c is thermal comfort, V c is visual trapezoids and the size of each layer (i.e. number of units) is
comfort and Ic is Indoor air quality. N orm(x) is a normal- depicted in the figure.
ization process, i.e., N orm(x) = (x - xmin )/(xmax - xmin ) Training Process: The training process of the BDQ-based
to transform energy and three human comfort to the same control agent is outlined in Algorithm 1. At the beginning,
scale. This reward function merges the objective (e.g. energy we first initialize a neural network Q with random weight
consumption) and constraint satisfaction (e.g. human comfort). θ. Another neural network Q− with the same architecture
The reward consists of four parts, namely, the penalty for is also created. The outer ”for” loop controls the number of
the energy consumption of the HVAC and lighting system, training episodes, and the inner ”for” loop performs control
the penalty for the occupants’ thermal discomfort, the penalty at each control time step within one training episode. During
for the occupants’ visual discomfort and the penalty for the the training process, the recent transition tuples (St , At , St+1 ,
occupants’ indoor air condition discomfort. Specifically, the Rt+1 ) are stored in the replay memory Λ from which a mini-
reward should be less, if more energy is consumed by the batch of samples will be generated for neural network training.
HVAC system or the occupants feel uncomfortable about The variable At stores the control action in the last step, and
the building thermal, visual and indoor air condition. The St and St+1 represent the building state in the previous and
details about how to define and formulate energy consumption current control time steps, respectively. At the beginning of
E, thermal comfort T c, visual comfort V c and indoor air each time slot t, we first update four actions and obtain the
condition Ic are explained in Section IV-D. current state St+1 . In line 7, the immediate reward Rt+1 is
calculated by Equation 3. A training mini-batch can be built
by randomly drawing some transition tuples from the memory.
C. Branching Dueling Q-Network We calculate the target vector and update the weights of
To solve the high-dimensional action problem described in the neural network Q by using an Adam optimizer for every
Section IV-B3, OCTOPUS adopts a Branching Dueling Q- control step t. Formally, for an action dimension d ∈ 1, ...N
Network (BDQ), which is a branching variant of the dueling with n discrete actions, a branch’s Q-value at state s ∈ S and
Double Deep Q-Network (DDQN). BDQ is a new neural with action ad ∈ Ad is expressed in terms of the common
architecture featuring a shared decision module followed by state value V (s) (the result of the shared representation layer
7

Algorithm 1: The Training Process of Our BDQ-Based TABLE I: PMV Constants


Agent
Parameter Value Units
Input: The range of human comfort metrics and
maximum acceptable energy consumption Metabolic rate 70 W/m2
Output: A trained DRL agent Clothing Level 0.5 clo
1 Initialize BDQ’s prediction Q with random weights θ;
− −
2 Initialize BDQ’s target Q with weight θ = θ ;
3 for episode =0,1,...,M do 1) Energy Consumption: The energy consumption of a
4 Obtain the initial state St and At randomly; building includes heating coil power Ph and cooling coil
5 for control time step t = 0,1,...,T do power Pc and fan power Pf from the HVAC system and
6 Update Ht , Lt , Bt , Wt by the control action, electric light power Pl from the lighting system. We calculate
At ; the reward function for energy consumption E during a time
7 Calculate reward Rt+1 by Equation 3; slot as
8 Obtain current state observation St+1 ; E = (Ph + Pc + Pf + Pl ) (7)
9 Store (St , At , St+1 , Rt+1 ) in reply memory Λ;
10 Draw mini-batch sample transitions from Λ; The heating and cooling coil are used to cool or heat the
11 Calculate the target vector and update weights air and the fan is used to distribute the heating air or cooling
in neural network Q ; air to the zone. The electric lights are used for normal work
12 Update target network Q− d (s, ad ) using
in the zone. They are calculated by EnergyPlus simulator in
Equation 5; our training and evaluation. In our current implementation,
13 Perform greedy descent iteratively to tune BDQ we ignore the power consumed by the water pumps and the
by Equation 6. motors to adjust blinds and windows, because it is relatively
small compared with the power consumption of the HAVC
system or the lighting systems, and can be safely ignored (less
than 1% total).
in Figure 8) and the corresponding (state-dependent) action 2) Human Comfort: We define and explain the measure-
advantage Ad (s, ad ) of each branch (the result of the each ment of the three human comfort metrics.
advantage dimension in Figure 8) by: Thermal Comfort: It is determined by the index PMV (Pre-
0
dictive Mean Vote) that is calculated by Fanger’s equation [41].
Qd (s, ad ) = V (s) + (Ad (s, ad ) − n1 a0 ∈Ad Ad (s, ad )).
P
d
PMV predicts the mean thermal sensation vote on a standard
(4) scale for a large group of persons. The American Society
The target network Q− will be updated with the latest of Heating Refrigerating and Air Conditioning Engineers
weights of the network Q every c control time steps. c is set (ASHRAE) developed the thermal comfort index by using
to 50 in our current implementation. Q− is used for inferring coding -3 for cold, -2 for cool, -1 for slightly cool, 0 for
the target value for the next c control steps. We use yd to natural, +1 for slightly warm, +2 for warm, and +3 for hot.
represent the maximum accumulative reward we can obtain in PMV has been adopted by the ISO 7730 standard [42]. The
the next c steps. yd can be calculated by temporal-difference ISO recommends maintaining PMV at level 0 with a tolerance
(TD) targets in a recursive fashion: of 0.5 as the best thermal comfort. We calculate the reward
0 0 0
function for thermal comfort Tc during a time slot as
yd = R + γ N1 d Q−
P
d (s , arg max
0
Qd (s , ad )), (5) (
ad ⊆Ad
0, PMV ≤ P
Tc = (8)
where Q− d denoting the branch d of the target network Q ;
− |P M V − P |, P M V > |P |
R is the reward function result; and γ is discount factor.
The occupants can feel comfort when PMV value is within
Finally, at the end of the inner ”for” loop, we calculate the
an acceptable range. We denote the range as [−P, P ], where P
following loss function every c control steps:
is the threshold for PMV value. If the PMV value lies within
L = E(s,a,r,s0 ) ∼ D
P 2

(6) [−P, P ], it will not incur a penalty. Otherwise, it will incur
d (yd − Qd (s, ad )) ,
a penalty for the occupants’ dissatisfaction with the building
where D denotes a (prioritized) experience replay buffer thermal condition. There are six primary factors that directly
and a denotes the joint-action tuple (a1 , a2 , ..., aN ). The loss affect thermal comfort that can be grouped in two categories:
function L should decrease as more training episodes are personal factors - because they are characteristics of the
performed. occupants - and environmental factors - which are conditions
of the thermal environment. The former are metabolic rate
and clothing level, the latter are air temperature, mean radiant
D. Reward Calculation temperature, air speed and humidity. The PMV personal
This section describes how we calculate the reward function factors parameters are shown in Table I. The PMV personal
in Equation 3, including energy cost E, thermal comfort T , factors environmental factors are obtained in real time from
visual comfort V and indoor air condition I. EnergyPlus.
8

Visual Comfort: The research on visual comfort is dom-


Simulation Platform
inated by studies analyzing the presence of an adequate
amount of light where discomfort can be caused by either Building Model Controller
too low or too high level of light as glare. In this paper, the Design Process (Matlab/Python)
major glare metric is illuminance range [43]. The illuminance
source includes daylight and electrical light. Thus, the main Floor Plan
subsystems that can have an impact on visual comfort are blind (SketchUp) Gateway
(BCVTB)
system and lighting system. We calculate the reward function
for visual comfort Vc during a time slot as

Thermal Zone Building
 −F − ML ,
 F < ML
(OpenStudio) (EnergyPlus)
Vc = 0, ML ≤ F ≤ MH (9)

F − MH ,

F > MH
The occupants can feel comfort when illuminance value F Fig. 9: Workflow of Octopus
is within an acceptable range. We denote the range as [ML ,
MH ], where M is the threshold for illuminance value. If the
implement the control of each zone temperature set points,
illuminance value lies within [ML , MH ], it will not incur a
blinds, lighting and window schedule during each action
penalty. Otherwise, it will incur the penalty for the occupants’
time in EnergyPlus for our Building alongside weather data.
dissatisfaction with the building illuminance condition.
OCTOPUS is modeled using EnergyPlus version 8.6 [37]. We
Indoor Air Quality: Carbon dioxide (CO2 ) concentration
train OCTOPUS based on 10-year weather data from two
in a building is used as a proxy for air quality [44]. The carbon
different cities, Merced, CA and Chicago, IL due to their
dioxide concentration comes from building’s users. There are
distinct weather characteristics. The weather data for Merced
various other sources of pollution (NOx, Total Volatile Organic
has intensive solar radiation and large variance in temperature,
Compounds (TVOC), respirable particles, etc.). Ventilation is
while Chicago is classified as hot-summer humid continental
an important means for controlling indoor air quality (IAQ)
with four distinct seasons. To train our model, we define an
in buildings [45]. Ventilation in this work mainly comes from
“episode” as one inner for loop of Algorithm 1.
the HVAC system and the window system. We calculate the
reward function for indoor air condition Ic during a time slot
as  B. Rule Based Method
 −C − AL ,
 C < AL
We implement a rule-based method based on our current
Ic = 0, AL ≤ C ≤ AH (10) campus building control policy. This policy was first set up

C − AH , at commissioning time by a mechanical engineering company,

C > AH
and then it was further optimized by two experienced HVAC
The occupants can feel comfort when carbon dioxide con-
engineers when going over the LEED certification process.
centration value C is within an acceptable range. We denote
First, we assign different zone temperature setpoints. Each
the range as [AL , AH ], where A is the threshold for dioxide
zone has a separate heating and cooling setpoint. The heating
concentration value. If the dioxide concentration value lies
setpoint is set to 70 ◦ F, and the cooling setpoint to 74 ◦ F during
within [AL , AH ], it will not incur a penalty. Otherwise, it
the warm-up stage. The cooling setpoint is limited between
will incur a penalty for the occupants’ dissatisfaction with the
72◦ F and 80◦ F, and the heating setpoint is limited between
building indoor air quality.
65◦ F and 72◦ F. Second, we set control restrictions and
actuator limits and control inputs are subject to the following
V. I MPLEMENTATION OF O CTOPUS constraints: the heating setpoint should not exceed the cooling
In this section, we illustrate in detail the implementation setpoint minus 1 ◦ F. The adjustment will move both the
of OCTOPUS including platform setup, HVAC modeling and existing heating and cooling setpoints upwards or downwards
calibration, and OCTOPUS training. by the same amount unless the limit has been reached. Third,
for the control Loops: two separate control loops operate to
A. Platform Setup maintain space temperature at setpoint, the Cooling Loop and
the Heating Loop. Both loops are continuously active.
Figure. 9 shows a conceptual flow diagram of our build-
ing simulation and control platform. Our building model is
rendered using SketchUp [46]. It replicates a LEED Gold C. HVAC System Description
Certified Building in our University Campus. Using Open- The HVAC system we modeled is a single duct central
Studio, the HVAC, lighting, blind and window system are cooling HVAC with terminal reheat as shown in Figure 10.
installed in the building/zones. The control scheme - OC- The process begins at the supply fan in the air handler unit
TOPUS is implemented using Tensorflow, which is an open- (AHU), which supplies air for the zone. The supply fan’s air
source machine learning library for Python. Using the Building first goes through a cooling coil, which cools the air to the
Control Virtual Test Bed (BCVTB), a Ptolemy II platform minimum required temperature needed for the zone. Before
that enables co-simulation across different models [47], we air enters a zone, the air passes through a variable air volume
9

Supply Fan Cooling Coil TABLE II: Model Calibration Parameters

Outside Parameter Range Adoption


Air
3 3
Infiltration Rate 0.01 m ∼ 0.5 m 0.05 m3
Heating Coil VAV Damper
Window Type Single/Double Pane Single

Light
Supply Air
AHU

Window
Blind
2 2
Dampers
Illuminance Window Area 1m ∼ 4m 2m2
Temperature
CO2 Sensor
Return Air Window Thickness 3mm ∼ 6mm 3mm
room
Exhaust
Fan Efficiency 0.5 ∼ 0.8 0.7
Return Fan
Air
Blind Type Interior/Exterior Blind Interior

Fig. 10: HVAC Single Duct VAV Terminal Reheat Layout. Blind Thickness 1mm ∼ 6mm 1mm

Initial Proposed
TABLE III: Modeling Error after Calibration
Model
MBE CVRMSE
Interim model #1 Weather Station February (hourly temperature) -1.48% 5.32%
(weather data) (https://darksky.net) March (hourly temperature) -0.26% 4.95%
April (hourly temperature) 1.20% 5.06%
Interim model#2
(occupancy Panasonic PIR and May (hourly temperature) 0.48% 4.38%
schedule) Grid eye sensors
February - May(monthly energy) -3.83% 12.33%
Interim model #3
(zone temperature WebCTRL and
and HVAC energy) influx database
system was installed in the target building on our campus and
allows the collection of fine grain occupancy data at the zone
Calibrate Error Actual level in the building, allowing the evaluation using accurate
Calibrated Model MBE, CVRMSE measured data
occupancy patterns. We used the hourly occupancy data from
3 months as the occupancy schedule in our simulated building
Fig. 11: Building Model Calibration Process. by EnergyPlus. The third step is to calibrate certain system and
control parameters to match those in the target building we
want to replicate. This involves multiple issues, including (a)
(VAV) unit that regulates the amount of air that flows into a
the selection of the parameters to be calibrated, (b) the range
zone. Terminal reheat occurs when the heating coil increases
of those parameters, and (c) the step used within the range.
the temperature before discharging air into a zone. A discharge
In our work, we use an N-factorial design with 5 parameters
setpoint temperature is selected for each zone and the VAV
and ranges to be tested based on operational experience. We
ensures that the air is heated to this temperature for each zone.
tested different combinations of HVAC system parameters
The air supplied to the zone is mixed with the current zone air,
(Infiltration rate) and control (mass flow rate, heating, and
and some of the air is exhausted out of the zone to maintain
cooling setpoints) and found the combination that minimized
a constant static pressure. The return air from each zone is
the calibrated error (see below). The selected calibration
mixed in the return duct, and then portions of it may enter the
parameters are listed in the Table II with their calibration
economizer.
ranges and value selected. The final step is to compare the
calibrated error between the calibrated model and the actual
D. HVAC Modeling and Calibration measured zone temperature and energy consumption stored
The purpose of the calibration is to ensure the building in the operational building database. The whole calibration
energy model can generate energy use results close to the process of modeling our building takes nearly one month.
measured values in the target building using actual inputs, ASHRAE Guideline 14-2002 [49] defines the evaluation
including weather, occupancy schedule, and the HVAC system criteria to calibrate BEM models. According to the Guide-
parameters and controls. line, monthly and hourly data can be used for calibration.
The building model calibration process is shown in Figure Mean Bias Error (MBE) and Coefficient of Variation of the
11. The first step of the calibration is to collect the real Root Mean Squared Error (CVRMSE) are used as evaluation
weather data from a public weather station for the period indices. The guideline states that the model should have an
to be tested. We use a Dark Sky’s API, a public weather MBE of 5% and a CVRMSE of 15% relative to monthly
website, to collect real weather data for three months. The calibration data. If hourly calibration data are used, these
second step is to replace the default occupancy schedules in requirements should be 10% and 30%, respectively. In our
the simulator with the actual occupancy schedules collected case, hourly data is used to calculate the error metrics for
from the real target building using ThermoSense [48]. This the average zone temperature. We choose monthly data to
10

TABLE IV: Human Comfort Statistical Results for Rule Based, DDQN-HVAC and OCTOPUS Schemes
CO2 Concentration Energy Consumption
PMV Illuminance (lux) (ppm) (kWh)
Location Method Metric
January July January July January July January July
Mean 0.03 -0.25 576.78 646.45 623.61 668.03
Rule Based
Method Std 0.11 0.13 152.54 157.11 120.64 181.22 1990.99 3583.03
Violation rate 0 2% 0.94% 0 0.3% 3.629%
Mean -0.19 0.28 576.78 646.45 625.62 648.01
DDQN-HVAC
Merced Std 0.21 0.11 152.54 157.11 122.62 120.57 1859.10 3335.58
[7]
Violation rate 2.99% 4.4% 0.94% 0 0 0.2%
Mean -0.31 0.27 587.12 569.88 594.77 612.33
OCTOPUS Std 0.2 0.10 382.27 75.83 111.59 110.35 1756.24 2941.46
Violation rate 5.7% 2.5% 0.26% 0.2% 1.31% 0.33%
Mean -0.28 -0.15 583.27 637.07 610.26 638.33
Rule Based
Method Std 0.11 0.02 163.96 151.37 63.94 151.37 3848.61 3309.56
Violation rate 3.09% 0 1.1% 0 0 0
Mean -0.32 0.24 583.27 637.07 612.74 649.32
DDQN-HVAC
Chicago Std 0.08 0.07 163.96 151.37 65.09 90.16 3605.21 3078.67
[7]
Violation rate 3.7% 2.9% 1.1 % 0 0 0
Mean -0.4 0.29 598.34 544.09 640.31 633.71
OCTOPUS Std 0.1 0.11 259.88 55.37 99.85 111.04 3496.54 2722.03
Violation rate 4.2% 1.47% 1.6 % 0 1% 1.31%

TABLE V: Parameter Settings in DRL Algorithms the agent with a minibatch size of 64 and a discount factor
γ = 0.99. The target network is updated every 103 time
4tc 15 m β1 0.9
steps. We use the rectified non-linearity (or ReLU) [51] for
Minibatch Size 64 β2 0.999 all hidden layers and linear activation on the output layers.
Learning Rate 10−4
Action Dimension 35040 The network has two hidden layers with 512 and 256 units in
the shared network module and one hidden layer per branch
γ 0.99 Action Space 2.37 ∗ 107
with 128 units. The weights are initialized using the Xavier
initialization [52] and the biases were initialized to zero.
We used the prioritized replay with a buffer size of 106
calculate energy error metrics because energy data can only be
and linear annealing of β from β0 = 0.4 to 1 over 2 x
obtained monthly. The calibration results for zone temperature
106 steps. While an −greedy policy is often used with Q-
and energy consumption are shown in Table III. It is shown
learning, random exploration (with an exploration probability)
that less than 2% NMBE and less than 6% CVRMSE for the
in physical, continuous-action domains can be inefficient. To
zone temperature can be achieved with the optimal parameter
explore actions well in our building environment, we decided
setting. We found that both the CVRMSE for the monthly
to sample actions from a Gaussian distribution with its mean
heating and cooling energy demand is relatively large, but the
at the greedy actions and with a small fixed standard deviation
NMBE and CVRMSE are still within the acceptable range.
throughout the training to encourage life-long exploration. We
This means the model can achieve accurate calculation for the
used a fixed standard deviation of 0.2 during training and zero
monthly energy.
during evaluation. This exploration strategy yielded a mildly
better performance as compared to using an −greedy policy
E. OCTOPUS Training with a fixed or linearly annealed exploration probability. The
duration of each time (action) slot is 15 minutes. We achieved
10-year weather data for training from the two locations
convergence of our reward function after 1000 episodes as
tested (Merced, CA and Chicago, IL) is randomly divided,
explained in Section VI-F.
with eight years used for training and the remaining two
years used for testing. The parameter settings in our DRL
Algorithms are shown in Table V. In our implementation of VI. E VALUATION
OCTOPUS, we use the Adam optimizer [50] for gradient- In this section, we compare the performance of OCTOPUS
based optimization with a learning rate of 10−4 . We train with the rule-based method and the latest DRL-based method.
11

1800
Daily Total Energy Consumption(kWh)
80
Rule Based Method
DDQN-HVAC 2000
OCTOPUS
70 2200
2400

Reward
Episode Reward
60 Average Reward
2600
50 2800
40 3000
30 3200
0 200 400 600 800 0
5 10 15 DAY 20 25 30 100
Episode
Fig. 12: Daily Energy Consumption of Control Methods. Fig. 14: The Convergence of OCTOPUS.

4000
B. Human Comfort
Total Energy Consumption (kWh)

3500
Rule Based Method From the results in Table IV, we see that all three methods
3000 OCTOPUS_HVAC
OCTOPUS_HVAC_L
OCTOPUS_HVAC_L_B can maintain the PMV value in the desired range for most
2500 OCTOPUS_HVAC_L_B_W
of the time since the violation rate is low. The average PMV
2000 violation rate of OCTOPUS and DDQN-HVAC is higher than
1500 the rule-based method by 2.19% and 2.22% respectively. The
1000 reason for this is that the DRL-based methods try to save more
500 energy by setting the PMV to a value close to the boundary of
0 January(M) January(C) July(M) July(C) Average
the acceptable range. It can be observed in Table IV that the
average PMV value of OCTOPUS and DDQN-HVAC (-0.36
Fig. 13: Performance Contribution of Each Subsystem. and -0.26) is closer to the range boundary (-0.5), compared
with the rule-based method (-0.13).
For both visual comfort and indoor air quality, the three
A. Experiment Setting control methods provide a very small violation rate. For illumi-
nance, the mean illuminance value of OCTOPUS and DDQN-
The implementation of the rule-based HVAC control has HAVC is 590.69 lux and 610.89 lux respectively. OCTOPUS
been introduced in Section V-B. The rule-based method only saves energy by utilizing natural light as much as possible.
controls the HVAC system. For the conventional DRL-based For indoor air quality, the average of CO2 concentration of
method, we implement the dueling DQN architecture used OCTOPUS, DDQN-HVAC, and rule-based method is 620.28
in [7], which controls the water-based heating system. We ppm, 633.92 ppm, and 635.06 ppm. OCTOPUS adjusts both
name that work as DDQN-HVAC in our comparison. Since window system and HVAC system to maintain the CO2
these two benchmarks do not control the light system, for a concentration level within the desired range. DDQN-HVAC
fair comparison, we initialize the lights on in all experiments. and the rule-based method only use the HVAC system.
OCTOPUS may dim the lights if the blind is open during the
day. In addition, the two benchmarks always leave the blind C. Energy Efficiency
and window system closed.
The results in Table IV reveal that OCTOPUS save 14.26%
The three human comfort metrics are measured by PMV, and 8.1% energy on average, compared with the rule-based
Illuminance, and carbon dioxide concentration. We set the control method and DDQN-HVAC. In both cities, OCTOPUS
acceptable range of three human comfort metrics according to achieves similar performance gain. OCTOPUS reduces the
building standards and previous experiences in related work. energy consumption of HVAC by using the other subsystems.
The comfort range of PMV is set to -0.5 to 0.5 [53]. The Figure 12 shows a daily energy consumption of three control
comfort range of illuminance is set to 500-1000 lux [43]. The methods in January at Merced. In most days, OCTOPUS
comfort range of carbon dioxide concentration is set to 400- consumes less energy than the other two methods; however,
1000 ppm [45]. OCTOPUS is not always the best although we see clear gains
We use three control methods to control the building we towards the second half of the month due to a change in
modeled in Section V for two months (January and July) and weather temperature. The average range of outdoor temper-
at two places with distinct weather patterns. Table IV shows ature changes from 2 ◦ C ∼ 13 ◦ C in the first half of the
the human comfort results of three control methods and their month to -1 ◦ C ∼ 18 ◦ C in the second half of the month.
energy consumption. The violation rate is calculated as the OCTOPUS could use external air with the window open for
time when the value of a human comfort metric falls beyond more natural ventilation.
its acceptable range divided by the total simulated time. Other In Table IV, compared to the rule-based method and DDQN-
quality of service metrics, including the amount by the which HVAC, OCTOPUS saves more energy in July (17.6% and
the violation occurred, or combination of amount and time 11.7%) than in January (10.05% and 3.9%). In July, the
will be explored in future work. outdoor air temperature range at Merced and Chicago is 15◦ C
12

TABLE VI: Different Parameters for Reward Function in Octopus


Parameter Illuminance CO2 Concentration
(ρ1 ,ρ2 , ρ3 , ρ4 ) PMV (lux) (ppm) Energy (kWh)
Mean Std Mean Std Mean Std
1, 1, 1, 1 -0.36 0.15 587.35 94.52 587.25 101.14 3250.55
5, 1, 1, 1 -0.33 0.16 611.71 131 608.48 175.1 3221.20
10, 1, 1, 1 -0.31 0.16 624.97 189.04 647.77 150.33 3150.62
2, 3, 1, 1 -0.383 0.10 569.88 75.83 636.5 179.46 2941.46
2, 5, 1, 1 -0.481 0.13 689.23 146.66 616.02 177.32 2900.44

∼ 42◦ C and 15◦ C ∼ 40◦ C respectively. The window can be results of the different hyperparameters are not intuitive.
opened when the temperature is within the acceptable range, For example, we would expect the bigger ρ1 and smaller
in order to save the energy consumed by the HVAC system. ρ2 , ρ3 , ρ4 to lead to lower energy consumption and just meet
However, in January, due to the cold weather at both places, the requirements of thermal comfort, visual comfort and indoor
the windows stay closed most of the time and cannot make air condition. However, the results in Table VI shows that
much contribution to energy savings. when increasing the weight of energy, energy consumption
does not necessarily decrease. Such counter-intuitive results
are possibly caused by the delayed reward problem that the
D. Performance Decomposition
DRL agents are stuck in some local optimal areas during the
We implement four versions of OCTOPUS to study the training. Out of the five experiments in Table VI, the fourth
energy saving contribution of each subsystem, i.e., OCTO- row saves 17.9% of the energy consumption with only slightly
PUS just with the HVAC system (OCTOPUS HVAC), OC- worse three human comfort quality in the testing model, which
TOPUS with HVAC and lighting (OCTOPUS HVAC L), comparably achieves the best balance between the human
OCTOPUS with HVAC, lighting and blind (OCTOPUS comfort and energy consumption. Therefore, the parameters
HVAC L B) and OCTOPUS with all four subsystems (OC- in the fourth row are used for the trained agent.
TOPUS HVAC L B W). Figure 13 depicts the energy con-
sumption of these four versions in two different months and at F. Convergence of OCTOPUS training
two different places (Merced and Chicago). Compared with the
Figure 14 shows that the accumulated reward of OCTOPUS
rule-based method, OCTOPUS HVAC can save 6.16% more
in each episode during a training process. We calculate the
energy by only considering HVAC. When the lighting system
reward function every control time step (15 minutes), and
is added in OCTOPUS HVAC L, 2.73% more energy can
thus one episode (one month) contains 2880 time steps. The
be saved. If the blind system is further added in OCTOPUS
accumulated reward of one episode (episode reward in Figure
HVAC L B 1.93% more energy can be saved. Finally, when
14) is the sum of the rewards of 2880 time steps. From the
the window system is added in OCTOPUS HVAC L B W,
results in Figure 14, we see that the episode reward increases
3.44% more energy can be saved. Four subsystems make
and tends to be stable as the number of training episodes
different contributions to energy saving in January and July.
increases. When the episode reward does not change much, it
In January, four subsystems (i.e., HVAC, lighting, blind and
means that we cannot do further to improve the learned control
window) make 6.16%, 2.73%, 1.93% and 0% contribution
policy and thus the training process converges. As indicated in
of energy savings respectively. In July, the contribution of
Figure 14, the training reward fluctuates between two adjacent
these subsystems changes to 5.9 %, 3.31 %, 1.99%, and 6.4%
episodes, because the number of time steps is large in one
respectively. The most obvious difference between these two
episode, i.e., 2880. The rewards calculated at some of these
months is made by the window system (6.4%). The reason
2880 time steps may vary dynamically because we randomly
for this has been explained above. In January, the windows
choose some time steps by an exploration rate (determined by
are closed almost all the time. In July, the cold outdoor air is
a Gaussian distribution with a standard deviation of 0.2). At
used to cool down the building instead of using HVAC system.
these time steps, we do not use the action generated by the
agent, but randomly choose an action to avoid local minimum
E. Hyperparameters Setting convergence. If we smooth the episode reward using a sliding
The hyperparameters in the reward function (Equation 3) window of 10 episodes, the average reward in Figure 14 is
are tuned to balance between the energy consumption and more stable during the training.
human comfort. Table VI shows the performance results of
the trained DRL agents in the selected experiments of the VII. D ISCUSSION
hyperparameters tuning. The total energy consumption and Deploying in a Real Building. Although we have de-
the mean and standard deviation of the PMV, Illuminance veloped a calibrated simulation model of a real building
and carbon dioxide concentration are used as the evaluation on our campus for training and evaluation, we have not
metrics. It is interesting to find that the control performance deployed OCTOPUS in the building, because we do not have
13

access to automatic blind and window system at the moment. for training using existing historical weather data together with
We are seeking financial support to work with our facility a calibrated simulator for the target building. We compare our
team for a possible upgrade. OCTOPUS is designed for real results with both the state-of-art rule-based control scheme
deployment in buildings. For a new building, we need to build obtained from a LEED Gold certified building, a DRL scheme
an EnergyPlus model for it and calibrate the model using real used for optimized heating in the literature, and show that we
building operation data. After training the OCTOPUS control can get 14.26% and 8.1% energy savings while maintaining
agent using the calibrated simulation model and real weather (and sometime even improving) human comfort values for
data, we can deploy the trained agent in the building for real- temperature, air quality and lighting.
time control. For a certain action interval (e.g., every 10 mins),
the OCTOPUS control agent takes the state of the building as
input and generates the control actions of four subsystems. R EFERENCES
OCTOPUS can provide real-time control, as one inference [1] X. Ding, W. Du, and A. Cerpa, “Octopus: Deep reinforcement learning
only takes 22 ms. We plan to deploy OCTOPUS in a real for holistic smart building control,” in ACM BuildSys, 2019.
building in our future work. [2] O. Lucon, D. Ürge-Vorsatz, A. Z. Ahmed, H. Akbari, P. Bertoldi, L. F.
Cabeza, N. Eyre, A. Gadgil, L. Harvey, Y. Jiang et al., “Buildings,”
Scalability of OCTOPUS. OCTOPUS can work in a one- 2014.
zone building with one HVAC system, lighting zone, blind and [3] H. Rajabi, Z. Hu, X. Ding, S. Pan, W. Du, and A. Cerpa, “Modes: M ulti-
window. However, a realistic building (or even a small home) sensor o ccupancy d ata-driven e stimation s ystem for smart buildings,”
in Proceedings of the Thirteenth ACM International Conference on
is usually equipped with many lighting zones, blinds and Future Energy Systems, 2022, pp. 228–239.
windows which may take different actions in one subsystem. [4] W. W. Shein, Y. Tan, and A. O. Lim, “Pid controller for temperature
OCTOPUS may solve this scalability problem by increasing control with multiple actuators in cyber-physical home system,” in IEEE
NBiS, 2012.
the number of BDQ branches, i.e., each branch corresponds [5] A. Beltran and A. E. Cerpa, “Optimal hvac building control with
to one subsystem in each zone of a building. We will tackle occupancy prediction,” in ACM BuildSys, 2014.
this scalability problem in our future work. [6] B. Li and L. Xia, “A multi-grid reinforcement learning method for
energy conservation and comfort of hvac in buildings,” in IEEE CASE,
Building Model Calibration. A critical component of 2015.
our architecture is the use of a calibrated building model [7] Z. Zhang and K. P. Lam, “Practical implementation and evaluation of
that is close to the target building, allowing us to generate deep reinforcement learning control for a radiant heating system,” in
ACM BuildSys, 2018.
sufficient data for our training needs. However, getting a
[8] T. Wei, Y. Wang, and Q. Zhu, “Deep reinforcement learning for building
calibrated model ”right” is a tedious process of trial-and-error hvac control,” in ACM DAC, 2017.
over a large number of parameters. Out of the thousands of [9] (2019). [Online]. Available: https://www.geze.com/en/discover/topics/
parameters available in EnergyPlus, we use our experience natural-ventilation/
[10] (2019). [Online]. Available: https://www.buildings.com/article-details/
and consulted experts to determine both the most important articleid/12969/title/operable-windows-for-operating-efficiency
parameters and a sensible range of values to explore (it took [11] A. Tzempelikos and A. K. Athienitis, “The impact of shading design
us four weeks to get it ”right”). However, there is no magic and control on building cooling and lighting demand,” Solar energy,
2007.
bullet, and this may become a problem, especially for unusual [12] Z. Cheng, Q. Zhao, F. Wang, Y. Jiang, L. Xia, and J. Ding, “Satisfaction
building architectures or specialized HVAC systems that may based q-learning for integrated lighting and blind control,” Energy and
not be trivial to replicate in a simulation environment. Buildings, 2016.
[13] L. Wang and S. Greenberg, “Window operation and impacts on building
Accepting Users’ Feedback. Some existing work [54] energy consumption,” Energy and Buildings, vol. 92, pp. 313–321, 2015.
allows users to send their feedback to the control server. The [14] S. Privara, J. Cigler, Z. Váňa, F. Oldewurtel, C. Sagerschnig, and
feedback can represent a user’s personalized preference on E. Žáčeková, “Building modeling as a crucial part for building predictive
control,” Energy and Buildings, vol. 56, pp. 8–22, 2013.
different human comfort metrics and will be considered in the [15] D. Kumar, X. Ding, W. Du, and A. Cerpa, “Building sensor fault
control decision process. OCTOPUS can easily accept users’ detection and diagnostic system,” in Proceedings of the 8th ACM
feedback to train a better agent model by making a small International Conference on Systems for Energy-Efficient Buildings,
Cities, and Transportation, 2021, pp. 357–360.
modification, i.e., changing the calculated comfort values in
[16] P. Rockett and E. A. Hathway, “Model-predictive control for non-
the reward function by the users’ feedback. This can be used domestic buildings: a critical review and prospects,” Building Research
for the initial training or for updated training (once deployed). & Information, 2017.
For example, the OCTOPUS control agent can be trained [17] E. Atam and L. Helsen, “Control-oriented thermal modeling of mul-
tizone buildings: methods and issues: intelligent control of a building
incrementally with a certain time interval (e.g., one month). system,” IEEE Control systems magazine, 2016.
The newly-trained agent will be used for real-time. [18] J. Zhao, K. P. Lam, and B. E. Ydstie, “Energyplus model-based
predictive control (epmpc) by using matlab/simulink and mle+,” 2013.
[19] O. T. Karaguzel and K. P. Lam, “Development of whole-building energy
VIII. C ONCLUSIONS performance models as benchmarks for retrofit projects,” in Proceedings
of the 2011 Winter Simulation Conference (WSC). IEEE, 2011.
This paper proposes OCTOPUS, a DRL-based control [20] B. Gu, S. Ergan, and B. Akinci, “Generating as-is building information
system for buildings that holistically controls many subsys- models for facility management by leveraging heterogeneous existing
information sources: A case study,” in Construction Research Congress
tems in modern buildings (e.g., HVAC, light, blind, window) 2014: Construction in a Global Network, 2014.
and manages the trade-offs between energy use and human [21] H. T. Dinh and D. Kim, “Milp-based imitation learning for hvac control,”
comfort. As part of our architecture, we develop a system IEEE Internet of Things Journal, 2021.
[22] C. Agbi, Z. Song, and B. Krogh, “Parameter identifiability for multi-
that addresses the issues of large action state, a novel reward zone building models,” in 2012 IEEE 51st IEEE conference on decision
function based on energy and comfort, and data requirements and control (CDC). IEEE, 2012.
14

[23] D. Kolokotsa, G. Stavrakakis, K. Kalaitzakis, and D. Agoris, “Genetic [49] A. Guideline, “Guideline 14-2002, measurement of energy and demand
algorithms optimized fuzzy controller for the indoor environmental savings,” American Society of Heating, Ventilating, and Air Conditioning
management in buildings implemented using plc and local operating Engineers, Atlanta, Georgia, 2002.
networks,” Engineering Applications of Artificial Intelligence, 2002. [50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[24] M. Liu, X. Ding, and W. Du, “Continuous, real-time object detection arXiv preprint arXiv:1412.6980, 2014.
on mobile devices without offloading,” in 2020 IEEE 40th International [51] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
Conference on Distributed Computing Systems (ICDCS). IEEE, 2020, networks,” in AISTATS, 2011.
pp. 976–986. [52] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
[25] H. Zhu, V. Gupta, S. S. Ahuja, Y. Tian, Y. Zhang, and X. Jin, “Network feedforward neural networks,” in AISTATS, 2010.
planning with deep reinforcement learning,” in Proceedings of the 2021 [53] R. American Society of Heating and A.-C. E. S. 55, “Thermal environ-
ACM SIGCOMM 2021 Conference, 2021, pp. 258–271. mental conditions for human occupancy,” 2017.
[26] X. Ding and W. Du, “Drlic: Deep reinforcement learning for irrigation [54] D. A. Winkler, A. Beltran, N. P. Esfahani, P. P. Maglio, and A. E. Cerpa,
control,” in 2022 21st ACM/IEEE International Conference on Informa- “Forces: feedback and control for occupants to refine comfort and energy
tion Processing in Sensor Networks (IPSN). IEEE, 2022, pp. 41–53. savings,” in ACM UbiComp, 2016.
[27] Z. Shen, K. Yang, W. Du, X. Zhao, and J. Zou, “Deepapp: A
deep reinforcement learning framework for mobile application usage
prediction,” in ACM SenSys, 2019.
[28] P. Fazenda, K. Veeramachaneni, P. Lima, and U.-M. O’Reilly, “Using
reinforcement learning to optimize occupant comfort and energy usage
in hvac systems,” Journal of Ambient Intelligence and Smart Environ-
ments, vol. 6, no. 6, pp. 675–690, 2014.
[29] Z. Zhang, A. Chong, Y. Pan, C. Zhang, and K. P. Lam, “Whole building
energy model for hvac optimal control: A practical framework based on
deep reinforcement learning,” Energy and Buildings, 2019.
[30] K. Dalamagkidis, D. Kolokotsa, K. Kalaitzakis, and G. S. Stavrakakis,
“Reinforcement learning for energy conservation and comfort in build-
ings,” Building and environment, 2007.
[31] D. Van Le, Y. Liu, R. Wang, R. Tan, Y.-W. Wong, and Y. Wen,
“Control of air free-cooled data centers in tropics via deep reinforcement
learning,” in Proceedings of the 6th ACM International Conference on
Systems for Energy-Efficient Buildings, Cities, and Transportation, 2019.
[32] D. V. Le, R. Wang, Y. Liu, R. Tan, Y.-W. Wong, and Y. Wen, “Deep
reinforcement learning for tropical air free-cooled data center control,”
ACM Transactions on Sensor Networks (TOSN), 2021.
[33] J. R. Vazquez-Canteli, G. Henze, and Z. Nagy, “Marlisa: Multi-agent
reinforcement learning with iterative sequential action selection for
load shaping of grid-interactive connected buildings,” in Proceedings of
the 7th ACM International Conference on Systems for Energy-Efficient
Buildings, Cities, and Transportation, 2020.
[34] X. Ding, W. Du, and A. E. Cerpa, “Mb2c: Model-based deep reinforce-
ment learning for multi-zone building control,” in Proceedings of the 7th
ACM international conference on systems for energy-efficient buildings,
cities, and transportation, 2020, pp. 50–59.
[35] Z. Zhang, A. Chong, Y. Pan, C. Zhang, S. Lu, and K. P. Lam, “A
deep reinforcement learning approach to using whole building energy
model for hvac optimal control,” in 2018 Building Performance Analysis
Conference and SimBuild, 2018.
[36] G. Gao, J. Li, and Y. Wen, “Deepcomfort: Energy-efficient thermal
comfort control in buildings via reinforcement learning,” IEEE Internet
of Things Journal, 2020.
[37] U. D. of Energy. (2016) Energyplus 8.6.0. [Online]. Available:
https://energyplus.net/
[38] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
with double q-learning,” in AAAI 2016, 2016.
[39] K. Ito and K. Kunisch, Lagrange multiplier approach to variational
problems and applications. Siam, 2008.
[40] A. Tavakoli, F. Pardo, and P. Kormushev, “Action branching architectures
for deep reinforcement learning,” in AAAI, 2018.
[41] P. O. Fanger et al., “Thermal comfort. analysis and applications in
environmental engineering.” Thermal comfort. Analysis and applications
in environmental engineering., 1970.
[42] P. Fanger, “Moderate thermal environments determination of the pmv
and ppd indices and specification of the conditions for thermal comfort,”
ISO 7730, 1984.
[43] D. C. Pritchard, Lighting. Routledge, 2014.
[44] S. J. Emmerich and A. K. Persily, State-of-the-art review of CO2 demand
controlled ventilation technology and application. Citeseer, 2001.
[45] A. S. 62.1, “Ventilation for acceptable indoor air quality,” 2016.
[46] (2018) sketchup. [Online]. Available: https://www.sketchup.com
[47] M. Wetter, “Co-simulation of building energy and control systems with
the building controls virtual test bed,” Journal of Building Performance
Simulation, 2011.
[48] A. Beltran, V. L. Erickson, and A. E. Cerpa, “Thermosense: Occupancy
thermal based sensing for hvac control,” in ACM BuildSys Workshop,
2013.

You might also like