Variational Autoencoder Generative Adversarial Network for Synthetic Data Generation in SmartHome
Variational Autoencoder Generative Adversarial Network for Synthetic Data Generation in SmartHome
Abstract—Data is the fuel of data science and machine learning smart homes can be divided into two approaches: model-
techniques for smart grid applications, similar to many other based and data-driven methods. The model-based method uses
fields. However, the availability of data can be an issue due to mathematical equations to describe the feature of household
privacy concerns, data size, data quality, and so on. To this end,
in this paper, we propose a Variational AutoEncoder Generative devices, including Markov chain [5], [6], statistical model
Adversarial Network (VAE-GAN) as a smart grid data generative [7], and physical simulator-based method [8]. This approach
model which is capable of learning various types of data requires extensive knowledge to build a dedicated generation
distributions and generating plausible samples from the same model, which lacks flexibility and generalization capability.
distribution without performing any prior analysis on the data Moreover, a model-based method has difficulty in capturing
before the training phase. We compared the Kullback–Leibler
(KL) divergence, maximum mean discrepancy (MMD), and the effect of user habits, which has a great impact on household
Wasserstein distance between the synthetic data (electrical load power consumption. In contrast, the data-driven approaches
and PV production) distribution generated by the proposed require no prior knowledge and assumptions for the device’s
model, vanilla GAN network, and the real data distribution, operation and energy consumption. It avoids the complexity
to evaluate the performance of our model. Furthermore, we of building a dedicated physical operation model of household
used five key statistical parameters to describe the smart grid
data distribution and compared them between synthetic data devices and consequently increases flexibility by eliminating
generated by both models and real data. Experiments indicate tedious assumptions. Recently flourishing machine learning
that the proposed synthetic data generative model outperforms techniques provide useful tools for synthetic data generation
the vanilla GAN network. The distribution of VAE-GAN synthetic where the Generative Adversarial Networks (GAN) is one of
data is the most comparable to that of real data. the promising data generation solutions [9]. For instance, [10]
Index Terms—synthetic data, load consumption, smart grid,
deep learning, generative adversarial network proposed a deep GAN-based method to generate synthetic data
for energy consumption and generation. The main idea behind
the GAN network is to use a discriminator to indirectly train
I. I NTRODUCTION
the generator network to produce synthetic data. The generator
As an important part of the smart grid, smart home is must deceive the discriminator in order to not distinguish
expected to improve household energy usage efficiency, reduce between fake and real samples to reach an equilibrium point.
energy cost, and enhance the user comfort level [1]. The In this work, we propose a novel Variational Autoencoder
widespread smart meter devices are considered as a key GAN (VAE-GAN) technique for the synthetic time series data
enabler of smart homes by collecting the critical data of generation of smart homes. Different from the aforementioned
household devices such as energy consumption profiles. These schemes, this approach is capable of learning various types of
data can be further used by the smart home controllers or data distributions in a smart home and generating plausible
utility companies for load consumption and generation fore- samples from the same distribution without performing any
casting, demand-side management, and economic dispatch [2]. prior analysis on the data before the training phase. In addition,
Consequently, the availability of fine-grained data becomes the utilizing a variational autoencoder network in the GAN genera-
prerequisite of building up a smart home. tor module helps the network to avoid mode collapse, which is
However, access to real-world data is a challenging is- a common failure in GAN networks. More specifically, it will
sue due to privacy concerns. Furthermore, the size and the prevent the generator from finding only one output that seems
quality of real-world data can also be bottlenecks for ap- most plausible to the discriminator and generates it every time.
plying data science techniques to the smart grid [3]. To The main contributions of this paper are as follows:
this end, generating synthetic data emerges as a promising
alternative. Then, the generated data can be leveraged by • A variational autoencoder GAN-based scheme is pro-
machine learning algorithms, for instance, to decide when posed to generate different types of time series synthetic
to implement demand response, when to charge an EV, etc. data for smart homes with high temporal resolution.
[4]. The existing works around synthetic data generation for • The performance of the proposed model and effectiveness
1
Accepted by 2022 IEEE International Conference on Communications (ICC) , ©2022 IEEE
II. R ELATED W ORK Fig. 1: VAE-GAN model architecture. In this network, the encoder
module maps the input sequence to the mean and the variance of a
A wide variety of methods have been developed for syn- latent space with Gaussian distribution. The generator module
thetic data generation of smart grid applications. For example, reconstructs the input sequence from the latent space and tries to
a Markov chain-based user behavior simulation method is mislead the discriminator module to discriminate the generated
proposed in [5] for home energy consumption modeling. sequence as a real sample. The discriminator module learns the
[6] develops a bottom-up analysis method for the residential distribution difference in real and fake samples
building energy consumption. A statistical synthetic data gen-
erator is defined in [7] for electric vehicle load modeling, in
A. Generative Adversarial Network (GAN)
which the Gaussian mixture model is used to estimate the
connection time. In addition, [8] defines a smart residential GAN networks employ an unsupervised learning method
load simulator based on MATLAB-Simulink, and it includes to detect and learn patterns in input data and produce new
dedicated physical models of various household devices. samples that have the same distribution as the original dataset.
On the other hand, considering the high complexity of GAN is composed of two main modules, namely generator,
the above mentioned model-based methods, the data-driven and discriminator, and it actively seeks an equilibrium between
methods become a favorable replacement as they do not the two modules.
require prior knowledge. [11] proposes a GAN-based scheme • Generator (G): It maps a prior probability distribution,
to generate synthetic labeled load patterns and usage habits, that is defined on input noise pz (Z), to a data space
which requires no model assumptions. Furthermore, [12] intro- G(z; θg ), where z is the input noise, and θg is the network
duces a model-free method for scenario generation of smart parameter.
grid, and GAN is used to capture the spatial and temporal • Discriminator (D): D(x; θd ) produces a single scalar
correlations of renewable power plants. Similarly, GAN is indicating the probability of x being a member of the
deployed in [13] to generate realistic energy consumption data original data.
by learning from actual data. G and D play an adversarial game shown by equation (1),
In our former work, we proposed a sequence-to-sequence where D maximizes the probability of assigning true labels,
learning-based method for load prediction in [14], and a Q- logD(x), and G tries to minimize the same probability:
learning based scheme for smart home energy management in
[15]. We used a limited real dataset for our algorithm training min maxLGAN (D, G) = Ex [log(D(x))]
G D (1)
in these former works. In this paper, we apply a novel VAE- + Ez [1 − log(D(G(z)))]
GAN method for the synthetic data generation of the smart
home. This is different from GAN-based approaches as a vari-
B. Variational Autoencoder (VAE)
ational autoencoder network is deployed in the GAN generator,
to overcome the mode collapse issue of the traditional GAN. Autoencoder neural networks consist of two deep-learning
based modules: encoder and decoder. The encoder module
III. S MART H OME S YNTHETIC DATA G ENERATION maps the input sequence into a meaningful latent space
based on the original input sequence distribution, allowing
To generate realistic smart home data, we adopt a Vari- the decoder module to reconstruct the input sequence with
ational Autoencoder-Generative Adversarial Network (VAE- minimal error. However, vanilla autoencoders suffer from a
GAN) as a data-driven approach. VAE-GAN is used to gen- lack of regularity in the latent space, which means the latent
erate daily overall electricity consumption and PV production space may not be continuous to interpolate for data points that
data. are not present in the input sequence.
Deep learning-based generative models, which use unsu- Variational autoencoders overcome this shortcoming by
pervised learning to learn data distribution and underlying adding a regularization parameter, Kullback–Leibler (KL) di-
patterns, have gotten a lot of attention in recent years. GAN vergence (equation(10)), in the training process, to ensure
and Variational Autoencoders are two of the well-known deep the latent space follows a Gaussian distribution. Instead of
learning based generative models. In the following sections, mapping the input sequence (x) to a vector, the VAE encoder
we explain the details of these techniques. (E) maps the data to two different vectors that are mean and
2
Accepted by 2022 IEEE International Conference on Communications (ICC) , ©2022 IEEE
Dilated CNN Dilated CNN Dilated CNN Dilated CNN Dilated CNN
Linear
Conv1D Conv1D Conv1D Conv1D Conv1D
Linear
Batch Batch Batch Batch Batch
Linear
Norm Norm Norm Norm Norm Sigmoid
Leaky Leaky Leaky Leaky Leaky Tanh
ReLU ReLU ReLU ReLU ReLU
Dilated CNN Dilated CNN Dilated CNN Dilated CNN Dilated CNN
standard deviation parameters of a Gaussian distribution. By module, hence LdG must be kept to a minimum. LGenerator
minimizing the Lprior loss, the encoder network is forced to is computed as in equation (5).
compress the input sequence into a Gaussian distribution. In The discriminator module (D) needs to distinguish the
addition, it helps the decoder with reconstruction robustness, original input sequence (LReal ) from the generator output
since the decoder module samples from a continuous distri- sequence (Lf ake ). Meanwhile, to prevent the discriminator
bution. The decoder loss is computed based on the distance from failing to converge, Lnoise is added to the discriminator’s
between the reconstructed sequence (x̂) and x. Lprior and loss function. This term enforces D to distinguish a random
Lreconstruction are backpropagated through the network to sample from normal distribution from the real input sequence.
train the VAE parameters. The overall discriminator loss function is computed based on
equation (9).
Lprior = DKL (E(x)||N (0, 1)) (2) LdG = Ex [log(D(G(z)))] (4)
The VAE-GAN architecture that is used for smart home Lf ake = Ez [1 − log(D(G(z)))] (7)
synthetic data generation is shown in Fig. 1. This network ar-
chitecture includes a GAN network with the generator module Lnoise = Ez [1 − log(D(N (0, 1))) (8)
being a VAE neural network. As previously stated, the vanilla
LD = Lreal + Lf ake + Lnoise (9)
GAN network suffers from mode collapse. The main reason
is that the discriminator is trapped in a local minimum, and Fig. 2 presents the encoder structure of VAE-GAN. The
the generator module repeatedly produces the output that is generator and discriminator modules have similar structures,
most likely to mislead the discriminator. As a result, training as shown in Fig. 3. Dilated one-dimensional convolutional
the GAN network becomes challenging and problematic. To (Dilated CONV1D) neural network is used in the structure
address this problem, Larsen et al. [16] inserted a variational of the encoder, generator, and discriminator of the VAE-
autoencoder into the GAN’s generator module to leverage the GAN network. This architecture is inspired by the WaveNet
VAE latent space’s regularity. network [17] and utilizes dilated causal convolution layers to
The encoder module (E) compresses the input sequence capture long-term dependencies in the input sequence. The
into two vectors that are meanz and variancez of a Gaussian 1-dimensional convolution slides a filter on an input series
distribution by minimizing Lprior . by one stride. However, in the dilated convolution, the sliding
The generator module (G) reconstructs the input sequence filter skips the input sequence with certain steps while keeping
from the latent space z so that the reconstructed and original the order of the input data. Furthermore, multiple stacked
sequences have the lowest Mean Squared Error (MSE) by min- dilated convolutional layers allow for longer input sequences,
imizing Lreconstruction . In addition, the input sequence cannot which reduces network complexity and training time compared
be considered as a generated sequence by the discriminator with other long-term learning neural networks.
3
Accepted by 2022 IEEE International Conference on Communications (ICC) , ©2022 IEEE
4
Accepted by 2022 IEEE International Conference on Communications (ICC) , ©2022 IEEE
TABLE I: Distance between real and synthetic smart grid [3] N. Komninos, E. Philippou, and A. Pitsillides, “Survey in smart grid and
smart home security: Issues, challenges and countermeasures,” IEEE
data distribution. Communications Surveys & Tutorials, vol. 16, no. 4, pp. 1933–1954,
Apr. 2014.
KL divergence Wasserstein distance MMD [4] H. Zhou and M. Erol-Kantarci, “Decentralized microgrid energy man-
Model
Load PV Load PV Load PV agement: A multi-agent correlated q-learning approach,” in Proc. of
IEEE conference on Smart Grid Communication, Nov. 2020, pp. 1–7.
GAN 0.543 0.273 530.3 961.5 0.227 0.224 [5] L. Diao, Y. Sun, Z. Chen, and C. Jiayu, “Modeling energy consumption
VAE-GAN 0.017 0.006 255.3 266.6 0.101 0.117 in residential buildings: A bottom-up analysis based on occupant behav-
ior pattern clustering and stochastic simulation,” Energy and Buildings,
vol. 147, pp. 47–66, Jul. 2017.
[6] S. Ge, J. Li, H. Liu, X. Liu, Y. Wang, and H. Zhou, “Domestic en-
MMD results, in Table I, show the difference between ergy consumption modeling per physical characteristics and behavioral
factors,” Energy Procedia, vol. 158, pp. 2512–2517, Feb. 2019.
synthetic data and real data distributions are lower for the [7] M. Lahariya, D. Benoit, and C. Develder, “Synthetic data generator for
VAE-GAN network, demonstrating the network’s superior per- electric vehicle charging sessions: Modeling and evaluation using real-
formance in generating synthetic data distributions compared world data,” Energies, vol. 13, pp. 1–18, Aug. 2020.
[8] J. López, E. Pouresmaeil, C. Cañizares, K. Bhattacharya, A. Mosaddegh,
to the GAN network. and B. Solanki, “Smart residential load simulator for energy man-
Table II and III summarize the mean and standard deviation agement in smart grids,” IEEE Transactions on Industrial Electronics,
of five essential statistical parameters for synthetic and real vol. 66, no. 2, pp. 1443–1452, Mar. 2018.
[9] I. Goodfellow, J. Pouget-Abadie, J. Mirza, B. Xu, D. Warde-Farley,
electricity consumption and PV production data, respectively. S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
For each parameter, we bold the result that is closer to the true in Proc. of the 27th International Conference on Neural Information
distribution, for convenience. The data is min-max normalized Processing Systems, Dec. 2014, pp. 1–18.
[10] C. Zhang, S. Kuppannagariy, R. Kannany, and V. Prasanna, “Generative
before the training phase, then the base load is extremely close adversarial network for synthetic time series data generation in smart
to zero. This is because of the difference in base load mean and grids,” in Proc. of 2018 IEEE International Conference on Communica-
standard deviation for synthetic and actual data for aggregated tions, Control, and Computing Technologies for Smart Grids, Oct. 2018,
pp. 1–6.
load consumption. The VAE-GAN network’s ability to learn [11] S. Kababji and P. Srikantha, “A data-driven approach for generating
the load consumption and PV production data pattern is synthetic load patterns and usage habits,” IEEE Transactions on Smart
demonstrated by better peak load and high load duration Grid, vol. 11, no. 6, pp. 4984–4995, Nov. 2020.
[12] Y. Chen, Y. Wang, D. Kirschen, and B. Zhang, “Model-free renewable
results. Both models perform similarly and satisfactorily for scenario generation using generative adversarial networks,” IEEE Trans-
the rise and fall time parameters. actions on Power Systems, vol. 33, no. 3, pp. 3265–3275, May. 2018.
[13] M. Fekri, A. Ghosh, and K. Grolinger, “Generating energy data for ma-
V. C ONCLUSION chine learning with recurrent generative adversarial networks,” Energies,
vol. 13, pp. 1–23, Dec. 2019.
Synthetic data generation is an important capability to apply [14] M. Razghandi, H. Zhou, M. Erol-Kantarci, and D. Turgut, “Short-term
advanced data science and machine learning techniques for load forecasting for smart home appliances with sequence to sequence
smart home management. In this work, we apply a variational learning,” in Proc. of IEEE ICC, Jun. 2021, pp. 1–6.
[15] M. Razghandi, H. Zhou, M. Erol-Kantarci, and Turgut, “Smart home
autoencoder GAN (VAE-GAN) for synthetic time series gen- energy management: Sequence-to-sequence load forecasting and q-
eration of the smart home data including electrical load and learning,” in Proc. of IEEE Globecom, Dec. 2021, pp. 1–6.
PV generation. The main advantage of this technique is that [16] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,
“Autoencoding beyond pixels using a learned similarity metric,” in Proc.
the network does not require any prior training analysis on of the 33rd International Conference on Machine Learning, vol. 48,
the data, and can be utilized to generate different forms of 2016, pp. 1558–1566.
smart home data, including household electricity load and PV [17] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,
N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A gener-
power generation. The model performance and synthetic data ative model for raw audio,” arXiv preprint arXiv:1609.03499 [cs.SD],
effectiveness are assessed with respect to the vanilla GAN Sep. 2016.
network. The simulations show that our proposed scheme [18] P. Huber, M. Ott, M. Friedli, A. Rumsch, and A. Paice, “Residential
power traces for five houses: the iHomeLab RAPT dataset,” Data, vol. 5,
achieves satisfying performance. no. 1, pp. 1–14, Feb. 2020.
[19] Z. Wang and T. Hong, “Generating realistic building electrical load
ACKNOWLEDGEMENT profiles through the generative adversarial network (gan),” Energy and
Hao Zhou and Melike Erol-Kantarci were supported by the Buildings, vol. 224, pp. 1–15, Oct. 2020.
Natural Sciences and Engineering Research Council of Canada
(NSERC), Collaborative Research and Training Experience
Program (CREATE) under Grant 497981 and Canada Research
Chairs Program.
R EFERENCES
[1] B. Sovacool and D. Rio, “Smart home technologies in europe: A
critical review of concepts, benefits, risks and policies,” Renewable and
Sustainable Energy Reviews, vol. 148, pp. 1–22, Mar 2020.
[2] A. Al-Ali, I. Zualkernan, M. Rashid, R. Gupta, and M. Alikarar, “Smart
home technologies in europe: A critical review of concepts, benefits,
risks and policies,” IEEE Transactions on Consumer Electronics, vol. 63,
no. 4, pp. 426–434, Nov. 2017.
5
Accepted by 2022 IEEE International Conference on Communications (ICC) , ©2022 IEEE
4
4
3
PDF
PDF
3
2
2
1 1
0 0
−0.50 −0.25 0.00 0.25 0.50 0.75 1.00 1.25 1.50 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 1.25 1.50
Normalized Power Normalized Power
2.5 1.50
2.0 1.25
1.00
PDF
1.5
0.75
1.0
0.50
0.5
0.25
0.0 0.00
−0.50 −0.25 0.00 0.25 0.50 0.75 1.00 1.25 1.50 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 1.25 1.50
Normalized Power Normalized Power
Fig. 5: PV power production real and synthetic data probability density function for (a) GAN, and (b) VAE-GAN generative models. The
blue line shows the real data PDF, the orange line shows the synthetic data PDF.
TABLE II: Aggregated Load Consumption Evaluation Results. The numbers that are closest to the real data are highlighted in
bold.
Base Load Peak Load High-Load Duration Rise Time Fall Time
Model
mean std mean std mean std mean std mean std
GAN 3.53 22.34 257.84 1658.12 0.01 0.08 0.45 0.80 0.49 0.96
VAE-GAN 1.73 10.99 119.85 752.92 0.00 0.04 0.44 0.74 0.52 0.97
Real data 9.75 51.13 151.39 1008.20 0.02 0.33 0.48 0.87 0.49 0.91
TABLE III: PV Power Production Evaluation Results. The numbers that are closest to the real data are highlighted in bold.
Base Load Peak Load High-Load Duration Rise Time Fall Time
Model
mean std mean std mean std mean std mean std
GAN 0.00 0.00 148.53 927.53 0.02 0.50 0.49 0.91 0.45 0.81
VAE-GAN 0.04 0.29 205.63 1283.73 0.00 0.00 0.42 0.79 0.54 1.02
Real data 0.00 0.00 197.91 1236.58 0.01 0.35 0.73 5.46 0.68 4.80