0% found this document useful (0 votes)
16 views

18.A Deep Probabilistic Transfer Learning Framework For Soft Sensor Modeling With Missing Data

Uploaded by

Bappy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

18.A Deep Probabilistic Transfer Learning Framework For Soft Sensor Modeling With Missing Data

Uploaded by

Bappy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

A Deep Probabilistic Transfer Learning Framework


for Soft Sensor Modeling With Missing Data
Zheng Chai , Chunhui Zhao , Senior Member, IEEE, Biao Huang , Fellow, IEEE,
and Hongtian Chen , Member, IEEE

Abstract— Soft sensors have been extensively developed and are difficult to obtain. On the contrary, with the availability
applied in the process industry. One of the main challenges of of massive process measurements, data-driven soft sensors
the data-driven soft sensors is the lack of labeled data and the have been extensively studied and successfully applied to the
need to absorb the knowledge from a related source operating
condition to enhance the soft sensing performance on the target process industry [4]–[6].
application. This article introduces deep transfer learning to Currently, many data-driven approaches have been
soft sensor modeling and proposes a deep probabilistic transfer established for industrial process modeling through machine
regression (DPTR) framework. In DPTR, a deep generative learning algorithms [7]–[10]. However, as most of them
regression model is first developed to learn Gaussian latent are developed from statistical approaches, an underlying
feature representations and model the regression relationship
under the stochastic gradient variational Bayes framework. Then, assumption is that the training and testing data are drawn
a probabilistic latent space transfer strategy is designed to reduce from the same distribution. In practice, because of the changes
the discrepancy between the source and target latent features in operating conditions, the distribution of the data collected
such that the knowledge from the source data can be explored from the new operating condition (target domain) may show
and transferred to enhance the target soft sensor performance. certain discrepancies in comparison with the data collected
Besides, considering the missing values in the process data in
the target operating condition, the DPTR is further extended to from the original condition (source domain) [11]–[13]. This
handle the missing data problem utilizing the strong generation discrepancy can lead to inferior soft sensing performance on
and reconstruction capability of the deep generative model. The the target application if using models built under the source
effectiveness of the proposed method is validated through an domain. Under such circumstances, the soft sensor model has
industrial multiphase flow process. to be rebuilt from scratch with the training data collected from
Index Terms— Deep learning, industrial processes, missing the new target domain. This strategy, on one hand, would
data, probabilistic transfer learning (TL), soft sensor. be costly in terms of computation. On the other hand, it is
generally expensive or even impossible to collect sufficient
I. I NTRODUCTION labeled data in a new operating condition within a short time
period.

E FFECTIVE soft sensors are of great importance in the


process industry. By building regression models, the key
process variables can be timely predicted for further process
To tackle the distribution mismatch across various domains,
transfer learning (TL), as a recently developed learning para-
digm, has shown potential in leveraging knowledge from the
monitoring and control [1]. Generally, soft sensors tend to different but related source-domain data [14], [15]. Because
follow two alternative routes: first principle model-based meth- of its nature, TL has attracted considerable attention in many
ods [2] and data-driven methods [3]. Due to the increasing fields recently [16], [17]. Specifically, it has been applied
scale of modern industrial processes, the first principle models to remaining useful life prediction [18] and fault diagno-
sis [11], [19] in manufacturing processes and demonstrates
Manuscript received August 17, 2020; revised February 2, 2021 and
May 10, 2021; accepted May 24, 2021. This work was supported in prominent success. To address the soft sensor problem across
part by the NSFC-Zhejiang Joint Fund for the Integration of Industrializa- domains, limited studies have made attempts at learning from
tion and Informatization under Grant U1709211, in part by the Research both the source- and target-domain datasets. Liu et al. [20]
Project of the State Key Laboratory of Industrial Control Technology,
Zhejiang University, China, under Grant ICT2021A15, and in part by the Open developed a domain adaptation extreme learning machine
Research Project of the State Key Laboratory of Industrial Control Technol- (DAELM) model for soft sensing in multigrade chemical
ogy, Zhejiang University, China, under Grant ICT2021B52. (Corresponding processes. In [21], a modified partial least squares (PLS)
author: Chunhui Zhao.)
Zheng Chai and Chunhui Zhao are with the State Key Laboratory of Indus- method is designed for the incremental learning and TL of
trial Control Technology, College of Control Science and Engineering, Zhe- industrial batch processes. In these methods, the distribution
jiang University, Hangzhou 310027, China (e-mail: [email protected]; discrepancy that naturally exists between the two domains
[email protected]).
Biao Huang and Hongtian Chen are with the Department of Chemical has not been considered and resolved properly. In [22],
and Materials Engineering, University of Alberta, Edmonton, AB T6G 1H9, an adversarial TL strategy is used to reduce the distribution
Canada (e-mail: [email protected]; [email protected]). difference first, and then, a DAELM regressor is conducted
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2021.3085869. for soft sensing. However, the separated feature extraction
Digital Object Identifier 10.1109/TNNLS.2021.3085869 and regressor modeling with shallow structures may show
2162-237X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:26:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

suboptimal performances in comparison with the end-to-end modeling strategy with consideration of missing data is devel-
deep soft sensors [23]. oped under the TL framework to fully exploit the generation
In industrial applications, missing data are also a significant and reconstruction capacities of the deep model. This work
problem that should be considered for soft sensing across contributes in two aspects.
different operating conditions [3]. Due to possible failures or 1) A DPTR framework is developed to tackle the distribu-
in maintenance of hardware sensors or signal transmission tion discrepancy in developing soft sensors considering
errors, missing values are frequently encountered in certain data uncertainties. Thus, the probabilistic latent space
variables, leading to incomplete training samples [24]. In this and model knowledge in the source domain can be
case, most of the machine learning or TL methods become transferred for enhancing the performance of the soft
difficult to implement because of the assumption that the sensor in the target domain.
training samples should be structurally complete. To deal 2) The missing data are explored under the TL framework
with the missing data problem, the downsampling technique for industrial soft sensor modeling. With the capability
provides a simple solution by discarding or lifting the records of data generation and reconstruction, the proposed
with missing values. This solution, although it can be readily method can naturally impute the missing values that are
implemented in practice, can generally cause information prevalent in industrial applications.
loss or asymmetry. Thus, instead of discarding data directly, The remainder of this article starts with brief introductions
imputing the missing values has become a popular alterna- to the related works in Section II. In Section III, the detailed
tive [3] and various approaches have been developed for data description of the DPTR using both complete data and missing
imputation. Representative methods include principal compo- data is introduced. Section IV presents the illustrations on a
nent analysis (PCA)-based approaches, such as probabilistic multiphase flow process (MFP) to verify the efficacy of the
PCA [25]. In recent years, because of the efficacy in nonlinear DPTR. Finally, Section V concludes this article.
information processing and data reconstruction, deep models,
such as autoencoders (AEs), have been actively researched
for performing data imputation and soft sensor modeling [26]. II. R ELATED W ORKS
Despite the popularity, most of the deep models are developed A. Data-Driven Soft Sensors
in a deterministic fashion. This means that the deterministic Data-driven soft sensors have been widely developed in
feature representations are expressed, which do not contain recent years [3], [4]. To predict the key quality-relevant
uncertainty information and thus may lead to weak robust- variables in industrial processes, data-driven soft sensors gen-
ness of the soft sensors [27]. As a probabilistic counterpart erally build regressive models between those easy-to-measure
of the deterministic AE, the variational autoencoder (VAE) process variables and the hard-to-measure quality variables
[28] naturally provides an uncertain data modeling solution based on the offline training data [23]. With the advances
for industrial applications [29], [30]. Developed under the of machine learning techniques, variable approaches have
framework of stochastic gradient variational Bayes (SGVB), been proposed for soft sensor developments. The most
the VAE learns nonlinear reconstructive latent variables that representative machine learning-based soft sensors include
are expected to follow standard normal distributions. There- PLS [4], [7], support vector regression [8], and slow feature
fore, the capacities of characterization, reconstruction, and analysis [9], [31].
complex nonlinear feature extraction of uncertain data make In recent years, due to the capability of learning complex
it more suitable for probabilistic TL with missing data. feature representations, deep neural networks (DNNs) have
In this article, a deep probabilistic transfer regres- been extensively researched and applied in industrial appli-
sion (DPTR) framework is proposed for the TL problem in cations [32]. The widely used model structures for building
modeling soft sensors with missing data. First, formulating industrial soft sensors include AEs [6], [23], VAEs [27], [30],
the learning objective under the framework of SGVB, a deep and recurrent neural networks (NNs) [33], [34]. For example,
generative regression model (DGRM) is developed in an Yan et al. [32] pretrained multiple denoising AEs as the
end-to-end fashion, which is structured by deep EncoderNet, backbone, and then, an additional output layer is added for
PriorNet, and DecoderNet. The developed model can not only prediction of the oxygen content in flue gases. Yuan et al. [23]
extract distributions of nonlinear feature representations from developed a quality-driven AE-based soft sensor, and the effec-
raw inputs but also model the data regression relationship tiveness is validated through an industrial debutanizer column
for the labeled training data. Second, a probabilistic latent case. Despite these advances, the conventional data-driven and
space transfer strategy is designed to make the probabilistic machine learning-based soft sensors generally assume that the
latent features transferrable across different operating condi- training and testing data are drawn from the same distribution,
tions. The AdversaryNet is designed to reduce the discrep- which is challenged in many industrial applications due to the
ancy between the probabilistic representations under different changes in operating conditions.
operating conditions together with the sampling-based repara-
meterization trick. Thus, the DPTR model can be established
based on the DGRM and the probabilistic latent space transfer B. Domain Adaptation
strategy, which can then be deployed for the soft sensor task Domain adaptation aims to minimize the distribution dis-
under the target operating condition. Furthermore, considering crepancy between the source and the target domain such
missing values in certain sampling instances, a regression that the knowledge from the relevant source domain can

Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:26:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHAI et al.: DEEP PROBABILISTIC TL FRAMEWORK FOR SOFT SENSOR MODELING 3

be adopted and transferred to the target domain [15], [16]. Based on these two models, which are also referred to as
As a representative technique of TL [14], domain adaptation probabilistic encoder and decoder in the literature, VAE aims
algorithms have been widely developed and applied to many to optimize the variational Evidence Lower BOund (ELBO)
fields, such as computer vision [16] and natural language on the marginal likelihood. Specifically, the data likelihood is
processing [17]. calculated as the sum over all the marginal likelihoods of the
Many conventional machine learning-based domain adapta- samples
tion algorithms have been developed [16], [35]. For example,

N
Pan et al. [15] designed a transfer component analysis method log pθ (x 1 , x 2 , . . . , x N ) = log pθ (x i ) (1)
for extracting transferrable components, and then, the clas- i=1
sifiers or regression models in the source domain can be
applied to the target domain. In the locality-preserving joint where N signifies the number of samples and each marginal
likelihood log pθ (x i ) is given by
transfer method, the feature and sample levels of knowledge
transfer are jointly considered and optimized to improve the log pθ (x i ) = DKL (qφ (z|x i )|| pθ (z|x i )) + L(θ, φ; x i ). (2)
performance [16]. In recent years, with the rapid progress
of deep learning techniques, the deep learning-based domain In the right-hand side (RHS) of (2), the first term indicates
adaption has attracted considerable attention. For example, the Kullback–Leibler (KL) divergence between the approx-
Li et al. [36] developed faster domain adaptation networks to imation and the true posterior. Because the KL term is
deal with the limited computing resource problem and acceler- nonnegative, the ELBO on log pθ (x i ) can then be formulated
ate the knowledge adaptation. Among these deep model-based as
domain adaptation methods, common solutions include opti- L(θ, φ; x i ) = Eqφ (z|x i ) [− log qφ (z|x i ) + log pθ (x i , z)]
mizing the distribution discrepancy metrics [11], [37] and
using the domain adversarial training (DAT) [38]. For the = −DKL [qφ (z|x i )|| pθ (z)] + Eqφ (z|x i ) [log pθ (x i |z)]
discrepancy metric-based methods, existing approaches gen- (3)
erally embed the metrics into the deep structures to learn
where θ and φ are optimized such that
transferable feature spaces. For example, by incorporating the
multikernel maximum mean discrepancy (MMD) metric into 
N

the loss function of DNNs, Long et al. [39] proposed a deep θ ∗ , φ ∗ = arg max L(θ, φ; x i ). (4)
θ,φ
adaptation network to extract transferable features, and an i=1
optimal multikernel selection strategy is designed to further To accomplish the optimization, note that the second RHS
improve the feature matching performance. Recently, different term in (3) requires sampling a latent variable z ∼ qφ (z|x),
from the abovementioned solution, the DAT that inspired while this is problematic in practice as the random sampling is
by generative adversarial nets [40] is utilized to make the nondifferentiable in NN training. Typically, the approximation
feature representations from the source and target domains qφ (z|x) is designed as some parameterized distribution, and
unrecognizable. It draws wide research interests due to the it is possible to introduce an auxiliary variable ε to sample a
superior performance in comparison with some MMD-based deterministic z from x using z = gφ (ε, x). This reparameter-
works [38] and less need of specified hyperparameters, such ization trick yields a low-variance SGVB estimator, which is
as the kernel parameters in MMD [15], [37], [39]. Specifically, the learning objective of the VAE model
it has been applied to the manufacturing processes and demon-
strated prominent effectiveness. For example, Li et al. [41] L̃(θ, φ; x i ) = −DKL (qφ (z|x i )|| pθ (z))
1   
L
developed a DAT-based method to leverage the knowledge
from different but related equipment to improve the diagnostic + Eqφ (z|x i ) log pθ x i |z li (5)
L l=1
performance in rotating machinery. By treating each grade as
a domain in the multigrade industrial processes, Liu et al. [22] where z li = gφ (εl , x i ) with εl sampled from the independent
used the DAT to extract transferrable representations, which marginal distribution p(ε) and L is the number of samplings.
benefits the performance of industrial quality inferring. Due Generally, VAE assumes the prior pθ (z) = N (0, I ) and the
to its advantages, popularities, and effectiveness in man- true posterior is a multivariate Gaussian distribution, and the
ufacturing process modeling, this article follows the DAT approximate posterior is also a multivariate Gaussian with an
approach and developed a probabilistic counterpart to model isotropic covariance qφ (z|x i ) = N (μi , (σ i )2 I ). Then, z i can
the cross-domain soft sensor of process data. be sampled using z li = μi + σ i  εl , where εl ∼ N (0,I ). 
indicates the elementwise multiplication.
To incorporate supervision information into the VAE model,
C. Variational Autoencoders
the conditional VAE (CVAE) [42]–[44] extends the distrib-
The VAE [28] is an NN that is composed of a recognition utions to dependence on external information, i.e., condition
model qφ (z|x) that approximates the intractable posterior label c. The ELBO of CVAE can be readily formulated through
pθ (z|x) and a generation model pθ (x|z) that provides a the following expression according to (3):
distribution for the generated x, where φ and θ signify the
variational and generative parameters, i.e., the parameters log pθ (x|c) ≥ −DKL [qφ (z|x, c)|| pθ (z|c)]
regarding the recognition and the generation processes [28]. + Eqφ (z|x,c) [log pθ (x|z, c)]. (6)

Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:26:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

III. M ETHODOLOGY
In this section, a DGRM is first designed to obtain the
soft sensor in a supervised end-to-end fashion. Second, a
probabilistic latent space transfer strategy is developed to drive
the probabilistic features to be transferrable such that the
soft sensor built in the source domain can be adapted to the
application in the target operating condition. Then, a DPTR
Fig. 1. Graphical models of (a) plain VAE, (b) CVAE, (c) DGRM, and
model can be established to conduct soft sensing across (d) DGRM-MD. The black arrowed lines denote the generative process.
different operating conditions. Finally, taking the missing data The green and red arrowed lines indicate the conditional prior of z and the
problem into account, we extend the proposed method to the approximate inference of z, respectively.
TL scenario with missing data.
the recognition process qφ (z|x, y) in the DGRM. Specifically,
we refer qφ (z|x, y) as an EncoderNet, which is parameterized
A. Deep Generative Regression Model by W E . For the generative process pθ (y|z), it is realized using
In this section, we present the DGRM that extends the plain a DecoderNet parameterized by W D . The prior density pθ (z|x)
VAE to a probabilistic soft sensor model under the SGVB is provided by PriorNet parameterized by W P .
framework. The plain VAE is an unsupervised generative For analytically solving the KL term in (8), similar to the
model aiming at maximizing the data likelihood log pθ (x). plain VAE, we assume the prior density to be a parameterized
Considering the supervised regression task in soft sensing, Gaussian distribution N (μx , σx2 I ). The parameters μx and σx
we first deduce the DGRM under the SGVB framework, whose are estimated using the PriorNet in DGRM. The approximated
graphical model is shown in Fig. 1 in comparison with VAE posterior qφ (z|x, y) is multivariate Gaussian with isotropic
and CVAE. covariance N (μxy , σxy 2
I ), where μxy and σxy are realized
Specifically, given a labeled sample pair {x, y}, let using EncoderNet in DGRM. Denoting J as the dimensionality
the hidden variable z be conditioned on x in DGRM. of z, then the KL divergence in (8) is given by
In the inference phase, a hidden variable z is sampled from  
the prior density pθ (z|x). Then, the output y can be generated −DKL qφ (z|x, y)|| pθ (z|x)
 
based on pθ (y|z). Given the description of the DGRM, whose   N μxy , σxy 2
I
= − N μxy , σxy I log
2
  dz
graphical model is shown in Fig. 1(c), the goal is to optimize N μx , σx2 I
the parameters such that the conditional likelihood pθ (y|x) is ⎡ 2 2 ⎤
( j) ( j)  2
maximized. To this end, for an individual sample pair {x, y}, 1 ⎢
J σ σ μxy − μx
xy xy

we first deduce the conditional log likelihood as follows: = ⎣log 2 − 2 − 2 + 1⎦.
2 j =1 ( j) ( j) ( j)
σx σx σx
log pθ (y|x)
  (9)
pθ (x, y, z) qφ (z|x, y)
= Eqφ (z|x,y) log
pθ (z|x, y) pθ (x) qφ (z|x, y) According to (9), the SGVB lower bound estimator of the
  DGRM in (8) can be rewritten as
pθ (y, z|x)
= Eqφ (z|x,y) log
qφ (z|x, y) LDGRM (W P , W E , W D ; x, y)
  ⎡ 2 2 ⎤
+ DKL qφ (z|x, y)|| pθ (z|x, y) ( j) ( j)  2
   J σ σ μxy − μx
= ELBODGRM (θ, φ) + DKL qφ (z|x, y)|| pθ (z|x, y) . (7) 1 ⎢ xy xy

= ⎣log 2 − 2 −  2 + 1⎦
2 j =1 ( j) ( j) ( j)
In (7), the ELBO term indicates the variational lower bound σx σx σx
of log pθ (y|x). Following the model structure in Fig. 1(c) and 1
L
recalling that the output y is generated through the conditional + Eq (z|x,y) [log pθ (y|z l )] (10)
L l=1 φ
density pθ (y|z) in DRGM, the ELBO of the DGRM can be
further written as and the parameters W P , W E , and W D are optimized such that
ELBODGRM (θ, φ)
  (
WP ,  
WE , W D ) = arg max LDGRM (W P , W E , W D ; x, y).
W P ,W E ,W D
pθ (z|x)
= Eqφ (z|x,y) log + Eqφ (z|x,y) [log pθ (y|z)] (11)
qφ (z|x, y)
 
= −DKL qφ (z|x, y)|| pθ (z|x) + Eqφ (z|x,y) [log pθ (y|z)]. As the first RHS term in (10) forces qφ (z|x, y) to be close to
(8) the prior pθ (z|x), in the online testing phase, the latent variable
z can be obtained through the efficient mapping pθ (z|x) and
The ELBO of the DGRM method contains two parts. The the reparameterization technique directly [27], [44]. Moreover,
first RHS term performs regularization on the divergence in the proposed DGRM, the output y is generated through
between qφ (z|x, y) and pθ (z|x). The second RHS term is the probabilistic DecoderNet pθ (y|z). We note that this shares
an expected negative reconstruction error of y, given the a similar property as the conditional multimodal AE [45],
sampled hidden variable z. Multilayered perceptrons are used which naturally enables us to transfer the latent space z across
to realize the generative processes pθ (z|x) and pθ (y|z) and datasets collected under different operating conditions and thus

Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:26:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHAI et al.: DEEP PROBABILISTIC TL FRAMEWORK FOR SOFT SENSOR MODELING 5

adapts the learned deep generative regression knowledge from


the source-domain dataset to the target.

B. Probabilistic Latent Space Transfer


To achieve the latent space transfer on the probabilistic
representation z, we first give some necessary notations for
the soft sensor in the TL scenario. Denote the training data in
the source domain as {X S , YS } = {x Si , ySi }ni=1
S
and the training
data in the target domain as {X T , YT } = {x Ti , yTi }ni=1
T
, where
x and y signify the input and the output variables in a soft
sensor, respectively. n S and n T denote the number of training
samples in the two datasets, and n T  n S , which thus makes
it less accurate to train a soft sensor model using target or
source data only.
Denote the marginal distribution of X S and X T as P(X S )
and P(X T ) respectively. It is noted that in the soft sens-
ing problem under variable operating conditions, the two
distributions are naturally different, i.e., P(X S ) = P(X T ), Fig. 2. Schematic of the proposed DPTR. The arrowed solid lines indicate
the feedforward propagation during the training stage and the arrowed dashed
which is also the primary assumption in TL [15], [46]. Under flow is used in online testing.
this condition, a feasible solution is to adapt the regression
knowledge from the source domain to the target domain reparameterization trick are utilized to sample z, and then,
to enhance the soft sensing performance by transferring the the prediction losses of the AdversaryNet G A are averaged as
latent feature space [15], [47]. This means that the goal follows:
of the probabilistic latent space transfer is to minimize the
discrepancy between the two datasets in the feature repre- LMC−DAT (W A ; z̃ i )
1  i     
sentation space, i.e., the probabilistic latent space z. To this L

end, we first assign an operating condition label a i ∈ {0, 1} =− a log G A W A ; z li +(1 − a i ) log 1−G A W A ; z li
L l=1
to each x Si and x Ti to denote whether the sample belongs to
the source or target domain dataset, considering the operating (13)
condition information is readily available prior to the training where L is the number of samplings, z̃ i denotes the empirical
phase. Thus, a training sample is represented by {x i , y i , a i }. sampled z for the i th sample, and each z is achieved by
Then, the latent variables of both datasets can be trained
adversarially to enable the adaptation from source to target z li = μixy + εl  σxy
i
, εl ∼ N (0, I ). (14)
domain motivated by the deterministic DAT idea proposed by
Ganin et al. [38]. Let G A (W A ; z i ) parameterized by W A be an C. DPTR-Based Soft Sensor
operating condition label predictor modeled by an NN, which In this section, we will present the DPTR model along with
is termed AdversaryNet in this article. Then, the logistic loss the corresponding soft sensing procedure. The basic idea is to
function can be used to measure the prediction loss of G A , learn a transferrable probabilistic latent space z for DGRM
which is formulated by such that the model knowledge from the source operating
 condition can be adopted and transferred to the target operating
LDAT (W A ; z i ) = − a i log G A (W A ; z i )
   condition where the labeled data are limited. To achieve
+ (1 − a i ) log 1 − G A W A ; z i . (12)
this goal, the DPTR integrates the EncoderNet, PriorNet,
The vanilla DAT strategy is motivated by GAN [40], which and DecoderNet in the DGRM and the AdversaryNet in the
aims to find a mapping from the input noises to the fake MC-DAT. Let d signifies the dimensionality of an input sample
samples by competing for the generator against the discrim- x, and a sketch of the DPTR is presented in Fig. 2.
inator. In the DAT framework, on one hand, the feature z i Based on the DGRM and the MC-DAT strategy, the overall
is optimized to fool the AdversaryNet G A (W A ; z i ), which objective function of the DPTR model is given by (15), as
thus makes the operating condition label of z i unrecognizable. shown at the bottom of the next page where the parameters
On the other hand, W A is optimized to make G A (W A ; z i ) W P , W E , and W D of the DGRM and W A of the AdversaryNet
accurately distinguish the operating condition label of an can be learned in an adversarial fashion such that
input z i . Such a competition is expected to achieve a Nash
WP , 
( 
WE , W D ) = arg max LDPTR (W P , W E , W D , W A )
equilibrium to make the learned z i from both the source- and W P ,W E ,W D
target-domain datasets transferrable. 
W A = arg min LDPTR (W P , W E , W D , W A ). (16)
WA
It is noted that the vanilla DAT is conducted on a
deterministic feature space. As we are interested in achiev- In the offline modeling phase, based on the avail-
ing probabilistic latent space transfer, a Monte Carlo DAT able data {X S , Y S } and {X T , Y T }, the DPTR model is
(MC-DAT) approach is designed in which the EncoderNet and trained according to (15) and (16) through gradient descent.

Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:26:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

 
Then, the well-trained PriorNet and DecoderNet are deployed = ELBODGRM−MD (θ, φ) + DKL qφ (z|x̄, y)|| pθ (z|x, y, x̄)
 
for online applications. ≥ −DKL qφ (z|x̄, y)|| pθ (z|x̄) + Eqφ (z|x̄,y) [log pθ (x, y|z)].
In the online application phase, denote x Ttest as the online
(17)
input sample in the target operating condition. The sample is
first passed through the PriorNet, and the latent variable can In comparison with the DGRM, note that these two model
then be obtained as z Ttest = E[z|x Ttest ]. Finally, the output is structures are designed for different scenarios. The DGRM is
estimated as ŷTtest for x Ttest with DecoderNet pθ (y|z Ttest ). developed for soft sensing using structurally complete data and
the DGRM-MD is used for soft sensing using incomplete data
with missing variables. Thus, for the ELBO of DGRM in (8),
D. Extension to Transferring Soft Sensor With Missing Data
the goal is to estimate the soft sensor output y based on the
A TL-based probabilistic regression model DPTR has been structurally complete process variables x, and the DecoderNet
built so far. Like most of the TL methods, it uses structurally is realized through pθ (y|z). In (17), as missing variables are
complete data samples to build the model. In industrial appli- contained in x, both the recovery of clean x and prediction of
cations, however, missing data are a common problem due to y are crucial to build a valid soft sensor, and thus, DecoderNet
hardware sensor failures or signal transmission errors. Previous is realized using pθ (x, y|z).
works [27], [48] have applied VAE for data imputation, Similar to the DGRM, the optimization objective of DPTR
which used complete data to train a network in advance with missing data (DPTR-MD) can be achieved by substitut-
and then input the incomplete data iteratively or directly ing (17) for the DGRM loss in (15)
input the incomplete data into the stacked model. However,
the complete training data in the target operating condition LDPTR−MD = LDGRM−MD + LMC−DAT (18)
are generally very limited to learn a valid VAE model for where LDGRM−MD is defined by (19), as shown at the bottom
imputation. Note that there are generally sufficient complete of the next page.
data in the source domain, which are labeled and provide a The aim of the DGRM-MD method can be regarded as the
potential to build a valid soft sensor with data imputation reconstruction of x and prediction of y given the incomplete
ability. Here, the “labeled data” in the source domain refers x̄ under the SGVB framework, both of which are crucial to
to as the source-domain samples having the corresponding build a valid soft sensor with missing data. Thus, besides the
soft sensor outputs y and thus can be used for soft sensor restriction on the latent feature z, the loss is summed over
training. Specifically, for a complete sample x in the source the recovery of x and the prediction of y, as shown in (19).
and target domain, let x̄ signify the incomplete version in Many sophisticated strategies can be adopted to optimize this.
which certain variables are intentionally missed and y denote In this article, a two-stage training strategy is conducted on
the corresponding output of x̄. Thus, unlike the existing the reconstruction task of x and the prediction task of y. In the
VAE-based imputation methods, extending from the DGRM first stage, the losses regarding the latent space regularization
in (7) and (8), we consider the DGRM with missing data and the reconstruction of x in (19) are optimized to impute
(DGRM-MD) where the conditional probability pθ (x, y|x̄) is the missing values first. With the structurally complete source
maximized. The graphical model of DGRM-MD is presented and target data, in the second stage, the DPTR model can be
in Fig. 1(d). With the model description, the corresponding applied to learn probabilistic transferable feature representa-
ELBO of the DGRM-MD on the source data is deduced as tions across the source and target data by optimizing the losses
follows: with respect to the prediction of y and the latent regularization,
and the cross-domain soft sensor can thus be established.
log pθ (x, y|x̄)
  To assess the soft sensing performance, two metrics, includ-
pθ (x, y, z|x̄) qφ (z|x̄, y)
= Eqφ (z|x̄,y) log ing root-mean-squared error (RMSE) and mean absolute error
pθ (z|x, y, x̄) qφ (z|x̄, y) (MAE), are used for evaluation
  
pθ (x, y, z|x̄) 
= Eqφ (z|x̄,y) log  1  T 
n test 2
qφ (z|x̄, y)   i,test i,test 
  RMSE = y T − ŷ T  (20)
+ DKL qφ (z|x̄, y)|| pθ (z|x, y, x̄) n test
T i=1

LDPTR (W P , W E , W D , W A ; X S , YS , X T , YT )
= LDGRM (W P , W E , W D ; X S , YS , X T , YT ) + LMC−DAT (W A ; Z S , Z T )
⎡ ⎛ 2 2 ⎞
i,( j ) i,( j )  i 
+n T
n S  J σ σ μ − μ i 2
1 ⎢1 ⎜ xy xy
xy x ⎟
= ⎣ ⎝log 2 − 2 − 2 + 1⎠
(n S + n T ) i=1 2 j =1 i,( j ) i,( j ) i,( j )
σx σx σx

 
1 
L
         ⎥
+ Eqφ (z|x,y) log pθ y i |z li − a i log G A W A ; z li + (1 − a i ) log 1 − G A W A ; z li ⎦ (15)
L l=1

Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:26:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHAI et al.: DEEP PROBABILISTIC TL FRAMEWORK FOR SOFT SENSOR MODELING 7

T 
n test 
1   i,test i,test 
TABLE I
MAE = test y T − ŷ T  (21) D ESCRIPTION OF THE I NPUT VARIABLES FOR THE MFP S YSTEM [49]
nT i=1

where n test
T signifies the number of testing data in the target
operating condition.

IV. C ASE S TUDY


In this section, an industrial MFP is used to carry out the
performance evaluation of the proposed DPTR framework.

A. Process Description
The MFP system is developed to supply a controlled and
measured multiphase flow comprised of water, oil, and air
to a pressurized facility [49]. For monitoring and control
purposes, some hardware sensors have been installed in the
system, as shown in the simplified sketch of the facility
in Fig. 3. As shown in the figure, the MFP system consists
of a gas–liquid separator, a three-phase separator, and sev-
eral coalescers and storage tanks that are connected through
pipelines with various sizes and geometries. A flow mixture
consisting of water, oil, and air can be used as the input of the
MPF system at required flowrates. The test area of the facility TABLE II
consists of pipelines with different bore sizes and geometries D ESCRIPTION OF THE T WO O PERATING C ONDITIONS
and a gas and liquid separator at the top of a high platform. FOR THE MFP P LANT
The mixtures are separated in the three-phase separator at the
ground level. According to the system design, the pressure in
the three-phase separator is selected as the output variable of
the soft sensor model and 16 related variables in the MFP are
selected as input variables, which are listed in Table I with the
corresponding tags and units. The sampling period of all the
variables in the MFP system is 1 s. testing samples in the target operating condition. Specifically,
The MFP is an industrial system working under varying 10%, 30%, and 50% of samples in the target dataset are
operational conditions [49]. Typically, the two set points of corrupted. For each corrupted sample, three randomly selected
the process, including the airflow rate and the waterflow rate, variables are missed and replaced by random noises sampled
can be tuned to generate different steady operating conditions. from N (0, 1). The three missing levels are denoted as lightly,
Thus, this process provides a suitable case for evaluating the mediumly, and heavily missing cases in this article.
soft sensing performance of the developed DPTR framework.
Among the working conditions, considering the appropriate B. Experimental Setup
number of training samples, two conditions are selected as To assess the performance of the developed DPTR frame-
the source and target domain, respectively. The details of the work, two comparison experiments are designed using com-
designed task are given in Table II. n S = 1000 labeled samples plete data or missing data, respectively.
collected from the source operating condition are available and For the first experiment with complete source- and target-
can be used for knowledge transfer. There are 700 samples domain datasets for training, two groups of comparison meth-
collected from the target operating condition, among which ods are designed to fully explore the advantage of the DPTR.
the first 200 samples are utilized for training and the rest First, we select the conventional methods with no TL capacity.
500 samples are utilized as testing data. Besides, to assess the The VAE [28] and multilayer NN are combined as the baseline
performance of the proposed DPTR-MD, we consider three structure, in which the VAE model is utilized to extract
levels of missing data ratio in the training samples and online feature representations from the inputs, and then, the NN

⎛⎡ 2 2 ⎞⎤
i,( j ) i,( j )  i 
+n T
n S J σ σ μ − μ i 2
1 ⎢1 ⎜ x̄ y x̄ y x̄ y x̄ ⎟⎥
LDGRM−MD = ⎣ ⎝log 2 − 2 − 2 + 1⎠⎦
n S + n T i=1 2 j =1 i,( j ) i,( j ) i,( j )
σx̄ σx̄ σx̄
 L 
1    
+ Eqφ (z|x̄,y) log pθ x i , y i |z li (19)
L l=1

Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:26:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 3. Diagram of the multiphase flow facility [49].

model is further trained for prediction. Specifically, as both For the second experiment with missing data, following the
datasets are available for training, the model trained with target existing work [3], [48], three common solutions, including data
domain data only (VAE-T) and the model trained with com- deletion, mean imputation, and VAE imputation, are compared.
bined source- and target-domain data (VAE-C) are designed Specifically, these methods are utilized to handle the missing
as the comparing methods. Besides, the recently proposed data first, and then, the soft sensor performance using the
VAE-regression (VAE-R) method [50] is used as a comparing proposed DPTR is compared with the proposed DPTR-MD
method, which uses both the source- and target-domain data at the three missing levels.
for training. Then, the second group includes methods with
TL capacity. Specifically, to verify the efficacy of the designed C. Experimental Results and Discussion
probabilistic modeling mechanism of the proposed DPTR, this The prediction performance on the testing dataset for the
group of methods includes two deterministic TL approaches, complete data experiment is presented in Table III, in which
i.e., the recently proposed DAELM [20] and an AE-DAT the evaluation performance based on at least three runs of
model by integrating the deterministic AE, multilayer NN, and experiments is reported. There are several observations from
the deterministic version of DAT [38]. For fair comparisons, Table III. First, in comparison with VAE-T, the other methods
the basic network structures of the probabilistic deep models, show a significantly better performance, due to the incorpora-
including VAE-T, VAE-C, VAE-R, and the proposed DPTR, tion of the source-domain training data. Second, although both
are set to be the same. The EncoderNet and PriorNet structures the source- and target-domain datasets are utilized for learning
for the DPTR are {17, 8, 4} and {16, 8, 4}, respectively. The the DAELM model, the shallow structure makes it less com-
DecoderNet and AdversaryNet structure of DPTR is {4, 1}. petitive when learning complex feature representations and
The decoder structure of the plain VAE models is {4, 16}. regression relationships compared with VAE-C, VAE-R, and
The structures of the AE and DAT model in AE-DAT are {16, DPTR. Third, among the deep methods with both source-
8, 4, 8, 16} and {4, 1}. The additional NN structure in VAE-T, and target-training data, the AE-DAT shows inferior perfor-
VAE-C, and AE-DAT are {4, 8, 10, 1}. The main challenge in mance in comparison with those probabilistic deep methods.
adversarial training is to balance the competing components Also, due to its separated learning of representations and
of the network [17]. In the experiments, equal weights are regressor, it is not very competitive than DAELM. Then, ben-
used on the two losses in the proposed DPTR and AE-DAT efited from capturing the uncertainty distribution and simul-
to verify the effectiveness. Other sophisticated weighting taneously learn the regressed latent vector and reconstructed
schemes for the domain adversary loss have also been reported inputs, the VAE-R shows the second-best performance among
recently [38]. To train the deep models, the Adam optimizer the methods, illustrating the effectiveness of deep generative
with a 0.0001 weight decay value is used. The learning rate is models in soft sensor modeling. Finally, in comparison with
selected from {0.01, 0.001}. The rectified linear unit (ReLU) the other methods, the proposed DPTR shows clear improve-
activation function is used in the hidden layers. The number ment in both RMSE and MAE metrics. The potential reason
of hidden nodes of DAELM is searched from {4, 8, 12}. The is twofold.
ridge parameters λS and λT of DAELM are selected from 1) The discrepancy between the source and target prob-
{10−1 , 10−2 , 10−4 , 10−8 }. abilistic feature spaces rather than deterministic data

Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:26:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHAI et al.: DEEP PROBABILISTIC TL FRAMEWORK FOR SOFT SENSOR MODELING 9

TABLE III
P REDICTION P ERFORMANCE FOR THE S IX M ETHODS

Fig. 6. Predictions of VAE-R.

Fig. 4. Predictions and errors of VAE-T.


Fig. 7. Predictions and errors of DAELM.

Besides, an ablation study is conducted to show the effec-


tiveness of the designed MC-DAT. Specifically, the MC-DAT
loss in (15) is discarded and only the DGRM loss is retained.
The prediction performance and errors are shown in Fig. 10.
As shown in Fig 10, larger errors can be found in the
300th–400th samples in comparison with the performance of
DPTR shown in Fig. 9. Specifically, the RMSE and MAE
of the ablated method, i.e., DGRM, are 1.737 × 10−4 and
1.405 × 10−4 , respectively. The addition of the MC-DAT
further reduces the errors of DGRM, showing the significance
of the probabilistic latent space transfer.
Fig. 5. Predictions and errors of VAE-C.
In the second experiment, the efficacy of the proposed
DPTR-MD is evaluated for the case with missing data. In this
section, we select the proposed DPTR as the base soft sen-
points is minimized, endowing the DPTR with proba- sor model. Then, the data deletion, mean imputation, and
bilistic transfer capacity. the VAE imputation methods, which are common solutions
2) Both the plain AE and plain VAE-based models learn dealing with missing data, are compared with the proposed
feature space without considering the relationship to DPTR-MD. These comparing methods are first utilized to han-
the output, while the latent distribution and regression dle the missing values in the target data, and then, the DPTR
relationship are simultaneously learned in DPTR. trained with complete data is applied for building the soft
For more comparisons, the prediction results and errors for sensor model. The comparison results evaluated by RMSE and
VAE-T, VAE-C, VAE-R, DAELM, AE-DAT, and the proposed MAE are shown in Fig. 11. From the results, first, the data
DPTR are shown in Figs. 4–9. Generally, VAE-T shows deletion method shows the worst performance among all the
an inferior ability to track the real output curve. Besides, compared methods. This can be attributed to the insufficient
large deviations are found in VAE-C, VAE-R, DAELM, and mining and use of target training data. Also, note that the
AE-DAT. For DPTR, the prediction errors of most of the test- deletion method cannot handle the online missing values,
ing sample points are kept within (−0.0005, 0.0005). Hence, resulting in further inferior testing performance. Besides, it can
we can conclude that the DGRM and the probabilistic latent be found that when the missing ratio is light, the mean
space transfer have improved the soft sensing performance. imputation and VAE imputation show similar performances.

Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:26:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 11. Performance comparisons of data deletion, mean imputation, VAE


imputation, and DPTR-MD methods evaluated by (a) RMSE and (b) MAE.

Fig. 8. Predictions and errors of AE-DAT.

Fig. 12. (a) Visualization of raw data. (b) Feature visualization of the
proposed method. (c) Convergence procedure of the proposed method.

Fig. 9. Predictions and errors of DPTR.


Fig. 13. Correlations of the real and the predicted values for the four methods.

The VAE method shows imputation improvement in compar-


ison with the mean imputation method, while the soft sens-
ing performance improvement is not very significant. Then,
the feature visualizations aided by t-SNE [51] of raw complete
data and the probabilistic feature representations provided by
the proposed method are shown in Fig. 12(a) and (b), which
demonstrates that the proposed method has reduced the distri-
bution gap between the two domain data. Fig. 12(c) shows that
the test RMSE of the soft sensor gradually decreases with the
progress of training epochs. Then, the prediction performance
comparison on the soft sensing output is shown in Fig. 13, and
Fig. 10. Predictions and errors of DGRM. part of the testing samples is selected for a clearer view. It can
be seen that most predictions of the proposed DPTR-MD are
located around the diagonal line and exhibit lower errors and
A potential reason is that only 10% of data are corrupted in variances, demonstrating the soft sensing effectiveness of the
this case, and the statistics of the left training data are sufficient DPTR-MD method.
to represent the overall data and also sufficient to train a VAE
model. For the proposed DPTR-MD, as both the source and V. C ONCLUSION
target data can be utilized for learning data reconstruction, In this article, considering the distribution discrepancy
its soft sensing performance is generally better than VAE and coupled with the missing data problem in industrial soft
mean imputation methods. sensing, a DPTR framework has been developed to transfer
Moreover, the heavily missing data case is illustrated to the knowledge from a source operating condition to the target.
present more details for the proposed method. First, the impu- The framework consists of a DGRM module and an MC-DAT
tation errors are evaluated by the RMSE metric, which are module. The DGRM can sufficiently encode the source- and
1.949 ± 0.141, 1.973 ± 0.094, and 2.189 ± 0.091 for target-domain data collected under different operating con-
DPTR-MD, VAE imputation, and mean imputation, respec- ditions to a Gaussian latent representation space under the
tively. The results demonstrate that the proposed method can SGVB framework. Meanwhile, the MC-DAT is used to drive
better impute the missing values than the other two methods. the source and target probabilistic latent features transferrable.

Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:26:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHAI et al.: DEEP PROBABILISTIC TL FRAMEWORK FOR SOFT SENSOR MODELING 11

After that, the DPTR is further extended to the case where [15] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation
missing data exist in the target domain dataset. Thus, recon- via transfer component analysis,” IEEE Trans. Neural Netw., vol. 22,
no. 2, pp. 199–210, Feb. 2011.
struction and prediction knowledge learned from the source [16] J. Li, M. Jing, K. Lu, L. Zhu, and H. T. Shen, “Locality preserving joint
operating condition with sufficient data can be transferred transfer for domain adaptation,” IEEE Trans. Image Process., vol. 28,
to the target operating condition. The overall framework is no. 12, pp. 6103–6115, Dec. 2019.
[17] F. Alam, S. Joty, and M. Imran, “Domain adaptation with adversarial
jointly trained in an end-to-end fashion to transfer the model training and graph embeddings,” in Proc. 56th Annu. Meeting Assoc.
knowledge and abstract representative probabilistic features. Comput. Linguistics, 2018, pp. 1077–1087.
The application on an industrial multiphase flow dataset [18] C. Sun, M. Ma, Z. Zhao, S. Tian, R. Yan, and X. Chen, “Deep transfer
learning based on sparse autoencoder for remaining useful life prediction
demonstrates that the DPTR method is superior to the tra- of tool in manufacturing,” IEEE Trans. Ind. Informat., vol. 15, no. 4,
ditional deterministic TL methods such as DAELM and the pp. 2416–2425, Apr. 2019.
probabilistic method such as VAE trained on a combination [19] Z. Chai, C. Zhao, and B. Huang, “Multisource-refined transfer net-
work for industrial fault diagnosis under domain and category incon-
of datasets. For the missing data problem, the effectiveness sistencies,” IEEE Trans. Cybern., early access, May 25, 2021, doi:
of the DPTR-MD is verified through different missing lev- 10.1109/TCYB.2021.3067786.
els. Considering the multiple historical operating conditions, [20] Y. Liu, C. Yang, K. Liu, B. Chen, and Y. Yao, “Domain adaptation
future work can extend the deep probabilistic TL method to transfer learning soft sensor for product quality prediction,” Chemomet-
ric Intell. Lab. Syst., vol. 192, Sep. 2019, Art. no. 103813.
fully explore soft sensors across multiple domains. Besides, [21] J. Wang and C. Zhao, “Mode-cloud data analytics based transfer learning
exploring the other optimization strategies and more state-of- for soft sensor of manufacturing industry with incremental learning
the-art discrepancy metrics [52] would be an interesting topic, ability,” Control Eng. Pract., vol. 98, May 2020, Art. no. 104392.
[22] Y. Liu, C. Yang, M. Zhang, Y. Dai, and Y. Yao, “Development of
which deserves deep investigation in the future. adversarial transfer learning soft sensor for multigrade processes,” Ind.
Eng. Chem. Res., vol. 59, no. 37, pp. 16330–16345, Aug. 2020.
[23] X. Yuan, C. Ou, Y. Wang, C. Yang, and W. Gui, “A layer-wise
R EFERENCES data augmentation strategy for deep learning networks and its soft
sensor application in an industrial hydrocracking process,” IEEE
[1] P. Zhou, D. Guo, H. Wang, and T. Chai, “Data-driven robust M-LS-
Trans. Neural Netw. Learn. Syst., early access, Dec. 13, 2020, doi:
SVR-based NARX modeling for estimation and control of molten iron
10.1109/TNNLS.2019.2951708.
quality indices in blast furnace ironmaking,” IEEE Trans. Neural Netw.
Learn. Syst., vol. 29, no. 9, pp. 4007–4021, Sep. 2018. [24] W. Yu and C. Zhao, “Low-rank characteristic and temporal corre-
[2] B. Huang, Y. Qi, and A. K. M. M. Murshed, Dynamic Modeling and lation analytics for incipient industrial fault detection with missing
Predictive Control in Solid Oxide Fuel Cells: First Principle and Data- data,” IEEE Trans. Ind. Informat., early access, Apr. 27, 2020, doi:
based Approaches. Hoboken, NJ, USA: Wiley, 2013. 10.1109/TII.2020.2990975.
[3] P. Kadlec, B. Gabrys, and S. Strandt, “Data-driven soft sensors in the [25] S. Dray and J. Josse, “Principal component analysis with missing
process industry,” Comput. Chem. Eng., vol. 33, no. 4, pp. 795–814, values: A comparative survey of methods,” Plant Ecol., vol. 216, no. 5,
Apr. 2009. pp. 657–667, May 2015.
[4] C. Zhao, “A quality-relevant sequential phase partition approach for [26] V. Miranda, J. Krstulovic, H. Keko, C. Moreira, and J. Pereira, “Recon-
regression modeling and quality prediction analysis in manufacturing structing missing data in state estimation with autoencoders,” IEEE
processes,” IEEE Trans. Autom. Sci. Eng., vol. 11, no. 4, pp. 983–991, Trans. Power Syst., vol. 27, no. 2, pp. 604–611, May 2012.
Oct. 2014. [27] R. Xie, N. M. Jan, K. Hao, L. Chen, and B. Huang, “Supervised
[5] Y. Liu, C. Yang, Z. Gao, and Y. Yao, “Ensemble deep kernel learn- variational autoencoders for soft sensor modeling with missing data,”
ing with application to quality prediction in industrial polymeriza- IEEE Trans. Ind. Informat., vol. 16, no. 4, pp. 2820–2828, Apr. 2020.
tion processes,” Chemometric Intell. Lab. Syst., vol. 174, pp. 15–21, [28] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in
Mar. 2018. Proc. Int. Conf. Learn. Represent., 2014, pp. 1–14.
[6] X. Yuan, J. Zhou, B. Huang, Y. Wang, C. Yang, and W. Gui, “Hierar- [29] R. D. Camino, C. A. Hammerschmidt, and R. State, “Improving missing
chical quality-relevant feature representation for soft sensor modeling: data imputation with deep generative models,” 2019, arXiv:1902.10666.
A novel deep learning strategy,” IEEE Trans. Ind. Informat., vol. 16, [Online]. Available: http://arxiv.org/abs/1902.10666
no. 6, pp. 3721–3730, Jun. 2020. [30] B. Shen, L. Yao, and Z. Ge, “Nonlinear probabilistic latent variable
[7] C. Zhao, F. Wang, Z. Mao, N. Lu, and M. Jia, “Quality prediction regression models for soft sensor application: From shallow to deep
based on phase-specific average trajectory for batch processes,” AIChE structure,” Control Eng. Pract., vol. 94, Jan. 2020, Art. no. 104198.
J., vol. 54, no. 3, pp. 693–705, Mar. 2008. [31] Y. Qin, C. Zhao, and B. Huang, “A new soft-sensor algorithm with con-
[8] C. Shang, X. Gao, F. Yang, and D. Huang, “Novel Bayesian framework current consideration of slowness and quality interpretation for dynamic
for dynamic soft sensor based on support vector machine with finite chemical process,” Chem. Eng. Sci., vol. 199, no. 18, pp. 28–39,
impulse response,” IEEE Trans. Control Syst. Technol., vol. 22, no. 4, May 2019.
pp. 1550–1557, Jul. 2014. [32] W. Yan, D. Tang, and Y. Lin, “A data-driven soft sensor modeling
[9] J. Corrigan and J. Zhang, “Integrating dynamic slow feature analysis method based on deep learning and its application,” IEEE Trans. Ind.
with neural networks for enhancing soft sensor performance,” Comput. Electron., vol. 64, no. 5, pp. 4237–4245, May 2017.
Chem. Eng., vol. 139, Aug. 2020, Art. no. 106842. [33] L. Feng, C. Zhao, and Y. Sun, “Dual attention-based encoder-decoder: A
[10] C. Zhao, W. Wang, C. Tian, and Y. Sun, “Fine-scale modelling and customized sequence-to-sequence learning for soft sensor development,”
monitoring of wide-range nonstationary batch processes with dynamic IEEE Trans. Neural Netw. Learn. Syst., early access, Aug. 24, 2020, doi:
analytics,” IEEE Trans. Ind. Electron., early access, Jul. 21, 2020, doi: 10.1109/TNNLS.2020.3015929.
10.1109/TIE.2020.3009564. [34] X. Yuan, L. Li, Y. A. W. Shardt, Y. Wang, and C. Yang, “Deep
[11] W. Lu, B. Liang, Y. Cheng, D. Meng, J. Yang, and T. Zhang, “Deep learning with spatiotemporal attention-based LSTM for industrial soft
model based domain adaptation for fault diagnosis,” IEEE Trans. Ind. sensor model development,” IEEE Trans. Ind. Electron., vol. 68, no. 5,
Electron., vol. 64, no. 3, pp. 2296–2305, Mar. 2017. pp. 4404–4414, May 2021.
[12] Z. Chai and C. Zhao, “A fine-grained adversarial network method for [35] J. Li, K. Lu, Z. Huang, L. Zhu, and H. T. Shen, “Heterogeneous domain
cross-domain industrial fault diagnosis,” IEEE Trans. Autom. Sci. Eng., adaptation through progressive alignment,” IEEE Trans. Neural Netw.
vol. 17, no. 3, pp. 1432–1442, Jul. 2020. Learn. Syst., vol. 30, no. 5, pp. 1381–1391, May 2019.
[13] C. Zhao, J. Chen, and H. Jing, “Condition-driven data analytics [36] J. Li, M. Jing, H. Su, K. Lu, L. Zhu, and H. T. Shen, “Faster domain
and monitoring for wide-range nonstationary and transient continuous adaptation networks,” IEEE Trans. Knowl. Data Eng., early access,
processes,” IEEE Trans. Autom. Sci. Eng., early access, Aug. 4, 2021, Feb. 19, 2021, doi: 10.1109/TKDE.2021.3060473.
doi: 10.1109/TASE.2020.3010536. [37] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola,
[14] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. “A kernel method for the two-sample problem,” in Proc. Adv. Neural
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. Inf. Process. Syst., 2007, pp. 513–520.

Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:26:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[38] Y. Ganin et al., “Domain-adversarial training of neural networks,” Chunhui Zhao (Senior Member, IEEE) received
J. Mach. Learn. Res., vol. 17, no. 1, pp. 2030–2096, 2016. the Ph.D. degree from Northeastern University,
[39] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable Shenyang, China, in 2009.
features with deep adaptation networks,” in Proc. Int. Conf. Mach. From 2009 to 2012, she was a Post-Doctoral Fel-
Learn., 2015, pp. 97–105. low with The Hong Kong University of Science and
[40] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural Technology, Hong Kong, and the University of Cal-
Inf. Process. Syst., 2014, pp. 2672–2680. ifornia at Santa Barbara, Santa Barbara, CA, USA.
[41] X. Li, W. Zhang, Q. Ding, and X. Li, “Diagnosing rotating machines Since January 2012, she has been a Professor with
with weakly supervised data using deep transfer learning,” IEEE Trans. the College of Control Science and Engineering,
Ind. Informat., vol. 16, no. 3, pp. 1688–1697, Mar. 2020. Zhejiang University, Hangzhou, China. Her research
[42] X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2Image: Conditional interests include statistical machine learning and
image generation from visual attributes,” in Proc. Eur. Conf. Comput. data mining for industrial applications. She has authored or coauthored more
Vis., 2016, pp. 776–791. than 140 articles in peer-reviewed international journals.
[43] K. Sohn, H. Lee, and X. Yan, “Learning structured output representation Dr. Zhao has served as a Senior Editor of Journal of Process Control and
using deep conditional generative models,” in Proc. Adv. Neural Inf. Associate Editor of two international journals, Control Engineering Practice
Process. Syst., 2015, pp. 3483–3491. and Neurocomputing.
[44] C. Du, B. Chen, B. Xu, D. Guo, and H. Liu, “Factorized discriminative
conditional variational auto-encoder for radar HRRP target recognition,”
Signal Process., vol. 158, pp. 176–189, May 2019.
[45] G. Pandey and A. Dukkipati, “Variational methods for conditional
multimodal deep learning,” in Proc. Int. Joint Conf. Neural Netw. Biao Huang (Fellow, IEEE) received the B.Sc. and
(IJCNN), May 2017, pp. 308–315. M.Sc. degrees in automatic control from the Beijing
[46] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Schölkopf, University of Aeronautics and Astronautics, Beijing,
“Correcting sample selection bias by unlabeled data,” in Proc. Adv. China, in 1983 and 1986, respectively, and the Ph.D.
Neural Inf. Process. Syst., 2006, pp. 601–608. degree in process control from the University of
[47] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and Alberta, Edmonton, AB, Canada, in 1997.
J. W. Vaughan, “A theory of learning from different domains,” Mach. In 1997, he joined the Department of Chemical
Learn., vol. 79, nos. 1–2, pp. 151–175, May 2010. and Materials Engineering, University of Alberta,
[48] J. T. McCoy, S. Kroon, and L. Auret, “Variational autoencoders for as an Assistant Professor, where he is currently
missing data imputation with application to a simulated milling circuit,” a Professor and the NSERC Industrial Research
IFAC-PapersOnLine, vol. 51, no. 21, pp. 141–146, 2018. Chair in Control of Oil Sands Processes. He has
[49] C. Ruiz-Cárcel, Y. Cao, D. Mba, L. Lao, and R. T. Samuel, “Statistical applied his expertise extensively in industrial practice. His current research
process monitoring of a multiphase flow facility,” Control Eng. Pract., interests include process control, system identification, control performance
vol. 42, pp. 74–88, Sep. 2015. assessment, Bayesian methods, and state estimation.
[50] Y. Yoo, S. Yun, H. J. Chang, Y. Demiris, and J. Y. Choi, “Variational Dr. Huang is a fellow of the Canadian Academy of Engineering and the
autoencoded regression: High dimensional regression of visual data on Chemical Institute of Canada. He was a recipient of the Germany’s Alexan-
complex manifold,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. der von Humboldt Research Fellowship, the Canadian Chemical Engineer
(CVPR), Jul. 2017, pp. 3674–3683. Society’s Syncrude Canada Innovation and D. G. Fisher Awards, the APEGA
[51] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Summit Research Excellence Award, the University of Alberta McCalla and
J. Mach. Learn. Res., vol. 9, nos. 2579–2605, p. 85, 2008. Killam Professorship Awards, the Petro-Canada Young Innovator Award, and
[52] J. Li, E. Chen, Z. Ding, L. Zhu, K. Lu, and H. T. Shen, “Maximum den- the Best Paper Award from the Journal of Process Control.
sity divergence for domain adaptation,” IEEE Trans. Pattern Anal. Mach.
Intell., early access, Apr. 28, 2020, doi: 10.1109/TPAMI.2020.2991050.

Hongtian Chen (Member, IEEE) received the B.S.


and M.S. degrees from the School of Electrical and
Automation Engineering, Nanjing Normal Univer-
sity, Nanjing, China, in 2012 and 2015, respectively,
and the Ph.D. degree from the College of Automa-
tion Engineering, Nanjing University of Aeronautics
and Astronautics, Nanjing, in 2019.
He had ever been a Visiting Scholar at the Insti-
tute for Automatic Control and Complex Systems,
Zheng Chai received the B.Eng. degree in automa- University of Duisburg-Essen, Duisburg, Germany,
tion from the College of Automation, Harbin Engi- in 2018. He is currently a Post-Doctoral Fellow
neering University, Harbin, China, in 2017. He is with the Department of Chemical and Materials Engineering, University
currently pursuing the Ph.D. degree in control sci- of Alberta, Edmonton, AB, Canada. His research interests include process
ence and engineering with the College of Con- monitoring and fault diagnosis, data mining and analytics, machine learning;
trol Science and Engineering, Zhejiang University, and their applications in high-speed trains, new energy systems, and industrial
Hangzhou, China. processes.
He was a Visiting Scholar with the Department of Dr. Chen was a recipient of the Grand Prize of Innovation Award of Ministry
Chemical and Materials Engineering, University of of Industry and Information Technology of the People’s Republic of China
Alberta, Edmonton, AB, Canada, from 2019 to 2020. in 2019, the Excellent Ph.D. Thesis Award of Jiangsu Province in 2020, and
His current research interests include deep learning the Excellent Doctoral Dissertation Award from the Chinese Association of
and its industrial applications. Automation (CAA) in 2020.

Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:26:41 UTC from IEEE Xplore. Restrictions apply.

You might also like