Chen 2015
Chen 2015
B. Alpha-Cuts
If A is a fuzzy set, the α-cut of A, denoted by A[α], is defined
as
A[α] = {x ∈ Ω|A(x) ≥ α} (2)
where 0 < α ≤ 1.
C. Fuzzy Function
The fuzzy function f , which is extended from real-value func-
Fig. 2. Symmetric triangular fuzzy number. tion f : Y = f (x, W), is defined by
Y = f (x, W) (3)
and their high-level features is introduced into the model, the
where Y is the dependent fuzzy output set, W and W are
fuzzy RBMs demonstrate competitive performances in both
parameters in the two functions [26].
data representation capability and robustness to cope with
1) Extension Principle: The membership function deduced
noises. These similar merits also can be initially found in fuzzy
from the extension principle can be expressed as
regression [19] and fuzzy support vector machine (SVM) [20].
The fuzzy RBMs, which are also designed to boost the Y (y) = sup{min(W 1 (W1 ), . . . , W n (Wn ))|f (x, W) = y}
development of deep learning from the building component W
(4)
(RBMs) of deep networks, has never been introduced before.
where W = (W1 , . . . , Wn )T , and W = (W 1 , . . . , W n )T .
The stochastic gradient descent method integrated with Monte
2) Alpha-Cuts of Y : If f is continuous, the α-cut of Y , i.e.,
Carlo Markov chain (MCMC) approach is employed to train the
Y [α] = [Y 1 (α), Y 2 (α)], has following expression:
proposed fuzzy RBMs. This kind of learning methods are com-
monly used in training RBMs, and proved to be very efficient Y 1 (α) = min{Y (W, x)|W ∈ W[α]}
[12], [21], [22]. Other learning approaches like Bayesian estima- (5)
tion methods have also been developed to learn RBM in [23] and Y 2 (α) = max{Y (W, x)|W ∈ W[α]}.
[24]. Gaussian RBMs, conditional RBMs, temporal RBMs, and
recurrent RBMs are developed through modifying the structure D. Interval Arithmetic
of RBMs. Alternatively, fuzzy RBMs are proposed from differ- For two intervals [a, b] and [c, d], that are two subsets of the
ent perspective of extending the relationships between visible real domain, the fundamental operations of interval arithmetic
and hidden units. Therefore, fuzzy RBMs can also be further [27] are defined as follows:
developed by taking consideration of others variants of RBMs.
The rest of the paper is organized as follows. In Section II, [a, b] + [c, d] = [a + c, b + d]
the preliminaries about fuzzy sets, fuzzy functions, and their [a, b] − [c, d] = [a − c, b − d]
notations are presented. The proposed FRBM and its learning
algorithm are introduced in Section III. After that, the outstand- [a, b] × [c, d] = [min(a × c, a × d, b × c, b × d)
ing performance of the FRBM model is verified by conducting max(a × c, a × d, b × c, b × d)]
experiments on bar-and-stripe (BAS) benchmark inpainting and
MNIST handwritten digits classification in Section IV. Finally, [a, b] ÷ [c, d] = [min(a ÷ c, a ÷ d, b ÷ c, b ÷ d)
the conclusion and remarks are drawn in Section V. max(a ÷ c, a ÷ d, b ÷ c, b ÷ d)], 0 ∈
/ [c, d].
free energy F, which marginalize hidden units and map (7) into
a simpler one, is deduced as
F(x, θ) = − log e−E (x, h̃,θ) (11)
h̃
Fig. 3. Fuzzy restricted Boltzmann Machine (FRBM). If the fuzzy free energy function is directly employed to de-
fine the probability, it leads to a fuzzy probability [27]. Finally,
the optimization in learning process turns into a fuzzy max-
III. FUZZY-RESTRICTED BOLTZMANN MACHINE AND ITS imum likelihood problem. However, this kind of problem is
LEARNING ALGORITHM quite intractable because the fuzzy objective function is nonlin-
A. Fuzzy-Restricted Boltzmann Machine ear and the membership function is difficult to compute, since
the computation of its alpha-cuts become NP-hard problems
The proposed novel FRBM is illustrated in Fig. 3, in which
[29]. Therefore, it is necessary to transform the problem into
the connection weighs and biases are fuzzy parameters denoted
regular maximum likelihood problem by defuzzifying the fuzzy
by θ. There are several merits of the FRBM model. The first one
free energy function (11). The center of area (centroid) method
is that the FRBM has much better representation than the regular
[30] is employed to defuzzify the fuzzy free energy function
RBM in modeling probabilities over visible and hidden units.
F(x). Then, the likelihood function can be defined by the de-
Specifically, the RBM is only a special case of the FRBM when
fuzzified fuzzy free energy function. Consequently, the fuzzy
no fuzziness exists in the FRBM model. The second one is that
optimization problem becomes real-valued problem, and con-
the robustness of the FRBM model surpasses RBM model. The
ventional optimization approaches can be directly applied to
FRBM shows out more robustness when it comes to the fitting
find the optimal solutions. The centroid of fuzzy number F(x)
of the model with noisy data. All these advantages spring from
is denoted by Fc (x), and has the following form:
the fuzzy extension of the relationships between cross-layer
variables, and inherit the characteristics of fuzzy models. θF(x, θ)dθ
Fc (x, θ) = , θ ∈ θ. (13)
Since the FRBM is an extension of the RBM model; there- F(x, θ)dθ
fore, the discussion starts from a brief introduction on the RBM
Naturally, after the fuzzy free energy is defuzzified, the proba-
model. An RBM is an energy-based probabilistic model, in
bility can be defined as
which the probability distribution is defined through an energy
function. Its probability is defined as e−Fc (x;θ)
Pc (x, θ) = , Z= e−Fc (x̃,θ) . (14)
e−E (x,h,θ) Z
x̃
P (x, h, θ) = (7)
Z In the fuzzy RBM model, the objective function is the negative
Z= e−E (x̃, h̃,θ) (8) log-likelihood, which is given by
x̃
h̃
L(θ, D) = − log Pc (x, θ) (15)
where E(x, h, θ) is the energy function, θ are the parameters x∈D
governing the model, Z is the normalizing factor which is called where D is the training dataset.
the partition function, x̃ and h̃ are two vector variables repre- The learning problem is to find optimal solutions for param-
senting visible and hidden units that are used to traverse and eters θ that minimize the objective function L(θ, D), i.e.
summarize all the configurations of units on the graph. The
energy function for the RBM is defined by min L(θ, D). (16)
θ
E(x, h, θ) = −bT x − cT h − hT Wx (9) In the following section, the detailed procedure to address the
where bj and ci are the offsets, and Wij is the connection weight dual problem of maximum likelihood by utilizing stochastic
between jth visible unit and ith hidden unit, and θ = {b, c, W}. gradient descent method will be investigated.
To establish FRBM, it is necessary to first define the fuzzy
energy function for the model. The fuzzy energy function can B. Fuzzy-Restricted Boltzmann Machines Learning Algorithm
be extended from (9) in accordance with extension principle as In order to solve the optimization problem (16), it is required
follows: to first finish the defuzzification process of the fuzzy free en-
T ergy function in some viable ways. However, it is infeasible to
E(x, h, θ) = −b x − cT h − hT Wx (10)
defuzzify the fuzzy free energy function by using (13) which
where E(x, h, θ) is a fuzzified energy function, and θ = involves integrals. Alternatively, the centroid is calculated by
{b, c, W} are fuzzy parameters. Correspondingly, the fuzzy employing a discrete form, which associates with a number of
2166 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 23, NO. 6, DECEMBER 2015
alpha-cuts of the fuzzy function. Therefore, the alpha-cuts of distribution, then, we have
the fuzzy free energy function and interval arithmetic are first
∂ log Pc (x, θ) ∂Fc (x, θ) 1 ∂Fc (x̃, θ)
investigated to obtain an approximation of the centroid. − ≈ − . (22)
1) Alpha-Cuts of Fuzzy Free Energy Function: As supposed, ∂θ ∂θ |N | ∂θ
x̃∈N
θ is a vector of symmetric triangular fuzzy numbers, and its α- 4) Conditional Probability: For the RBM, the conditional
cut is θ[α] = [θ L , θ R ], where θ L and θ R are lower and upper energy-based probabilities [11] are defined as
bounds of the interval with respect to α, respectively. F(x, θ) is
often a triangular-shaped fuzzy number for nonlinear functions e−E (x,h)
P (h|x) = (23)
[27]. However, the fuzzy free energy is monotonic decreasing −E (x, h̃)
h̃ e
function with respect to parameters θ when x and h are non-
e−E (x,h)
negative. Therefore, according to interval arithmetic, the α-cut P (x|h) = −E ( x̃,h) . (24)
of F(x, θ) can be given by x̃ e
In the commonly studied case of the RBM with binary units,
F(x, θ)[α] = F(x, θ[α]) where xj and hi ∈ {0, 1}, the probabilistic version of the usual
= [F(x, θ R ), F(x, θ L )]. (17) neuron activation functions can be derived from conditional
probabilities. They have the following affine forms:
2) Approximation of Centroid: An approximation of the cen-
ec i +W i x
troid of the fuzzy free energy function is provided by discretizing P (hi = 1|x) = = σ(ci + Wi x) (25)
the fuzzy output F(x, θ) and calculating M number of its α- 1 + ec i +W i x
T
cuts. By combining with these α-cuts, the approximate centroid eb j +W . j h
is given by P (xj = 1|h) = b j +W .Tj h
= σ bj + W.jT h (26)
1+e
M
αi [F(x, θiL ) + F(x, θiR )] where Wi and W.j denote the ith row and jth column of W ,
Fc (x, θ) ≈ i=1 (18) respectively, σ is logistic sigmoid function
2 M i=1 αi
ex 1
where α = (α1 , . . . , αN ), α ∈ [0, 1] and θ[αi ] = [θiL , θiR ].
N σ(x) = = .
ex + 1 1 + e−x
As all the α-cuts are bounded intervals [31], for convenience, For fuzzy RBM, the conditional probabilities also become
we only consider a special case, in which fuzzy numbers are fuzzy and can be extended from (25) and (26) as
degraded into intervals (α = 1). Let θ = [θ L , θ R ]. According
to (18), the free energy function can be written as P (hi = 1|x) = σ(ci + W i x)
T
1 P (xj = 1|h) = σ(bj + W .j h).
Fc (x, θ) ≈ [F(x, θ L ) + F(x, θ R )] . (19)
2 After defuzzifying the objective function, MCMC method can
After defuzzifying the fuzzy free energy, the probability defined be employed to sample from these conditional distributions. This
on it returns to (14). Then, the problem is transformed into reg- process is crucial to approximate the objective function due to
ular optimization problem, which can be solved by the gradient the difficulties to calculate the expectations. For the predefined
descend-based stochastic maximum likelihood method. α, the α-cut of fuzzy conditional probabilities are consistent
3) Gradient Related Optimization: The gradients of nega- with the targeting probability distribution. They are given as
tive log-probability with respect to θ L then have a particularly follows:
form (see Appendix A)
P (hi = 1|x)[α] = [PL (hi = 1|x), PR (hi = 1|x)]
∂ log Pc (x, θ) ∂Fc (x, θ L ) ∂Fc (x, θ L ) P (xj = 1|h)[α] = [PL (xj = 1|h), PR (xj = 1|h)]
− = − EP (20)
∂θ L ∂θ L ∂θ L
where PL (hi |x), PR (hi |x), PL (xj |h), and PR (xj |h) are the
where EP (·) means the expectation over the target probability conditional probabilities with respect to the lower bounds and
distribution P . Similarly, the gradients of (16) with respect to upper bounds of the parameters governing the model. They have
θ R is given by following forms:
PL (hi |x) = P (hi |x; θL ) = σ cLi + WiL x
∂ log Pc (x, θ) ∂Fc (x, θ R ) ∂Fc (x, θ R )
− = − EP . (21)
∂θ R ∂θ R ∂θ R PR (hi |x) = P (hi |x; θR ) = σ cR R
i + Wi x (27)
weight WijL , visible bias bLj , hidden bias cLi , and their upper
bounds WijR , bR R
j , and ci . For simplicity, the energy function is
denoted by a sum of terms associated with only one hidden unit
m
E(x, h) = −μ(x) − φi (x, hi ) (29)
i=1
where
μ(x) = bT x, φi (x, hi ) = −hi (ci + Wi x). (30)
Then, the free energy of RBM with binary units can be further
simplified explicitly to (see Appendix B) Fig. 4. k-step Gibbs sampling.
m
F(x) = −bT x − log(1 + e(c i +W i x) ). (31) units, hidden units are sampled given visible units, as illustrated
i=1 in Fig. 4
The gradients of the free energy can be calculated explicitly h(k +1) ∼ P (h(k +1) |x(k ) ) (32)
when the RBM has binary units and the energy function has
form (29). x(k +1) ∼ P (x(k +1) |h(k +1) ). (33)
No matter whether the fuzzy parameters in fuzzy RBM are As k → ∞, samples (x , h ) are guaranteed to be accu-
(k ) (k )
symmetric or not, their α-cuts are always intervals with lower rate samples of P (x, h). However, Gibbs sampling is very time-
and upper bounds that need to be learned in the training phase. consuming as k needs to be large enough. An efficient learning
According to (19)–(21), all the gradients (see details in Ap- approach, called contrastive divergence (CD) [32] learning was
pendix C) can be obtained. Associate with (20) and (31), it is proposed in 2002. It shows that learning process still performs
easy to get the following negative log-likelihood gradients for very well, even though only a number of steps are run in Markov
the fuzzy RBM with binary units: chain [33]. The CD learning uses two tricks to speed up the sam-
∂ log Pc (x) pling process. The first on it to initialize the Markov chain with
− L
= EP PL (hi |x) · xLj − PL (hi |x) · xLj a training example, and the second one is to obtain samples after
∂Wij
only k-steps of Gibbs sampling. This is regarded as CD-k learn-
∂ log Pc (x) ing algorithm. A lot of experiments show that the performances
− = EP [PL (hi |x)] − PL (hi |x)
∂cLi of the approximations are still very good when k = 1.
CD is a different function compared with Kullback–Leibler
∂ log Pc (x)
− = EP [PL (xj |h)] − xLj divergence to measure the difference between approximated dis-
∂bLj tribution and true distribution. Why is CD learning efficient? It
∂ log Pc (x) is because that CD learning provides an approximation of the
− = EP [PR (hi |x) · xR
j ] − PR (hi |x) · xj
R
log-likelihood gradient that has been found to be a successful
∂WijR
update rule for training probabilistic models. Variational justifi-
∂ log Pc (x) cation can provide a theoretical proof to the convergence of the
− = EP [PR (hi |x)] − PR (hi |x) learning processes [11], [34]. Conducting CD−1 learning by
∂cR
i
using (25) and (26), namely x = x(0) → h(0) → x(1) → h(1) ,
∂ log Pc (x)
− = EP [PR (xj |h)] − xR
j
it is easy to get the updating rules for all the parameters (θL and
∂bR
j θR ) in the FRBM model. The pseudocode is demonstrated in
where Pc (x) is the centroid probability defined though (14) Algorithm 1.
combining with (18) and (19).
5) Contrastive Divergence: When to approximate the ex- IV. EXPERIMENTAL RESULTS
pectations, samples of PL (x) and PR (x) can be obtained by In this section, the representation capabilities of the RBM and
running two Markov chains to convergence, using Gibbs sam- FRBM will be examined on two datasets. One is BAS bench-
pling as the transition operator. One is for the lower bounds, mark dataset, and another one is MNIST handwritten digits
and the other is for the upper bounds. Gibbs sampling of the dataset. The RBM and proposed FRBM will be trained in an
joint distribution over N random variables S = (S1 , ..., SN ) is unsupervised way on BAS dataset to recovery incomplete im-
done through a sequence of N sampling substeps of the form ages. The training on noisy BAS dataset is also considered to
Si ∼ P (Si |S−i ), where S−i contains the N − 1 other random compare the robustness of the two models. To compare the clas-
variables in S excluding Si . sification performances of the two models in experiments based
For both the RBM and fuzzy RBM model, S consists of the on MNIST handwritten digits dataset, both RBM and FRBM
set of visible and hidden units. However, since they are condi- are trained in a supervised manner.
tionally independent, one can perform block Gibbs sampling. A On account of the partition function Z in (14), it is very tricky
step in the Markov chain, visible units are sampled given hidden to track the RBM and FRBM training process in unsupervised
2168 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 23, NO. 6, DECEMBER 2015
learning rates for updating W , b and c are set to be 0.05. The B. Noisy Bar-and-Stripe Benchmark Inpainting
MSEs (the summation over reconstruction error of all the 60 000 To verify the robustness of the FRBM model, there are 10%
training samples) produced in the two learning phases (50 hid- and 20% training samples replaced with noisy training samples.
den units) are shown in Fig. 6. It is easy to see that the FRBM Each noisy sample is generated by reversing all the pixel values
model generates less reconstruction errors than the RBM model, in a row or column. Then, the FRBM and RBM are trained with
which means the FRBM can learn the probability distribution these noisy samples to investigate the robustness of the two
more accurate than the traditional RBM model. different models. The inpainting results are shown in Figs. 9
The comparative recovery results for the RBM and FRBM and 10. The same conclusion is that the FRBM model learn a
models are demonstrated in Figs. 7 and 8, respectively, where k more accurate distribution than the regular RBM model.
2170 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 23, NO. 6, DECEMBER 2015
TABLE I
BAS BENCHMARK INPAINTING IN TESTING PROCESS: MEAN SQUARE ERROR
(MSE), RECOVERY ERROR RATE (RER)
Then, the likelihood function can be given by all the gradients can be expressed as
[18] I. Sutskever and G. E. Hinton, “Learning multilevel distributed representa- Chun-Yang Zhang received the B.S. degree in math-
tions for high-dimensional sequences,” in Proc. Int. Conf. Artificial Intell. ematics from Beijing Normal University, Zhuhai,
Statist., 2007, pp. 548–555. China, in 2010, and the M.S. and Ph.D. degrees
[19] H.-F. Wang and R.-C. Tsaur, “Insight of a fuzzy regression model,” Fuzzy from the Faculty of Science and Technology, Uni-
Sets Syst., vol. 112, no. 3, pp. 355–369, 2000. versity of Macau, Macau, China, in 2012 and 2015,
[20] P.-Y. Hao and J.-H. Chiang, “Fuzzy regression analysis by support vector respectively.
learning approach,” IEEE Trans. Fuzzy Syst., vol. 16, no. 2, pp. 428–441, He is currently with the College of Mathematics
Apr. 2008. and Computer Science, Fuzhou University, Fuzhou,
[21] A. Fischer and C. Igel, “Training restricted Boltzmann machines: An China. His research interests include computational
introduction,” Pattern Recog., vol. 47, no. 1, pp. 25–39, 2014. intelligence, machine learning and Big Data analysis.
[22] C.-Y. Zhang and C. Chen, “An automatic setting for training restricted
Boltzmann machine,” in Proc. IEEE Int. Conf. Syst., Man Cybern., 2014,
pp. 4037–4041.
[23] M. Aoyagi, “Learning coefficient in bayesian estimation of restricted
Boltzmann machine,” J. Algebraic Statist., vol. 4, no. 1, pp. 31–58, 2013.
[24] M. Aoyagi and K. Nagata, “Learning coefficient of generalization error in
Bayesian estimation and Vandermonde matrix-type singularity,” Neural
Comput., vol. 24, no. 6, pp. 1569–1610, 2012.
[25] G. J. Klir and B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and Applica-
tions. Englewood Cliffs, NJ, USA: Prentice-Hall, 1995. Long Chen (M’11) received the B.S. degree in in-
[26] S. Dutta and M. Chakraborty, “Fuzzy relation and fuzzy function over formation sciences from Peking University, Beijing,
fuzzy sets: A retrospective,” Soft Comput., vol. 19, no. 1, pp. 99–112, China, in 2000, the M.S.E. degree from the Insti-
2014. tute of Automation, Chinese Academy of Sciences,
[27] J. J. Buckley, Fuzzy Probabilities: New Approach and Applications. New Beijing, in 2003, the M.S. degree in computer engi-
York, NY, USA: Springer, 2009. neering from the University of Alberta, Edmonton,
[28] W. Pedrycz, A. Skowron, and V. Kreinovich, Handbook of Granular Com- AB, Canada, in 2005, and the Ph.D. degree in electri-
puting. New York, NY, USA: Wiley, 2008. cal engineering from the University of Texas at San
[29] W. A. Lodwick and J. Kacprzyk, Fuzzy Optimization: Recent Advances Antonio, San Antonio, TX, USA, in 2010.
and Applications. New York, NY, USA: Springer, 2010. From 2010 to 2011, he was a Postdoctoral Fellow
[30] N. N. Karnik and J. M. Mendel, “Centroid of a type-2 fuzzy set,” Inform. with the University of Texas at San Antonio. He is
Sci., vol. 132, no. 14, pp. 195–220, 2001. currently an Assistant Professor with the Department of Computer and Infor-
[31] O. Castillo and P. Melin, Type-2 Fuzzy Logic: Theory and Applications. mation Science, University of Macau, Macau, China. His current research inter-
New York, NY, USA: Springer, 2008. ests include computational intelligence, Bayesian methods, and other machine
[32] G. Hinton, “Training products of experts by minimizing contrastive diver- learning techniques and their applications. He has been working in publica-
gence,” Neural Comput., vol. 14, no. 8, pp. 1771–1800, 2002. tion matters for many IEEE conferences and was the Publications Cochair of
[33] M. A. Carreira-Perpinan and G. E. Hinton, “On contrastive divergence the IEEE International Conference on Systems, Man and Cybernetics in 2009,
learning,” in Proc. 10th Int. workshop artificial intelligence and statistics, 2012, and 2014.
NP: Society for Artificial Intelligence and Statistics, 2005, pp. 33–40.
[34] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An intro-
duction to variational methods for graphical models,” Mach. Learning,
vol. 37, no. 2, pp. 183–233, 1999.
[35] C. M. Bishop, Pattern Recognition and Machine Learning. New York,
NY, USA: Springer-Verlag, 2006.
[36] G. E. Hinton, “Learning multiple layers of representation,” Trends Cogni-
tive Sci., vol. 11, no. 10, pp. 428–434, 2007.
[37] C. P. Chen and C.-Y. Zhang, “Data-intensive applications, challenges,
techniques and technologies: A survey on big data,” Inform. Sci., vol. 275,
Min Gan received the B.S. degree in computer sci-
pp. 314–347, 2014.
ence and engineering from the Hubei University of
Technology, Wuhan, China, in 2004, and the Ph.D.
degree in control science and engineering from Cen-
C. L. Philip Chen (S’88–M’88–SM’94–F’07) re- tral South University, Changsha, China, in 2010.
ceived the M.S. degree in electrical engineering from He is currently an Associate Professor with the
the University of Michigan, Ann Arbor, MI, USA, in School of Electrical Engineering and Automation,
1985, and the Ph.D. degree in electrical engineering Hefei University of Technology, Hefei, China. His
from Purdue University, West Lafayette, IN, USA, in current research interests include neural networks,
1988. system identification, and nonlinear time series
After having worked in the U.S. for 23 years as analysis.
a tenured Professor, as the Department Head and an
Associate Dean in two different universities, he is
currently the Dean of the Faculty of Science and
Technology, University of Macau, Macau, China, and
the Chair Professor of the Department of Computer and Information Science.
From 2012-2013, he was the IEEE SMC Society President, currently, he is the
Editor-in-Chief of IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNET-
ICS: SYSTEMS and an Associate Editor of several IEEE Transactions. He is also
the Chair of TC 9.1 Economic and Business Systems of IFAC. His research
areas are systems, cybernetics, and computational intelligence. In addition, he
is a Program Evaluator for Accreditation Board of Engineering and Technology
Education in computer engineering, electrical engineering, and software engi-
neering programs.
Dr. Chen is a Fellow of the AAAS.