Facial Expression Recognition Using Weighted Mixture Deep Neural Network Based On Double-Channel Facial Images
Facial Expression Recognition Using Weighted Mixture Deep Neural Network Based On Double-Channel Facial Images
ABSTRACT Facial expression recognition (FER) is a significant task for the machines to understand
the emotional changes in human beings. However, accurate hand-crafted features that are highly related
to changes in expression are difficult to extract because of the influences of individual difference and
variations in emotional intensity. Therefore, features that can accurately describe the changes in facial
expressions are urgently required. Method: A weighted mixture deep neural network (WMDNN) is proposed
to automatically extract the features that are effective for FER tasks. Several pre-processing approaches, such
as face detection, rotation rectification, and data augmentation, are implemented to restrict the regions for
FER. Two channels of facial images, including facial grayscale images and their corresponding local binary
pattern (LBP) facial images, are processed by WMDNN. Expression-related features of facial grayscale
images are extracted by fine-tuning a partial VGG16 network, the parameters of which are initialized using
VGG16 model trained on ImageNet database. Features of LBP facial images are extracted by a shallow con-
volutional neural network (CNN) built based on DeepID. The outputs of both channels are fused in a weighted
manner. The result of final recognition is calculated using softmax classification. Results: Experimental
results indicate that the proposed algorithm can recognize six basic facial expressions (happiness, sadness,
anger, disgust, fear, and surprise) with high accuracy. The average recognition accuracies for benchmarking
data sets ‘‘CK+,’’ ‘‘JAFFE,’’ and ‘‘Oulu-CASIA’’ are 0.970, 0.922, and 0.923, respectively. Conclusions:
The proposed FER method outperforms the state-of-the-art FER methods based on the hand-crafted features
or deep networks using one channel. Compared with the deep networks that use multiple channels, our
proposed network can achieve comparable performance with easier procedures. Fine-tuning is effective to
FER tasks with a well pre-trained model if sufficient samples cannot be collected.
INDEX TERMS Facial expression recognition, double channel facial images, deep neural network, weighted
mixture, softmax classification.
2169-3536
2017 IEEE. Translations and content mining are permitted for academic research only.
4630 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 6, 2018
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
B. Yang et al.: FER Using WMDNN Based on Double-Channel Facial Images
a two-stage multi-task framework to study FER was pro- that expressions can be decomposed into multiple facial
posed by Zhong et al. [16]. Key facial regions were effec- action units [27]. However, the recognition ability of AUDN
tively detected through multi-task learning, and features were is restricted due to the single modality of input facial images.
extracted from these regions through a sparse coding strategy. Mollahosseini et al. attempted to learn improved features
Afterward, SVM was used as a classifier to recognize dif- specific for expression representation through a very deep
ferent expressions. Zhang et al. [17] extracted texture and neural network [28]. The network consisted of two convo-
landmark features from facial images. These two features are lutional layers, each followed by a max pooling layer and
complementary and can catch subtle expression changes. four inception layers. However, this network is difficult to
These FER tasks mainly involve still facial images. With train without using powerful machines (especially powerful
the development of FER for video analysis, an increasing GPUs). In short, recent FER approaches based on deep learn-
number of researchers have focused on motion features, ing outperform traditional FER approaches based on hand-
such as optical flow [18], motion history images (MHI) [6], crafted features. However, only a few studies on deep learning
and volume LBP [19]. Dynamic models of FER tasks have employ facial depth images as an input of deep networks.
also been widely studied. Walecki et al. used a conditional
random field (CRF) framework to recognize different facial III. PROPOSED METHOD
expressions and motion units on faces [20]. They argued A. PRE-PROCESSING
that temporal variations in facial expressions could improve 1) FACE DETECTION
the accuracy of FER. Jain et al. [4] combined linear chain Face detection is the key issue in FER. Excessive background
CRF, hidden CRF, and the additional variables of the hidden information that is uncorrelated to expression recognition
layer to build a dynamic model. This model can describe exists in a facial image, even when the image is selected
expression changes through a similarity analysis. from a benchmarking facial expression dataset. Thus, precise
FER depends on the accuracy of the results of face detection,
B. FER APPROACHES BASED ON DEEP LEARNING which should exclude uncorrelated background information
Existing FER approaches based on hand-crafted features as much as possible. The commonly used Viola–Jones frame-
demonstrate a limited recognition performance. Efforts work [2] is used for face detection in the present study.
should be exerted to manually extract effective features Certain results of face detection (represented by a yellow
related to expression changes. Many studies have recently rectangle) are illustrated in Fig. 3.
investigated FER issues based on deep learning in consid-
eration of FER’s great success in pattern recognition, espe-
cially with the development of the Emotion Recognition in
the Wild Challenge (EmotiW) [21]. A thorough review of
deep learning is beyond the scope of this study; however,
readers may refer to [9], [22]. This work mainly discusses FIGURE 3. Illustration of detected faces with different facial.
a few deep networks that can be used to implement FER
tasks. Zhao et al. proposed deep belief networks (DBNs) to
automatically learn facial expression features, and a multi- 2) ROTATION RECTIFICATION
layer perceptron (MLP) was trained to recognize different Facial images in benchmarking datasets and real environ-
facial expressions based on the learned features. They argued ments vary in rotation, even for images of the same subject.
that MLP outperforms SVM and the RF classifier [23]. These variations are unrelated to facial expressions and may
Boughrara et al. presented a constructive training algorithm thus affect the recognition accuracy of FER. To address this
for MLP applied to FER applications [24]. Aside from MLP, issue, the facial region is aligned via rotation rectification by
CNN is also commonly used to extract features simulta- means of a rotation transformation matrix defined as follows:
neously and classify expressions. Lopes et al. presented a
cos θ sin θ
0
CNN for FER and reported its satisfactory performance
(Lx 0 , Ly0 , 1) = [Lx, Ly, 1] − sin θ cos θ 0. (1)
in the ‘‘CK+’’ dataset. A data augmentation strategy was
0 0 1
proposed to address the lack of labeled samples for CNN
training. Several pre-processing technologies were also used where (Lx, Ly) represents the original coordinate in the facial
to preserve expression-related features in facial images. Later, image and (Lx’, Ly’) represents the coordinate (x, y) after rota-
Yu et al. combined several CNNs to study FER [25]. These tion transformation. θ represents the rotation angle formed
CNNs were fused by learning the set weights of the net- by the line segment that moves from one eye center to the
work response. Kim et al. also trained multiple deep CNNs other. The horizontal axis is zero. We use the DRMF proposed
for robust FER [26]. The committee of deep CNNs was by Cheng et al. to detect both eyes in facial images with
improved by varying the network architecture and random high accuracy and speed [29]. After rotation rectification,
weight initialization. To learn improved features specific for all detected facial regions are rescaled to 72 × 72 to reduce
expression representation, Liu et al. proposed AU-inspired the dimension. A smaller size of facial region can further
deep networks (AUDNs) inspired by the psychological theory accelerate the speed of FER, but it may also lead to losing
FIGURE 4. Illustration of LBP coding. FIGURE 6. Structure of the partial VGG16 network used to extract
expression related features from facial grayscale images.
LBP facial images using the shallow CNN. Each feature where P(y = k|x) is the probability that the class is k given
vector is connected with two cascaded full connect layers for that the input is x. The cross entropy is used as the cost
dimension reduction. These full connect layers are fc1_1 = function, which is defined as
{s1 , s2 , . . . , sm } (m is experimentally to 100), fc1_2 = {s1 , s2 , K
. . . , s6 } for fv1 and fc2_1 = {l1 , l2 , . . . , lm } (m = 100),
X
Loss(y, z) = − Zi · log(yi ). (6)
fc2_2 = {l1 , l2 , . . . , l6 } for fv2. Distances between different i=1
where zi indicates the true label and yi represents the output disgust, fear, anger, and neutral expressions. We only use
of softmax function. In the present study, we use the back- six basic expressions in this work. A total of 1960 labeled
propagation (aka backprops) based on gradient descendant samples are captured from 28 subjects (a mix of male/female
optimization algorithm to minimize Eq.6. and glasses/without glasses), with 280 samples for each
expression. Data augmentation is used to increase the samples
IV. EXPERIMENTS RESULTS of each expression by 10 times and then 10-fold cross vali-
A. DATASETS AND CONFIGURATIONS dation is used for evaluation. We capture all samples under
We evaluate the performance of our FER method based on the constant illumination conditions but with heavy occlusion
Keras framework on the Linux platform. All experiments are and drastic head deflection to test the robustness of the pro-
performed using a standard NVIDIA GTX 1080 GPU (8 GB), posed FER approach. Notably, each practical facial image
a NVIDIA CUDA framework 6.5, and a cuDNN library. is down-sampled to 480×270 to reduce calculating amount.
To facilitate fair and effective evaluations, three benchmark- Table 3 lists other configurations of the proposed approach,
ing datasets are used, which are composed of facial RGBD such as learning rate, learning policy, and weight decay.
images. Descriptions of the used datasets are listed below.
TABLE 3. Parameter setting of the proposed WMDNN.
1) CK+ [32]
This fully annotated dataset includes 593 sequences that
represent seven expressions (happiness, sadness, surprise,
disgust, fear, anger, and neutral) of 123 subjects (males and
females). We only use the six basic expressions, including
happiness, sadness, surprise, disgust, fear, and anger. For each
sequence, we select the last frame because each sequence in
this dataset begins with a neutral expression and proceeds
The convergences of the proposed approach are evaluated
to a peak expression. Thus, roughly 80 to 120 samples are
in three benchmarking datasets, and the results are illustrated
selected for each expression. Data augmentation (by using
in Figs. 9(a), 9(b), and 9(c). Each sub-figure shows the
simple operations such as rotation, translation, and skewing)
trends of accuracy (red curve) and loss (green curve) with
is used to increase the samples of each expression by 50 times.
the increase in epochs. For each dataset, the tendencies of
Finally, 10-fold cross validation is used for evaluation.
accuracy and loss stabilize after 40C50 epochs.
A shallow CNN is constructed to automatically extract fea- [14] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn,
tures of facial expressions from LBP facial images because ‘‘DISFA: A spontaneous facial action intensity database,’’ IEEE Trans.
Affect. Comput., vol. 4, no. 2, pp. 151–160, Apr. 2013.
of lack of effective pre-trained model based on LBP images. [15] H. Kobayashi and F. Hara, ‘‘Facial interaction between animated 3D face
Subsequently, a weighted fusion strategy is proposed to fuse robot and human beings,’’ in Proc. IEEE Int. Conf. Syst., Man, Cybern.,
both features to fully use complementary facial information. Comput. Cybern. Simulation, vol. 4. Oct. 1997, pp. 3732–3737.
[16] L. Zhong, Q. Liu, P. Yang, J. Huang, and D. N. Metaxas, ‘‘Learning
The recognition results are obtained based on the fused multiscale active facial patches for expression analysis,’’ IEEE Trans.
features via a ‘‘softmax’’ operation. Furthermore, it takes Cybern., vol. 45, no. 8, pp. 1499–1510, Aug. 2014.
about 1.3s to process a facial image, including 0.5s for pre- [17] W. Zhang, Y. Zhang, L. Ma, J. Guan, and S. Gong, ‘‘Multimodal learning
for facial expression recognition,’’ Pattern Recognit., vol. 48, no. 10,
processing and 0.8s for recognizing different expressions. pp. 3191–3202, 2015.
Evaluations in three benchmarking datasets verify the effec- [18] K. Mase, ‘‘Recognition of facial expression from optical flow,’’ IEICE
tiveness of our approach in recognizing six basic expressions. Trans. Inf. Syst., vol. E74-D, no. 10, pp. 3474–3483, 1991.
[19] G. Zhao and M. Pietikäinen, ‘‘Dynamic texture recognition using local
On the one hand, our method outperforms FER approaches binary patterns with an application to facial expressions,’’ IEEE Trans.
based on hand-crafted features. The ability to automatically Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 915–928, Jun. 2007.
extract features enables our method to implement more easily [20] R. Walecki, O. Rudovic, V. Pavlovic, and M. Pantic, ‘‘Variable-state latent
conditional random fields for facial expression recognition and action
than approaches based on hand-crafted features, which fre- unit detection,’’ in Proc. 11th IEEE Int. Conf. Workshops Automat. Face
quently require initially detection of facial landmark points. Gesture Recognit. (FG), May 2015, pp. 1–8.
On the other hand, by utilizing complementary facial infor- [21] A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon, ‘‘Emotion recog-
nition in the wild challenge 2014: Baseline, data and protocol,’’ in Proc.
mation in a weighted fusion manner, our approach outper- ACM Int. Conf. Multimodal Interact., 2014, pp. 461–466.
forms several FER approaches based on deep learning. Our [22] J. Schmidhuber, ‘‘Deep learning in neural networks: An overview,’’ Neural
future work will focus on simplifying the network used to Netw., vol. 61, pp. 85–117, Jan. 2015.
[23] Y. Lv, Z. Feng, and C. Xu, ‘‘Facial expression recognition via deep learn-
speed up the algorithm. Furthermore, we plan to focus on ing,’’ IETE Tech. Rev., vol. 32, no. 5, pp. 347–355, 2015.
other channels of facial images that can be used to further [24] H. Boughrara, M. Chtourou, C. B. Amar, and L. Chen, ‘‘Facial expression
improve the fusion network. recognition based on a mlp neural network using constructive train-
ing algorithm,’’ Multimedia Tools Appl., vol. 75, no. 2, pp. 709–731,
2016.
REFERENCES [25] Z. Yu and C. Zhang, ‘‘Image based static facial expression recognition
[1] C.-R. Chen, W.-S. Wong, and C.-T. Chiu, ‘‘A 0.64 mm2 real-time cascade with multiple deep network learning,’’ in Proc. ACM Int. Conf. Multimodal
face detection design based on reduced two-field extraction,’’ IEEE Trans. Interact., 2015, pp. 435–442.
Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 11, pp. 1937–1948, [26] B.-K. Kim, J. Roh, S.-Y. Dong, and S.-Y. Lee, ‘‘Hierarchical committee of
Nov. 2011. deep convolutional neural networks for robust facial expression recogni-
[2] Y. Q. Wang, ‘‘An analysis of the Viola-Jones face detection algorithm,’’ tion,’’ J. Multimodal User Interfaces, vol. 10, no. 2, pp. 173–189, 2016.
Image Process. Line, vol. 4, pp. 128–148, Jun. 2014. [27] M. Liu, S. Li, S. Shan, and X. Chen, ‘‘Au-inspired deep networks
[3] M. Demirkus, D. Precup, J. J. Clark, and T. Arbel, ‘‘Multi-layer temporal for facial expression feature learning,’’ Neurocomputing, vol. 159,
graphical model for head pose estimation in real-world videos,’’ in Proc. pp. 126–136, Jul. 2015.
IEEE Int. Conf. Image Process., Oct. 2015, pp. 3392–3396. [28] A. Mollahosseini, D. Chan, and M. H. Mahoor, ‘‘Going deeper in facial
[4] S. Jain, C. Hu, and J. K. Aggarwal, ‘‘Facial expression recognition with expression recognition using deep neural networks,’’ in Proc. IEEE Winter
temporal modeling of shapes,’’ in Proc. IEEE Int. Conf. Comput. Vis. Conf. Appl. Comput. Vis., Mar. 2016, pp. 1–10.
Workshops, Nov. 2011, pp. 1642–1649. [29] S. Cheng, A. Asthana, S. Zafeiriou, J. Shen, and M. Pantic, ‘‘Real-
[5] M. H. Siddiqi, R. Ali, A. Sattar, A. M. Khan, and S. Lee, ‘‘Depth camera- time generic face tracking in the wild with CUDA,’’ in Proc. 5th ACM
based facial expression recognition system using multilayer scheme,’’ Multimedia Syst. Conf., 2014, pp. 148–151.
IETE Tech. Rev., vol. 31, no. 4, pp. 277–286, 2014. [30] K. Simonyan and A. Zisserman. (2014). ‘‘Very deep convolutional
[6] M. Valstar, M. Pantic, and I. Patras, ‘‘Motion history for facial action networks for large-scale image recognition.’’ [Online]. Available:
detection in video,’’ in Proc. IEEE Int. Conf. Syst., Man Cybern., vol. 1. https://arxiv.org/abs/1409.1556
Oct. 2004, pp. 635–640. [31] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
[7] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, ‘‘A convolutional neural R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
network cascade for face detection,’’ in Proc. IEEE Conf. Comput. Vis. from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
Pattern Recognit., Jun. 2015, pp. 5325–5334. 2014.
[8] J. Zhang et al., ‘‘Watch, attend and parse: An end-to-end neural network [32] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews,
based approach to handwritten mathematical expression recognition,’’ ‘‘The extended Cohn-Kanade dataset (CK+): A complete dataset for
Pattern Recognit., vol. 71, pp. 196–206, Nov. 2017. action unit and emotion-specified expression,’’ in Proc. IEEE Comput. Vis.
[9] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521, Pattern Recognit. Workshops, Jun. 2010, pp. 94–101.
pp. 436–444, May 2015. [33] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, ‘‘Coding facial
[10] Y. Sun, D. Liang, X. Wang, and X. Tang. (2015). ‘‘DeepID3: Face expressions with Gabor wavelets,’’ in Proc. 3rd IEEE Int. Conf. Autom.
recognition with very deep neural networks.’’ [Online]. Available: Face Gesture Recognit., 1998, pp. 200–205.
https://arxiv.org/abs/1502.00873 [34] S. Aly, A. L. Abbott, and M. Torki, ‘‘A multi-modal feature fusion frame-
[11] M. R. Mohammadi, E. Fatemizadeh, and M. H. Mahoor, ‘‘PCA-based work for kinect-based facial expression recognition using dual kernel dis-
dictionary building for accurate facial expression recognition via sparse criminant analysis (DKDA),’’ in Proc. IEEE Winter Conf. Appl. Comput.
representation,’’ J. Vis. Commun. Image Represent., vol. 25, no. 5, Vis., Mar. 2016, pp. 1–10.
pp. 1082–1092, 2014. [35] A. R. Rivera, J. R. Castillo, and O. O. Chae, ‘‘Local directional
[12] C. Liu and H. Wechsler, ‘‘Gabor feature based classification using number pattern for face analysis: Face and expression recognition,’’
the enhanced Fisher linear discriminant model for face recogni- IEEE Trans. Image Process., vol. 22, no. 5, pp. 1740–1752,
tion,’’ IEEE Trans. Image Process., vol. 11, no. 4, pp. 467–476, May 2013.
Apr. 2002. [36] A. T. Lopes, E. de Aguiar, A. F. De Souza, and T. Oliveira-Santos, ‘‘Facial
[13] C. Shan, S. Gong, and P. W. McOwan, ‘‘Facial expression recognition expression recognition with convolutional neural networks: Coping with
based on local binary patterns: A comprehensive study,’’ Image Vis. Com- few data and the training sample order,’’ Pattern Recognit., vol. 61,
put., vol. 27, no. 6, pp. 803–816, 2009. pp. 610–628, Jan. 2017.
BIAO YANG was born in Changzhou, Jiangsu, in RONGRONG NI was born in Nantong, China,
1987. He received the B.S. degree from the College in 1987. She received the B.S. degree from
of Automation, Nanjing University of Technology, the College of Instrument Science and Tech-
in 2009, the M.S. and the Ph.D. degrees from nology, Southeast University, Nanjing, China,
the College of Instrument Science and Technol- in 2012. She is currently a Research Assistant
ogy, Southeast University, Nanjing, China, in 2011 with the College of Mechanical and Electrical,
and 2014, respectively. Since 2015, he has been Changzhou Textile Garment Institute, Changzhou.
a Lecturer with the Department of Information Her research interests include computer vision and
Science and Engineering, Changzhou University. pattern recognition.
His research interests include pattern recognition
and machine learning.
JINMENG CAO was born in Wuxi, China, YUYU ZHANG was born in Huai’an, China, in
in 1994. She received the B.S. degree in automa- 1992. He received the B.S. degree from the Suzhou
tion from Changzhou University in 2016. She is University of Science and Technology in 2016.
currently pursuing the master’s degree in machine He is currently pursuing the master’s degree in
learning. machine learning.