0% found this document useful (0 votes)

103 views

Facial Expression Recognition Using Weighted Mixture Deep Neural Network Based On Double-Channel Facial Images

The document describes a facial expression recognition system using a weighted mixture deep neural network (WMDNN) that analyzes two channels of facial image data - grayscale images and corresponding local binary pattern (LBP) images. The system first performs preprocessing like face detection and rotation correction. It then extracts expression-related features from each channel using different deep networks, and weights and fuses the outputs to classify six basic facial expressions. Experimental results on three benchmark datasets show the proposed method achieves high recognition accuracy compared to other techniques.

Uploaded by

Samitha Chathuranga

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views

Facial Expression Recognition Using Weighted Mixture Deep Neural Network Based On Double-Channel Facial Images

Uploaded by

Samitha Chathuranga

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Received November 14, 2017, accepted December 9, 2017, date of publication December 15, 2017,

date of current version February 28, 2018.

Digital Object Identifier 10.1109/ACCESS.2017.2784096

Facial Expression Recognition Using Weighted

Mixture Deep Neural Network Based on
Double-Channel Facial Images
BIAO YANG 1, JINMENG CAO1 , RONGRONG NI2 , YUYU ZHANG1
1 Department of Information Science and Engineering, Changzhou University, Changzhou 213164, China
2 College of Mechanical and Electrical, Changzhou Textile Garment Institute, Changzhou 213164, China

Corresponding author: Biao Yang ([email protected])

This work was supported in part by the National Natural Science Foundation of China under Grant 61501060 and Grant 61703381, in part
by the Natural Science Foundation of Jiangsu Province under Grant BK20150271, and in part by the Key Laboratory for New Technology
Application of Road Conveyance of Jiangsu Province under Grant BM20082061708.

ABSTRACT Facial expression recognition (FER) is a significant task for the machines to understand
the emotional changes in human beings. However, accurate hand-crafted features that are highly related
to changes in expression are difficult to extract because of the influences of individual difference and
variations in emotional intensity. Therefore, features that can accurately describe the changes in facial
expressions are urgently required. Method: A weighted mixture deep neural network (WMDNN) is proposed
to automatically extract the features that are effective for FER tasks. Several pre-processing approaches, such
as face detection, rotation rectification, and data augmentation, are implemented to restrict the regions for
FER. Two channels of facial images, including facial grayscale images and their corresponding local binary
pattern (LBP) facial images, are processed by WMDNN. Expression-related features of facial grayscale
images are extracted by fine-tuning a partial VGG16 network, the parameters of which are initialized using
VGG16 model trained on ImageNet database. Features of LBP facial images are extracted by a shallow con-
volutional neural network (CNN) built based on DeepID. The outputs of both channels are fused in a weighted
manner. The result of final recognition is calculated using softmax classification. Results: Experimental
results indicate that the proposed algorithm can recognize six basic facial expressions (happiness, sadness,
anger, disgust, fear, and surprise) with high accuracy. The average recognition accuracies for benchmarking
data sets ‘‘CK+,’’ ‘‘JAFFE,’’ and ‘‘Oulu-CASIA’’ are 0.970, 0.922, and 0.923, respectively. Conclusions:
The proposed FER method outperforms the state-of-the-art FER methods based on the hand-crafted features
or deep networks using one channel. Compared with the deep networks that use multiple channels, our
proposed network can achieve comparable performance with easier procedures. Fine-tuning is effective to
FER tasks with a well pre-trained model if sufficient samples cannot be collected.

INDEX TERMS Facial expression recognition, double channel facial images, deep neural network, weighted
mixture, softmax classification.

I. INTRODUCTION Pre-processing, such as face detection and rotation

Facial expression recognition (FER) aims to predict basic rectification, are needed for a given facial image. The for-
facial expressions (e.g., happiness, sadness, anger, surprise, mer is consistently realized with cascade classifiers, such
disgust, and fear) from human facial images, as shown as the Adaboost [1] and the Viola-Jones frameworks [2].
in Fig. 1. This method helps machines understand the inten- Rotation rectification can be implemented with the aid of
tion or emotion of humans by analyzing their facial images. landmarks such as the eyes [3]. Facial expression fea-
FER elicited considerable attention because of its poten- tures are extracted from facial regions after pre-processing.
tial application in human-abnormal behavior detection, com- Geometric and appearance features are commonly used. For
puter interfaces, autonomous driving, health management, the former, the locations of many facial landmark points
and other similar tasks. are extracted and subsequently combined into a feature

2169-3536
2017 IEEE. Translations and content mining are permitted for academic research only.
4630 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 6, 2018
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
B. Yang et al.: FER Using WMDNN Based on Double-Channel Facial Images

It has some advantages such as easy calculation and small

data size. LBP is widely utilized in face recognition. Thus,
we argue that it may be suitable for FER. Different deep
neural networks are used for different channels of facial
images. A partial VGG16 network with initial parameters
obtained from the VGG16 model pre-trained on ImageNet
is built for facial grayscale images to automatically extract
expression-related features. For LBP facial images, a shallow
FIGURE 1. Six basic facial expressions in benchmarking datasets.
CNN, which refers to the construction of DeepID [10], is built
for automatic feature extraction. Outputs from binary channel
facial images are then fused in a weighted way and the
vector that encodes facial geometric information (e.g., angle,
fusion results are finally processed by softmax classification
distance, and position) [4]. Appearance features are used to
to predict current facial expression from six basic expres-
model the appearance variations of a particular face via a
sions(happiness, sadness, anger, surprise, disgust, and fear).
holistic spatial analysis [5]. The features of motion infor-
Our work focuses on the issues of feature extraction and
mation are used for expression recognition in a sequence of
expression recognition for facial images. The novelties of the
facial images [6]. Finally, an effective classifier is used to
study are threefold. First, binary channels of facial images,
recognize different facial expressions by learning parameters
including grayscale images and their corresponding LBP
on the extracted features.
images, are used for FER because of their complementary
Despite recent rapid developments, FER remains challeng-
properties. Second, a fine-tuning strategy is utilized to make
ing because of several factors, such as illumination changes,
full use of a well-learned pre-trained model (VGG16 model
partial occlusions of facial regions, and head deflection.
on ImageNet). Finally, outputs of both channels are weighted
These interferences may influence the performance of face
fused to predict a robust result. Three benchmarking datasets
detection and reduce the accuracy of FER. Hand-crafted
and several practical facial images are used to evaluate the
features are no longer suitable for FER tasks with severe
effectiveness of our work.
disturbances. Fortunately, deep learning may provide a sat-
The rest of this study is organized as follows. Section 2
isfactory solution to these issues.
reviews related work on FER approaches. Section 3 provides
Convolutional neural network (CNN) has recently achie-
details of the proposed weighted mixture deep networks.
ved rapid advances in pattern recognition, especially in
Section 4 shows the experimental results and analysis. The
face detection [7] and handwritten mathematical expression
conclusions are presented in Section 5.
recognition [8]. CNN can automatically understand and learn
the abstract signatures of the target with a deep network [9].
II. RELATED WORK
With deeper layers and elaborate designs, CNN or any
A detailed review of FER is beyond the scope of this paper,
other deep network can perfectly realize FER under wild
which can be referred to [1]. Here we only review some work
conditions.
on feature extraction, which is a significant issue of FER.

A. FER APPROACHES BASED ON

HAND-CRAFTED FEATURES
A FER task focuses on extracting facial expression fea-
tures from facial RGB (or grayscale) images and recog-
nizing different facial expressions with a trained classifier.
Traditional FER tasks depend on hand-crafted features. The
three main types of features are appearance, geometric, and
motion features. Commonly used appearance features include
pixel intensity [11], Gabor texture [12], LBP [13], and his-
FIGURE 2. Pipeline of the proposed FER approach based on WMDNN. togram of oriented gradients (HOG) [14]. These features
capture global and detailed information from facial images
A weighted mixture deep neural network (WMDNN) and can thus reflect an individual’s expression. However,
is proposed for FER tasks under wild conditions. The these features are extracted from the entire facial region, and
pipeline of the proposed FER approach is illustrated local regions that are highly related to expression changes,
in Fig. 2. As shown in the figure, several pre-processing such as the eyes, nose, and mouth, are ignored. Therefore,
approaches, such as face detection, rotation rectification, data geometric features, which are represented by the geomet-
augmentation, are necessary for input facial images. The ric relationships of facial landmark points detected from
corresponding local binary pattern (LBP) facial image for a local regions that are highly related to expression changes,
pre-processed facial image is calculated to focus on facial are used for FER tasks [15]. Moreover, the combination
local information. LBP is a commonly used texture feature. of different features is a promising trend. For example,

VOLUME 6, 2018 4631

B. Yang et al.: FER Using WMDNN Based on Double-Channel Facial Images

a two-stage multi-task framework to study FER was pro- that expressions can be decomposed into multiple facial
posed by Zhong et al. [16]. Key facial regions were effec- action units [27]. However, the recognition ability of AUDN
tively detected through multi-task learning, and features were is restricted due to the single modality of input facial images.
extracted from these regions through a sparse coding strategy. Mollahosseini et al. attempted to learn improved features
Afterward, SVM was used as a classifier to recognize dif- specific for expression representation through a very deep
ferent expressions. Zhang et al. [17] extracted texture and neural network [28]. The network consisted of two convo-
landmark features from facial images. These two features are lutional layers, each followed by a max pooling layer and
complementary and can catch subtle expression changes. four inception layers. However, this network is difficult to
These FER tasks mainly involve still facial images. With train without using powerful machines (especially powerful
the development of FER for video analysis, an increasing GPUs). In short, recent FER approaches based on deep learn-
number of researchers have focused on motion features, ing outperform traditional FER approaches based on hand-
such as optical flow [18], motion history images (MHI) [6], crafted features. However, only a few studies on deep learning
and volume LBP [19]. Dynamic models of FER tasks have employ facial depth images as an input of deep networks.
also been widely studied. Walecki et al. used a conditional
random field (CRF) framework to recognize different facial III. PROPOSED METHOD
expressions and motion units on faces [20]. They argued A. PRE-PROCESSING
that temporal variations in facial expressions could improve 1) FACE DETECTION
the accuracy of FER. Jain et al. [4] combined linear chain Face detection is the key issue in FER. Excessive background
CRF, hidden CRF, and the additional variables of the hidden information that is uncorrelated to expression recognition
layer to build a dynamic model. This model can describe exists in a facial image, even when the image is selected
expression changes through a similarity analysis. from a benchmarking facial expression dataset. Thus, precise
FER depends on the accuracy of the results of face detection,
B. FER APPROACHES BASED ON DEEP LEARNING which should exclude uncorrelated background information
Existing FER approaches based on hand-crafted features as much as possible. The commonly used Viola–Jones frame-
demonstrate a limited recognition performance. Efforts work [2] is used for face detection in the present study.
should be exerted to manually extract effective features Certain results of face detection (represented by a yellow
related to expression changes. Many studies have recently rectangle) are illustrated in Fig. 3.
investigated FER issues based on deep learning in consid-
eration of FER’s great success in pattern recognition, espe-
cially with the development of the Emotion Recognition in
the Wild Challenge (EmotiW) [21]. A thorough review of
deep learning is beyond the scope of this study; however,
readers may refer to [9], [22]. This work mainly discusses FIGURE 3. Illustration of detected faces with different facial.
a few deep networks that can be used to implement FER
tasks. Zhao et al. proposed deep belief networks (DBNs) to
automatically learn facial expression features, and a multi- 2) ROTATION RECTIFICATION
layer perceptron (MLP) was trained to recognize different Facial images in benchmarking datasets and real environ-
facial expressions based on the learned features. They argued ments vary in rotation, even for images of the same subject.
that MLP outperforms SVM and the RF classifier [23]. These variations are unrelated to facial expressions and may
Boughrara et al. presented a constructive training algorithm thus affect the recognition accuracy of FER. To address this
for MLP applied to FER applications [24]. Aside from MLP, issue, the facial region is aligned via rotation rectification by
CNN is also commonly used to extract features simulta- means of a rotation transformation matrix defined as follows:
neously and classify expressions. Lopes et al. presented a
cos θ sin θ
 
0
CNN for FER and reported its satisfactory performance
(Lx 0 , Ly0 , 1) = [Lx, Ly, 1] − sin θ cos θ 0. (1)
in the ‘‘CK+’’ dataset. A data augmentation strategy was
0 0 1
proposed to address the lack of labeled samples for CNN
training. Several pre-processing technologies were also used where (Lx, Ly) represents the original coordinate in the facial
to preserve expression-related features in facial images. Later, image and (Lx’, Ly’) represents the coordinate (x, y) after rota-
Yu et al. combined several CNNs to study FER [25]. These tion transformation. θ represents the rotation angle formed
CNNs were fused by learning the set weights of the net- by the line segment that moves from one eye center to the
work response. Kim et al. also trained multiple deep CNNs other. The horizontal axis is zero. We use the DRMF proposed
for robust FER [26]. The committee of deep CNNs was by Cheng et al. to detect both eyes in facial images with
improved by varying the network architecture and random high accuracy and speed [29]. After rotation rectification,
weight initialization. To learn improved features specific for all detected facial regions are rescaled to 72 × 72 to reduce
expression representation, Liu et al. proposed AU-inspired the dimension. A smaller size of facial region can further
deep networks (AUDNs) inspired by the psychological theory accelerate the speed of FER, but it may also lead to losing

4632 VOLUME 6, 2018

B. Yang et al.: FER Using WMDNN Based on Double-Channel Facial Images

of facial information, especially for the information acquired

from facial LBP images.

3) CALCULATING LOCAL BINARY PATTERN

FACIAL IMAGES
LBP is a commonly used descriptor to capture texture infor-
mation of the given target. The LBP coding of a given pixel is
calculated by comparing its value with adjacent pixels [13].
As shown in Fig. 4, the left part illustrates all pixel values of a
local region, whereas the right part provides the LBP coding
of the center pixel in the way of binary coding.

FIGURE 4. Illustration of LBP coding. FIGURE 6. Structure of the partial VGG16 network used to extract
expression related features from facial grayscale images.

After the pixel is effectively encoded by LBP coding,

The proposed deep neural network for feature extrac-
its LBP value can be calculated as follows:
tion is based on the VGG16 network of Simonyan and
N
X Zisserman [30]. VGG16 is chosen because of its effec-
LBP = S(gn − gc ) ∗ 2n . (2) tive performance in visual detection and fast convergence.
n=1 Fig. 6 shows the primary module of the network. Compared
where S(*) represents the signature function and N represents with the traditional VGG16, our partial VGG16 network is
number of adjacent pixels. gc and gn indicate the values of simplified by removing two dense layers. The dimension of
center pixel and adjacent pixels, respectively. A LBP facial the input data is 1 × 72 × 72. We then fix the structures
image can be obtained by calculating LBP value of each pixel. of the first four blocks. For the fifth block, we change the
names of each layer by adding ‘‘ft’’ (ft means fine-tune) at
the end of its original name. We also change the structure
of Conv5_1_ft. The parameters of layers that belong to this
block are shown in Table. 1. Only one dense layer is preserved
and its dimension is set to 1×500. We decrease the learning
rates of layers that belong to the fifth block by 10 time
(0.001 used for layers of the fifth block) than their original
values (0.01 used for layers of other blocks) to guarantee
that they can learn more effective information. Finally, the
FIGURE 5. Illustration of calculating facial LBP image. initial portion of the network is initialized with weights from
a VGG16 model trained on the ImageNet dataset. Rectified
Fig. 5 shows a LBP facial image of surprise expression. Linear Unit (ReLu) activations are applied after each convo-
Expressions related facial regions, such as mouths, eyes, and lutional layer.
eyebrows, are more remarkable in LBP images than in gray
scale images. TABLE 1. Parameter of layers belonging to block five.

B. FEATURE EXTRACTION FROM GRAYSCALE

FACIAL IMAGES
Lack of sufficient training samples limits the performance
of CNN-based FER approach. Data augmentation can partly
handle this issue at the risk of over-fitting. Thus, fine-tuning
is used to extract expression related features from facial C. FEATURE EXTRACTION FROM LBP FACIAL IMAGES
grayscale images by referring to the deep neural network that To the best of our knowledge, no elaborate model is trained
attained high success in similar tasks. on LBP images. Thus, we construct a shallow CNN model

VOLUME 6, 2018 4633

B. Yang et al.: FER Using WMDNN Based on Double-Channel Facial Images

FIGURE 7. Structure of the shallow CNN used to extract features from

LBP facial images.

that is similar to DeepID to automatically extract expression-

related features from LBP facial images. Fig. 7 illustrates
its structure, which comprises an input layer, two convolu-
tion (‘‘C’’) and sub-sampling (‘‘S’’) layers, and a feature
vector (‘‘fv’’) layer. Sixty-four filters are used in the first
convolution layer (‘‘C1’’) for the input facial LBP images,
FIGURE 8. Weighted fusion network of binary outputs.
which focus on the detailed information of facial expressions.
This layer uses a convolution kernel of 7 × 7 and outputs
64 images of 72 × 72 pixels. This layer is followed by a facial features are automatically captured by the network and
sub-sampling layer (‘‘S1’’), which uses optional max pooling are revealed through fc1_2 and fc2_2. Further, fc1_2 and
(with kernel size 2 × 2) to reduce the image to half its size. fc2_2 are fused in a weighted way to construct a fused vector
A new convolution layer (‘‘C2’’) performs 256 convolutions fl = {p1 , p2 , . . . , p6 }. The ith element pi can be calculated as
with a 3 × 3 kernel to map the previous layer and is followed follows:
by another sub-sampling layer (‘‘S2’’) with a 2 × 2 kernel.
All parameters used in the shallow CNN are listed in detail pi = α · si + (1 − α) · li . (3)
in Table 2. Then, the output is given to a fully connected where α weights the contributions of facial grayscale images
hidden layer (‘‘fv’’) with 500 neurons. The ‘‘fv’’ layer is and LBP facial images to FER tasks; α is calculated exper-
connected with sub-sampling layers ‘‘S1’’ and ‘‘S2’’ to guar- imentally by cross validation. Softmax classification with a
antee the scale invariance of the extracted features. The ability dimension of 6(6 basic expressions) is used to recognize the
to handle nonlinear data is guaranteed by adding ‘‘Relu’’ given expression based on the fused feature vector.
activations after sub-sampling layers ‘‘S1’’ and ‘‘S2’’. Data The softmax function produces a categorical probability
augmentation is used to synthetically increase the number distribution, when the input is a set of multi-class logits as
of LBP facial images. Thus, over-fitting can be handled by
exi
using the ‘‘dropout’’ operation [31] (parameter was set to 0.5) yi = . for i = 1, . . . , K (4)
K
between the ‘‘S’’ layers (‘‘S1’’ and ‘‘S2’’) and ‘‘fv’’ layer. P
e xj
j=1
TABLE 2. Parameters of the shallow CNN.
where input x is K-dimensional vector (x = {xi }K i=1 )and out-
put y is also a K-dimensional vector ((y = {yi }K i=1 ))with real
values in the range (0,1) and that add up to 1 as normalization
occurs via the sum of exponent terms that divide the actual
exponentiation term. The probability that the class y = k for
a given input x and with j = 1, . . . ,K (K = 6 in our work) can
be written in a matrix form as follows:
D. WEIGHTED FUSION OF DIFFERENT OUTPUTS
ς(x)1
     x 
P(y = 1|x) e1
Fig. 8 shows the proposed weighted fusion network. ..   ..  1  . 
. = . = K  ..  (5)

Expression-related feature vectors fv1 is extracted from facial 
ς (x)K
P x x
grayscale images using the partial VGG16 network with the P(y = K |x) ej eK
fine-tuning strategy. Feature vector fv2 is extracted from j=1

LBP facial images using the shallow CNN. Each feature where P(y = k|x) is the probability that the class is k given
vector is connected with two cascaded full connect layers for that the input is x. The cross entropy is used as the cost
dimension reduction. These full connect layers are fc1_1 = function, which is defined as
{s1 , s2 , . . . , sm } (m is experimentally to 100), fc1_2 = {s1 , s2 , K
. . . , s6 } for fv1 and fc2_1 = {l1 , l2 , . . . , lm } (m = 100),
X
Loss(y, z) = − Zi · log(yi ). (6)
fc2_2 = {l1 , l2 , . . . , l6 } for fv2. Distances between different i=1

4634 VOLUME 6, 2018

B. Yang et al.: FER Using WMDNN Based on Double-Channel Facial Images

where zi indicates the true label and yi represents the output disgust, fear, anger, and neutral expressions. We only use
of softmax function. In the present study, we use the back- six basic expressions in this work. A total of 1960 labeled
propagation (aka backprops) based on gradient descendant samples are captured from 28 subjects (a mix of male/female
optimization algorithm to minimize Eq.6. and glasses/without glasses), with 280 samples for each
expression. Data augmentation is used to increase the samples
IV. EXPERIMENTS RESULTS of each expression by 10 times and then 10-fold cross vali-
A. DATASETS AND CONFIGURATIONS dation is used for evaluation. We capture all samples under
We evaluate the performance of our FER method based on the constant illumination conditions but with heavy occlusion
Keras framework on the Linux platform. All experiments are and drastic head deflection to test the robustness of the pro-
performed using a standard NVIDIA GTX 1080 GPU (8 GB), posed FER approach. Notably, each practical facial image
a NVIDIA CUDA framework 6.5, and a cuDNN library. is down-sampled to 480×270 to reduce calculating amount.
To facilitate fair and effective evaluations, three benchmark- Table 3 lists other configurations of the proposed approach,
ing datasets are used, which are composed of facial RGBD such as learning rate, learning policy, and weight decay.
images. Descriptions of the used datasets are listed below.
TABLE 3. Parameter setting of the proposed WMDNN.

1) CK+ [32]
This fully annotated dataset includes 593 sequences that
represent seven expressions (happiness, sadness, surprise,
disgust, fear, anger, and neutral) of 123 subjects (males and
females). We only use the six basic expressions, including
happiness, sadness, surprise, disgust, fear, and anger. For each
sequence, we select the last frame because each sequence in
this dataset begins with a neutral expression and proceeds
The convergences of the proposed approach are evaluated
to a peak expression. Thus, roughly 80 to 120 samples are
in three benchmarking datasets, and the results are illustrated
selected for each expression. Data augmentation (by using
in Figs. 9(a), 9(b), and 9(c). Each sub-figure shows the
simple operations such as rotation, translation, and skewing)
trends of accuracy (red curve) and loss (green curve) with
is used to increase the samples of each expression by 50 times.
the increase in epochs. For each dataset, the tendencies of
Finally, 10-fold cross validation is used for evaluation.
accuracy and loss stabilize after 40C50 epochs.

2) JAFFE [33] B. ANALYSIS OF THE FUSION WEIGHT

This fully annotated dataset includes 213 samples of We evaluate the influences of the fusion weight α to the
10 Japanese females. The dataset also contains the six basic recognition accuracies on three benchmarking datasets. The
expressions and a neutral expression. However, we only use step to increase α is set as 0.1. α = 0 equals to the
the samples of six basic expressions. For each expression, case, wherein only LBP facial images are used for FER
we select all facial images (approximately 30 images) belong- and α = 1 indicates another extreme case that uses only
ing to it. Data augmentation is used to increase the samples facial grayscale images for FER. As shown in Fig. 10, the
of each expression by 100 times. Finally, 10-fold cross vali- blue solid curve, the green chain curve, and the red dotted
dation is used for evaluation. curve represent the results of ‘‘CK+,’’ ‘‘JAFFE,’’ and ‘‘Oulu-
CASIA’’ datasets, respectively. The accuracy of α = 1 is
3) Oulu-CASIA [33] higher than that of α = 0, thereby indicating that the contri-
A total of 10,800 labeled samples are captured from bution of facial grayscale images in FER is larger than that of
80 subjects (a mix of male/female and glasses/without LBP facial images. The fusion approach achieves the highest
glasses). We also use six basic expressions for evaluation. performance when α is set to 0.7. Thus, in the present study,
For each expression of each subject, a sequence of facial we manually set the weight α to 0.7.
images is provided. We select the second half of the image
sequence and then flip these facial images. Thus, there are C. QUANTITATIVE EVALUATIONS OF THE
approximately 1800 samples for each expression. We do not PROPOSED APPROACH
implement data augmentation for this dataset to avoid the Fig. 11 illustrates the performance of our approach in rec-
probable over-fitting. Finally, 10-fold cross validation is used ognizing six basic expressions in the different datasets. For
for evaluation. each dataset, the recognition results are provided through
Except for the benchmarking datasets used for quanti- confusion matrixes. For the ‘‘CK+’’ dataset (Fig. 11(a)),
tative evaluations, we also capture practical facial images our approach recognizes different expressions with very high
for qualitative evaluations. These images are gathered using accuracies (higher than 0.96), except for the expression
the Kinect 2.0 sensor (1920×1080). These images contain ‘‘disgust’’ (recognition accuracy of 0.94). For the ‘‘JAFFE’’
seven expressions, including happiness, sadness, surprise, dataset (Fig. 11(b)), the recognition accuracies of expression

VOLUME 6, 2018 4635

B. Yang et al.: FER Using WMDNN Based on Double-Channel Facial Images

FIGURE 10. Evaluations of fusion weight α on different datasets.

accuracies of around 0.92. The unstable recognition per-

formance is due to the fact that expressions from the
‘‘JAFFE’’ dataset are difficult to distinguish even by manual
manipulation. For the ‘‘Oulu-CASIA’’ dataset (Fig. 11(c)),
the performance is similar to that in the ‘‘JAFFE’’ dataset.
As indicated by the above results, the proposed method can
accurately recognize the expressions ‘‘anger,’’ ‘‘fear,’’ and
‘‘happiness’’ due to their drastic changes in appearance. For
the expression ‘‘disgust,’’ the proposed approach frequently
misclassifies it as ‘‘fear’’ or ‘‘sadness’’. We check this prob-
lem and find that the reason is that several subjects in the
chosen dataset appear similar when they are showing different
expressions. Meanwhile, generalization ability is promised
due to the effective combination of LBP and grayscale facial
images. A fine-tuning strategy can be used to further improve
the generalization ability.
The proposed method and several state-of-the-art meth-
ods [17], [34]–[36] are also compared in the benchmarking
datasets. We also evaluate the effectiveness of our approach
by calculating recognition accuracies based on single channel
facial images. We term those approaches as partial ones which
include only partial VGG16 for facial grayscale images, shal-
low CNN for facial grayscale images and LBP facial images,
respectively.
For each dataset, the recognition results are listed
in Table 4. The parameters of the employed methods are
set according to the original work that proposed them. Our
method outperforms the methods that use hand-crafted fea-
FIGURE 9. Curves of ‘‘Loss’’ and ‘‘Accuracy’’ during training in (a) ‘‘CK+’’, tures [34], [35] in both datasets. This result verifies the
(b) ‘‘JAFFE’’, and (c) ‘‘Oulu-CASIA’’ datasets. superiority of deep learning-based FER approaches in auto-
matically extracting expression related features. Aly et al.
and Rivera et al. manually extracted expression-related
‘‘angry’’ is as high as 0.95, whereas the expression ‘‘disgust’’ features, such as HOG and local directional number
is lower than 0.9. The four remaining expressions, ‘‘fear,’’ pattern. Compared with other CNN-based FER methods,
‘‘happiness,’’ ‘‘sadness,’’ and ‘‘surprise’’ have recognition our method also achieves better performance than the two

4636 VOLUME 6, 2018

B. Yang et al.: FER Using WMDNN Based on Double-Channel Facial Images

TABLE 4. Comparisons between our approach and the state-of-the-art

FER approaches.

facial grayscale images. Our method also outperforms the

method proposed by Zhang et al., who proposed a similar
FER framework based on multi-channel CNN, in both global
and local facial regions (regions around the eyes, nose, and
mouth) [17]. However, extra effort is necessary to detect
facial landmark points, which are useful in finding local
facial regions. Wrong detections of local facial regions may
decrease the recognition accuracies of Zhang et al. Moreover,
the employed LBP facial information and the weighted fusion
way make our method more accurate in recognizing different
facial expressions than Zhang’s method. Finally, it is obvious
that our whole approach outperforms our partial approaches
which only handle single channel facial images. Furthermore,
fine-tuning using partial VGG16 achieves the best recogni-
tion performance among the three partial approaches due to
its ability in extracting effective features from a given image.
Only shallow CNN for LBP facial images plays the worst
in recognizing different facial expressions in all the three
datasets. This result is in line with the conclusion indicated by
the evaluations of fusion weight α. Meanwhile, we evaluate
our model (trained in ‘‘CK+’’) on the middle frames of
each sequence of CK+ dataset and the recognition accuracy
is 96.68%. It reveals the effectiveness of our approach in
processing dynamic sequences of a facial expression.

D. QUALITATIVE EVALUATIONS OF THE

PROPOSED APPROACH
To evaluate the qualitative performance of the proposed
approach, practical facial images are collected for evaluation.
Partial occlusions are considered to test the robustness of our
approach. In each sub-figure, the detected facial region is rep-
FIGURE 11. Recognition results of six basic facial expressions on three resented by a red rectangle. The recognition result is shown
benchmarking datasets. (a) ‘‘CK+’’ dataset. (b) ‘‘JAFFE’’ dataset.
(c) ‘‘Oulu-CASIA’’ dataset.
on the top left corner of the image. The red characters indicate
the ground truth of the given facial expression, whereas the
blue characters indicate the recognized facial expression with
certain recognition accuracy.
employed methods. For example, our method outperforms Fig. 12 illustrates some cases of successful recognition
the FER method based on a single-modal CNN proposed by of facial expressions with high accuracies. All recognition
Lopes et al. [36]]. The advantage of our method is achieved accuracies of different facial expressions are above 0.9, even
by fully utilizing the complementarity of different chan- when the subjects are partially occluded by a notebook.
nels of facial images, while the other method only uses Obviously, drastic changes in appearance exist in these

VOLUME 6, 2018 4637

B. Yang et al.: FER Using WMDNN Based on Double-Channel Facial Images

expression may be difficult to recognize because of other

factors, except for gentle changes in appearance. For instance,
the ‘‘surprise’’ expression is sufficiently obvious, but its
recognition accuracy is 0.79 only. This low recognition accu-
racy is attributed to head deflection, which may lead to loss
in face information. Moreover, partial occlusions may influ-
ence the recognition of given expressions to a certain extent
(Figs. 13(e) and 13(f)), especially when changes in these
expressions are insufficiently drastic (Fig. 13(d)).

FIGURE 12. Successful recognition of facial expressions with high

accuracies. FIGURE 14. Failed recognition of facial expressions.

Fig. 14 illustrates some cases of failed recognition of

facial expressions. We can conclude that facial expressions, facial expressions, which are represented as ‘‘Unknown’’
such as ‘‘anger,’’ ‘‘happiness,’’ and ‘‘surprise’’ are easy to or a wrong label. Only expressions with recognition accu-
recognize. This conclusion is in line with the results that racies larger than a given threshold (we manually set this
are indicated by the three confusion matrixes, as illustrated threshold as 0.7 and the value can be changed based on
in Fig. 11. specific tasks) are denoted with recognized expressions and
corresponding accuracies. Otherwise, ‘‘Unknown’’ is used
to indicate the failure of FER. Furthermore, ‘‘Unknown’’
is also used when no faces are detected in the given facial
image. The detected facial region in Fig. 14(a) is too small
to detect the facial expression, especially when the subject
shows gentle changes in appearance. This case is hard to
recognize accurately even when occlusion does not occur. For
example, the subject in Fig. 14(b) insists that she shows an
expression of anger, but our approach cannot recognize the
facial expression. We cannot easily recognize her expression
through a manual approach. Sometimes, our approach can-
not easily detect precise facial regions because of different
factors, such as large occlusions (Fig. 14(c)) and poor lighting
conditions. Moreover, inaccurate recognition of facial expres-
sion is inevitable, especially when the changes of the detected
expression in appearance are not so drastic. For example, the
subject in Fig. 14(d) revealed the expression ‘‘sadness’’, but
FIGURE 13. Successful recognition of facial expressions with low our approach wrongly recognized the expression ‘‘disgust’’
accuracies. with accuracy 0.73.

Fig. 13 illustrates some cases of successful recogni- V. CONCLUSION

tion of facial expressions with low accuracies listed from This study proposes a FER method based on WMDNN
0.7 to 0.9. Poor performance is caused by several causes, that can process facial grayscale and LBP facial images
including gentle changes in appearance, partial occlusions, simultaneously. We argue that both used image channels are
and head deflection. For example, the ‘‘anger’’ expression complementary, can capture abundant (both local and global)
in Fig. 13(a) is not as obvious as the expressions that appeared information from the facial images, and can improve the
in Figs. 12(b) and 12(e). Thus, the recognition accuracy is less recognition ability. A weighted fusion strategy is proposed
than 0.9. As shown in Fig. 13(b), the recognition accuracy to fully use the features that have been extracted from dif-
of the expression ‘‘sadness’’ is 0.85, which is lower than ferent image channels. A partial VGG16 network is con-
the recognition accuracies of the expressions ‘‘anger’’ and structed to automatically extract features of facial expressions
‘‘happiness’’. This result is in agreement with the conclu- from facial grayscale images. Fine-tuning is used to train
sions of three confusion matrixes. Sometimes, an obvious the network with initial parameters obtained from ImageNet.

4638 VOLUME 6, 2018

B. Yang et al.: FER Using WMDNN Based on Double-Channel Facial Images

A shallow CNN is constructed to automatically extract fea- [14] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn,
tures of facial expressions from LBP facial images because ‘‘DISFA: A spontaneous facial action intensity database,’’ IEEE Trans.
Affect. Comput., vol. 4, no. 2, pp. 151–160, Apr. 2013.
of lack of effective pre-trained model based on LBP images. [15] H. Kobayashi and F. Hara, ‘‘Facial interaction between animated 3D face
Subsequently, a weighted fusion strategy is proposed to fuse robot and human beings,’’ in Proc. IEEE Int. Conf. Syst., Man, Cybern.,
both features to fully use complementary facial information. Comput. Cybern. Simulation, vol. 4. Oct. 1997, pp. 3732–3737.
[16] L. Zhong, Q. Liu, P. Yang, J. Huang, and D. N. Metaxas, ‘‘Learning
The recognition results are obtained based on the fused multiscale active facial patches for expression analysis,’’ IEEE Trans.
features via a ‘‘softmax’’ operation. Furthermore, it takes Cybern., vol. 45, no. 8, pp. 1499–1510, Aug. 2014.
about 1.3s to process a facial image, including 0.5s for pre- [17] W. Zhang, Y. Zhang, L. Ma, J. Guan, and S. Gong, ‘‘Multimodal learning
for facial expression recognition,’’ Pattern Recognit., vol. 48, no. 10,
processing and 0.8s for recognizing different expressions. pp. 3191–3202, 2015.
Evaluations in three benchmarking datasets verify the effec- [18] K. Mase, ‘‘Recognition of facial expression from optical flow,’’ IEICE
tiveness of our approach in recognizing six basic expressions. Trans. Inf. Syst., vol. E74-D, no. 10, pp. 3474–3483, 1991.
[19] G. Zhao and M. Pietikäinen, ‘‘Dynamic texture recognition using local
On the one hand, our method outperforms FER approaches binary patterns with an application to facial expressions,’’ IEEE Trans.
based on hand-crafted features. The ability to automatically Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 915–928, Jun. 2007.
extract features enables our method to implement more easily [20] R. Walecki, O. Rudovic, V. Pavlovic, and M. Pantic, ‘‘Variable-state latent
conditional random fields for facial expression recognition and action
than approaches based on hand-crafted features, which fre- unit detection,’’ in Proc. 11th IEEE Int. Conf. Workshops Automat. Face
quently require initially detection of facial landmark points. Gesture Recognit. (FG), May 2015, pp. 1–8.
On the other hand, by utilizing complementary facial infor- [21] A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon, ‘‘Emotion recog-
nition in the wild challenge 2014: Baseline, data and protocol,’’ in Proc.
mation in a weighted fusion manner, our approach outper- ACM Int. Conf. Multimodal Interact., 2014, pp. 461–466.
forms several FER approaches based on deep learning. Our [22] J. Schmidhuber, ‘‘Deep learning in neural networks: An overview,’’ Neural
future work will focus on simplifying the network used to Netw., vol. 61, pp. 85–117, Jan. 2015.
[23] Y. Lv, Z. Feng, and C. Xu, ‘‘Facial expression recognition via deep learn-
speed up the algorithm. Furthermore, we plan to focus on ing,’’ IETE Tech. Rev., vol. 32, no. 5, pp. 347–355, 2015.
other channels of facial images that can be used to further [24] H. Boughrara, M. Chtourou, C. B. Amar, and L. Chen, ‘‘Facial expression
improve the fusion network. recognition based on a mlp neural network using constructive train-
ing algorithm,’’ Multimedia Tools Appl., vol. 75, no. 2, pp. 709–731,
2016.
REFERENCES [25] Z. Yu and C. Zhang, ‘‘Image based static facial expression recognition
[1] C.-R. Chen, W.-S. Wong, and C.-T. Chiu, ‘‘A 0.64 mm2 real-time cascade with multiple deep network learning,’’ in Proc. ACM Int. Conf. Multimodal
face detection design based on reduced two-field extraction,’’ IEEE Trans. Interact., 2015, pp. 435–442.
Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 11, pp. 1937–1948, [26] B.-K. Kim, J. Roh, S.-Y. Dong, and S.-Y. Lee, ‘‘Hierarchical committee of
Nov. 2011. deep convolutional neural networks for robust facial expression recogni-
[2] Y. Q. Wang, ‘‘An analysis of the Viola-Jones face detection algorithm,’’ tion,’’ J. Multimodal User Interfaces, vol. 10, no. 2, pp. 173–189, 2016.
Image Process. Line, vol. 4, pp. 128–148, Jun. 2014. [27] M. Liu, S. Li, S. Shan, and X. Chen, ‘‘Au-inspired deep networks
[3] M. Demirkus, D. Precup, J. J. Clark, and T. Arbel, ‘‘Multi-layer temporal for facial expression feature learning,’’ Neurocomputing, vol. 159,
graphical model for head pose estimation in real-world videos,’’ in Proc. pp. 126–136, Jul. 2015.
IEEE Int. Conf. Image Process., Oct. 2015, pp. 3392–3396. [28] A. Mollahosseini, D. Chan, and M. H. Mahoor, ‘‘Going deeper in facial
[4] S. Jain, C. Hu, and J. K. Aggarwal, ‘‘Facial expression recognition with expression recognition using deep neural networks,’’ in Proc. IEEE Winter
temporal modeling of shapes,’’ in Proc. IEEE Int. Conf. Comput. Vis. Conf. Appl. Comput. Vis., Mar. 2016, pp. 1–10.
Workshops, Nov. 2011, pp. 1642–1649. [29] S. Cheng, A. Asthana, S. Zafeiriou, J. Shen, and M. Pantic, ‘‘Real-
[5] M. H. Siddiqi, R. Ali, A. Sattar, A. M. Khan, and S. Lee, ‘‘Depth camera- time generic face tracking in the wild with CUDA,’’ in Proc. 5th ACM
based facial expression recognition system using multilayer scheme,’’ Multimedia Syst. Conf., 2014, pp. 148–151.
IETE Tech. Rev., vol. 31, no. 4, pp. 277–286, 2014. [30] K. Simonyan and A. Zisserman. (2014). ‘‘Very deep convolutional
[6] M. Valstar, M. Pantic, and I. Patras, ‘‘Motion history for facial action networks for large-scale image recognition.’’ [Online]. Available:
detection in video,’’ in Proc. IEEE Int. Conf. Syst., Man Cybern., vol. 1. https://arxiv.org/abs/1409.1556
Oct. 2004, pp. 635–640. [31] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
[7] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, ‘‘A convolutional neural R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
network cascade for face detection,’’ in Proc. IEEE Conf. Comput. Vis. from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
Pattern Recognit., Jun. 2015, pp. 5325–5334. 2014.
[8] J. Zhang et al., ‘‘Watch, attend and parse: An end-to-end neural network [32] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews,
based approach to handwritten mathematical expression recognition,’’ ‘‘The extended Cohn-Kanade dataset (CK+): A complete dataset for
Pattern Recognit., vol. 71, pp. 196–206, Nov. 2017. action unit and emotion-specified expression,’’ in Proc. IEEE Comput. Vis.
[9] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521, Pattern Recognit. Workshops, Jun. 2010, pp. 94–101.
pp. 436–444, May 2015. [33] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, ‘‘Coding facial
[10] Y. Sun, D. Liang, X. Wang, and X. Tang. (2015). ‘‘DeepID3: Face expressions with Gabor wavelets,’’ in Proc. 3rd IEEE Int. Conf. Autom.
recognition with very deep neural networks.’’ [Online]. Available: Face Gesture Recognit., 1998, pp. 200–205.
https://arxiv.org/abs/1502.00873 [34] S. Aly, A. L. Abbott, and M. Torki, ‘‘A multi-modal feature fusion frame-
[11] M. R. Mohammadi, E. Fatemizadeh, and M. H. Mahoor, ‘‘PCA-based work for kinect-based facial expression recognition using dual kernel dis-
dictionary building for accurate facial expression recognition via sparse criminant analysis (DKDA),’’ in Proc. IEEE Winter Conf. Appl. Comput.
representation,’’ J. Vis. Commun. Image Represent., vol. 25, no. 5, Vis., Mar. 2016, pp. 1–10.
pp. 1082–1092, 2014. [35] A. R. Rivera, J. R. Castillo, and O. O. Chae, ‘‘Local directional
[12] C. Liu and H. Wechsler, ‘‘Gabor feature based classification using number pattern for face analysis: Face and expression recognition,’’
the enhanced Fisher linear discriminant model for face recogni- IEEE Trans. Image Process., vol. 22, no. 5, pp. 1740–1752,
tion,’’ IEEE Trans. Image Process., vol. 11, no. 4, pp. 467–476, May 2013.
Apr. 2002. [36] A. T. Lopes, E. de Aguiar, A. F. De Souza, and T. Oliveira-Santos, ‘‘Facial
[13] C. Shan, S. Gong, and P. W. McOwan, ‘‘Facial expression recognition expression recognition with convolutional neural networks: Coping with
based on local binary patterns: A comprehensive study,’’ Image Vis. Com- few data and the training sample order,’’ Pattern Recognit., vol. 61,
put., vol. 27, no. 6, pp. 803–816, 2009. pp. 610–628, Jan. 2017.

VOLUME 6, 2018 4639

B. Yang et al.: FER Using WMDNN Based on Double-Channel Facial Images

BIAO YANG was born in Changzhou, Jiangsu, in RONGRONG NI was born in Nantong, China,
1987. He received the B.S. degree from the College in 1987. She received the B.S. degree from
of Automation, Nanjing University of Technology, the College of Instrument Science and Tech-
in 2009, the M.S. and the Ph.D. degrees from nology, Southeast University, Nanjing, China,
the College of Instrument Science and Technol- in 2012. She is currently a Research Assistant
ogy, Southeast University, Nanjing, China, in 2011 with the College of Mechanical and Electrical,
and 2014, respectively. Since 2015, he has been Changzhou Textile Garment Institute, Changzhou.
a Lecturer with the Department of Information Her research interests include computer vision and
Science and Engineering, Changzhou University. pattern recognition.
His research interests include pattern recognition
and machine learning.

JINMENG CAO was born in Wuxi, China, YUYU ZHANG was born in Huai’an, China, in
in 1994. She received the B.S. degree in automa- 1992. He received the B.S. degree from the Suzhou
tion from Changzhou University in 2016. She is University of Science and Technology in 2016.
currently pursuing the master’s degree in machine He is currently pursuing the master’s degree in
learning. machine learning.