0% found this document useful (0 votes)
30 views

Combining Deep Convolutional Neural Networks With Stochastic Ensemble Weight Optimization For Facial Expression Recognition in The Wild

This document summarizes a research paper that proposes a new deep convolutional neural network (DCNN) ensemble classifier for facial expression recognition in uncontrolled environments. The key aspects of the proposed method are: (1) Formulating the process of finding optimal ensemble weights as a stochastic optimization problem solved using simulated annealing. This allows the weights to minimize generalized classification error. (2) Creating diverse DCNN ensemble members by combining different face representations and bagging, which increases ensemble diversity. Experiments on three facial expression datasets show the proposed DCNN ensemble achieves state-of-the-art performance.

Uploaded by

sikandar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Combining Deep Convolutional Neural Networks With Stochastic Ensemble Weight Optimization For Facial Expression Recognition in The Wild

This document summarizes a research paper that proposes a new deep convolutional neural network (DCNN) ensemble classifier for facial expression recognition in uncontrolled environments. The key aspects of the proposed method are: (1) Formulating the process of finding optimal ensemble weights as a stochastic optimization problem solved using simulated annealing. This allows the weights to minimize generalized classification error. (2) Creating diverse DCNN ensemble members by combining different face representations and bagging, which increases ensemble diversity. Experiments on three facial expression datasets show the proposed DCNN ensemble achieves state-of-the-art performance.

Uploaded by

sikandar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

100 IEEE TRANSACTIONS ON MULTIMEDIA, VOL.

25, 2023

Combining Deep Convolutional Neural Networks


With Stochastic Ensemble Weight Optimization for
Facial Expression Recognition in the Wild
Jae Young Choi , Member, IEEE, and Bumshik Lee , Member, IEEE

Abstract—Although recent emotion recognition methods (based I. INTRODUCTION


on facial expression cues) achieve excellent performance in
acial expression traits play the most important role in an-
controlled scenarios, the recognition of emotion in the wild remains
a challenging problem because of occlusion, large head poses,
illumination variations, etc. Recent advances in deep learning
F alyzing a subject’s emotional state in a spontaneous envi-
ronment [18]. Automated facial expression recognition (FER)
show that combining an ensemble of deep learning models can has a wide range of applications, such as in interactive games,
considerably outperform the approach of using only a single
deep learning model for challenging recognition problems. This
sociable robotics, crowd analytics, neuromarketing, and several
paper presents a novel ensemble deep learning method, “deep other human-computer interaction systems [49]. Recently, with
convolutional neural network (DCNN) ensemble classifier”, for the transition of FER from laboratory-controlled to challeng-
improved facial expression recognition (FER) in the wild. Our ing in-the-wild conditions, deep convolutional neural networks
proposed DCNN ensemble classifier is novel in terms of the (DCNNs) [12], [20] have achieved state-of-the-art performances
following aspects: (1) the process of finding ensemble weights for
combining DCNN decision outputs is formulated as a stochastic across a variety of FER-related applications [50]. Thus, the DC-
optimization problem (via simulated annealing) in which the energy NNs have been mainstreamed in the field of FER; the results so
to be minimized represents the generalized (test) classification error far have been promising, and most of the FER challenge winners
of the DCNN ensemble and (2) for the creation of DCNN ensemble have employed DCNNs [3]–[5], [15]. It mainly benefits from
members, we propose the combined use of different types of face large-scale training data and an end-to-end learning framework
representations and bagging (T. G. Dietterich, 2000), which is quite
useful in increasing the diversity of the DCNN ensemble. Extensive [20].
and comparative experiments on three wild FER datasets, namely Thus far, almost all DCNN-based FER systems are based on
FER2013, SFEW2.0, and RAF-DB, show that the proposed the design paradigm of a single classifier with only one input
DCNN ensemble classifier achieves competitive FER performances image representation (grayscale or RGB color representations)
when compared with other recently developed methods—76.69%, [54]. However, the FER problem in the wild is too complex
58.68%, and 87.13% of FER accuracy under the FER2013,
SFEW2.0, and RAF-DB evaluation protocols, respectively. [21], [22] to be solved by a single DCNN-based classifier. This is
because facial images are often captured in the wild under natural
Index Terms—Facial expression recognition, ensemble deep conditions such as occlusion, large head pose or illumination
learning, stochastic ensemble weight optimization, simulated variations, and lower image resolution [21], [22]. As such, it is
annealing, energy function, deep ensemble generalization error.
difficult to separate the expressions’ feature space, that is, facial
features from one subject in two different expressions may be
quite close in the feature space, while the facial features from
two subjects with the same expression may be very far from each
Manuscript received 30 January 2021; revised 31 May 2021 and 13 August other [51], [52].
2021; accepted 3 October 2021. Date of publication 26 October 2021; date Many examples of both natural and artificial systems show
of current version 13 January 2023. This work was supported in part by the that a composite system consisting of several subsystems can
Hankuk University of Foreign Studies Research Fund, in part by the National
Research Foundation of Korea (NRF) grant funded by the Korea government reduce the total complexity of the system while satisfactorily
(MSIT) under Grant 2021R1A2C1092322, in part by Institute of Information solving a difficult problem [23], [26]. The success of a neu-
& Communications Technology Planning & Evaluation (IITP) grant funded by ral network ensemble in improving a classifier’s generalization
the Korea government (MSIT) under Grant 1711134404. The Associate Editor
coordinating the review of this manuscript and approving it for publication was [24], [25] is a typical example. In pattern classification, ensemble
Dr. Ramanathan Subramanian. (Corresponding author: Bumshik Lee.) classifiers are often considered a practical and effective solution
Jae Young Choi is with the Division of Computer Engineering, Hankuk Uni- for difficult recognition problems such as in-the-wild FER [28].
versity of Foreign Studies, Yongin-si, Gyeonggi-do 17025, Korea (e-mail: jy-
[email protected]). The use of ensemble classifiers is supported by the fact that the
Bumshik Lee is with the Information and Communication Engineering, patterns misclassified by different classifiers are not necessarily
Chosun University, Gwangju 61452, Korea (e-mail: [email protected]). the same [14], [26]. However, designing an ensemble of multi-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TMM.2021.3121547. ple DCNNs is still an open problem [27]. In particular, effec-
Digital Object Identifier 10.1109/TMM.2021.3121547 tively constructing an ensemble of DCNNs and combining their

1520-9210 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
CHOI AND LEE: COMBINING DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STOCHASTIC ENSEMBLE WEIGHT OPTIMIZATION 101

decisions can be the primary challenge and an important research in the triplet loss for the identity-invariant FER. Ding et al. [57]
direction [27], [54]. proposed the FaceNet2ExpNet, where a novel two-stage train-
In this paper, we propose a new DCNN ensemble classifier ing algorithm for FER was used. In the first stage, a probabilistic
for in-the-wild FER. Our method has two key aspects. First, we distribution function was used to regularize the training of the
formulate the process of finding ensemble combination weights targeted FER net based on the already fine-tuned face net. In the
as an optimization problem, which can be effectively solved second stage, to further boost the discriminative capability, ran-
using our novel simulated annealing (SA)-based algorithm. domly initialized fully connected layers attached to the trained
We show that ensemble weights optimized via our SA-based convolutional blocks were used to train the entire network from
algorithm can significantly improve FER performance. Second, scratch with strong facial expression label supervision. Meng
we introduce an effective DCNN ensemble construction that et al. [58] proposed a novel identity-aware CNN (IACNN) that
takes advantage of the combined use of different face represen- contains two identical sub-CNNs with sharing weights. In their
tations and bagging [16]. This allows maximizing the diversity method, the expression-sensitive contrastive loss was used to
among DCNN ensemble members in the sense that they do not learn expression-related features, while the identity-sensitive
make coincident errors, as demonstrated by our experimental contrastive loss was employed to learn identity-related features.
results in Section VI.C. We evaluate the proposed method on Identity-invariant FER was achieved by combining expression-
three publicly available wild FER databases (DBs) collected related and identity-related features.
under real-world scenarios: FER2013 [29], static facial expres- Recent advances in deep learning suggest that the use of an en-
sions in the wild (SFEW2.0) [53], and real-world affective face semble of DCNNs can improve the performance of image recog-
database (RAF-DB) [67]. The results of our method are better nition and classification tasks. In [1], multi-column deep neural
than, or at least competitive with the best-reported results. networks have been suggested, where each column is repre-
The rest of this paper is organized as follows: Section II sented as a DCNN. The outputs of all columns are simply aver-
reviews previous works on deep learning-based FER. Section III aged for decision aggregation. Yu et al. [4] proposed an effective
describes the motivation behind the proposed DCNN ensemble optimization framework using the log-likelihood loss and hinge
weight optimization and construction for completeness. Sec- loss to adaptively combine multiple DCNNs to perform expres-
tion IV details the proposed DCNN ensemble construction. In sion recognition. Kim et al. [5] used an exponentially weighted
Section V, the proposed SA-based DCNN ensemble weight op- fusion based on validation accuracy and constructed a hierar-
timization algorithm is presented. Section VI presents extensive chical architecture of DCNN committees by implementing ma-
and comparative experimental results to demonstrate the effec- jority voting or a simple average for FER. Pramerdorfer et al.
tiveness of the proposed method for FER in the wild. Finally, [59] constructed an ensemble of DCNNs based on different deep
the discussion and conclusions are presented in Section VII. architectures such as VGG, Inception, and ResNet; they used a
simple ensemble voting of the outputs produced by eight DCNNs
as ensemble members. In summary, the aforementioned works
II. RELATED WORK show that deep learning models can be employed as good base
A significant part of FER’s recent progress has been achieved classifiers (i.e., ensemble members) for typical ensemble classi-
because of the emergence of deep learning models and, more fication approaches, such as the average combination rule [28].
specifically, with DCNNs [12]. In the following paragraphs, we However, ensemble combination approaches with deep learning
review some recent methods based on deep neural networks that in the field of FER [54] have been limited to simple averaging or
are most relevant to our work and refer the reader to a recent majority voting based on the conditional probability vector [28]
and comprehensive survey [54] on FER using deep learning for obtained from each DCNN. Thus, an optimal combination of
further information. the outputs of ensemble DCNN classifiers has not yet been
The authors of [55] proposed an action unit-inspired deep net- explored in the literature.
work (AUDN) to exploit a psychological theory where expres- Unlike other recent DCNN-based ensemble approaches fo-
sions can be decomposed into several facial expression action cused on FER [1], [4], [5], [7], [54], [59], the main contributions
units. Tang [46] proposed a deep neural network architecture of our work are summarized as follows:
using a linear support vector machine (SVM) as the top layer in- r The focus of this work is on making the DCNN ensemble
stead of a softmax layer. To train deep neural networks, L2-SVM combination more adaptable to the recognition problem
loss was used for facial expression classification. Devries et al. being addressed. We developed a novel method that opti-
[47] developed a multi-task CNN that jointly predicts FER and mally determines the ensemble (combination) weights by
facial landmarks and demonstrated that learning features asso- minimizing the generalized (test) classification error of the
ciated with facial landmark position can improve FER. Shao et whole DCNN ensemble. In our method, finding ensemble
al. [19] proposed three different DCNN models, namely shallow weights is formulated as an optimization problem, and it
Light-CNN, dual-branch CNN, and pre-trained CNN, for FER can be effectively solved using our novel simulated an-
in the wild. Liu et al. [56] proposed a method for optimizing nealing (SA) [36], [37]-based algorithm well-suited for
both (N+M)-tuples cluster loss and softmax loss via two fully improving FER in the wild.
connected layer branch configurations. Specifically, during the r In addition to ensemble weight optimization, we propose a
training, the (N+M)-tuples cluster loss was formalized to alle- new ensemble construction approach that takes advantage
viate the difficulty of anchor selection and threshold validation of the combined use of different face representations

Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
102 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023

and bagging [16] to create DCNN ensemble members. It finds the optimal weights by minimizing the ensemble gener-
allows for maximizing diversity (for facilitating a comple- alization error composed of ensemble accuracy and diversity.
mentary effect) among DCNN members, thereby boosting Another important issue is that the individual DCNNs must be
FER performance. as diverse (uncorrelated) as possible while being accurate. In
light of this fact, Siqueira et al. [7] proposed the so-called En-
III. MOTIVATION semble with Shared Representations (ESRs) based on deep con-
volutional networks for FER, aiming to achieve high diversity
To combine the outputs of DCNN members, we adopt (low redundancy) in the ensemble. This motivates us to develop
weighted majority voting (WMV) because WMV and its vari- an effective ensemble construction of diverse DCNNs, which
ants are by far the most popular approach [43] for ensemble will be explained in the next section.
combination. Assuming that we are given a set of M individ-
ual DCNNs, {f1 , . . . , fM }, our task is to combine the out-
puts of fk (k = 1, . . . , M ) to predict the emotion class label IV. PROPOSED DCNN ENSEMBLE CONSTRUCTION
cj (e.g., happy). Here, the outputs of fk are in the form of an To construct an ensemble of DCNNs that are diverse while
L-dimensional vector [fk1 (x), fk2 (x), . . . , fkL (x)] for the input being accurate, we propose the combined use of different face
instance (sample) x, where fkj (x) ∈ [0, 1] represents an esti- representations and bagging. For different face representations,
mate of the posterior probability P (cj |x). In WMV, the output facial texture images such as local binary pattern (LBP) [33] or
of the DCNN ensemble classifier can be defined as Gabor [33] are used because they have been widely used for face

F (x) = M w f (x) representations in FER [12], [54].
M k=1 k1 k (1) From the entire training set T containing N samples, we cre-
= k=1 [wk fk (x) , wk fk2 (x) , . . . , wk fkL (x)]
ate P bootstrap training sets, denoted as Tp , p = 1, . . . , P , by
where wk is the weight assigned to each DCNN member. In forming bootstrap replicates of the original training samples in
practice, the weights
 are usually normalized and constrained by [16]. Therefore, each Tp will contain, on average, 0.6N dif-
wk = 0 and M k = 1 wk = 1. ferent samples [16], some of them repeated once or more. The
With adequate weight assignments, the classification perfor- remaining 0.4N examples in T̄p = T − Tp are used for val-
mance of WMV can be significantly improved [26], [30]. In par- idation purposes in the training phase for each DCNN, that is,
ticular, Zhou’s work [31] provides insight and motivation into tuning the learning rate and determining when to stop network
the importance of developing an ensemble weight optimization training [41]. Note that different face representations [33] are
algorithm. He adopted a Bayesian optimal discriminant anal- obtained by transforming an original RGB (or grayscale) image
ysis to determine ensemble weights for achieving a minimum into various facial texture images based on associated param-
ensemble classification error. We now briefly review this analy- eters, for example, the number of sampling points and radius
sis to adequately explain our proposed method. Let pk denote the value for LBP face representations, as shown in Fig. 1. Each of
classification accuracy of fk and let us assume that the outputs the facial images contained in a particular Tp is transformed into
of the individual fk (k = 1, . . . , M ) are conditionally indepen- Q different face representations, resulting in a total of P Q boot-
dent (i.e., uncorrelated). Then, a Bayesian optimal discriminant strap training sets. These multiple transformed bootstrap train-
function that leads to a minimum classification error for the ing sets, each with a specific face representation, are applied as
combined output on the class label cj can be written as [31] inputs to train individual DCNNs as ensemble members. This
way, we generate a total of M DCNN ensemble members, where
M
 pk M = P Q.
log P (cj ) + fkj (x) log (2)
1 − pk Fig. 1 shows the proposed construction of DCNN ensemble
k=1
members. To our knowledge, we are the first to propose the
where P (cj ) is the prior probability of the input x being from combined use of various facial texture representations and
the emotion class cj . The second term in (2) suggests that the bagging [16] to learn DCNNs as ensemble members for FER.
ensemble weights should be proportional to the classification As demonstrated in our experiment (see Table III), our proposed
pk
accuracies of individual DCNNs, that is, wk ≈ log 1−p k
. This combination of facial texture representation with bagging is ad-
also theoretically supports the argument that different weights vantageous for increasing diversity (i.e., decreasing correlation)
should be properly used for individual DCNNs relying on dif- among DCNN members, which is beyond the case of using only
ferent strengths. RGB (or grayscale) as the input space for learning DCNNs. This
However, the weights chosen using (2) are obtained by assum- makes the DCNN members so different that they contradict each
ing independence among the outputs of the individual classifiers. other, thereby boosting ensemble classification performance.
In practical applications, this does not hold because individual
classifiers are usually correlated to a large extent. Hence, for
V. DCNN ENSEMBLE WEIGHT OPTIMIZATION WITH
the WMV combination to be effective on the DCNN ensem-
ble, weight optimization should be performed by considering SIMULATED ANNEALING
an appropriate compromise between the DCNN classification There are two main theoretical insights [32] behind the sound-
accuracy and their correlation (i.e., diversity [32]). To achieve ness of using ensemble classifier models: (a) ambiguity decom-
this, we design a novel stochastic optimization algorithm that position and (b) bias-variance decomposition, both of which

Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
CHOI AND LEE: COMBINING DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STOCHASTIC ENSEMBLE WEIGHT OPTIMIZATION 103

 
 
 
= wk fk (x) − t (x)
 
k p

 1 
−λ wk fk (x) − fn (x)p (3)
M
k k=n

where fk (x) outputs the estimated class probabilities on sample


x, t(x) is the true distribution, where the entire probability mass
is on a correct emotion class (i.e., if the true class of x is “angry”
and is placed at the -th position, t = [0, . . . , 1, . . . , 0] contains
a single 1 at the -th position), and  · p denotes the p-norm of
a vector; L2 -norm was used (i.e., p = 2) in our experiments. In
(3), the first term is the ensemble accuracy that computes the
classification error of the DCNN ensemble, while the second one
characterizes ensemble diversity, which is a measure of how
individual DCNNs’ outputs (on data sample) differ from each
other so that it can quantify the variability (i.e., disagreement)
among DCNN members. The Divw (x) is based on root quartic
negative correlation (RTQRT-NCL) [34].
The larger the diversity term is, the larger the ensemble gen-
eralization error reduction will be [25], [35]. However, as the
variability of the individual DCNNs rises, so does the value of
the first accuracy term [24]. Therefore, we need to optimize
the balance between the two terms to find optimal ensemble
Fig. 1. Visualization for the proposed DCNN ensemble construction. The sub- weights. To this end, a λ parameter is introduced to control the
script enclosed in brackets indicates the following format [33]: (No. of sampling
points, radius) of a circular neighborhood for LBP and (scale, orientation) of
trade-off between the two terms in (3); for λ = 0, we optimize
the Gabor kernel. the ensemble weight wk for each of the DCNN members by
considering only their ensemble accuracy. In contrast, as λ in-
creases, higher weights would be imposed on DCNN members
offer a theoretical justification for improving the generaliza- with a larger diversity. A good compromise is found by setting
tion performances of an ensemble classifier over its base (com- λ = 0.3 (for more details, see Section VI.A). Also note that a
ponent) classifiers. Ambiguity decomposition states that the proper use of Divw (x) is helpful in relieving overfitting to the
quadratic error of the ensemble classifier is guaranteed to be validation set for finding ensemble weights. The experimental
less than or equal to the average quadratic error of the base clas- results demonstrating this advantage are presented in Fig. 6.
sifiers. In contrast, bias-variance decomposition [9] states that Note that energy function E w (·) is parameterized by the set
the generalization error of an ensemble classifier can be broken of ensemble weights w = {wk : k = 1, . . . , M } for all M
down into two components: bias (the average difference between DCNNs. As such, minimizing E w (·) with respect to w is the
the prediction of the base classifier and the target output) and objective of the proposed ensemble weight optimization. Con-
variance (the average variability of the base classifiers). Brown sidering this, we solve the following optimization problem:
et al. [10], [11] showed the equivalence between ambiguity and
wopt = arg min Ew (x|V )
bias-variance decompositions, and hence, there exists a com- w
mon term that quantifies the accuracy-diversity trade-off for 1 
achieving a lower ensemble generalization error. This represents = arg min Ew (x) (4)
w |V |
a well-grounded theoretical basis for pursuing the right balance (x,t)∈V

between accuracy and diversity of the DCNN ensemble in the where V denotes a validation set (of the entire dataset) that
process of ensemble weight optimization. In the proposed en- should be kept unseen by all DCNNs during their ensemble
semble weight optimization using SA, a measure of the balance construction explained in Section III, and | · | is the cardinality
between DCNN ensemble accuracy and diversity is given by the of a set.
energy function, which will be explained in the next subsection.
B. Proposed Simulated Annealing Optimization Algorithm
A. Proposed Energy Function The optimization problem in (4) has a quadratic form in terms
To find ensemble weights that achieve the lowest generalized of multivariables w1 , w2 , . . . , wM as the p − norm is used in
classification error, we propose the following energy function as (3). Hence, finding wopt is a nonlinear multivariate optimiza-
the learning objective: tion problem where multiple local minima may exist. In this
context, gradient-based methods are likely to get stuck in unac-
Ew (x) = Accw (x) − λ · Divw (x) ceptable local minima [36]. To tackle this problem, we consider

Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
104 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023

where Δw contains a single Δwk at the k-th position and zero


Algorithm 1: Simulated annealing (SA)-based algorithm for
otherwise. The proposed SA algorithm checks whether the en-
finding optimal DCNN ensemble weights.
ergy function in (3) increases or decreases when ensemble DC-
1: Input: ensemble weights NNs are combined using the candidate weight solution wb . Our
w0 = [w0,1 , . . . , w0,k , . . . , w0,M ] and validation set developed perturbation function allows the ensemble weight op-
V. timization problem to be well-suited for working with the fol-
2: Initialize ensemble weights w0,k = 1/M for all k. lowing SA optimization: select a weight configuration that will
3: Initialize annealing temperature T0 = 3Ew0 (x|V ) result in the lowest energy (i.e., the best ensemble generalization
using Eq. (3) and (4). performance).
4: Initialize maximum number of annealing steps bmax . We accept the candidate solution wb with a probability of 1
5: Initialize final temperature and energy Tfinal and when the energy Ewb (x|V ) decreases. On the contrary, when
Efinal at which to stop. Ewb (x|V ) increases, we use the following probability:
6: Initialize maximum number of tries within one  
temperature trymax . − (Ewb (x|V ) − Ewb−1 (x|V ))
exp (7)
7: while b < bmax or Tb < Tfinal or Ewb (x|V ) < Efinal Tb
do
where Tb denotes the annealing temperature at the b-th anneal-
8: b = b + 1. −(Ewb (x|V )−Ewb−1 (x|V ))
9: Repeat iterations: ing step. If exp[ Tb ] > Rand[0, 1), we ac-
10: Randomly select a certain weight wb−1,k from all M cept the candidate solution, wb . Here, Rand( ) function returns
previous weights. a random integral number between 0 and 1. The aforementioned
11: Perturb selected wb−1,k using (5), leading to procedure is repeated many times, which is sufficient to poll all
wb−1,k → wb−1,k + Δwk . weight values wk (randomly chosen) several times.
12: Generate candidate weight solution wb using Eq. (6). The annealing temperature was initially set to a relatively large
13: Evaluate energy Ewb (x|V ) using Eq. (3) and (4). value, and then decreased according to the annealing schedule
14: if (Ewb (x|V ) − Ewb−1 (x|V )) < 0. [38]:
15: then wb is the new current weight solution. Tb = 0.9Tb−1 (8)
16: else if e[−(Ewb (x|V )−Ewb−1 (x|V ))/Tb ] > Rand[0, 1).
17: then wb is the new current weight solution. where b is the annealing step and T0 = 3Ew0 (x|V ). SA runs
18: until trymax is reached (all weight values polled until the annealing step b reaches the maximum value bmax or
several times). when the stopping criterion is satisfied (e.g., setting the stopping
19: Reduce temperature Tb = 0.9Tb−1 . criterion to the final temperature or energy that must be low
20: end while enough).
21: Output: optimal DCNN ensemble weights wopt . Our proposed SA-based algorithm is summarized in Algo-
rithm 1. To our knowledge, this algorithm has not been previ-
ously explored in the field of deep learning with ensemble classi-
a stochastic optimization algorithm. SA has proven success- fication and constitutes the main contribution of our work. Fig. 2
ful in extremely complicated optimization tasks [37] because shows that early in the annealing process when the temperature
it can deal with local optima without getting stuck in them while is high, our proposed SA-based algorithm explores a wide range
searching for the global optimum. of ensemble weight configurations. Later, as the temperature is
In the following paragraphs, we describe the proposed lowered, only energy states close to its minimum are tested, and
SA algorithm for finding the best set of ensemble weights. finally, energy is converged to the minimum. It is observed that
Assuming that the initial weights are given by w0 = 140 iterations are generally sufficient for convergence (energy
[w0,1 , . . . , w0,k , . . . , w0,M ] (e.g., w0,k = 1/M for all k), we de- is less than a certain threshold Efinal ). In addition, as shown in
note the k-th ensemble weight at the b-th SA iteration by wb,k . Fig. 2(b), decreasing energy values are strongly correlated with
To generate the candidate weight solution wb at the b-th itera- a reduction in the DCNN ensemble generalization error. This
tion, one of the M weight variables wb−1,k (k = 1, . . . , M ) at demonstrates that the SA algorithm equipped with our novel
the previous iteration is randomly selected and then perturbed energy and perturbation functions is quite useful for finding op-
in the following way: timal DCNN ensemble weights (optimal in the sense that the
generalization classification error is minimized).
wb−1,k → wb−1,k + Δwk and Δ wk = rwmax (5)
VI. EXPERIMENTS
where wmax is the maximum value of weights (we set wmax =
1) and r is a random number in the interval [−1, 1]. Note that An extensive experimental study was carried out to evaluate
Δwk represents the state change of a randomly selected weight the effectiveness of the proposed DCNN ensemble construction
wb−1, k . Using (5), the candidate weight solution wb is obtained and SA-based ensemble weight optimization for FER. The fol-
as lowing subsections describe the used benchmark datasets that
contain images with either real or acted facial expressions in the
wb = wb−1 + Δw and Δw = [0, . . . , Δwk , . . . , 0] (6) wild, as well as our evaluation methodology.

Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
CHOI AND LEE: COMBINING DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STOCHASTIC ENSEMBLE WEIGHT OPTIMIZATION 105

TABLE I
NUMBER OF IMAGES PER EACH EXPRESSION IN USED DATABASES

Fig. 2. (a) Annealing schedule of the temperature Tb as a function of iteration


number b. (b) The energy Ewb (x|V ) versus iteration number b, and the asso-
ciated ensemble generalization (test) error obtained with each candiate weight
solution wb . The public test data of the FER2013 DB [29] was used as the
validation set to compute both Ewb (x|V ) and wb , while the private test data
was used for computing the ensemble generalization error.

A. Experimental Setup and Condition Fig. 3. Facial examples of the FER2013 [29]. The columns show the emotion
categories.
The FER2013 [29], SFEW2.0 [53], and RAF-DB [67]
datasets collected in the wild were used in our experiments. The
FER2013 dataset was created using the Google image search
engine to search for images of faces that match a set of 184
emotion-related keywords, such as blissful, enraged, etc. These
keywords were combined with words related to gender, age, or
ethnicity to obtain nearly 600 strings that were used as facial
image search queries. The first 1,000 images returned for each
query were kept for the next stage of processing. The images
were resized to 48 × 48 pixels and converted to a grayscale
format. The resulting dataset contains 35,887 grayscale images
acquired in a wild setting. The SFEW2.0 DB [53] was created
by selecting static frames from the acted facial expressions in
the wild (AFEW). The SFEW2.0 DB covers unconstrained facial
expressions, different head poses, large age ranges, varied focus,
occlusions, different resolutions of faces, and close to real-world
illuminations. The SFEW2.0 DB is divided into two sets, which
are created in a strict person-independent manner. The RAF-DB
is a widely used wild FER dataset acquired in an unconstrained
setting, offering a broad diversity across pose, gender, age, de-
mography, image quality, and illumination. RAF-DB contains
30,000 facial images annotated with basic or compounded ex-
pressions by 40 trained human coders. In our experiments, only
images with six discrete basic emotions were used.
Table I shows the number of images for the seven basic ex-
pressions in FER2013, SFEW2.0, and RAF-DB. In addition,
Figs. 3 and 4 show some image examples for the respective
FER DB. As shown in Figs. 3 and 4, face images of our col-
lected datasets are of the great variability in head pose degrees Fig. 4. Example of facial images (scaled with a size of 224 × 224 × 3 pixels)
captured in the wild conditions, which is partciulary difficult to and their original images from (a) SFEW2.0 [53] and (b) RAF-DB [67].

Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
106 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023

tackle for correct FER [54], [70]. We measured head pose an-
gles of the face images used in our experiments. For this, we
employed the popular “Intraface” toolbox [66] to estimate the
three angles, i.e., yaw, pitch, and roll in the degree, of origi-
nal images (see Fig. 4) of each FER DB. Based on the statis-
tics of our collected data, the range of yaw is [−43.43, 47.01],
[−55.79, 52.48], and [−61.32, 58.72] in degrees for FER2013
(test), SFEW2.0 (validation), and RAF-DB (test), respectively.
In terms of pitch range, [−32.32, 45.26], [−15.54, 37.89], and Fig. 5. Examples of randomly augmented facial images with three data aug-
mentation techniques.
[−43.85, 54.84], while [−20.34, 19.86], [−29.8, 30.22], and
[−31.86, 28.41] for roll, are observed for the aforementioned
order of FER datasets. Moreover, among their original test im- all the reported results, realizing a good compromise between
ages, about one-fifth (20%) of FER2013 (test), about half (50%) the accuracy and diversity of the DCNN ensemble.
of SFEW2.0 (validation), and about one-fifth (20%) of RAF-DB Three data augmentation strategies were used during the
(test) have poses larger than 30 degrees (in yaw or pitch or roll). fine-tuning phase, which included horizontal flips, random rota-
These observations demonstrate that our method is tested under tions (in a degree range of [−60, 60]), and random shifting. To
in-the-wild FER scenario, which covers a wide range of pose implement data augmentation, we used the Keras ImageData-
variations. Generator API.1 Fig. 5 shows the original facial images and the
As for DCNN ensemble members, we chose VGG-Face [39] corresponding randomly augmented sample images.
as the pre-trained deep network because it performs well and
involves a moderate number of parameters. Note that other re- B. Effectiveness of Proposed SA Algorithm for Finding
cent pre-trained deep CNN or newly designed deep CNN net- Optimal DCNN Ensemble Weights
works [12], [54] can be readily applied to our SA algorithm. To demonstrate the usefulness of our SA-based algorithm in
The VGG-Face is a deep CNN model successfully trained on terms of finding optimal (DCNN) ensemble weights, a compara-
2.6 million facial images collected from the web to recognize tive experimental study was carried out. For comparison, we em-
2,622 identities. This network involves 16 convolutional layers, ployed other popular ensemble weight computation approaches.
five max-pooling layers, three fully connected (FC) layers, and Specifically, the following four approaches were compared: (a)
a final layer with softmax activation. To construct an ensemble conventional majority voting [28], (b) performance weighting
of DCNNs, we used grayscale, LBP, and Gabor face representa- [26], [31], (c) random search [60], and (d) attention network
tions as an input to the individual DCNNs. Four different LBP [65]. In the case of majority voting, all DCNN members have
representations were obtained by adjusting the parameter val- the same combination weights [31] and are treated equally (i.e.,
ues [33] (no. of sampling points, radius): (8,1), (8,2), (8,3), and this approach is the same as that proposed in conventional major-
(16,2). In addition, referring to [33], 2D Gabor kernels with three ity voting [28]). When using performance weighting [26], [31],
different scales (1, 3, and 4 scales) and three orientations (0, 4, we weight each DCNN according to its individual performance
and 5 orientations) were used to create nine Gabor representa- on the validation set. The best DCNN can be assigned a weight
tions. Note that we used four bootstrap training datasets [16] of one, whereas the weight of the worst DCNN is zero [26], [31].
and 14 different face representations, resulting in a total of 56 The DCNN classifiers − whose performances are given weights
DCNN ensemble members. − are determined as in [26], [31]. For implementing random
We used the training dataset of each FER DB for fine-tuning search [60], uniform DCNN ensemble weights are generated,
the VGG-Face model as in [40]–[41] and scaled the size of image and then a sampling procedure is performed to find the ensemble
data to 224 × 224 × 3 to fit the VGG-Face input requirement. weights that yield the highest validation set performance. Fur-
The hyper-parameters of each network are the same as those thermore, attention network [65] was used to generate DCNN
used in [41]: momentum 0.9; weight decay 5 × 10−4 ; and initial weights; we first pooled all deep face features, each extracted
learning rate 10−2 , which is decreased by a factor of 10 when from the FC layer of an individual DCNN, and then apply them
the validation error stops decreasing (specifically, when the error to attention blocks to adaptively compute the weights [65].
increases for more than three consecutive times). Overall, each From Table II, we observe the following: 1) Ensemble re-
DCNN was trained using three decreasing learning rates. In the sults have been compared with the baseline performance of a
proposed SA algorithm, we set bmax = 1, 000, Tfinal = 10−8 , best single DCNN classifier. Every DCNN ensemble combina-
Efinal = −∞, and trymax = 300. The λ parameter in (3) was tion approach has shown a better performance than a best single
experimental chosen by means of an exhaustive tuning process DCNN. 2) DCNN ensemble with different combination weights
where λ is varied over the range [0,1], using a step size equal performs better than the DCNN ensemble combined via naive
to “0.05”. The determination of λ is made by selecting the one majority voting [28] with uniform weights, which indicates that
having the best FER accuracy on the validation set of each FER it is beneficial to have an ensemble of DCNNs with different
dataset. Our results show that a setting λ in the range of [0.25, degrees of contributions to improve performance. 3) the perfor-
0.35] is found to be adequate for all of the datasets used in our mance improvement is quite convincing when ensemble weights
experiments. In addition, there is little performance difference
in the range of [0.25, 0.35]. For this reason, we set λ = 0.3 for 1 [Online]. Available: https://keras.io/preprocessing/image

Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
CHOI AND LEE: COMBINING DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STOCHASTIC ENSEMBLE WEIGHT OPTIMIZATION 107

TABLE II TABLE IV
COMPARISON OF FER ACCURACIES WITH RESPECT TO THE DIFFERENT DCNN FER ACCURACY COMPARISONS WITH OTHER STATE-OF-THE-ART APPROACHES
ENSEMBLE WEIGHT COMPUTATION APPROACHES. THE PRIVATE TEST SET OF ON FER2013 PRIVATE TEST DATASET. RESULTS FOR THE COMPARISON ARE
THE FER2013 DB WAS USED FOR TESTING. (BOLD: BEST, UNDERLINE: DIRECTLY CITED FROM PAPERS RECENTLY PUBLISHED. (BOLD: BEST,
SECOND BEST) UNDERLINE: SECOND BEST)

∗SA: Simulated annealing

TABLE III
COMPARISON OF THE DIVERSITIE MEASURES [13] AMONG THE DCNN
MEMBERS FOR THREE DIFFERENT ENSEMBLE CONSTRUCTION APPROACHES

this indicates that DCNN members tend to be negatively corre-


lated. In general, combining classifiers are most effective when
the errors in the individual classifiers are negatively correlated
[44]. Overall, the results validate the advantage of using differ-
ent face representations with bagging to produce a more diverse
(uncorrelated) DCNN ensemble.
determined by our SA algorithm are used. In particular, our
SA algorithm is superior to the approach of choosing ensemble D. Comparison With State-of-the-art FER Methods
weights based on the attention network proposed in [65]. The
In this section, we compare the proposed method for in-the-
underlying reason for this is that we find and adopt ensemble
wild facial expression against other state-of-the-art methods.
weights by directly minimizing the estimate of the DCNN
Note that a comparative study done by implementing other meth-
ensemble generalization error via our SA algorithm.
ods for ourselves may not guarantee fair and stable comparison,
as we are likely to miss some optimal parameters and incor-
C. Demonstrating Our DCNN Ensemble Construction for
rectly perform their tuning process. To circumvent this problem,
Increasing Diversity
we make direct comparisons with other state-of-the-art results
To assess the effectiveness of our ensemble construction recently reported by other researchers on the FER2013 [29],
method in improving ensemble diversity, we used the following SFEW2.0 [53], and RAF-DB [67]. Therefore, all the results of
three pairwise diversity measures [13]: Q statistic, correlation the comparison are directly cited from recently published papers.
coefficients, and double-default measure. The possible ranges The FER2013 consists of 28,709 training, 3,589 public test,
of Q statistic, correlation coefficients, and double-default mea- and 3,589 private test images under seven different types of
sure are [−1 1], [−1 1], and [0 1], respectively, for a pair of emotions. Note that a training set was used to create DCNN
individual DCNN members. Note that the lower the values of ensemble members, while 3,589 public test images were used
the aforementioned diversity measures, the more diverse the en- as the validation set for optimizing their ensemble weights, and
semble members are. 3,589 private test images were used as the testing set for perfor-
Table III shows the three diversity measures among the DCNN mance evaluation. Table IV lists the comparative results of the
members with respect to three different ensemble construction FER2013 DB. Compared to other state-of-the-art approaches,
approaches. Note that each diversity measure was averaged over our proposed method achieves the highest recognition perfor-
a total number of different DCNN member pairs. As shown in mance of 76.69%, outperforming the most recent best perfor-
Table III, for all types of diversity measures, the DCNN ensem- mance [45], i.e., 75.42%, on the FER2013 DB. In addition,
ble trained by our approach (using different face representations comparative experiments were performed on SFEW2.0, released
with bagging) was found to be better than its counterparts for for a sub-competition in the 3rd Emotion Recognition in the
producing more diverse ensemble members. Especially, our ap- Wild 2015 (EmotiW2015) challenge [53]. The SFEW2.0 ver-
proach achieved a negative value of the correlation coefficient; sion targets the efforts required toward an effect analysis in wild

Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
108 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023

TABLE V TABLE VI
FER ACCURACY COMPARISONS ON SFEW2.0. RESULTS FOR THE COMPARISON FER ACCURACY COMPARISONS ON RAF-DB. RESULTS FOR THE COMPARISON
ARE DIRECTLY CITED FROM PAPERS RECENTLY PUBLISHED. VALIDATION ARE DIRECTLY CITED FROM PAPERS RECENTLY PUBLISHED
ACCURACY WAS REPORTED

TABLE VII
CONFUSION MATRIX (UNIT OF %) OF THE PROPOSED DCNN ENSEMBLE
METHOD EVALUATED ON THE SFEW 2.0 VALIDATION SET

FER. The SFEW2.0 consists of three sets: train (958 samples),


val (436 samples), and test (372 samples). Each facial image
is assigned to one of the seven expression categories (angry,
disgust, fear, sadness, happiness, surprise, and neutral). The ex- 12,271 images) set were randomly chosen. As shown in Ta-
pression labels of the training and validation sets are publicly ble VI, the FER performance of our method on RAF-DB, i.e.,
available, whereas those of the testing set are held back by the 87.13%, is quite comparable with the most recent best perfor-
challenge organizer. To cope with a small number of training mance, i.e., 87.78%, reported in [71].
samples (only 958 images) when creating DCNN members in Results in Tables IV, V, and VI show that FER accuracy of
our method, data augmentation techniques were applied for the our method is quite comparable or better than the most recent
SFEW2.0 training set. We generated 11 augmentations per fa- best performance reported in [45], [64], [71]. This demonstrates
cial image, as described in Fig. 5, yielding 11,496 images as the that our proposed DCNN ensemble approach has a better dis-
augmented training dataset. Note that because the labels of the crimination ability for expression characteristics, as well as a
test dataset were not provided, the validation set was used as better generalization ability in terms of attaining competitive
the test set for the experiment in this work, while 4,598 facial FER performances compared with the state-of-the-art methods
images (around 40%) from our augmented training dataset were on various and different wild FER datasets.
randomly selected to become a part of our validation dataset.
Table V shows the comparison between the FER performances
of our proposal with those of other approaches. Because we do VII. DISCUSSION AND CONCLUSION
not have access to the test data, we report the results on the vali- The confusion matrix of our proposed DCNN ensemble is re-
dation data. The proposed method outperforms the baseline [53] ported in Table VII. We can see that the accuracy for “disgust”
of SFEW2.0 by a large margin of 22.67%. Our method attains a and “fear” are much lower than other expressions, which is con-
58.68% FER accuracy, achieving competitive performance with sistent with results previously reported for the SFEW2.0 datasets
the state-of-the-art results of 58.20% [63] and 58.29% [64]. [4], [61]–[64]. This may be attributed to the fact that such emo-
To further validate whether the proposed approach general- tion class is inherently more challenging to classify or simply
ized well with another wild FER dataset acquired in an uncon- because of the relatively small number of samples available for
strained setting, images with six discrete basic emotions were each emotion class (see Table I). However, note that the pro-
collected from the RAF-DB [67], including 12,271 images as posed method considerably improves the accuracy on “disgust”
training data and 3,068 images as test data. For constructing or “fear” when compared to the corresponding results reported
a validation dataset, 4,800 images from the full training (i.e., by other works [4], [61]–[64].

Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
CHOI AND LEE: COMBINING DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STOCHASTIC ENSEMBLE WEIGHT OPTIMIZATION 109

Fig. 6. Impact of parameter λ on generalization (testing) performance in


our SA algorithm. Validation and testing accuracies were computed using the
FER2013 public and private data sets, respectively. Note that the gap between
validation and generalization performance represents the degree of overfitting.

The runtime of our DCNN ensemble classifier is approx-


imately 85 ms per image on a hardware configuration com-
prising of Intel Xeon E5-2620 v4 CPUs, 256-GB RAM, when
accounting for facial image transformation (into LBP and/or
Fig. 7. Visualization of feature activation maps [75] of a test face image from
Gabor face representations), the feedforward network, and the the VGG-Face networks [39] as ensemble members, each trained using one of
final combined classification output. This level of runtime per- the nine different Gabor face representations with three scales and three orien-
formance should be reasonable for many practical FER appli- tations. The strongest activations for the selected conv2_1 layer are displayed.
An associated (scale, orientation) of the Gabor kernel is described below each
cations as almost 12 facial images can be processed per second corresponding feature map.
for FER. In summary, the proposed method can achieve better
recognition accuracy than other state-of-the-art results on three
wild FER DBs, in practically efficient runtime performance. feature maps from the VGG-Face networks, each trained us-
This indicates that the proposed DCNN ensemble approach can ing one of the nine Gabor face representations with three scales
be used as a competitive solution for expression recognition in (1, 3, and 4 scales) and three orientations (0, 4, and 5 orien-
the wild. tations), were obtained from the conv2_1 layer of VGG-Face
Note that the parameter λ shown in (3) was devised to incorpo- network [39]; note that the strongest activations for this selected
rate the diversity term into the energy function to be minimized. layer were displayed. As can be observed in Fig. 7, our DCNN
We demonstrate that properly adjusting λ can be beneficial for ensemble captures complementary (diverse) activation pat-
improving the generalization (testing) performance. The results terns from various input (Gabor) face representations, and
in Fig. 6 justify the advantage of using λ (i.e., the ensemble therefore they are likely to be mutually compensational in terms
diversity term). As shown in Fig. 6, the validation accuracy (ob- of extracting different deep face features of an input face im-
tained using the validation set used to optimize DCNN ensemble age. This allows DCNN ensemble members to be diverse (to a
weights) reaches a maximum for λ = 0, while the generaliza- large extent), leading to highly complementary deep face fea-
tion accuracy is lower and smaller than the averaged general- tures for correct FER. The observations in Fig. 7 is line with
ization accuracy along with λ. Note that λ = 0 means that the our argument that applying different face representations for the
ensemble accuracy term Accw (·) in (3) is only considered as an construction of DCNN ensemble is useful to induce diversity
energy function in our SA algorithm. For generalization accu- among DCNN members.
racy, the maximum is achieved in the range of [0.25, 0.35] For improving FER in the wild, we proposed the DCNN en-
for λ. The rationale behind this observation is that for λ = 0, semble classification approach. In particular, we developed a
ensemble weights are adjusted in such way that the DCNN en- novel simulated annealing (SA) based algorithm that exploits
semble classifier is strongly overfitted to the validation set so optimal ensemble weights by minimizing the energy function
that the diversity would be quite small. In contrast, as λ approxi- composed of ensemble accuracy and diversity terms. We have
mates to 1, diversity would be facilitated at the cost of ensemble shown that our SA-based algorithm can be used to find optimal
accuracy. This observation is consistent with the commonly held DCNN ensemble weights, which can lead to a considerably bet-
idea that the benefit of an ensemble classifier would be maxi- ter generalized recognition performance than other popular en-
mized when an optimal balance between classification accuracy semble weight computation methods. In addition, our proposed
and diversity is found [13], [32]. Consequently, the use of λ way of combining different face representations and bagging
for the proper adjustment of ensemble accuracy and diversity is for the creation of DCNN ensemble is shown to be effective in
beneficial to achieve a low generalization error. terms of increasing diversity among the outputs of individual
Prior to concluding this paper, we presented feature activation DCNNs. The results show that our DCNN ensemble approach
map [75] of a test face image for examining the internal repre- can achieve competitive performances on challenging wild FER
sentations learned by our DCNN ensemble. For this purpose, datasets in both recognition accuracy and runtime.

Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
110 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023

REFERENCES [27] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural


Netw., vol. 61, pp. 85–117, 2015.
[1] D. CireşAn, U. Meier, J. Masci, and J. Schmidhuber, “Multi-column [28] J. Kittler, M. Hatef, R. P. Duin, and J. Matas, “On combining classi-
deep neural network for traffic sign classification,” Neural Netw., vol. 32, fiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239,
pp. 333–338, 2012. Mar. 1998.
[2] X. S. Wei, B. B. Gao, and J. Wu, “Deep spatial pyramid ensemble for cul- [29] I. J. Goodfellow et al., “Challenges in representation learning: A report on
tural event recognition,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, three machine learning contests,” in Proc. Int. Conf. Neural Inf. Process.,
2015, pp. 38–44. 2013, pp. 117–124.
[3] S. E. Kahou et al., “EmoNets: Multimodal deep learning approaches for [30] M. P. Perrone and L. N. Cooper, “When networks disagree: Ensemble
emotion recognition in video,” J. Multimodal User Interfaces, vol. 10, methods for hybrid neural networks,” in Neural Networks for Speech and
no. 2, pp. 99–111, 2016. Image Processing, R. J. Mammone, Ed. New York, NY, USA: Chapman
[4] Z. Yu and C. Zhang, “Image based static facial expression recognition & Hall, 1993, pp. 126–142.
with multiple deep network learning,” in Proc. ACM Int. Conf. Multimodal [31] Z. H. Zhou, Ensemble Methods: Foundation and Algorithms, Boca Raton,
Interaction, 2015, pp. 435–442. FL, USA: CRC, 2012.
[5] B. K. Kim, J. Roh, S. Y. Dong, and S. Y. Lee, “Hierarchical committee of [32] G. Brown, J. Wyatt, R. Harris, and X. Yao, “Diversity creation methods:
deep convolutional neural networks for robust facial expression recogni- A survey and categorization,” Inf. Fusion, vol. 6, no. 1, pp. 5–20, 2005.
tion,” J. Multimodal User Interfaces, vol. 10, no. 2, pp. 173–189, 2016. [33] J. Y. Choi, Y. M. Ro, and K. N Plataniotis, “Color local texture features
[6] R. C. Malli, M. Aygun, and H. K. Ekenel, “Apparent age estimation using for color face recognition,” IEEE Trans. Image Process., vol. 21, no. 3,
ensemble of deep learning models,” in Proc. IEEE Int. Conf. Comput. Vis. pp. 1366–1380, Mar. 2012.
Pattern Recognit. Workshops, 2016, pp. 714–721. [34] R. McKay and H. Abbass, “Anticorrelation measures in genetic pro-
[7] H. Siqueira, S. Magg, and S. Wermter, “Efficient facial feature learning gramming,” in Proc. Australasia-Jpn. Workshop Intell. Evol. Syst., 2001,
with wide ensemble-based convolutional neural networks,” in Proc. AAAI pp. 45–51.
Conf. Artif. Intell., 2020, pp. 5800–5809. [35] N. Ueda and R. Nakano, “Generalization error of ensemble estimators,”
[8] Y. D. Kim, T. Jang, B. Han, and S. Choi, “Learning to select pre-trained in Proc. IEEE Int. Conf. Neural Netw., 1996, pp. 90–95.
deep representations with bayesian evidence framework published in com- [36] R. S. Sexton, R. E. Dorsey, and J. D. Johnson, “Beyond backpropagation:
puter vison and pattern recognition,” in Proc. IEEE Int. Conf. Comput. Vis. Using simulated annealing for training neural networks,” J. Organizational
Pattern Recognit., 2016, pp. 5318–5326. End User Comput., vol. 11, no. 3, pp. 3–10, 1999.
[9] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and the [37] R. A. Rutenbar, “Simulated annealing algorithms: An overview,” IEEE
bias/variance dilemma,” Neural Computation, vol. 4, no. 1, pp. 1–58, 1992. Circuits Devices Mag., vol. 5, no. 1, pp. 19–26, Jan. 1989.
[10] G. Brown, J. L. Wyatt, and P. Tino, “Managing diversity in regression [38] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed..
ensembles,” J. Mach. Learn. Res., vol. 6, pp. 1621–1650, 2005. Hoboken, NJ, USA: Wiley, 1973.
[11] G. Brown, Diversity in Neural Network Ensembles, Ph.D. dissertation, [39] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” Brit.
School Comput. Sci., Birmingham Univ., Birmingham, U.K., 2004. Mach. Vis., vol. 1, no. 3, 2015, Art. no. 6.
[12] H. Li, J. Sun, Z. Su, and L. Chen, “Multimodal 2D+3D facial expression [40] A. Vedaldi and K. Lenc. “MatConvNet—Convolutional neural networks
recognition with deep fusion convolutional neural network,” IEEE Trans. for MATLAB,” in Proc. 23rd ACM Int. Conf. Multimedia, Oct. 2015,
Multimedia, vol. 19, no. 12, pp. 2816–2831, Dec. 2017. pp. 689–692.
[13] L. I. Kuncheva and C. J. Whitaker, “Measure of diversity in classifier en- [41] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
sembles and their relationship with the ensemble accuracy,” Mach. Learn., large-scale image recognition,” in Proc. Int. Conf. Learn. Representations,
vol. 51, no. 2, pp. 181–207, 2003. 2015.
[14] T. G. Dietterich, “Ensemble methods in machine learning,” in Proc. LNCS [42] D. Li and G. Wen, “MRMR-based ensemble pruning for facial expression
Int. Workshop Mult. Classifier Syst., 2000, pp. 1–15. recognition,” Multimed Tools Appl., vol. 77, pp. 15251–15272, 2018.
[15] A. Mollahosseini et al., “Facial expression recognition from world wild [43] A. Pentina and R. Urner, “Lifelong learning with weighted majority votes,”
web,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit. Workshops, in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 3612–3620.
2016, pp. 58–65. [44] H. Chen and X. Yao, “Multiobjective neural network ensembles based on
[16] T. G. Dietterich, “An experimental comparison of three methods for con- regularized negative correlation learning,” IEEE Trans. Knowl. Data Eng.,
structing ensembles of decision trees: Bagging, boosting, and randomiza- vol. 22, no. 12, pp. 1738–1751, Dec. 2010.
tion,” Mach. Learn., vol. 40, no. 2, pp. 139–157, 2000. [45] M.-I. Georgescu, R. T. Ionescu, and M. Popescu, “Local learning with deep
[17] R. Ranawana and V. Palade, “Multi-classifier systems—A review and and handcrafted features for facial expression recognition,” IEEE Access,
roadmap for developers,” Inf. Sci., vol. 179, no. 2, pp. 1298–1318, 2009. vol. 7, pp. 64827–64836, May 2019.
[18] J. Joo, W. Li, F. F. Steen, and S. C. Zhu, “Visual persuasion: Inferring [46] Y. Tang, “Deep learning using linear support vector machines,” 2013,
communicative intents of images,” in Proc. IEEE Int. Conf. Comput. Vis. arXiv:1306.0239.
Pattern Recognit., 2014, pp. 216–223. [47] T. Connie, M. Al-Shabi, W. P. Cheah, and M. Goh, “Facial expression
[19] J. Shao and Y. Qian, “Three convolutional neural network models for facial recognition using a hybrid CNN–SIFT aggregator,” in Proc. Int. Workshop
expression recognition in the wild,” Neurocomputing, vol. 355, pp. 82–92, Multi-Disciplinary Trends Artif. Intell., 2017, pp. 139–149.
2019. [48] Y. Gan, J. Chen, and L. Xu, “Facial expression recognition boosted by
[20] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, soft label with a diverse ensemble,” Pattern Recognit. Lett., vol. 125,
no. 7553, pp. 436–444, 2015. pp. 105–112, 2019.
[21] T. Gehrig and H. K. Ekenel, “Why is facial expression analysis in the wild [49] G. Ali, M. A. Iqbal, and T. S. Choi, “Boosted NNE collections for multicul-
challenging?,” in Proc. ACM Emotion Recognit. Wild Challenge Workshop, tural facial expression recognition,” Pattern Recognit., vol. 55, pp. 14–27,
2013, pp. 9–16. 2016.
[22] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 faces [50] A. T. Lopes, E. de Aguiar, A. F. De Souza, and T. Oliveira-Santos, “Fa-
in-the-wild challenge: Database and results,” Image Vis. Comput., vol. 47, cial expression recognition with convolutional neural networks: Coping
pp. 3–18, 2016. with few data and the training sample order,” Pattern Recognit., vol. 61,
[23] Y. Ren, L. Zhang, and P. N. Suganthan, “Ensemble classification and re- pp. 610–628, 2017.
gression - Recent developments, applications and future directions,” IEEE [51] S. Zafeiriou, A. Papaioannou, I. Kotsia, M. Nicolaou, and G. Zhao, “Facial
Comput. Intell. Mag., vol. 11, no. 1, pp. 41–53, Feb. 2016. affect ‘In-The-Wild’: A survey and new database,” in Proc. IEEE Int. Conf.
[24] D. W. Optiz and J. W. Shavlik, “Generating accurate and diverse members Comput. Vis. Pattern Recognit. Workshops, 2016, pp. 36–47.
of a neural-network ensemble,” in Proc. Adv. Neural Inf. Process. Syst., [52] M. Ghayoumi, “A quick review of deep learning in facial expression,”
1996, pp. 531–541. J. Commun. Comput., vol. 14, pp. 34–38, 2017.
[25] A. Krogh and J. Vedelsby, “Neural network ensembles, cross valida- [53] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Static facial expression
tion, and active learning,” in Proc. Adv. Neural Inf. Process. Syst., 1995, analysis in tough conditions: Data, evaluation protocol and benchmark,”
pp. 231–238. in Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2011, pp. 2106–2112.
[26] L. Rokach, “Taxonomy for characterizing ensemble methods in classifi- [54] S. Li and W. Deng, “Deep facial expression recognition: A
cation tasks: A review and annotated bibliography,” Comput. Statist. Data survey,” IEEE Trans. Affect. Comput., to be published, doi:
Anal., vol. 53, no. 12, pp. 4046–4072, 2009. 10.1109/TAFFC.2020.2981446.

Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
CHOI AND LEE: COMBINING DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STOCHASTIC ENSEMBLE WEIGHT OPTIMIZATION 111

[55] M. Liu, S. Li, S. Shan, and X. Chen, “Au-inspired deep networks for fa- [73] Z. Wang, F. Zeng, S. Liu, and B. Zeng, “OAENet: Oriented attention
cial expression feature learning,” Neurocomputing, vol. 159, pp. 126–136, ensemble for accurate facial expression recognition,” Pattern Recognit.,
2015. vol. 112, 2021, Art. no. 107694.
[56] X. Liu, B. Kumar, J. You, and P. Jia, “Adaptive deep metric learning for [74] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
identity-aware facial expression recognition,” in Proc. IEEE Int. Conf. networks,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818–833.
Comput. Vis. Pattern Recognit. Workshops, 2017, pp. 20–29.
[57] H. Ding, S. K. Zhou, and R. Chellappa, “FaceNet2expNet: Regularizing a
deep face recognition net for expression recognition,” in Proc. IEEE Int.
Conf. Autom. Face Gesture Recognit., 2017, pp. 118–126.
[58] Z. Meng, P. Liu, J. Cai, S. Han, and Y. Tong, “Identity-aware convolutional
neural network for facial expression recognition,” in Proc. IEEE Int. Conf.
Jae Young Choi (Member, IEEE) received the M.S.
Autom. Face Gesture Recognit., 2017, pp. 558–565.
and Ph.D. degrees from the Korea Advanced Institute
[59] C. Pramerdorfer and M. Kampel, “Facial expression recognition using
convolutional neural networks: State of the art,” 2016, arXiv:1612.02903. of Science and Technology, Daejeon, South Korea, in
2008 and 2011, respectively. In 2008, he was a Visit-
[60] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimiza-
ing Scholar with the University of Toronto, Toronto,
tion,” J. Mach. Learn. Res., vol. 13, pp. 281–305, 2012.
ON, Canada, and from 2011 to 2012, he was a Post-
[61] T. S. Ly, N.T. Do, S.H. Kim, H. J. Yang, and G. S. Lee, “A novel 2D and
3D multimodal approach for in-the-wild facial expression recognition,” doctoral Researcher with the University of Toronto.
From 2012 to 2013, he was a Postdoc Fellow with the
Image Vis. Comput., vol. 92, no. 103817, pp. 1–12, 2019.
University of Pennsylvania, Philadelphia, PA, USA.
[62] B. K. Kim, H. Lee, J. Roh, and S. Y. Lee, “Hierarchical committee of
From 2013 to 2014, he was a Senior Engineer with
deep CNNs with exponentially-weighted decision fusion for static facial
Samsung Electronics. He is currently an Associated
expression recognition,” in Proc. ACM Int. Conf. Multimodal Interaction,
Professor with the Division of Computer Engineering, Hankuk University of
2015, pp. 427–434.
Foreign Studies, Seoul, South Korea. He is the author or coauthor of more than
[63] D. Acharya, Z. Huang, D. Paudel, and L. V. Gool, “Covariance pooling
for facial expression recognition,” in Proc. IEEE Int. Conf. Comput. Vis. 100 refereed research publications in his research field, which include deep
learning, ensemble machine learning, pattern recognition, and computer vision.
Pattern Recognit. Workshops, 2018, pp. 480–487.
Especially, he has developed several pioneering algorithms for automatic face
[64] J. Zeng, S. Shan, and X. Chen, “Facial expression recognition with in-
recognition using facial color information. Prof. Choi was the recipient of the
consistently annotated datasets,” in Proc. Eur. Conf. Comput. Vis., 2018,
pp. 222–237. Best Paper Award of Korea Multimedia Society in 2021. He was also the recip-
ient of the Samsung HumanTech Thesis Prize in 2010.
[65] J. Yang et al., “Neural aggregation network for video face recogni-
tion,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2017,
pp. 4362–4371.
[66] F. D. L. Torre et al., “IntraFace,” in Proc. IEEE Int. Conf. Autom. Face
Gesture Recognit., 2015, pp. 1–8.
[67] S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality-
preserving learning for expression recognition in the wild,”in Proc. IEEE
Int. Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2584–2593.
[68] Y. Li, J. Zeng, S. Shan, and X. Chen, “Occlusion aware facial expression Bumshik Lee (Member, IEEE) received the B.S. de-
recognition using CNN with attention mechanism,” IEEE Trans. Image gree in electrical engineering from Korea Univer-
Process., vol. 28, no. 5, pp. 2439–2450, May 2019. sity, Seoul, South Korea, and the M.S. and Ph.D.
[69] K. Wang, X. Peng, J. Yang, S. Lu, and Y. Qiao, “Suppressing uncertainties degrees in information and communications engi-
for large-scale facial expression recognition,” in Proc. IEEE Int. Conf. neering from the Korea Advanced Institute of Science
Comput. Vis. Pattern Recognit., 2020, pp. 6896–6905. and Technology (KAIST), Daejeon, South Korea. He
[70] K. Wang, X. Peng, J. Yang, D. Meng, and Y. Qiao, “Region attention was a Research Professor with KAIST, in 2014 and a
networks for pose and occlusion robust facial expression recognition,” Postdoctoral Scholar with the University of Califor-
IEEE Trans. Image Process., vol. 29, pp. 4057–4069, 2020. nia San Diego, San Diego, CA, USA, from 2012 to
[71] A. H. Farzaneh and X. Qi, “Facial expression recognition in the wild via 2013. From 2015 to 2016, he was a Principal Engineer
deep attentive center loss,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. with Advanced Standard R&D Lab., LG Electronics,
Vis., Jan. 2021, pp. 2402–2411. Seoul, South Korea. He is currently an Associated Professor with the Department
[72] R. Momin, A. S. Momin, and K. Rasheed, “Recognizing facial expres- of Information and Communication Engineering, Chosun University, Gwangju,
sions in the wild using multi-architectural representations based ensemble South Korea. His research interests include pattern recognition, video compres-
learning with distillation,” 2021, arXiv:2106.16126. sion and processing, video security, and medical image processing.

Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.

You might also like