Combining Deep Convolutional Neural Networks With Stochastic Ensemble Weight Optimization For Facial Expression Recognition in The Wild
Combining Deep Convolutional Neural Networks With Stochastic Ensemble Weight Optimization For Facial Expression Recognition in The Wild
25, 2023
1520-9210 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
CHOI AND LEE: COMBINING DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STOCHASTIC ENSEMBLE WEIGHT OPTIMIZATION 101
decisions can be the primary challenge and an important research in the triplet loss for the identity-invariant FER. Ding et al. [57]
direction [27], [54]. proposed the FaceNet2ExpNet, where a novel two-stage train-
In this paper, we propose a new DCNN ensemble classifier ing algorithm for FER was used. In the first stage, a probabilistic
for in-the-wild FER. Our method has two key aspects. First, we distribution function was used to regularize the training of the
formulate the process of finding ensemble combination weights targeted FER net based on the already fine-tuned face net. In the
as an optimization problem, which can be effectively solved second stage, to further boost the discriminative capability, ran-
using our novel simulated annealing (SA)-based algorithm. domly initialized fully connected layers attached to the trained
We show that ensemble weights optimized via our SA-based convolutional blocks were used to train the entire network from
algorithm can significantly improve FER performance. Second, scratch with strong facial expression label supervision. Meng
we introduce an effective DCNN ensemble construction that et al. [58] proposed a novel identity-aware CNN (IACNN) that
takes advantage of the combined use of different face represen- contains two identical sub-CNNs with sharing weights. In their
tations and bagging [16]. This allows maximizing the diversity method, the expression-sensitive contrastive loss was used to
among DCNN ensemble members in the sense that they do not learn expression-related features, while the identity-sensitive
make coincident errors, as demonstrated by our experimental contrastive loss was employed to learn identity-related features.
results in Section VI.C. We evaluate the proposed method on Identity-invariant FER was achieved by combining expression-
three publicly available wild FER databases (DBs) collected related and identity-related features.
under real-world scenarios: FER2013 [29], static facial expres- Recent advances in deep learning suggest that the use of an en-
sions in the wild (SFEW2.0) [53], and real-world affective face semble of DCNNs can improve the performance of image recog-
database (RAF-DB) [67]. The results of our method are better nition and classification tasks. In [1], multi-column deep neural
than, or at least competitive with the best-reported results. networks have been suggested, where each column is repre-
The rest of this paper is organized as follows: Section II sented as a DCNN. The outputs of all columns are simply aver-
reviews previous works on deep learning-based FER. Section III aged for decision aggregation. Yu et al. [4] proposed an effective
describes the motivation behind the proposed DCNN ensemble optimization framework using the log-likelihood loss and hinge
weight optimization and construction for completeness. Sec- loss to adaptively combine multiple DCNNs to perform expres-
tion IV details the proposed DCNN ensemble construction. In sion recognition. Kim et al. [5] used an exponentially weighted
Section V, the proposed SA-based DCNN ensemble weight op- fusion based on validation accuracy and constructed a hierar-
timization algorithm is presented. Section VI presents extensive chical architecture of DCNN committees by implementing ma-
and comparative experimental results to demonstrate the effec- jority voting or a simple average for FER. Pramerdorfer et al.
tiveness of the proposed method for FER in the wild. Finally, [59] constructed an ensemble of DCNNs based on different deep
the discussion and conclusions are presented in Section VII. architectures such as VGG, Inception, and ResNet; they used a
simple ensemble voting of the outputs produced by eight DCNNs
as ensemble members. In summary, the aforementioned works
II. RELATED WORK show that deep learning models can be employed as good base
A significant part of FER’s recent progress has been achieved classifiers (i.e., ensemble members) for typical ensemble classi-
because of the emergence of deep learning models and, more fication approaches, such as the average combination rule [28].
specifically, with DCNNs [12]. In the following paragraphs, we However, ensemble combination approaches with deep learning
review some recent methods based on deep neural networks that in the field of FER [54] have been limited to simple averaging or
are most relevant to our work and refer the reader to a recent majority voting based on the conditional probability vector [28]
and comprehensive survey [54] on FER using deep learning for obtained from each DCNN. Thus, an optimal combination of
further information. the outputs of ensemble DCNN classifiers has not yet been
The authors of [55] proposed an action unit-inspired deep net- explored in the literature.
work (AUDN) to exploit a psychological theory where expres- Unlike other recent DCNN-based ensemble approaches fo-
sions can be decomposed into several facial expression action cused on FER [1], [4], [5], [7], [54], [59], the main contributions
units. Tang [46] proposed a deep neural network architecture of our work are summarized as follows:
using a linear support vector machine (SVM) as the top layer in- r The focus of this work is on making the DCNN ensemble
stead of a softmax layer. To train deep neural networks, L2-SVM combination more adaptable to the recognition problem
loss was used for facial expression classification. Devries et al. being addressed. We developed a novel method that opti-
[47] developed a multi-task CNN that jointly predicts FER and mally determines the ensemble (combination) weights by
facial landmarks and demonstrated that learning features asso- minimizing the generalized (test) classification error of the
ciated with facial landmark position can improve FER. Shao et whole DCNN ensemble. In our method, finding ensemble
al. [19] proposed three different DCNN models, namely shallow weights is formulated as an optimization problem, and it
Light-CNN, dual-branch CNN, and pre-trained CNN, for FER can be effectively solved using our novel simulated an-
in the wild. Liu et al. [56] proposed a method for optimizing nealing (SA) [36], [37]-based algorithm well-suited for
both (N+M)-tuples cluster loss and softmax loss via two fully improving FER in the wild.
connected layer branch configurations. Specifically, during the r In addition to ensemble weight optimization, we propose a
training, the (N+M)-tuples cluster loss was formalized to alle- new ensemble construction approach that takes advantage
viate the difficulty of anchor selection and threshold validation of the combined use of different face representations
Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
102 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023
and bagging [16] to create DCNN ensemble members. It finds the optimal weights by minimizing the ensemble gener-
allows for maximizing diversity (for facilitating a comple- alization error composed of ensemble accuracy and diversity.
mentary effect) among DCNN members, thereby boosting Another important issue is that the individual DCNNs must be
FER performance. as diverse (uncorrelated) as possible while being accurate. In
light of this fact, Siqueira et al. [7] proposed the so-called En-
III. MOTIVATION semble with Shared Representations (ESRs) based on deep con-
volutional networks for FER, aiming to achieve high diversity
To combine the outputs of DCNN members, we adopt (low redundancy) in the ensemble. This motivates us to develop
weighted majority voting (WMV) because WMV and its vari- an effective ensemble construction of diverse DCNNs, which
ants are by far the most popular approach [43] for ensemble will be explained in the next section.
combination. Assuming that we are given a set of M individ-
ual DCNNs, {f1 , . . . , fM }, our task is to combine the out-
puts of fk (k = 1, . . . , M ) to predict the emotion class label IV. PROPOSED DCNN ENSEMBLE CONSTRUCTION
cj (e.g., happy). Here, the outputs of fk are in the form of an To construct an ensemble of DCNNs that are diverse while
L-dimensional vector [fk1 (x), fk2 (x), . . . , fkL (x)] for the input being accurate, we propose the combined use of different face
instance (sample) x, where fkj (x) ∈ [0, 1] represents an esti- representations and bagging. For different face representations,
mate of the posterior probability P (cj |x). In WMV, the output facial texture images such as local binary pattern (LBP) [33] or
of the DCNN ensemble classifier can be defined as Gabor [33] are used because they have been widely used for face
F (x) = M w f (x) representations in FER [12], [54].
M k=1 k1 k (1) From the entire training set T containing N samples, we cre-
= k=1 [wk fk (x) , wk fk2 (x) , . . . , wk fkL (x)]
ate P bootstrap training sets, denoted as Tp , p = 1, . . . , P , by
where wk is the weight assigned to each DCNN member. In forming bootstrap replicates of the original training samples in
practice, the weights
are usually normalized and constrained by [16]. Therefore, each Tp will contain, on average, 0.6N dif-
wk = 0 and M k = 1 wk = 1. ferent samples [16], some of them repeated once or more. The
With adequate weight assignments, the classification perfor- remaining 0.4N examples in T̄p = T − Tp are used for val-
mance of WMV can be significantly improved [26], [30]. In par- idation purposes in the training phase for each DCNN, that is,
ticular, Zhou’s work [31] provides insight and motivation into tuning the learning rate and determining when to stop network
the importance of developing an ensemble weight optimization training [41]. Note that different face representations [33] are
algorithm. He adopted a Bayesian optimal discriminant anal- obtained by transforming an original RGB (or grayscale) image
ysis to determine ensemble weights for achieving a minimum into various facial texture images based on associated param-
ensemble classification error. We now briefly review this analy- eters, for example, the number of sampling points and radius
sis to adequately explain our proposed method. Let pk denote the value for LBP face representations, as shown in Fig. 1. Each of
classification accuracy of fk and let us assume that the outputs the facial images contained in a particular Tp is transformed into
of the individual fk (k = 1, . . . , M ) are conditionally indepen- Q different face representations, resulting in a total of P Q boot-
dent (i.e., uncorrelated). Then, a Bayesian optimal discriminant strap training sets. These multiple transformed bootstrap train-
function that leads to a minimum classification error for the ing sets, each with a specific face representation, are applied as
combined output on the class label cj can be written as [31] inputs to train individual DCNNs as ensemble members. This
way, we generate a total of M DCNN ensemble members, where
M
pk M = P Q.
log P (cj ) + fkj (x) log (2)
1 − pk Fig. 1 shows the proposed construction of DCNN ensemble
k=1
members. To our knowledge, we are the first to propose the
where P (cj ) is the prior probability of the input x being from combined use of various facial texture representations and
the emotion class cj . The second term in (2) suggests that the bagging [16] to learn DCNNs as ensemble members for FER.
ensemble weights should be proportional to the classification As demonstrated in our experiment (see Table III), our proposed
pk
accuracies of individual DCNNs, that is, wk ≈ log 1−p k
. This combination of facial texture representation with bagging is ad-
also theoretically supports the argument that different weights vantageous for increasing diversity (i.e., decreasing correlation)
should be properly used for individual DCNNs relying on dif- among DCNN members, which is beyond the case of using only
ferent strengths. RGB (or grayscale) as the input space for learning DCNNs. This
However, the weights chosen using (2) are obtained by assum- makes the DCNN members so different that they contradict each
ing independence among the outputs of the individual classifiers. other, thereby boosting ensemble classification performance.
In practical applications, this does not hold because individual
classifiers are usually correlated to a large extent. Hence, for
V. DCNN ENSEMBLE WEIGHT OPTIMIZATION WITH
the WMV combination to be effective on the DCNN ensem-
ble, weight optimization should be performed by considering SIMULATED ANNEALING
an appropriate compromise between the DCNN classification There are two main theoretical insights [32] behind the sound-
accuracy and their correlation (i.e., diversity [32]). To achieve ness of using ensemble classifier models: (a) ambiguity decom-
this, we design a novel stochastic optimization algorithm that position and (b) bias-variance decomposition, both of which
Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
CHOI AND LEE: COMBINING DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STOCHASTIC ENSEMBLE WEIGHT OPTIMIZATION 103
= wk fk (x) − t (x)
k p
1
−λ wk fk (x) − fn (x)p (3)
M
k k=n
between accuracy and diversity of the DCNN ensemble in the where V denotes a validation set (of the entire dataset) that
process of ensemble weight optimization. In the proposed en- should be kept unseen by all DCNNs during their ensemble
semble weight optimization using SA, a measure of the balance construction explained in Section III, and | · | is the cardinality
between DCNN ensemble accuracy and diversity is given by the of a set.
energy function, which will be explained in the next subsection.
B. Proposed Simulated Annealing Optimization Algorithm
A. Proposed Energy Function The optimization problem in (4) has a quadratic form in terms
To find ensemble weights that achieve the lowest generalized of multivariables w1 , w2 , . . . , wM as the p − norm is used in
classification error, we propose the following energy function as (3). Hence, finding wopt is a nonlinear multivariate optimiza-
the learning objective: tion problem where multiple local minima may exist. In this
context, gradient-based methods are likely to get stuck in unac-
Ew (x) = Accw (x) − λ · Divw (x) ceptable local minima [36]. To tackle this problem, we consider
Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
104 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023
Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
CHOI AND LEE: COMBINING DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STOCHASTIC ENSEMBLE WEIGHT OPTIMIZATION 105
TABLE I
NUMBER OF IMAGES PER EACH EXPRESSION IN USED DATABASES
A. Experimental Setup and Condition Fig. 3. Facial examples of the FER2013 [29]. The columns show the emotion
categories.
The FER2013 [29], SFEW2.0 [53], and RAF-DB [67]
datasets collected in the wild were used in our experiments. The
FER2013 dataset was created using the Google image search
engine to search for images of faces that match a set of 184
emotion-related keywords, such as blissful, enraged, etc. These
keywords were combined with words related to gender, age, or
ethnicity to obtain nearly 600 strings that were used as facial
image search queries. The first 1,000 images returned for each
query were kept for the next stage of processing. The images
were resized to 48 × 48 pixels and converted to a grayscale
format. The resulting dataset contains 35,887 grayscale images
acquired in a wild setting. The SFEW2.0 DB [53] was created
by selecting static frames from the acted facial expressions in
the wild (AFEW). The SFEW2.0 DB covers unconstrained facial
expressions, different head poses, large age ranges, varied focus,
occlusions, different resolutions of faces, and close to real-world
illuminations. The SFEW2.0 DB is divided into two sets, which
are created in a strict person-independent manner. The RAF-DB
is a widely used wild FER dataset acquired in an unconstrained
setting, offering a broad diversity across pose, gender, age, de-
mography, image quality, and illumination. RAF-DB contains
30,000 facial images annotated with basic or compounded ex-
pressions by 40 trained human coders. In our experiments, only
images with six discrete basic emotions were used.
Table I shows the number of images for the seven basic ex-
pressions in FER2013, SFEW2.0, and RAF-DB. In addition,
Figs. 3 and 4 show some image examples for the respective
FER DB. As shown in Figs. 3 and 4, face images of our col-
lected datasets are of the great variability in head pose degrees Fig. 4. Example of facial images (scaled with a size of 224 × 224 × 3 pixels)
captured in the wild conditions, which is partciulary difficult to and their original images from (a) SFEW2.0 [53] and (b) RAF-DB [67].
Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
106 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023
tackle for correct FER [54], [70]. We measured head pose an-
gles of the face images used in our experiments. For this, we
employed the popular “Intraface” toolbox [66] to estimate the
three angles, i.e., yaw, pitch, and roll in the degree, of origi-
nal images (see Fig. 4) of each FER DB. Based on the statis-
tics of our collected data, the range of yaw is [−43.43, 47.01],
[−55.79, 52.48], and [−61.32, 58.72] in degrees for FER2013
(test), SFEW2.0 (validation), and RAF-DB (test), respectively.
In terms of pitch range, [−32.32, 45.26], [−15.54, 37.89], and Fig. 5. Examples of randomly augmented facial images with three data aug-
mentation techniques.
[−43.85, 54.84], while [−20.34, 19.86], [−29.8, 30.22], and
[−31.86, 28.41] for roll, are observed for the aforementioned
order of FER datasets. Moreover, among their original test im- all the reported results, realizing a good compromise between
ages, about one-fifth (20%) of FER2013 (test), about half (50%) the accuracy and diversity of the DCNN ensemble.
of SFEW2.0 (validation), and about one-fifth (20%) of RAF-DB Three data augmentation strategies were used during the
(test) have poses larger than 30 degrees (in yaw or pitch or roll). fine-tuning phase, which included horizontal flips, random rota-
These observations demonstrate that our method is tested under tions (in a degree range of [−60, 60]), and random shifting. To
in-the-wild FER scenario, which covers a wide range of pose implement data augmentation, we used the Keras ImageData-
variations. Generator API.1 Fig. 5 shows the original facial images and the
As for DCNN ensemble members, we chose VGG-Face [39] corresponding randomly augmented sample images.
as the pre-trained deep network because it performs well and
involves a moderate number of parameters. Note that other re- B. Effectiveness of Proposed SA Algorithm for Finding
cent pre-trained deep CNN or newly designed deep CNN net- Optimal DCNN Ensemble Weights
works [12], [54] can be readily applied to our SA algorithm. To demonstrate the usefulness of our SA-based algorithm in
The VGG-Face is a deep CNN model successfully trained on terms of finding optimal (DCNN) ensemble weights, a compara-
2.6 million facial images collected from the web to recognize tive experimental study was carried out. For comparison, we em-
2,622 identities. This network involves 16 convolutional layers, ployed other popular ensemble weight computation approaches.
five max-pooling layers, three fully connected (FC) layers, and Specifically, the following four approaches were compared: (a)
a final layer with softmax activation. To construct an ensemble conventional majority voting [28], (b) performance weighting
of DCNNs, we used grayscale, LBP, and Gabor face representa- [26], [31], (c) random search [60], and (d) attention network
tions as an input to the individual DCNNs. Four different LBP [65]. In the case of majority voting, all DCNN members have
representations were obtained by adjusting the parameter val- the same combination weights [31] and are treated equally (i.e.,
ues [33] (no. of sampling points, radius): (8,1), (8,2), (8,3), and this approach is the same as that proposed in conventional major-
(16,2). In addition, referring to [33], 2D Gabor kernels with three ity voting [28]). When using performance weighting [26], [31],
different scales (1, 3, and 4 scales) and three orientations (0, 4, we weight each DCNN according to its individual performance
and 5 orientations) were used to create nine Gabor representa- on the validation set. The best DCNN can be assigned a weight
tions. Note that we used four bootstrap training datasets [16] of one, whereas the weight of the worst DCNN is zero [26], [31].
and 14 different face representations, resulting in a total of 56 The DCNN classifiers − whose performances are given weights
DCNN ensemble members. − are determined as in [26], [31]. For implementing random
We used the training dataset of each FER DB for fine-tuning search [60], uniform DCNN ensemble weights are generated,
the VGG-Face model as in [40]–[41] and scaled the size of image and then a sampling procedure is performed to find the ensemble
data to 224 × 224 × 3 to fit the VGG-Face input requirement. weights that yield the highest validation set performance. Fur-
The hyper-parameters of each network are the same as those thermore, attention network [65] was used to generate DCNN
used in [41]: momentum 0.9; weight decay 5 × 10−4 ; and initial weights; we first pooled all deep face features, each extracted
learning rate 10−2 , which is decreased by a factor of 10 when from the FC layer of an individual DCNN, and then apply them
the validation error stops decreasing (specifically, when the error to attention blocks to adaptively compute the weights [65].
increases for more than three consecutive times). Overall, each From Table II, we observe the following: 1) Ensemble re-
DCNN was trained using three decreasing learning rates. In the sults have been compared with the baseline performance of a
proposed SA algorithm, we set bmax = 1, 000, Tfinal = 10−8 , best single DCNN classifier. Every DCNN ensemble combina-
Efinal = −∞, and trymax = 300. The λ parameter in (3) was tion approach has shown a better performance than a best single
experimental chosen by means of an exhaustive tuning process DCNN. 2) DCNN ensemble with different combination weights
where λ is varied over the range [0,1], using a step size equal performs better than the DCNN ensemble combined via naive
to “0.05”. The determination of λ is made by selecting the one majority voting [28] with uniform weights, which indicates that
having the best FER accuracy on the validation set of each FER it is beneficial to have an ensemble of DCNNs with different
dataset. Our results show that a setting λ in the range of [0.25, degrees of contributions to improve performance. 3) the perfor-
0.35] is found to be adequate for all of the datasets used in our mance improvement is quite convincing when ensemble weights
experiments. In addition, there is little performance difference
in the range of [0.25, 0.35]. For this reason, we set λ = 0.3 for 1 [Online]. Available: https://keras.io/preprocessing/image
Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
CHOI AND LEE: COMBINING DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STOCHASTIC ENSEMBLE WEIGHT OPTIMIZATION 107
TABLE II TABLE IV
COMPARISON OF FER ACCURACIES WITH RESPECT TO THE DIFFERENT DCNN FER ACCURACY COMPARISONS WITH OTHER STATE-OF-THE-ART APPROACHES
ENSEMBLE WEIGHT COMPUTATION APPROACHES. THE PRIVATE TEST SET OF ON FER2013 PRIVATE TEST DATASET. RESULTS FOR THE COMPARISON ARE
THE FER2013 DB WAS USED FOR TESTING. (BOLD: BEST, UNDERLINE: DIRECTLY CITED FROM PAPERS RECENTLY PUBLISHED. (BOLD: BEST,
SECOND BEST) UNDERLINE: SECOND BEST)
TABLE III
COMPARISON OF THE DIVERSITIE MEASURES [13] AMONG THE DCNN
MEMBERS FOR THREE DIFFERENT ENSEMBLE CONSTRUCTION APPROACHES
Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
108 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023
TABLE V TABLE VI
FER ACCURACY COMPARISONS ON SFEW2.0. RESULTS FOR THE COMPARISON FER ACCURACY COMPARISONS ON RAF-DB. RESULTS FOR THE COMPARISON
ARE DIRECTLY CITED FROM PAPERS RECENTLY PUBLISHED. VALIDATION ARE DIRECTLY CITED FROM PAPERS RECENTLY PUBLISHED
ACCURACY WAS REPORTED
TABLE VII
CONFUSION MATRIX (UNIT OF %) OF THE PROPOSED DCNN ENSEMBLE
METHOD EVALUATED ON THE SFEW 2.0 VALIDATION SET
Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
CHOI AND LEE: COMBINING DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STOCHASTIC ENSEMBLE WEIGHT OPTIMIZATION 109
Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
110 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023
Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.
CHOI AND LEE: COMBINING DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STOCHASTIC ENSEMBLE WEIGHT OPTIMIZATION 111
[55] M. Liu, S. Li, S. Shan, and X. Chen, “Au-inspired deep networks for fa- [73] Z. Wang, F. Zeng, S. Liu, and B. Zeng, “OAENet: Oriented attention
cial expression feature learning,” Neurocomputing, vol. 159, pp. 126–136, ensemble for accurate facial expression recognition,” Pattern Recognit.,
2015. vol. 112, 2021, Art. no. 107694.
[56] X. Liu, B. Kumar, J. You, and P. Jia, “Adaptive deep metric learning for [74] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
identity-aware facial expression recognition,” in Proc. IEEE Int. Conf. networks,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818–833.
Comput. Vis. Pattern Recognit. Workshops, 2017, pp. 20–29.
[57] H. Ding, S. K. Zhou, and R. Chellappa, “FaceNet2expNet: Regularizing a
deep face recognition net for expression recognition,” in Proc. IEEE Int.
Conf. Autom. Face Gesture Recognit., 2017, pp. 118–126.
[58] Z. Meng, P. Liu, J. Cai, S. Han, and Y. Tong, “Identity-aware convolutional
neural network for facial expression recognition,” in Proc. IEEE Int. Conf.
Jae Young Choi (Member, IEEE) received the M.S.
Autom. Face Gesture Recognit., 2017, pp. 558–565.
and Ph.D. degrees from the Korea Advanced Institute
[59] C. Pramerdorfer and M. Kampel, “Facial expression recognition using
convolutional neural networks: State of the art,” 2016, arXiv:1612.02903. of Science and Technology, Daejeon, South Korea, in
2008 and 2011, respectively. In 2008, he was a Visit-
[60] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimiza-
ing Scholar with the University of Toronto, Toronto,
tion,” J. Mach. Learn. Res., vol. 13, pp. 281–305, 2012.
ON, Canada, and from 2011 to 2012, he was a Post-
[61] T. S. Ly, N.T. Do, S.H. Kim, H. J. Yang, and G. S. Lee, “A novel 2D and
3D multimodal approach for in-the-wild facial expression recognition,” doctoral Researcher with the University of Toronto.
From 2012 to 2013, he was a Postdoc Fellow with the
Image Vis. Comput., vol. 92, no. 103817, pp. 1–12, 2019.
University of Pennsylvania, Philadelphia, PA, USA.
[62] B. K. Kim, H. Lee, J. Roh, and S. Y. Lee, “Hierarchical committee of
From 2013 to 2014, he was a Senior Engineer with
deep CNNs with exponentially-weighted decision fusion for static facial
Samsung Electronics. He is currently an Associated
expression recognition,” in Proc. ACM Int. Conf. Multimodal Interaction,
Professor with the Division of Computer Engineering, Hankuk University of
2015, pp. 427–434.
Foreign Studies, Seoul, South Korea. He is the author or coauthor of more than
[63] D. Acharya, Z. Huang, D. Paudel, and L. V. Gool, “Covariance pooling
for facial expression recognition,” in Proc. IEEE Int. Conf. Comput. Vis. 100 refereed research publications in his research field, which include deep
learning, ensemble machine learning, pattern recognition, and computer vision.
Pattern Recognit. Workshops, 2018, pp. 480–487.
Especially, he has developed several pioneering algorithms for automatic face
[64] J. Zeng, S. Shan, and X. Chen, “Facial expression recognition with in-
recognition using facial color information. Prof. Choi was the recipient of the
consistently annotated datasets,” in Proc. Eur. Conf. Comput. Vis., 2018,
pp. 222–237. Best Paper Award of Korea Multimedia Society in 2021. He was also the recip-
ient of the Samsung HumanTech Thesis Prize in 2010.
[65] J. Yang et al., “Neural aggregation network for video face recogni-
tion,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2017,
pp. 4362–4371.
[66] F. D. L. Torre et al., “IntraFace,” in Proc. IEEE Int. Conf. Autom. Face
Gesture Recognit., 2015, pp. 1–8.
[67] S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality-
preserving learning for expression recognition in the wild,”in Proc. IEEE
Int. Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2584–2593.
[68] Y. Li, J. Zeng, S. Shan, and X. Chen, “Occlusion aware facial expression Bumshik Lee (Member, IEEE) received the B.S. de-
recognition using CNN with attention mechanism,” IEEE Trans. Image gree in electrical engineering from Korea Univer-
Process., vol. 28, no. 5, pp. 2439–2450, May 2019. sity, Seoul, South Korea, and the M.S. and Ph.D.
[69] K. Wang, X. Peng, J. Yang, S. Lu, and Y. Qiao, “Suppressing uncertainties degrees in information and communications engi-
for large-scale facial expression recognition,” in Proc. IEEE Int. Conf. neering from the Korea Advanced Institute of Science
Comput. Vis. Pattern Recognit., 2020, pp. 6896–6905. and Technology (KAIST), Daejeon, South Korea. He
[70] K. Wang, X. Peng, J. Yang, D. Meng, and Y. Qiao, “Region attention was a Research Professor with KAIST, in 2014 and a
networks for pose and occlusion robust facial expression recognition,” Postdoctoral Scholar with the University of Califor-
IEEE Trans. Image Process., vol. 29, pp. 4057–4069, 2020. nia San Diego, San Diego, CA, USA, from 2012 to
[71] A. H. Farzaneh and X. Qi, “Facial expression recognition in the wild via 2013. From 2015 to 2016, he was a Principal Engineer
deep attentive center loss,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. with Advanced Standard R&D Lab., LG Electronics,
Vis., Jan. 2021, pp. 2402–2411. Seoul, South Korea. He is currently an Associated Professor with the Department
[72] R. Momin, A. S. Momin, and K. Rasheed, “Recognizing facial expres- of Information and Communication Engineering, Chosun University, Gwangju,
sions in the wild using multi-architectural representations based ensemble South Korea. His research interests include pattern recognition, video compres-
learning with distillation,” 2021, arXiv:2106.16126. sion and processing, video security, and medical image processing.
Authorized licensed use limited to: HANKUK UNIVERSITY OF FOREIGN STUDIES. Downloaded on January 17,2023 at 18:31:11 UTC from IEEE Xplore. Restrictions apply.