Hybrid Facial Expression Recognition (FER2013) Model For Real-Time Emotion Classification and Prediction
Hybrid Facial Expression Recognition (FER2013) Model For Real-Time Emotion Classification and Prediction
Abstract. Facial expression recognition is a vital research topic in most fields ranging from artificial intelligence
and gaming to human–computer interaction (HCI) and psychology. This paper proposes a hybrid model for
facial expression recognition, which comprises a deep convolutional neural network (DCNN) and a Haar Cascade
deep learning architecture. The objective is to classify real-time and digital facial images into one of the seven
facial emotion categories considered. The DCNN employed in this research has more convolutional layers, ReLU
activation functions, and multiple kernels to enhance filtering depth and facial feature extraction. In addition, a Haar
Cascade model was also mutually used to detect facial features in real-time images and video frames. Grayscale
images from the Kaggle repository (FER-2013) and then exploited graphics processing unit (GPU) computation
to expedite the training and validation process. Pre-processing and data augmentation techniques are applied to
improve training efficiency and classification performance. The experimental results show a significantly improved
classification performance compared to state-of-the-art (SoTA) experiments and research. Also, compared to other
conventional models, this paper validates that the proposed architecture is superior in classification performance
with an improvement of up to 6%, totaling up to 70% accuracy, and with less execution time of 2,098.8 s.
Keywords: Deep learning, DCNN, facial emotion recognition, human–computer interaction, Haar Cascade, com-
puter vision, FER2013.
63
64 Ozioma Collins Oguine et al.
expressions from postured (deliberate/volitional/decep- of the culture a human being is born into (anger, fear,
tive) expressions, is a crucial yet challenging task in facial disgust, happiness, sad, surprise, and neutral) [2]. Sajid
emotion recognition. This research will focus on educat- et al., a study using the facial recognition technology
ing objective facial parameters from real-time and digital (FERET) dataset, recently established the significance of
images to determine the emotional states of people given facial asymmetry as a marker for age estimation. The
their facial expressions, as shown in Figure 1. study highlighted the simplicity with which the right face’s
Over the last 6 years, advancement in deep learning asymmetry may be identified relative to the left face [3].
has propelled the evolution of convolutional networks. Below are some reviews of works of literature pertinent to
Computer vision is an interdisciplinary scientific field that this research.
equips the computer with a high-level understanding from A model was trained on individual frames from movies
images or videos that replicate human visual prowess and still digital images using a CNN-RNN mixed archi-
within a computer perspective. Its aims are to automate tecture by Kahou et al. [4]. The acted facial expressions
tasks, categorize images from a given set of classes, and in the wild (AFEW) 5.0 dataset was used for the video
have the network determine the predominant class present clips, and a combination of the FER-2013 and Toronto
in the image. It can be implied that computer vision is Face Database [5] was used for the photographs. Long
the art of making a computer ‘see’ and impacting it with short-term memory (LSTM) units were replaced by IRNNs,
human intelligence of processing what is ‘seen’. More which are composed of rectified linear units (ReLUs) [6].
recently, deep learning has become the go-to method for IRNNs were appropriate because they offered a straight-
image recognition and detection, far usurping medieval forward solution to the vanishing and exploding gradient
computer vision methods due to the unceasing improve- problem. The accuracy of the study as a whole was 0.528.
ment in the state-of-the-art performance of the models. Face detection still has a serious problem with how
A gap identified by this research is with regards to the faces appear in poses. The explanation for the variation
fact that most datasets of preceding research consist of in face position look was offered by Ratyal et al. Using
well-labeled (posed) images obtained from a controlled subject-specific descriptors, three-dimensional pose invari-
environment, usually postured. According to Montesinos ant approach was used [7, 8].
López et al., this anomaly increased the likelihood of model The inter-variability of emotions between individuals
overfitting, given insufficient training data availability, and misclassification are two issues with still image-based
ultimately causing a relatively lesser efficiency in predict- FERs that Ming Li et al., to address [9], suggest a neural
ing emotions in uncontrolled scenarios [1]. Consequently, network model. Their model consists of two convolutional
this research also identified the importance of lighting in neural networks; the first is trained on facial expression
facial emotion recognition (FER), highlighting that poor databases, while the second is a DeepID network that
lighting conditions could decline the model’s efficiency. learns features relevant to identity. These two neural net-
This research will use convolutional neural networking works were integrated into a tandem facial expression of
(CNN) to model some critical extracted features used for TFE feature and delivered to the fully linked layers to pro-
facial detection and classify the human psychological state duce a new model. Arousal and valence emotion annota-
into one of the six emotions or a seventh neutral emotion. tions employing face, body, and context information were
It will also employ the Haar Cascade model for real-time used by Mou et al. to infer group emotion [10]. To identify
facial detection. It is worthy of note that given the hand- group-level emotion in photos, Tan et al. combined two
engineered nature of the features and model dependency different forms of CNNs, namely, individual facial emotion
on prior knowledge, a resultant comparatively higher CNNs and global image-based CNNs [11]. Different pool-
model accuracy is required. ing methods such as average and max pooling are used to
downsample the inputs and aid in generalization [12, 13].
EMPIRICAL REVIEW OF RELATED WORK Dropout, regularization, and data augmentation were used
to prevent overfitting. Batch normalization was developed
Over the years, several scholars have embarked on to help prevent gradient vanishing and exploding [14, 15].
research to tackle this novel challenge. Ekman & Friesen As can be inferred from the literature highlighted
highlighted seven basic human emotions independent above, several innovative research conducted by other
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Classification and Prediction 65
Table 1. Summary of previous reported accuracies for the FER- Convolutional Layer
2013 dataset.
Methods Accuracy Rating The central component of a convolutional neural network
CNN [16] 62.44% that does the majority of the computational work is the
GoogleNet [19] 65.20% conv layer. The convolution layer applies a filter f k with
VGG + SVM [18] 66.31% a kernel size of n ∗ m to an input x in order to perform
Conv + Inception Layer [20] 66.40% a convolution [21] for a given input. There are n ∗ m
Bag of words [17] 67.40% input connections. The following equation can be used to
CNN + Haar Cascade (This work) 70.04% compute the result.
n/2 n/2
scholars has emphasized continuous upscaling of accuracy C ( XU,V ) = ∑ ∑ f k (i, j) x (u−i,v− j) (1)
i =n/2 j=−n/2
without consideration for efficiency simultaneously. This
research paper proposes a more efficient and accurate
model with improved generalization, as discussed in sub- Max Pooling Layer
sequent sections. Table 1 summarizes previous reported By using the max function, it significantly decreases the
classification accuracies FER2013. Most reported methods input (Saravanan and others) [22]. Let m represent the
perform better than the estimated human performance filter’s size and xi represent the input. Figure 2 depicts the
(∼65.5%). In this work, a state-of-the-art accuracy of application of this layer, and the equation can be used to
70.04% was achieved. determine the result.
The emotions are divided into groups according to sur- Rectified Linear Unit (ReLU) Activation Function
prise, wrath, happiness, fear, disgust, and neutrality in It determines the output for a specific value of the net-
order to solve the FER problem [21]. It creates a classifi- work’s or a neuron’s input p, as shown in Equation (3).
cation algorithm utilizing the features that were extracted. ReLU was utilized for this research because it has no
In the past, researchers have used classifiers for identify- exponential function to calculate and has no vanishing gra-
ing facial expressions, including support vector machine dient error. Figure 2 shows the concatenated convolutional
(SVM), artificial neural network (ANN), hidden Markov layers parallelly processed via ReLU activation functions
model (HMM), and K-nearest neighbor (KNN). This to upscale accuracy and obtain facial features of images
research employs a modified hybrid model, which com- flawlessly.
prises two models (CNN and Haar Cascade). Following are
explanations of several model components that constitute 0, for x < 0
f (x) = (3)
an architecture for learning various degrees of abstraction 1, for x ≥ 0
and representations. In the output layer for seven emotion
classifications, these elements comprise convolutional lay- Fully Connected Layer
ers, dropout, ReLU activation function, categorical cross-
This is a multilayer perceptron, and it changes every neu-
entropy loss, Adam optimizer, and softmax activation
ron from the previous levels into a neuron in the current
function, as illustrated in Figure 2. This section will also
layer. The following equation serves as its mathematical
emphasize the modifications made to these components
representation:
through hyperparameter tuning that enhanced the profi-
ciency and accuracy of the hybrid model. f ( x) = σ( p ∗ x) (4)
Table 4. Describes all the hyperparameters used by the proposed model and their corresponding values.
Hyperparameter Value Description
Batch size 120 The number of images (from the training set) to be propagated through the network at a go.
Number of epochs 80 The number of complete passes through the training set.
Optimizer Adam Adam Optimization algorithm.
Learning rate (lr) 0.001 Controls the speed at which weights of the network are updated during training.
FC1 neurons 1024 Total number of neurons in the first fully connected layer
FC2 neurons 512 Total number of neurons in the second fully connected layer
Dropout 0.5 Dropout rate between fully connected layers
Convolution kernel size 3x3 Size of the kernel in each convolution layer
MaxPooling kernel size 2x2 Size of the kernel in each MaxPooling layer
MaxPooling strides 2 Kernel stride in each MaxPooling layer
neural networks for feature extraction and emotion clas- than the previous model. In the suggested method, val-
sification, which employs a single classifier for detecting idation accuracy rises as calculation time decreases, and
faces from multiple views over both real-time scenarios validation loss is significantly decreased. The FER-2013
and digital images or video frames. This research also dataset, which includes the seven main emotions, was used
sought to optimize the computational complexity of the to test the proposed DCNN model (sad, fear, happiness,
deep convolutional neural network (DCNN) by modifying angry, neutral, surprised, and disgust).
the architecture through the addition of layers to improve Figure 5 indicates the test sample result associated with
pattern identification in real-time or digital images. The happy emotions from the digital test image. The proposed
additional layers apply more convolution filters to the model also predicted the identical emotion with decreased
image to detect features of the image. To further enhance computation time compared to preceding models dis-
the model’s predictive efficiency and accuracy, the number cussed in the literature review.
of training Epoch was increased to 80. Table 5 describes the metrics used in measuring the suc-
The proposed model for emotion recognition uses three cess of the CNN model developed for this research. Given
steps—face detection (see Figure 3), features extraction, the necessary modifications made as proposed earlier, it
and emotion classification—and achieves better results was observed that, on average, the proposed model had
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Classification and Prediction 69
[2] P. Ekman, and W. V. Friesen, “Constants across cultures in the face [14] B. Han, J. Sim, and H. Adam, “BranchOut: Regularization for
and emotion.” J Personal Soc Psychol 17 (2): 12, 1971. online ensemble tracking with convolutional neural networks,”
[3] M. Sajid, NI. Ratyal, N. Ali, B. Zafar, S. H. Dar, M. T. Mahmood, and in Proceedings – 30th IEEE Conference on Computer Vision and
Y. B Joo, “The impact of asymmetric left and asymmetric right face Pattern Recognition, CVPR 2017, 2017, vol. 2017-January, DOI:
images on accurate age estimation.” Math Probl Eng 2019: 1–10. 10.1109/CVPR.2017.63.
[4] S. E. Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal, [15] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
“Recurrent neural networks for emotion recognition in video.” network training by reducing internal covariate shift,” in 32nd Inter-
Proceedings of the 2015 ACM on International Conference on Multi- national Conference on Machine Learning, ICML 2015, 2015, vol. 1.
modal Interaction. 467–474. ACM. [16] K. Liu, M. Zhang, and Z. Pan, “Facial Expression Recognition with
[5] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Collecting Large, CNN Ensemble,” in Proceedings – 2016 International Conference on
Richly Annotated Facial-Expression Databases from Movies.” IEEE Cyberworlds, CW 2016, 2016.
Multimedia, 19 (3): 3441, July 2012. [17] A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeper in
[6] Q. V. Le, N. Jaitly, and G. E. Hinton. (2015). A simple way to facial expression recognition using deep neural networks,” in 2016
initialize recurrent networks of rectified linear units. arXiv preprint IEEE Winter Conference on Applications of Computer Vision, WACV
arXiv:1504.00941. 2016 2016, DOI: 10.1109/WACV.2016.7477450.
[7] N. I. Ratyal, I. A. Taj, M. Sajid, N. Ali, A. Mahmood, and S. Razzaq, [18] M. I. Georgescu, R. T. Ionescu, and M. Popescu, “Local learning with
“Three-dimensional face recognition using variance-based registra- deep and handcrafted features for facial expression recognition,”
tion and subject-specific descriptors.” Int J Adv Robot Syst 16 (3) IEEE Access, vol. 7, 2019, DOI: 10.1109/ACCESS.2019.2917266.
2019:1729881419851716. [19] R. T. Ionescu, M. Popescu, and C. Grozea, “Local Learning to
[8] N.I. Ratyal, I. A. Taj, M. Sajid, N. Ali, A. Mahmood, S. Razzaq, S. H. Improve Bag of Visual Words Model for Facial Expression Recog-
Dar, M. Usman, M. J. A. Baig, and U. Mussadiq, Deeply learned pose nition,” Work. Challenges Represent. Learn. ICML, 2013.
invariant image analysis with applications in 3D face recognition. [20] P. Giannopoulos, I. Perikos, and I. Hatzilygeroudis, “Deep learning
Math Probl Eng 2019: 1–21. approaches for facial emotion recognition: A case study on FER-
[9] M. Li, H. Xu, X. Huang, Z. Song, X. Liu, and X. Li, “Facial Expres- 2013,” in Smart Innovation, Systems and Technologies, 2018, vol. 85,
sion Recognition with Identity and Emotion Joint Learning.” IEEE DOI: 10.1007/978-3-319-66790-4_1.
Transactions on Affective Computing 2018. 1–1. [21] A. Handa, R. Agarwal and N. Kohli, “Article: Incremental approach
[10] W. Mou, O. Celiktutan and H. Gunes, “Group-level arousal and for multi-modal face expression recognition system using deep neu-
valence recognition in static images: Face, body and context.,” in 11th ral networks Journal,” International Journal of Computational Vision
IEEE International Conference and Workshops on Automatic Face and Robotics (IJCVR) 2021 Vol. 11 No. 1 pp. 1–20.
and Gesture Recognition (FG), Ljubljana, Slovenia, 2015. [22] A. Saravanan, G. Perichetla and K. Gayathri, “Facial Emotion Recog-
[11] L. Tan, K. Zhang, K. Wang, X. Zeng, X. Peng, and Y. Qiao, “Group nition using Convolutional Neural Networks,” Arxiv.org, 2019.
Emotion Recognition with Individual Facial Emotion CNNs and [Online]. Available: h t t p s : / / a r x i v. o r g / p d f / 1 9 1 0 . 0 5 6 0 2 . p df.
Global Image-based CNNs,” in 19th ACM International Conference [Accessed: 12-Oct-2019].
on Multimodal Interaction, Glasgow, UK, 2017. [23] S. Ingale, A. Kadam and M. Kadam, “Facial Expression Recognition
[12] A. Giusti, D. C. Cireşan, J. Masci, L. M. Gambardella, and J. Using CNN with Data Augmentation,” International Research Jour-
Schmidhuber, “Fast image scanning with deep max-pooling con- nal of Engineering and Technology (IRJET), 2021. [Online]. Available:
volutional neural networks,” in 2013 IEEE International Confer- https://www.irjet.net/archives/V8/i4/IRJET- V8I4685.pdf.
ence on Image Processing, ICIP 2013 – Proceedings, 2013, DOI: [Accessed: 28-Apr-2022].
10.1109/ICIP.2013.6738831. [24] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter
[13] S. Albawi, T. A. Mohammed, and S. Al-Zawi, “Understanding of a Optimization.” Journal of Machine Learning Research, vol. 13,
convolutional neural network,” in Proceedings of 2017 International no. 2012, pp. 281–305, 2012. [Accessed 2 May 2022].
Conference on Engineering and Technology, ICET 2017, 2018, vol.
2018-January, DOI: 10.1109/ICEngTechnol.2017.8308186.