0% found this document useful (0 votes)
88 views

Hybrid Facial Expression Recognition (FER2013) Model For Real-Time Emotion Classification and Prediction

Facial expression recognition is a vital research topic in most fields ranging from artificial intelligence and gaming to human–computer interaction (HCI) and psychology. This paper proposes a hybrid model for facial expression recognition, which comprises a deep convolutional neural network (DCNN) and a Haar Cascade deep learning architecture. The objective is to classify real-time and digital facial images into one of the seven facial emotion categories considered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

Hybrid Facial Expression Recognition (FER2013) Model For Real-Time Emotion Classification and Prediction

Facial expression recognition is a vital research topic in most fields ranging from artificial intelligence and gaming to human–computer interaction (HCI) and psychology. This paper proposes a hybrid model for facial expression recognition, which comprises a deep convolutional neural network (DCNN) and a Haar Cascade deep learning architecture. The objective is to classify real-time and digital facial images into one of the seven facial emotion categories considered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

BOHR International Journal of Internet of Things, Artificial Intelligence and Machine Learning

2022, Vol. 1, No. 1, pp. 63–71


https://doi.org/10.54646/bijiam.011
www.bohrpub.com

Hybrid Facial Expression Recognition (FER2013) Model for


Real-Time Emotion Classification and Prediction
Ozioma Collins Oguine∗ , Kanyifeechukwu Jane Oguine, Hashim Ibrahim Bisallah
and Daniel Ofuani

Department of Computer Science, University of Abuja, Abuja, Nigeria


∗ Corresponding author: [email protected]

Abstract. Facial expression recognition is a vital research topic in most fields ranging from artificial intelligence
and gaming to human–computer interaction (HCI) and psychology. This paper proposes a hybrid model for
facial expression recognition, which comprises a deep convolutional neural network (DCNN) and a Haar Cascade
deep learning architecture. The objective is to classify real-time and digital facial images into one of the seven
facial emotion categories considered. The DCNN employed in this research has more convolutional layers, ReLU
activation functions, and multiple kernels to enhance filtering depth and facial feature extraction. In addition, a Haar
Cascade model was also mutually used to detect facial features in real-time images and video frames. Grayscale
images from the Kaggle repository (FER-2013) and then exploited graphics processing unit (GPU) computation
to expedite the training and validation process. Pre-processing and data augmentation techniques are applied to
improve training efficiency and classification performance. The experimental results show a significantly improved
classification performance compared to state-of-the-art (SoTA) experiments and research. Also, compared to other
conventional models, this paper validates that the proposed architecture is superior in classification performance
with an improvement of up to 6%, totaling up to 70% accuracy, and with less execution time of 2,098.8 s.
Keywords: Deep learning, DCNN, facial emotion recognition, human–computer interaction, Haar Cascade, com-
puter vision, FER2013.

INTRODUCTION language, locomotion, brain activity (fMRI), heart rate


(ECG), and facial expressions. A notable ability for humans
Humans possess a natural ability to understand facial to interpret emotions is crucial to effective communication;
expressions. In real life, humans express the emotions on hypothetically, 93% of efficient conversation depends on
their faces to show their psychological state and dispo- the emotion of an entity. Hence, for ideal human–computer
sition at a time and during their interactions with other interaction (HCI), a high-level understanding of the human
people. However, the current trend of transferring cogni- emotion is required by machines.
tive intelligence to machines has stirred up conversations Emotions are a fundamental part of human commu-
and research in the domain of human–computer interac- nication, driven by the erratic nature of the human
tion (HCI) and computer vision, with a particular interest mind and the perception of relayed information from
in facial emotion recognition and its application in human– the environment. There are varied emotions that inform
computer collaboration, data-driven animation, human- decision-making and are vital components in individual
robot communication, etc. reactions and psychological state. Contemporary psycho-
Since emotions are physical and instinctive, they imme- logical research observed that facial expressions are pre-
diately cause physical responses to dangers, rewards, dominantly used to understand social interactions rather
and other environmental elements. The objective measure- than the psychological state or personal emotions. Con-
ments used to ascertain how people respond to these fea- sequently, the credibility assessment of facial expressions,
tures include skin conductance (EDA/GSR), voice, body which includes the discernment of genuine (natural)

63
64 Ozioma Collins Oguine et al.

Figure 1. FER-2013 sample training set images.

expressions from postured (deliberate/volitional/decep- of the culture a human being is born into (anger, fear,
tive) expressions, is a crucial yet challenging task in facial disgust, happiness, sad, surprise, and neutral) [2]. Sajid
emotion recognition. This research will focus on educat- et al., a study using the facial recognition technology
ing objective facial parameters from real-time and digital (FERET) dataset, recently established the significance of
images to determine the emotional states of people given facial asymmetry as a marker for age estimation. The
their facial expressions, as shown in Figure 1. study highlighted the simplicity with which the right face’s
Over the last 6 years, advancement in deep learning asymmetry may be identified relative to the left face [3].
has propelled the evolution of convolutional networks. Below are some reviews of works of literature pertinent to
Computer vision is an interdisciplinary scientific field that this research.
equips the computer with a high-level understanding from A model was trained on individual frames from movies
images or videos that replicate human visual prowess and still digital images using a CNN-RNN mixed archi-
within a computer perspective. Its aims are to automate tecture by Kahou et al. [4]. The acted facial expressions
tasks, categorize images from a given set of classes, and in the wild (AFEW) 5.0 dataset was used for the video
have the network determine the predominant class present clips, and a combination of the FER-2013 and Toronto
in the image. It can be implied that computer vision is Face Database [5] was used for the photographs. Long
the art of making a computer ‘see’ and impacting it with short-term memory (LSTM) units were replaced by IRNNs,
human intelligence of processing what is ‘seen’. More which are composed of rectified linear units (ReLUs) [6].
recently, deep learning has become the go-to method for IRNNs were appropriate because they offered a straight-
image recognition and detection, far usurping medieval forward solution to the vanishing and exploding gradient
computer vision methods due to the unceasing improve- problem. The accuracy of the study as a whole was 0.528.
ment in the state-of-the-art performance of the models. Face detection still has a serious problem with how
A gap identified by this research is with regards to the faces appear in poses. The explanation for the variation
fact that most datasets of preceding research consist of in face position look was offered by Ratyal et al. Using
well-labeled (posed) images obtained from a controlled subject-specific descriptors, three-dimensional pose invari-
environment, usually postured. According to Montesinos ant approach was used [7, 8].
López et al., this anomaly increased the likelihood of model The inter-variability of emotions between individuals
overfitting, given insufficient training data availability, and misclassification are two issues with still image-based
ultimately causing a relatively lesser efficiency in predict- FERs that Ming Li et al., to address [9], suggest a neural
ing emotions in uncontrolled scenarios [1]. Consequently, network model. Their model consists of two convolutional
this research also identified the importance of lighting in neural networks; the first is trained on facial expression
facial emotion recognition (FER), highlighting that poor databases, while the second is a DeepID network that
lighting conditions could decline the model’s efficiency. learns features relevant to identity. These two neural net-
This research will use convolutional neural networking works were integrated into a tandem facial expression of
(CNN) to model some critical extracted features used for TFE feature and delivered to the fully linked layers to pro-
facial detection and classify the human psychological state duce a new model. Arousal and valence emotion annota-
into one of the six emotions or a seventh neutral emotion. tions employing face, body, and context information were
It will also employ the Haar Cascade model for real-time used by Mou et al. to infer group emotion [10]. To identify
facial detection. It is worthy of note that given the hand- group-level emotion in photos, Tan et al. combined two
engineered nature of the features and model dependency different forms of CNNs, namely, individual facial emotion
on prior knowledge, a resultant comparatively higher CNNs and global image-based CNNs [11]. Different pool-
model accuracy is required. ing methods such as average and max pooling are used to
downsample the inputs and aid in generalization [12, 13].
EMPIRICAL REVIEW OF RELATED WORK Dropout, regularization, and data augmentation were used
to prevent overfitting. Batch normalization was developed
Over the years, several scholars have embarked on to help prevent gradient vanishing and exploding [14, 15].
research to tackle this novel challenge. Ekman & Friesen As can be inferred from the literature highlighted
highlighted seven basic human emotions independent above, several innovative research conducted by other
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Classification and Prediction 65

Table 1. Summary of previous reported accuracies for the FER- Convolutional Layer
2013 dataset.
Methods Accuracy Rating The central component of a convolutional neural network
CNN [16] 62.44% that does the majority of the computational work is the
GoogleNet [19] 65.20% conv layer. The convolution layer applies a filter f k with
VGG + SVM [18] 66.31% a kernel size of n ∗ m to an input x in order to perform
Conv + Inception Layer [20] 66.40% a convolution [21] for a given input. There are n ∗ m
Bag of words [17] 67.40% input connections. The following equation can be used to
CNN + Haar Cascade (This work) 70.04% compute the result.
n/2 n/2
scholars has emphasized continuous upscaling of accuracy C ( XU,V ) = ∑ ∑ f k (i, j) x (u−i,v− j) (1)
i =n/2 j=−n/2
without consideration for efficiency simultaneously. This
research paper proposes a more efficient and accurate
model with improved generalization, as discussed in sub- Max Pooling Layer
sequent sections. Table 1 summarizes previous reported By using the max function, it significantly decreases the
classification accuracies FER2013. Most reported methods input (Saravanan and others) [22]. Let m represent the
perform better than the estimated human performance filter’s size and xi represent the input. Figure 2 depicts the
(∼65.5%). In this work, a state-of-the-art accuracy of application of this layer, and the equation can be used to
70.04% was achieved. determine the result.

THEORETICAL BACKGROUND M ( xi ) = max (ri+k,i+l /|k | ≤ m/2, |l | < m/2, k, l ∈ N ) (2)

The emotions are divided into groups according to sur- Rectified Linear Unit (ReLU) Activation Function
prise, wrath, happiness, fear, disgust, and neutrality in It determines the output for a specific value of the net-
order to solve the FER problem [21]. It creates a classifi- work’s or a neuron’s input p, as shown in Equation (3).
cation algorithm utilizing the features that were extracted. ReLU was utilized for this research because it has no
In the past, researchers have used classifiers for identify- exponential function to calculate and has no vanishing gra-
ing facial expressions, including support vector machine dient error. Figure 2 shows the concatenated convolutional
(SVM), artificial neural network (ANN), hidden Markov layers parallelly processed via ReLU activation functions
model (HMM), and K-nearest neighbor (KNN). This to upscale accuracy and obtain facial features of images
research employs a modified hybrid model, which com- flawlessly.
prises two models (CNN and Haar Cascade). Following are 
explanations of several model components that constitute 0, for x < 0
f (x) = (3)
an architecture for learning various degrees of abstraction 1, for x ≥ 0
and representations. In the output layer for seven emotion
classifications, these elements comprise convolutional lay- Fully Connected Layer
ers, dropout, ReLU activation function, categorical cross-
This is a multilayer perceptron, and it changes every neu-
entropy loss, Adam optimizer, and softmax activation
ron from the previous levels into a neuron in the current
function, as illustrated in Figure 2. This section will also
layer. The following equation serves as its mathematical
emphasize the modifications made to these components
representation:
through hyperparameter tuning that enhanced the profi-
ciency and accuracy of the hybrid model. f ( x) = σ( p ∗ x) (4)

Figure 2. Proposed CNN model structure.


66 Ozioma Collins Oguine et al.

where σ is the activation function, and p is the resultant


matrix of size n ∗ k. k is the dimension of x, and n is the
number of neurons in a fully connected layer.
Output Layer: It represents the class of the given input
image and is one hot vector [21]. It is expressed in the
following equation.

C ( x ) = (i |∃v j6=i and x j ≤ xi ) (5)


Figure 3. Facial detection and conversion to grayscale.
SoftMax Layer
Its main functionality is backpropagation of error [21]. Let
connected and SoftMax, which categorizes seven emotions.
the dimension of an input vector be P. Then, it can be
The fully connected layer received a dropout of 0.5 to
represented as a mapping function as expressed in the
reduce over-fitting, and the rectified linear units (ReLU)
following equation.
activation function was applied to all layers. After that, a
S( x ): R P → [0, 1] P (6) SoftMax output layer that can classify is connected to the
concatenation of two comparable models.
and the output for each component j(1 ≤ j ≤ P) is given
as the following equation. Real-Time Classification
ex j The output of the DCNN model is saved as a JSON string
S( x ) j = p (7)
∑ i =1 e x i file. A JavaScript Object Notation (JSON) was deemed
suitable for this research because it stores and allows
PROPOSED METHODOLOGY faster data exchange, according to Ingale et al. [23]. The
model.tojson() function of python is used to write the
With the need for real-time object detection, several object output of the trained model into JSON. A pre-trained Haar
detection architectures have gained wide adoption by Cascade XML file for frontal face detection of real-time
most researchers. However, the hybrid architecture put facial feature classification was imported. A multiscale
forward by this research uses both the Haar Cascade detection approach was implemented, and parameters
face detection, a popular facial detection algorithm pro- such as coordinates of bounding boxes around detected
posed by Paul Viola and Michael Jones in 2001, and the faces and detectMultiScale (grayscale inputs) and scale
CNN model together. In Figure 2, the CNN architecture factor were tuned for better efficiency.
initially needs to extract input pictures of 48 × 48 × 1 In this research, OpenCV’s Haar Cascade detects and
(48 wide, 48 high, 1 color channel) from dataset FER- extracts the face region from the webcam’s video feed
2013. The network starts with an input layer equal to the through the flask app. This process follows a video con-
input data dimension of 48 × 48. It also consists of seven version to grayscale, and the detected face is contoured
concatenated convolutional layers parallelly processed via or enclosed within a region surrounding the face (see
ReLU activation functions to upscale accuracy and obtains Figure 3).
facial features of images flawlessly, as shown in Figure 2.
This input is shared and the kernel size is the same EXPERIMENTAL DETAILS
across all submodels for feature extraction. Before a final
output layer allows classification, the outputs from these Dataset
feature extraction sub-models are flattened into vectors,
concatenated into a large vector matrix, and transferred The Kaggle repository’s facial emotion recognition (FER-
to a fully connected layer for assessment. A detailed 2013) dataset is used in this study. The FER-2013 dataset
step of the methodology is described in the architecture has 35,887 photos in total, of which 28,709 are tagged
below. images and the rest 7,178 are part of the test set. The
The suggested CNN model, as shown in Figure 2, con- photographs in the FER-2013 dataset are categorized into
sists of a batch normalization layer, seven convolutional seven universal emotions: happy, sad, angry, fear, surprise,
layers with independent learnable filters (kernels), each disgust, and neutral. The photos are 48 × 48 pixels in
with a size of [3 × 3], and a local contrast normalization size and are grayscale. Table 2 provides a summary of
layer to eliminate the average from nearby pixels. A max- the dataset. The models are trained on an Nvidia Tesla
pooling layer is also included in order to flatten and create K80 GPU running on Google Cloud (Colab Research).
dense layers while also reducing the spatial dimension It is a group-based alternative to an iPython or Jupyter
of the image. It is followed by a layer that is entirely Notebook.
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Classification and Prediction 67

Table 2. Summary of the FER-2013 dataset.


Surprise Fear Angry Neutral Sad Disgust Happy Total
train_count 3171 4097 3995 4965 4830 436 7215 28709
Surprise Fear Angry Neutral Sad Disgust Happy
test_count 831 1024 958 1233 1247 111 1774 7178
Total Count of the test dataset 35887

Pre-processing Table 3. Summary of the proposed DCNN layers.


Layer (Type) Output Shape Param #
Each image sample underwent a single application of Rescaling (Rescaling) (None, 48, 48, 1) 0
the face detection and registration techniques. They are sequential (Sequential) (None, 48, 48, 1) 0
required for this procedure in order to correct any pose and conv2d_7 (Conv2D) (None, 46, 46, 32) 320
illumination variations that may have resulted from the conv2d_8 (Conv2D) (None, 44, 44, 64) 18496
real-time facial detection operation. Keras library was used max_pooling2d_4 (MaxPooling2 (None, 22, 22, 64) 0
in this research for the pre-processing of both the test and
dropout_3 (Dropout) (None, 22, 22, 64) 0
train images before passing them to the deep convolution
conv2d_9 (Conv2D) (None, 20, 20, 64) 36928
neural network (DCNN). This process includes cropping of
conv2d_10 (Conv2D) (None, 18, 18, 64) 36928
detected faces and scaling of the detected images. For data
conv2d_11 (Conv2D) (None, 16, 16, 128) 73856
cleaning purposes, all image targetsizes were resized to a
max_pooling2d_5 (MaxPooling2 (None, 8, 8, 128) 0
dimension of 48 × 48 × 1, converted into a grayscale color
conv2d_12 (Conv2D) (None, 6, 6, 128) 147584
channel, and pre-processed in a batch size of 120. Square
conv2d_13 (Conv2D) (None, 4, 4, 256) 295168
boxes were also used for facial landmark detection to apply
max_pooling2d_6 (MaxPooling2 (None, 2, 2, 256) 0
illumination correction.
max_pooling2d_7 (MaxPooling2 (None, 1, 1, 256) 0
dropout_4 (Dropout) (None, 1, 1, 256) 0
Training and Validation flatten_1 (Flatten) (None, 256) 0
This research used the Keras library in python to accept dense_2 (Dense) (None, 1024) 263168
images to train. To minimize the losses of the neural dropout_5 (Dropout) (None, 1024) 0
network during training, this research used a Mini-Batch dense_3 (Dense) (None, 7) 7175
Gradient Descent algorithm. This type of Gradient Descent Total params: 879,623
algorithm was suitable because of its capabilities in finding Trainable params: 879,623
the coefficients or weights of neural networks by splitting Non-trainable params: 0
the training dataset into small batches, i.e., a training
batch and a validation batch, using data augmentation
of different emotions to see what predictability would look
techniques. From Keras, a sequential model was imple-
like during the test, as illustrated in Figure 4.
mented to define the flow of the DCNN, after which several
other layers, as described in the proposed methodology
above and shown in Figure 2. With an increase in the Hyperparameter Tuning
convolutional layers, a concurrent increase in the kernel
A hyperparameter is a parameter whose value is used
size was also adopted, which was used in parallel with the
to control the training process of a model (neural net-
ReLU activation function for training the model. Dropouts
work). Hyperparameter optimization entails choosing a set
were used at several layers of the DCNN to prevent the
of optimal hyperparameters for a learning algorithm to
overfitting of the model. A SoftMax activation function and
improve the generalization or predictability of a model.
an Adam optimizer were used to improve the classifica-
For this research, the random search technique is used
tion efficiency of the model. This research also adopted a
wherein random combinations of a range of values for each
categorical cross-entropy loss function. A summary of the
hyperparameter have been used to find the best possible
DCNN structure is shown in Table 3.
combination. The effectiveness of the random search tech-
nique for hyperparameter optimization has been demon-
Image Data Augmentation strated in Ref. [24].
For efficient generalization and optimized model perfor-
mance, augmentation was used to expand the size of the RESULT ANALYSIS AND DISCUSSION
training dataset by creating several modified variations of
the images using the ImageDataGenerator Class in Keras. The algorithm and facial emotion recognition model pro-
Here a target image was converted in an index of 0 to aug- posed by this research is based on two principal ideas.
mented variations, then looping it through nine instances First, the utilization of high-capacity deep convolutional
68 Ozioma Collins Oguine et al.

Figure 4. Result of image data augmentation.

Table 4. Describes all the hyperparameters used by the proposed model and their corresponding values.
Hyperparameter Value Description
Batch size 120 The number of images (from the training set) to be propagated through the network at a go.
Number of epochs 80 The number of complete passes through the training set.
Optimizer Adam Adam Optimization algorithm.
Learning rate (lr) 0.001 Controls the speed at which weights of the network are updated during training.
FC1 neurons 1024 Total number of neurons in the first fully connected layer
FC2 neurons 512 Total number of neurons in the second fully connected layer
Dropout 0.5 Dropout rate between fully connected layers
Convolution kernel size 3x3 Size of the kernel in each convolution layer
MaxPooling kernel size 2x2 Size of the kernel in each MaxPooling layer
MaxPooling strides 2 Kernel stride in each MaxPooling layer

neural networks for feature extraction and emotion clas- than the previous model. In the suggested method, val-
sification, which employs a single classifier for detecting idation accuracy rises as calculation time decreases, and
faces from multiple views over both real-time scenarios validation loss is significantly decreased. The FER-2013
and digital images or video frames. This research also dataset, which includes the seven main emotions, was used
sought to optimize the computational complexity of the to test the proposed DCNN model (sad, fear, happiness,
deep convolutional neural network (DCNN) by modifying angry, neutral, surprised, and disgust).
the architecture through the addition of layers to improve Figure 5 indicates the test sample result associated with
pattern identification in real-time or digital images. The happy emotions from the digital test image. The proposed
additional layers apply more convolution filters to the model also predicted the identical emotion with decreased
image to detect features of the image. To further enhance computation time compared to preceding models dis-
the model’s predictive efficiency and accuracy, the number cussed in the literature review.
of training Epoch was increased to 80. Table 5 describes the metrics used in measuring the suc-
The proposed model for emotion recognition uses three cess of the CNN model developed for this research. Given
steps—face detection (see Figure 3), features extraction, the necessary modifications made as proposed earlier, it
and emotion classification—and achieves better results was observed that, on average, the proposed model had
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Classification and Prediction 69

Figure 5. Result of real-time test sample associated with the


HAPPY emotion.

Table 5. Showing the evaluation metrics for the proposed CNN.


Precision Recall F1-Score Support
Angry 0.62 0.64 0.63 958
Disgust 0.73 0.34 0.45 111 Figure 7. Plot of the training and validation loss.
Fear 0.56 0.42 0.48 1024
Happy 0.91 0.91 0.91 1774
Neutral 0.64 0.71 0.67 1233
Sad 0.56 0.62 0.59 1247
Surprise 0.79 0.81 0.72 831
Accuracy 0.70 7178
Macro avg 0.69 0.65 0.66 7178
Weighted avg 0.70 0.70 0.70 7178

Figure 8. Confusion metrics of the model on emotion prediction.

incrementally. From the learning curve in Figure 7, it can be


observed that the plot of training loss and validation loss
declines to the point of stability with a generalization gap
of minimal difference. Also, it can be inferred in Figure 6
Figure 6. Plots of the training and validation accuracy. that the plot of training accuracy and validation accuracy
surges with an increase in training sequence and batch
size with a minimal generalization gap. Thematically, the
a predictive accuracy of 70%. The weighted average of the model can be a good fit and is proposed to generalize
test dataset is also 70%. efficiently.
After the training was done, the model was evaluated Figure 8 shows the CNN model’s confusion matrix on
and the training and validation accuracy (Figure 6) and loss the FER2013 testing set. It was inferred that the model
(Figure 7) were computed. generalizes best in the classification of “happiness” and
Figures 6 and 7 show the CNN model’s learning curve, “surprise” emotions. In contrast, it performs reasonably
which plots the model’s learning performance over experi- average when classifying “disgust” and “fear” Emotion.
ence or time. After each update, the model was evaluated This reduction in classification accuracy in “disgust” and
on the training dataset and a hold-out validation dataset. “fear” emotions can be ascribed to the reduced number of
Learning curves were preferred as a graphic metric for training set samples for the two classes. The complacency
this research because of their wide adoption in machine between “fear” and “sadness” may be due to the inter-class
learning for models that optimize their internal parameters similarities of the dataset.
70 Ozioma Collins Oguine et al.

CONCLUSION this area. State-of-the-art real-time algorithms (Faster R-


CNN, HOG + Linear SVM, SSDs, YOLO) are encouraged
This paper proposed a deep CNN model for real-time facial to be employed in future research as well. In addition,
expression recognition based on seven emotional classes advancement to this research may include the utilization of
(‘neutral’, ‘happy’, ‘sad’, ‘angry’, ‘surprised’, ‘fear’, and regression in analyzing bio-signals and physiological mul-
‘disgusted’). The structure of the proposed model has good timodal features in determining the levels and intensities
generality and classification performance. First, a variety of emotion classes discussed herein.
of well-classified, high-quality databases were acquired.
Then, the face region is detected, cut, and converted into ABBREVIATIONS
a grayscale image of one channel to remove unnecessary
information. Image data augmentation that increases the
DCNN Deep Convolutional Neural Network
number and variations of training images is applied to
CNN Convolutional Neural Network
solve the problem of overfitting. Hyperparameter tuning
FER Facial Emotion Recognition
was employed to achieve a state-of-the-art classification
GPU Graphical Processing Unit
performance accuracy of 70.04%.
HCI Human Computer Interaction
In the proposed hybrid architecture, an optimal structure
ECG Electrocardiography
to reduce execution time and improve the classification
FMRI Functional Magnetic Resonance Imaging
performance of real-time and digital images was devel-
AFEW Acted Facial Expressions in the Wild
oped. This was done by adjusting the number of feature
RNN Recurrent Neural Network
maps in the convolutional layer, layers in the neural
LSTM Long Short-Term Memory
network model, and numerous training epochs. Cross-
HMM Hidden Markov Model
validation experiments showed that the proposed con-
ReLU Rectified Linear Unit
volutional neural network (CNN) architecture has better
TFE Tandem Facial Expression
classification performance and generality than some state-
ANN Artificial Neural Network
of-the-arts (SoTA). The Haar Cascade model employed
SVM Support Vector Machine
for real-time facial detection showed better classification
IRNN Image Recognition Neural Network
performance than other implementations. Experimental
FERET Facial Recognition Technology
results confirmed the effectiveness of data pre-processing
SoTA State-of-The-Art
and augmentation techniques.
Some shortcomings in this research were a deficiency
in predicting the ‘disgust’ and ‘angry’ emotions due to DISCLOSURE
insufficient training dataset for the two-class categories.
Another issue the proposed model faces is the reduced The authors declare that they have no competing interests.
generality of real-time predictions. A major causative
factor stems from the postured nature of the training AUTHOR CONTRIBUTIONS
images and environmental conditioning (lightening) of
real-time test images. A significant impact of this research All authors made significant contributions to conception
in human–computer interaction (HCI) will be the upscal- and design, data acquisition, analysis, and interpretation;
ing of software and AI systems to deliver an improved participated in drafting the article and critically revising it
experience to humans in various applications, for instance, for important intellectual content; agreed to submit it to the
in Home robotic systems in the recommendation of mood- current journal; and gave final approval of the version to be
based music. This research will enhance psychological published.
prognosis by assisting psychologists in detecting prob-
able suicides and emotional traumas, and it can also DATA AVAILABILITY STATEMENT
be employed in the marketing, healthcare, and gaming
industry. This research paper analyzes the Facial Emotion Recogni-
tion (FER-2013) dataset housed by the Kaggle repository
[https://www.kaggle.com/datasets/msambare/fer2013].
FUTURE RESEARCH PROPOSAL
Recognizing human facial expressions is an important REFERENCES
practice since it helps to determine a subject’s mood or
emotional state while they are being observed. Numerous [1] O. A. Montesinos López, A. Montesinos López, J. Crossa, “Over-
fitting, Model Tuning, and Evaluation of Prediction Performance
situations could benefit from the straightforward idea of in Multivariate Statistical Machine Learning Methods for Genomic
a machine being able to recognize a person’s emotional Prediction.” Springer 2022, Cham. https://doi.org/10.1007/978-3-
state. As such, numerous potentials remain untapped in 030-89010-0_4
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Classification and Prediction 71

[2] P. Ekman, and W. V. Friesen, “Constants across cultures in the face [14] B. Han, J. Sim, and H. Adam, “BranchOut: Regularization for
and emotion.” J Personal Soc Psychol 17 (2): 12, 1971. online ensemble tracking with convolutional neural networks,”
[3] M. Sajid, NI. Ratyal, N. Ali, B. Zafar, S. H. Dar, M. T. Mahmood, and in Proceedings – 30th IEEE Conference on Computer Vision and
Y. B Joo, “The impact of asymmetric left and asymmetric right face Pattern Recognition, CVPR 2017, 2017, vol. 2017-January, DOI:
images on accurate age estimation.” Math Probl Eng 2019: 1–10. 10.1109/CVPR.2017.63.
[4] S. E. Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal, [15] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
“Recurrent neural networks for emotion recognition in video.” network training by reducing internal covariate shift,” in 32nd Inter-
Proceedings of the 2015 ACM on International Conference on Multi- national Conference on Machine Learning, ICML 2015, 2015, vol. 1.
modal Interaction. 467–474. ACM. [16] K. Liu, M. Zhang, and Z. Pan, “Facial Expression Recognition with
[5] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Collecting Large, CNN Ensemble,” in Proceedings – 2016 International Conference on
Richly Annotated Facial-Expression Databases from Movies.” IEEE Cyberworlds, CW 2016, 2016.
Multimedia, 19 (3): 3441, July 2012. [17] A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeper in
[6] Q. V. Le, N. Jaitly, and G. E. Hinton. (2015). A simple way to facial expression recognition using deep neural networks,” in 2016
initialize recurrent networks of rectified linear units. arXiv preprint IEEE Winter Conference on Applications of Computer Vision, WACV
arXiv:1504.00941. 2016 2016, DOI: 10.1109/WACV.2016.7477450.
[7] N. I. Ratyal, I. A. Taj, M. Sajid, N. Ali, A. Mahmood, and S. Razzaq, [18] M. I. Georgescu, R. T. Ionescu, and M. Popescu, “Local learning with
“Three-dimensional face recognition using variance-based registra- deep and handcrafted features for facial expression recognition,”
tion and subject-specific descriptors.” Int J Adv Robot Syst 16 (3) IEEE Access, vol. 7, 2019, DOI: 10.1109/ACCESS.2019.2917266.
2019:1729881419851716. [19] R. T. Ionescu, M. Popescu, and C. Grozea, “Local Learning to
[8] N.I. Ratyal, I. A. Taj, M. Sajid, N. Ali, A. Mahmood, S. Razzaq, S. H. Improve Bag of Visual Words Model for Facial Expression Recog-
Dar, M. Usman, M. J. A. Baig, and U. Mussadiq, Deeply learned pose nition,” Work. Challenges Represent. Learn. ICML, 2013.
invariant image analysis with applications in 3D face recognition. [20] P. Giannopoulos, I. Perikos, and I. Hatzilygeroudis, “Deep learning
Math Probl Eng 2019: 1–21. approaches for facial emotion recognition: A case study on FER-
[9] M. Li, H. Xu, X. Huang, Z. Song, X. Liu, and X. Li, “Facial Expres- 2013,” in Smart Innovation, Systems and Technologies, 2018, vol. 85,
sion Recognition with Identity and Emotion Joint Learning.” IEEE DOI: 10.1007/978-3-319-66790-4_1.
Transactions on Affective Computing 2018. 1–1. [21] A. Handa, R. Agarwal and N. Kohli, “Article: Incremental approach
[10] W. Mou, O. Celiktutan and H. Gunes, “Group-level arousal and for multi-modal face expression recognition system using deep neu-
valence recognition in static images: Face, body and context.,” in 11th ral networks Journal,” International Journal of Computational Vision
IEEE International Conference and Workshops on Automatic Face and Robotics (IJCVR) 2021 Vol. 11 No. 1 pp. 1–20.
and Gesture Recognition (FG), Ljubljana, Slovenia, 2015. [22] A. Saravanan, G. Perichetla and K. Gayathri, “Facial Emotion Recog-
[11] L. Tan, K. Zhang, K. Wang, X. Zeng, X. Peng, and Y. Qiao, “Group nition using Convolutional Neural Networks,” Arxiv.org, 2019.
Emotion Recognition with Individual Facial Emotion CNNs and [Online]. Available: h t t p s : / / a r x i v. o r g / p d f / 1 9 1 0 . 0 5 6 0 2 . p df.
Global Image-based CNNs,” in 19th ACM International Conference [Accessed: 12-Oct-2019].
on Multimodal Interaction, Glasgow, UK, 2017. [23] S. Ingale, A. Kadam and M. Kadam, “Facial Expression Recognition
[12] A. Giusti, D. C. Cireşan, J. Masci, L. M. Gambardella, and J. Using CNN with Data Augmentation,” International Research Jour-
Schmidhuber, “Fast image scanning with deep max-pooling con- nal of Engineering and Technology (IRJET), 2021. [Online]. Available:
volutional neural networks,” in 2013 IEEE International Confer- https://www.irjet.net/archives/V8/i4/IRJET- V8I4685.pdf.
ence on Image Processing, ICIP 2013 – Proceedings, 2013, DOI: [Accessed: 28-Apr-2022].
10.1109/ICIP.2013.6738831. [24] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter
[13] S. Albawi, T. A. Mohammed, and S. Al-Zawi, “Understanding of a Optimization.” Journal of Machine Learning Research, vol. 13,
convolutional neural network,” in Proceedings of 2017 International no. 2012, pp. 281–305, 2012. [Accessed 2 May 2022].
Conference on Engineering and Technology, ICET 2017, 2018, vol.
2018-January, DOI: 10.1109/ICEngTechnol.2017.8308186.

You might also like