Multimodal recognition with deep learning: audio, image, and text
Multimodal recognition with deep learning: audio, image, and text
Corresponding Author:
Ravi Gummula
Department of Electronics and Communication Engineering, Dr. M.G.R. Educational and Research Institute
Chennai, Tamil Nadu, India
Email: [email protected]
1. INTRODUCTION
Many intelligent systems might benefit greatly from automated emotion recognition, including those
used in online gaming, digital advertising, healthcare and consumer feedback collection. For instance, by
adapting the game interface according to the user's emotional state, an emotion recognition function might
potentially enhance player engagement in online gaming. In a similar way a live emotion detection
module may provide the selling firm immediate emotional feedback when a customer buys online, allowing
them to present the customer new offers. Healthcare providers may better monitor their patients' physical and
mental health via emotion detection, which in turn helps them provide the most appropriate medication or
therapy [1].
Intelligent conversational systems, smart cities, affect-aware e-health, affect-aware learning, and
travel recommendation systems are just a few of the many applications that are using emotionally-aware
intelligent systems [2]. A lot of systems rely on textual or emotion-based inputs. For example, there have
been suggestions for emotion-aware e-health systems that search for certain keywords in patient input in
order to identify emotions [3]. We have developed trip recommendation algorithms that are based on
context or emotion and affect-aware learning technologies [4]–[6]. Improving the quality life of people might
be achieved by an affect-aware smart city's ability to recognize and show emotions through the use of
hashtags, keywords, and emotions [7]. Each of these systems primarily relies on text or emotions to detect
emotions [8].
Video, audio, short words, emotions, long texts, short messages, and facial expressions are some of
the inputs that may be utilized to detect emotions. For these inputs, applications employ a variety of formats.
For instance, whereas video is often used in gaming systems, short messages, and emotions are more
common on social networking. Additionally, systems that can identify emotions from electroencephalogram
(EEG) data have been introduced lately [9], [10]. However, wearing an EEG hat is invasive and
uncomfortable for the wearer. A literature review found that when it comes to emotion identification, single-
modal input often fails to meet the necessary accuracy standards [11], [12].
A technique for audio-visual emotion recognition based on deep network feature extraction and
fusion is presented in this study [13]. Finally, non-linear feature fusion is ensured by using support vector
machines (SVMs), which are networks. The accuracy of deep learning (DL) models is dependent on the
available data and the structure of the model, although it has seen extensive usage in image, video, and audio
processing [14]. There are three things that this research adds: the proposed system takes use of a large
emotion dataset for training, a three-dimensional convolutional neural network (CNN) with an intricate key
frame selection technique for video signals and a variety of informative patterns for feature extraction,
including gray-scale key frame images, local binary pattern (LBP) images, and interlaced derivative pattern
(IDP) images.
2. LITERATURE SURVEY
As highlighted by Kudiri et al. [15] most research in this field has utilized asynchronous data and
unimodal or multimodal systems. As a result, incorrect synchronization has become a common issue,
increasing system complexity, and decreasing response time. To address this, a unique method has been
developed to anticipate human emotions from speech and facial expressions. This method employs two
feature vectors: relative bin frequency coefficient (RBFC) for voice data and relative sub-image based (RSB)
coefficient for visual data. The fusion strategy between the two modalities is based on feature-level
categorization and utilizes a SVM with a radial basis kernel.
According to Wang et al. [16] estimating human emotions using a computer have proven
challenging during conversational breaks. Their study uses a hybrid approach combining speech and facial
expressions to measure fundamental emotions. The approach employs RBFC for audio data and RSB features
for visual data, with classification performed by radial basis kernel SVMs. The findings indicate that the
proposed feature extraction method significantly enhances the emotion recognition system. It demonstrates
that the bimodal emotion recognition system, utilizing both speech and facial expressions, outperforms
unimodal methods.
According to Khalil et al. [17] recognizing emotions in spoken language is challenging part of
human computer interaction (HCI). As an alternative to conventional machine learning (ML) methods, DL
approaches are introduced. This article surveys various approaches and examines current research on emotion
identification using DL. This review covers databases used, enhancements to speech emotion identification
and its limitations.
3. PROBLEM STATEMENT
This study proposes a hybrid approach that uses speech and facial expressions to find fundamental
emotions in an arsonist during a conversational break [18]. For the audio and visual data respectively RBFCs
and RSB tires are used. For classification a SVM with a radial basis kernel is used. This study's findings
indicate that, in conjunction with the fusion approach, the most significant factor influencing the emotion
recognition system is feature extraction through speech and facial expression. While some factors may have
an impact on the system's ability to identify emotions, this impact is rather small [19]. Through intentional
facial expressions, it was shown that the unimodal emotion identification system outperforms the bimodal
emotion detection system in terms of performance. A proper database is employed to address the
problem [20]. The suggested emotion recognition system outperformed the others in terms of fundamental
emotional classes.
4. PROPOSED MODEL
Emotions are important to human perception and communication and have a substantial influence
on actions and choices. Consequently, emotion detection has been trending recently, with a focus on human
emotion recognition and classification across several media types (e.g., images, text, and audio) [21]. In our
research, we adopt a multi-modal approach to emotion recognition, encompassing text, image, and audio
modalities. Our goal is to identify the seven primary emotions: anger, fear, disgust, neutral, happy, sorrow,
and surprise across different data sources, leveraging CNN architectures for image and audio data, and long
Multimodal recognition with deep learning: audio, image, and text (Ravi Gummula)
256 ISSN: 2089-4864
short-term memory (LSTM) networks for textual data. Similarly, for audio-based emotion recognition, we
employ CNN architectures trained on datasets like the RAVDESS audio dataset. These networks analyze
spectrograms or other representations of audio signals to identify emotional cues conveyed through speech.
For text-based emotion analysis, we employ LSTM networks due to their ability to capture
sequential dependencies in textual data. We preprocess conversational text and feed it into LSTM networks
for emotion classification. Additionally, we utilize the term frequency-inverse document frequency (TF-IDF)
vectorizer to transform textual conversations into numerical representations, capturing the importance of
individual terms in expressing emotions within the corpus. Through this integrated approach, we aim to
leverage the strengths of each modality to enhance the accuracy and robustness of emotion recognition
systems [22]. By experimenting with various CNN and LSTM architectures tailored for image, audio, and
text modalities, our research strives to advance the state-of-the-art in multi-modal emotion recognition,
offering valuable insights into the complexities of human emotions expressed through language, facial
expressions, and speech [23].
− The proposed multi-modal system integrates text, image, and audio modalities, enhancing accuracy and
robustness in emotion recognition.
− It offers a comprehensive understanding of human emotions, improving versatility, and potential for real-
world applications.
For the proposed emotion recognition system, we have devised an LSTM architecture tailored for
text data. The LSTM model depicted in Figure 1 consists of multiple layers designed to capture sequential
dependencies in textual inputs effectively. Figure 1 outlines the detailed architecture specifications of the
LSTM model.
Following the LSTM layers, a fully connected neural network with hidden layers is employed to
further process the learned features. A SoftMax function is then applied to the output of the fully connected
layer to generate probability distributions over the target emotion classes. Subsequently, the output of the
SoftMax layer is passed through a classifier for final emotion classification.
However, conventional ML techniques exhibit limitations in accurately recognizing emotions from
text data. Enhancing the accuracy of emotion recognition in text-based inputs remains a significant objective
in the domain of artificial intelligence and DL algorithms. Traditional text-based emotion recognition
datasets often achieve suboptimal accuracy rates. To address these limitations and improve emotion
recognition accuracy, this study proposes the following innovations.
Integration of a supervised encoder for feature extraction from textual inputs. Utilization of LSTM
networks to analyze textual inputs and perform emotion recognition. Implementation of neural networks for
information collection and visualization of emotional cues conveyed through text. Deployment of a CNN
algorithm for analyzing emotions from text data within intelligent learning environments. Additionally,
classification enhancement algorithms based on ML techniques are employed for improved accuracy in text-
based emotion recognition. Compared to traditional text-based emotion recognition approaches, the LSTM
architecture presented in this study demonstrates superior accuracy and yields better recognition results.
Int J Reconfigurable & Embedded Syst, Vol. 14, No. 1, March 2025: 254-264
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 257
4.2.1. Text
The text module has a robust dataset consisting of 32,500 records, each of which has two essential
columns: "text" and "emotion". The "text" column lists all the materials that our model needs for its rigorous
training, and the "emotion" column shows which emotion category each text item belongs to. Despite some
initial setbacks, we eventually located and selected several text sources that met our established criteria and
allowed us to accomplish our research goals, allowing us to compile a sizable corpus of text data. We took
great effort to select reliable and authentic sources despite the challenges we had while gathering data in
order to ensure that our dataset remained accurate and applicable. Emotion expressions in real-world data
sources are naturally distributed and variable, making it difficult to achieve a perfect balance for all emotion
classes. Nevertheless, we were able to capture a wide range of patterns by intentionally crafting the dataset
with an emphasis on inclusion and variety. With seven distinct emotions represented by the encoded classes
in the dataset, the records in each category exhibit a wide range of distributions. Our subsequent research and
model development meticulously addressed any potential class imbalances to ensure the text emotion
recognition (TER) model's unbiased performance. The extensive and detailed text data gathered from sources
allows for a full exploration of textual expressions for emotion recognition.
4.2.2. Image
We used a dataset that found on Kaggle for the image module to construct our study on face-based
emotion recognition. In all, these datasets include around a thousand images which have been classified into
7 distinct groupings according to various emotional states as shown in Figure 2. To fix any potential data
imbalances and increase the image data's range, we applied data augmentation techniques. Augmentation
enhances our model's generalizability by enhancing the dataset with manufactured versions of the existing
pictures.
Multimodal recognition with deep learning: audio, image, and text (Ravi Gummula)
258 ISSN: 2089-4864
distinct emotions: happy, disgust, scared, surprised, sad, angry, and neutral are part of the data set. Transform
the unprocessed audio samples into a suitable format (such as WAV. files). Edit the audio samples such that
Mel-frequency cepstral coefficients (MFCCs) are not present. To obtain a Mel-spectrogram from a speech
signal, follow these steps and process is shown in Figure 3.
− Frame the signal: divide the speech signal into 40-millisecond frames with a 50% overlap between
consecutive frames.
− Apply a window function: multiply each frame by a hamming window to reduce spectral leakage.
− Compute the fast Fourier transform (FFT): perform an FFT on each windowed frame to convert it from
the time domain to the frequency domain.
− Filter the frequencies: apply 25 band-pass filters to the frequency-domain signal. These filters are
designed to match the critical bandwidths of human auditory perception, with their center frequencies
distributed according to the Mel scale.
− Logarithmic compression: apply logarithm function to the outputs of filters to suppress the dynamic
range.
− Construct the Mel-spectrogram: arrange the logarithmically compressed filter outputs from each frame to
form the Mel-spectrogram of the signal.
5. METHOD
Our goal is to develop a multi-input model capable of processing text, images, and audio, while
simultaneously analyzing the existing emotion recognition models in detail. The first steps of a fundamental
technique were data collection and preprocessing. This research collects datasets from a variety of reputable
sources, including academic journals, repositories, and organizations like hugging face, Inc., in order to
validate the data's authenticity and relevancy. Previous researchers contributions are greatly appreciated. The
citation of all sources used in this work is of the utmost importance to maintain academic honesty. Following
the data collection phase, meticulous data cleaning and preparation took place. To ensure data integrity and
quality, significant updates were made to the databases. The data preparation techniques utilized will be
discussed in further detail in the portions that follow.
The method integrates CNN for image and audio data and LSTM with TF-IDF vectorization for text
data aligning with the proposed system's multi-modal approach [24]. For image data CNN architectures are
employed, leveraging datasets such as the emotion facial expression photographs dataset. These CNN models
are trained to capture spatial features and patterns in facial expressions, facilitating accurate emotion
classification from images.
Similarly, for audio data CNN architectures trained on datasets like the RAVDESS audio dataset are
utilized. These models analyze spectrograms or other audio representations to identify emotional cues
conveyed through speech. For text data the method utilizes LSTM networks coupled with TF-IDF
vectorization. LSTM networks excel at capturing sequential dependencies in textual data, while TF-IDF
vectorization transforms conversational text into numerical representations, capturing term importance. This
combined approach enables the accurate classification of emotions from textual content.
The method ensures meticulous data preprocessing to enhance data quality and consistency across
all modalities. Techniques such as data cleansing and enhancement are employed to address challenges
associated with emotion recognition datasets. In the modeling phase, individual ML models are developed for
each data modality, tailored to leverage the unique characteristics of text, image, and audio data. This
includes designing CNN architectures for image and audio data and implementing LSTM networks with TF-
IDF vectorization for text data.
A key feature of the research is the development of a multi-input model capable of accommodating
text, image, and speech inputs. This model seamlessly directs inputs to appropriate sub-models based on their
modality, enhancing emotion recognition accuracy across different data types. Efficiency and performance
considerations are carefully balanced throughout the research design, ensuring accuracy while maintaining
Int J Reconfigurable & Embedded Syst, Vol. 14, No. 1, March 2025: 254-264
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 259
practicality. Overall, the method aligns closely with the proposed system's objectives and approach,
facilitating a comprehensive study of emotion recognition across text, image, and audio modalities [25]. By
contributing to the development of emotion detection algorithms that have a deeper understanding of human
emotions, our work opens door to more sophisticated applications in affective computing and HCI.
6. IMPLEMENTATION
Building advanced emotion detection models capable of reliably identifying and classifying
emotions across different modalities was the primary goal of our study. In this section, we detailed the steps
involved in creating our model from selecting appropriate architectures and algorithms to fine-tuning
hyperparameters. The goal of our research was to develop a set of models that could accurately anticipate
how people will feel based on their interpretation of text, images, and audio inputs, utilizing cutting-edge
technology and industry standards. Achieving state-of-the-art performance was the major objective of our
modeling effort. This would propel the development of emotion recognition technology and its game-
changing applications across several sectors.
𝜕𝐿
𝜃𝑤,𝑘+1 = 𝜃𝑤,𝑘 − 𝜂 ∙ (1)’
𝜕𝜃𝑤,𝑘
Using these vectors we may find the semantic relationships between words. For example, they pick
up on the "male-female" connection between "man" and "woman" as well as "king" and "queen". To top it all
off, they can recognize relationships between different tenses of verbs like "walk", "walks", "walked", and
"walking".
Our LSTM model's word vectors are trained in conjunction with the neural network and are first
constructed at the embedding layer. Since we used an embedding size of 400, each word is represented in a
400-dimensional vector space. To get the text ready, remove any unnecessary characters and replace them
with regular characters. This includes tab characters, newline characters, excessive spaces, and other similar
characters.
Tokenizing the data is done using the spaCy software. Since spaCy does not have a
parallel/multicore tokenizer, we resort to using the fast.ai package. When compared to its serial counterpart,
the parallel spaCy tokenizer is much faster because it makes use of all CPU cores. Separating the text into
individual tokens and assigning a unique index to each token is the main objective of tokenization. In order
for our model to work, we need to convert the text into integer indices. The next step is numericalization,
which involves turning tokens into numbers. This process comprises:
− Compiling an ordered list of all the words appearing in the text.
− Changing the word index in that list to match each one.
Int J Reconfigurable & Embedded Syst, Vol. 14, No. 1, March 2025: 254-264
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 261
Our focus is not on the whole vocabulary list since singular words are meaningless and easily
missed. A word has to appear twice or more times before it may be spelled properly. The model cannot be
taught using infrequent phrases. Stacking architectures in DL, such as LSTM-CNN (see Algorithm 2),
leverage the distinct capabilities of sequential data handling and spatial patterns processing, respectively, to
combine the characteristics of both LSTM networks and CNN. Here is the equation of LSTM and CNN
Algorithm:
7. MODULE DESCRIPTION
This system is intended to identify emotions in three distinct forms of input: text, speech, and facial
expressions. The main objective is to use CNN, a kind of deep learning model made to find patterns in data,
to train distinct models to correctly identify emotions in these various sources of input. To start, datasets with
text, audio, and facial photos representing a range of emotions including happiness, sadness, rage, and disgust
are prepared. After preprocessing each dataset to guarantee consistency in size and quality, the data are
labeled with the appropriate emotions. When the data is prepared, CNN models are trained on each form of
data: text for written emotions, audio for voice, and photos for facial expressions. Following training, any
new picture, audio file, or text input from the user can be used by the system to anticipate emotions.
− Upload facial emotion dataset: gathering images showing different facial expressions like happy, sad,
angry, and disgust.
− Preprocess dataset: make sure they are all similar in size and quality. Takes each image a label saying
what emotion is being shown (like "happy" and "sad").
− Train facial emotion CNN algorithm: it will learn to recognize patterns in the image that match different
emotions. This learning process is called training and gets better accuracy with more training.
− Train audio emotion CNN algorithm: similar to the facial emotion training, but here will use recordings.
It will learn to find patterns in sound that match different emotions like happiness and sadness.
− Train text emotion CNN algorithm: it will learn to recognize patterns in the text that match different
emotions. This learning process is called training and gets better accuracy with more training.
− Accuracy comparison graph: it is shown in a graph, making it easy to see which accuracy is better.
− Predict facial emotion: in this module user will upload test image to detect emotion.
− Predict speech emotion: in this module user will upload test audio file to detect speech.
− Predict text emotion: in this module user will upload test text data to detect emotion.
Multimodal recognition with deep learning: audio, image, and text (Ravi Gummula)
262 ISSN: 2089-4864
Figure 6. Training and validation accuracy graph Figure 7. Training and validation loss graph
The confusion matrix provides insights into the model's performance across different categories,
presenting precision, recall, F1-score, and support metrics as shown in Table 1. High precision (93%) and
recall (92%) indicate accurate identification in one category, while slightly lower precision (95%) and good
recall (84%) in another suggest few false negatives. Similarly, high precision (93%) and recall (95%)
showcase successful identification, whereas lower precision (82%) and recall (82%) in a category indicate
potential challenges. Meanwhile, high precision (96%) and recall (96%) signify excellence, while lower
precision (74%) and high recall (91%) imply more false positives but effective identification. Image and
audio models achieved the highest accuracy at 99%, show casing superior performance. Text models
achieved slightly lower accuracy at 96%, still demonstrating strong performances shown in Table 2.
Int J Reconfigurable & Embedded Syst, Vol. 14, No. 1, March 2025: 254-264
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 263
9. CONCLUSION
We want to extensively study the state-of-the-art emotion detection models and develop a multi-
input model capable of interpreting text, images, and audio. The primary objective was to identify and
categorize human emotions across several modalities using CNN for visual and auditory input and LSTM
networks with TF-IDF vectorization for textual output. The suggested emotion recognition system uses a 2D
CNN architecture to handle speech data and incorporates several fusion strategies, one of which is a novel
CNN-based fusion strategy. Emotion detection systems are made more accurate and resilient by this method,
which uses different ML models for each kind of input and guarantees careful data preparation. Whereas
LSTM networks with TF-IDF vectorization were better at classifying emotions in text, CNN models were
better at capturing the spatial features and patterns of facial expressions in images and audio recordings. The
primary result of our study is a multi-input model that improves emotion recognition accuracy across
different datasets by intelligently routing inputs to the appropriate sub-models based on their modality. The
single model's adaptability and suppleness to various input types proved better despite some challenges like
potential delays caused by several processing pipelines. Our approach is very congruent with the aims of the
suggested system, which is to conduct comprehensive studies on emotion recognition in text, image, and
audio formats.
In the future, we aim to evaluate the proposed method within an edge-and-cloud computing setup,
employing text datasets, and the emotion in the wild challenge databases. Additionally, we are keen on
exploring alternative DL architectures to ascertain if they yield improved performance. Taken together, our
research paves the way for future emotion recognition algorithms with a deeper understanding of human
emotions, which in turn will allow for more advanced uses in affective computing and HCI.
REFERENCES
[1] M. Chen, Y. Zhang, M. Qiu, N. Guizani, and Y. Hao, “SPHA: smart personal health advisor based on deep analytics,” IEEE
Communications Magazine, vol. 56, no. 3, pp. 164–169, Mar. 2018, doi: 10.1109/MCOM.2018.1700274.
[2] F. Doctor, C. Karyotis, R. Iqbal, and A. James, “An intelligent framework for emotion aware e-healthcare support systems,” in
2016 IEEE Symposium Series on Computational Intelligence (SSCI), Dec. 2016, pp. 1–8. doi: 10.1109/SSCI.2016.7850044.
[3] K. Lin, F. Xia, W. Wang, D. Tian, and J. Song, “System design for big data application in emotion-aware healthcare,” IEEE
Access, vol. 4, pp. 6901–6909, 2016, doi: 10.1109/ACCESS.2016.2616643.
[4] X. Liu, L. Zhang, J. Yadegar, and N. Kamat, “A robust multi-modal emotion recognition framework for intelligent tutoring
systems,” in 2011 IEEE 11th International Conference on Advanced Learning Technologies, Jul. 2011, pp. 63–65. doi:
10.1109/ICALT.2011.26.
[5] R. A. Calvo and S. D’Mello, “Frontiers of affect-aware learning technologies,” IEEE Intelligent Systems, vol. 27, no. 6, pp. 86–
89, Nov. 2012, doi: 10.1109/MIS.2012.110.
[6] K. Meehan, T. Lunney, K. Curran, and A. McCaughey, “Context-aware intelligent recommendation system for tourism,” in 2013
IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops), Mar. 2013, pp.
328–331. doi: 10.1109/PerComW.2013.6529508.
[7] S. Meng et al., “Privacy-aware factorization-based hybrid recommendation method for healthcare services,” IEEE Transactions
on Industrial Informatics, vol. 18, no. 8, pp. 5637–5647, Aug. 2022, doi: 10.1109/TII.2022.3143103.
[8] M. H. Shahbaz, Zain-Ul-Abidin, K. Mahboob, and F. Ali, “Enhancing contextualized GNNs for multimodal emotion recognition:
improving accuracy and robustness,” in 2023 7th International Multi-Topic ICT Conference (IMTIC), May 2023, pp. 1–7. doi:
10.1109/IMTIC58887.2023.10178481.
[9] Y. Li, “Research direction of smart home real-time monitoring,” in 2020 International Conference on Computer Engineering and
Intelligent Control (ICCEIC), Nov. 2020, pp. 220–232. doi: 10.1109/ICCEIC51584.2020.00051.
[10] Y. J. Liu, M. Yu, G. Zhao, J. Song, Y. Ge, and Y. Shi, “Real-time movie-induced discrete emotion recognition from EEG
signals,” IEEE Transactions on Affective Computing, vol. 9, no. 4, pp. 550–562, 2018, doi: 10.1109/TAFFC.2017.2660485.
[11] J. Teo and J. T. Chia, “Deep neural classifiers for EEG-based emotion recognition in immersive environments,” in 2018
International Conference on Smart Computing and Electronic Enterprise (ICSCEE), Jul. 2018, pp. 1–6. doi:
10.1109/ICSCEE.2018.8538382.
[12] X. Li, X. Zhang, H. Yang, W. Duan, W. Dai, and L. Yin, “An EEG-based multi-modal emotion database with both posed and
authentic facial actions for emotion analysis,” in 2020 15th IEEE International Conference on Automatic Face and Gesture
Recognition (FG 2020), Nov. 2020, pp. 336–343. doi: 10.1109/FG47880.2020.00050.
[13] P. Lopez-Otero, L. Dacia-Fernandez, and C. Garcia-Mateo, “A study of acoustic features for depression detection,” in 2nd
International Workshop on Biometrics and Forensics, Mar. 2014, pp. 1–6. doi: 10.1109/IWBF.2014.6914245.
[14] S. Gao, Y. Zhong, and W. Li, “Random weighting method for multisensor data fusion,” IEEE Sensors Journal, vol. 11, no. 9, pp.
1955–1961, Sep. 2011, doi: 10.1109/JSEN.2011.2107896.
[15] K. M. Kudiri, A. Md Said, and M. Y. Nayan, “Emotion detection using sub-image based features through human facial
expressions,” in 2012 International Conference on Computer & Information Science (ICCIS), Jun. 2012, vol. 1, pp. 332–335. doi:
10.1109/ICCISci.2012.6297264.
[16] X. Wang, X. Chen, and C. Cao, “Human emotion recognition by optimally fusing facial expression and speech feature,” Signal
Multimodal recognition with deep learning: audio, image, and text (Ravi Gummula)
264 ISSN: 2089-4864
Processing: Image Communication, vol. 84, p. 115831, May 2020, doi: 10.1016/j.image.2020.115831.
[17] R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar, and T. Alhussain, “Speech emotion recognition using deep learning
techniques: a review,” IEEE Access, vol. 7, pp. 117327–117345, 2019, doi: 10.1109/ACCESS.2019.2936124.
[18] Z. Zhang, E. Coutinho, J. Deng, and B. Schuller, “Cooperative learning and its application to emotion recognition from speech,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 1–1, 2014, doi:
10.1109/TASLP.2014.2375558.
[19] E. Mishra, A. K. Sharma, M. Bhalotia, and S. Katiyar, “A novel approach to analyse speech emotion using CNN and multilayer
perceptron,” in 2022 2nd International Conference on Advance Computing and Innovative Technologies in Engineering
(ICACITE), Apr. 2022, pp. 1157–1161. doi: 10.1109/ICACITE53722.2022.9823781.
[20] J. Deng, Z. Zhang, E. Marchi, and B. Schuller, “Sparse autoencoder-based feature transfer learning for speech emotion
recognition,” in 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, Sep. 2013, pp. 511–
516. doi: 10.1109/ACII.2013.90.
[21] J. Deng and F. Ren, “Multi-label emotion detection via emotion-specified feature extraction and emotion correlation learning,”
IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 475–486, Jan. 2023, doi: 10.1109/TAFFC.2020.3034215.
[22] C. Dalvi, M. Rathod, S. Patil, S. Gite, and K. Kotecha, “A survey of AI-based facial emotion recognition: features, ML & DL
techniques, age-wise datasets and future directions,” IEEE Access, vol. 9, pp. 165806–165840, 2021, doi:
10.1109/ACCESS.2021.3131733.
[23] W. Li et al., “A spontaneous driver emotion facial expression (DEFE) dataset for intelligent vehicles: emotions triggered by
video-audio clips in driving scenarios,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 747–760, Jan. 2023, doi:
10.1109/TAFFC.2021.3063387.
[24] K. Kasiri, P. Fieguth, and D. A. Clausi, “Self-similarity measure for multi-modal image registration,” in 2016 IEEE International
Conference on Image Processing (ICIP), Sep. 2016, vol. 2016-Augus, pp. 4498–4502. doi: 10.1109/ICIP.2016.7533211.
[25] S. Z. H. Naqvi, S. Aziz, M. U. Khan, N. Asghar, and G. Rasool, “Emotion recognition system using pulse plethysmograph,” in
2020 International Conference on Emerging Trends in Smart Technologies (ICETST), Mar. 2020, pp. 1–6, doi:
10.1109/ICETST49965.2020.9080725.
BIOGRAPHIES OF AUTHORS
Int J Reconfigurable & Embedded Syst, Vol. 14, No. 1, March 2025: 254-264