Emotion_Recognition_from_Speech_via_the_Use_of_Dif (1)
Emotion_Recognition_from_Speech_via_the_Use_of_Dif (1)
net/publication/372000368
Emotion Recognition from Speech via the Use of Different Audio Features,
Machine Learning and Deep Learning Algorithms
CITATION READS
1 66
5 authors, including:
All content following this page was uploaded by Tuna Cakar on 04 January 2024.
ABSTRACT
In this study, different machine learning and neural network methods for emotion
analysis from speaking are examined and solutions are sought. Audio consists of
a large number of attributes. It is possible to make emotion analysis from sound
using these attributes. Root Mean Square Energy (RMSE), Zero Crossing Rate (ZCR),
Chroma, Mel Frequency Cepstral Coefficients (MFCC), Spectral Bandwith, Spectral
Centroid properties were investigated for speech-free mood prediction. Ravdess, Save,
Tess, Crema-D datasets were used. The data sets were voiced in German and English
by 121 different people in total. The datasets consist of audio files in wav format contai-
ning 7 emotional states as happy, sad, angry, disgusted, scared, surprised and neutral.
By using the Librosa library, features were obtained from the audio files in the data-
sets. The features were used in various machine learning and neural network models
and the results were compared. When the classification results are examined, 0.68 for
Support Vector Machines, 0.63 for Random Forest Classification, 0.71 for LSTM and
0.74 F-1 for Convolutional Neural Networks.
Keywords: Voice analysis, Speech emotion recognition, Audio features, Classifiers, Machine
learning
INTRODUCTION
Communication has been the basis of information exchange since the exi-
stence of human beings. Words and emotions follow each other to make
communication more accurate, clear, and understandable. Depending on the
emotional state of people, there are some physiological changes such as body
movements, blood pressure, pulse, and tone of voice. While changes such as
heart rate and blood pressure are detected with a special device, changes such
as tone of voice and facial expression can be understood without the need
for a device. Machines are often used for emotion predictions. (Suha Gokalp
et al., 2021). Speech is one of the fastest and most natural communication
methods between people. For this reason, researchers have started to use
speech signals to make human-machine interaction faster and more efficient.
Speech signals have a complex structure that can contain much information
at the same time, such as the speaker’s age, mood, gender, physiology, and
language. Emotion recognition studies without speech try to obtain seman-
tic information from the sound signal during speech. (Suha Gokalp et al.,
2021). This study aims to determine the emotional state of the speaker using
speech signals. In academics, Speech Emotion Recognition has become one of
the most wondered and investigated research areas (Jain et al., 2020). Along
with the studies carried out in recent years, various studies have been car-
ried out on the mood analysis of the speaker using machine learning, and
thanks to these studies, great developments have been experienced in this
field. However, it is a difficult task to analyse the mood from the sound waves
of the speaker, because the sound consists of many parameters and has vari-
ous features that must be taken into account. For these reasons, choosing the
appropriate and correct features for emotion recognition without speech is
the critical and perhaps the most important point of this study.
Machine learning basically means that a computer has the ability to auto-
matically perform a task using data and learning methods. The computer uses
statistics, various probability algorithms, and neural networks to learn and
successfully complete these tasks. In the continuation of the study, parame-
ters of various datasets and algorithms are given to create a machine-learning
model.
Various approaches have been successfully applied for speech emotion
recognition to date. In this article, various features of sound waves and vari-
ous machine learning algorithms and neural networks are used for Speech
emotion recognition. In order to increase the accuracy and success of the
study, 4 different speech databases were combined.
DATA PROCESSING
Sample rate, in music and audio technology, indicates how many times per
second an audio file or signal is measured. A higher sampling rate means
higher sound quality and an audio file with more detail. Sample rate is usually
specified as a few thousand or million sampling points per second. Higher
sample rate can contain higher frequency values and provides higher sound
quality. The sample rate used in this project is 22.05 kHz.
Hop length is a term used in music and audio technology when processing
an audio file or signal. Hop length specifies the time interval after an audio
file or signal has been measured once. This is used in conjunction with the
sample rate, which is used to measure the frequency range of an audio file or
signal. Together with the sample rate, it describes the frequency width of an
audio file or signal and determines the sound quality. The hop length value
used in this project is 512.
Emotion Recognition From Speech 113
Frame length, in music and audio technology, refers to the time interval in
which an audio file or signal is measured once. Frame length is used along
with the sample length to use within the frequency range of the audio file or
signal. The frame length value used in this project is 2048.
The Fourier transform is a mathematical operation to find the frequency
spectrum of a signal. This process allows temporal patterns of a signal to
be expressed over a frequency spectrum. In this way, the amplitudes and
phases of the frequency components in the signal are determined and the
characteristics of the signal are examined with this information.
MFCC (Mel Frequency Cepstral Coefficients) is a feature vector often
used in audio processing applications. MFCCs represent audio based on
perception of human auditory systems. In MFCC, the frequency bands are
positioned logarithmically(i.e on the Mel Scale) which approximates the
human auditory system’s response more closely than the linearly spaced fre-
quency bands of FFT or DCT (Goh C,Leon K, Gold B,Morgan N,) The
MFCC feature vector is calculated over the frequency spectrum of the audio
signals and includes the mel frequencies coefficients (cepstrum) in the audio
signals. This cepstrum is logarithmically transformed over the frequency spe-
ctrum of the audio signals and then the Fourier transform of this logarithmic
transform is taken. The values obtained as a result of these operations are
converted into MFCC feature vectors. MFCC feature vectors help identify
words and phrases contained in audio signals, and these features make it
easier to classify audio signals. The number of mfcc value used in this project
is 128.
114 Sayar et al.
MODELING
Unsupervised learning aims either to discover similar sample groups in the
data or to determine the distribution in the data space by identifying hidden
patterns in the data. It uses unlabelled data to identify patterns. Clustering
and Association are types of unsupervised learning.
Emotion Recognition From Speech 115
signals as input and tries to guess which words or phrases are present in the
audio. In our study, 6-layer CNN was used together with ModelCheckpoint
and ReduceLROnPlateu methods.
RESULTS
The obtained results from the model development stage indicate promi-
sing findings. First of all, different feature extraction methods were applied
including Root Mean Square Energy (RMSE), Zero Crossing Rate (ZCR),
Chroma, Mel Frequency Cepstral Coefficients (MFCC), Spectral Bandwith,
Spectral Centroid properties for understanding speech-free mood prediction.
There were different datasets (Ravdess, Save, Tess, Crema-D) combined for
modelling and the whole dataset contained voiced in German and English
by 121 different people in total. Moreover, the datasets consist of audio files
in wav format containing 7 emotional states as happy, sad, angry, disgusted,
scared, surprised and neutral.
These emotional states has also been used as labels within this combi-
ned dataset. The mentioned features were extracted from the audio files and
classification models were developed to predict the correct labels using the
extracted features. The classification results as F1 score have been 0.68 for
Support Vector Machines (shown in Table 1), 0.63 for Random Forest Clas-
sification (shown in Table 2), 0.71 for LSTM (shown in Table 3) and 0.74
for Convolutional Neural Networks (shown in Table 4).
DISCUSSION
This modelling study examines intelligent voice emotion recognition systems
as opposed to conventional methods that are widely used interview techni-
ques in human resources. One of the major needs in this domain, has been
to provide an objective and automatic process to reduce the time and human
resource spent on this domain. Our current proposal fulfils this requirement
since it reduces the analysis and reporting of the whole session within less
than a minute.
Within the scope of this study, four common audio processing methods
(including rmse, zrcr, chroma, and mfcc) were determined for distinguish-
ing characteristics in speech emotion analysis. For better modelling outputs,
new extracted features are necessary to provide results with higher scores.
So far, the 6-layered CNN model has provided the highest output among the
developed models with a success rate of 74%.
Lastly, as mentioned within the manuscript, four distinct public data sets
were utilized for the research. As a consequence of analyzing the data sets, it
has been shown that the success rate of mood analysis may differ depending
on the spoken language. Thus, it seems that there should be other languages
integrated into this model. We are planning to develop a national database
Emotion Recognition From Speech 119
for this that could also be used in the research domains such as understanding
the effects of neurophysiological signals.
CONCLUSION
In this study, intelligent systems for speech emotion recognition are examined
and a very fundamental model has been developed. The main contribution
of this study has been the developed models with different models on the
combined datasets. The provided conclusion as a result of this study is as
follows: Basically, speech emotion recognition architectures consist of three
main parts including classification, selection of features, and extraction of
features. As a result, it was understood that rmse, zrcr, chroma, and mfcc are
distinctive features in speech emotion analysis. 4 different data sets were used
in the project. As a result of the analysed data sets, it has been understood
that the success rate in mood analysis may vary according to the spoken
language. Thus, regarding the major limitation of this study, new spoken
languages should be added to the combination of these datasets to provide
a more realistic model for the use of human resources during the interviews,
meanwhile one of the major challenges will be with respect to the increasing
the performance metrics to reach a more acceptable solution.
REFERENCES
Betül Akpınar, Adaptif Sıralı Minimal Optimizasyon ile Destek Vektör Makinesi,
(20, Kasım, 2021).
Bhavan, A., Chauhan, P., & Shah, R. R. (2019). Bagged support vector machines for
emotion recognition from speech. Knowledge-Based Systems, 184, 104886.
Bidirectional recurrent neural networks M. Schuster, K. K. Paliwal.
Convolutional Neural Network (CNN) Based Speech-Emotion Recognition. Alif Bin
Abdul Qayyum, Asiful Arefeen*, Celia Shahnaz.
Detection and analysis of emotion recognition from speech signals using Decision
Tree and comparing with Support Vector Machine Shaik Zuber and K. Vidhya.
GÖKALP, S., & AYDIN, İ. (2021). Farklı Derin Sinir Ağı Modellerinin Duygu
Tanımadaki Performanslarının Karşılaştırılması.
Goh C, Leon K (2009) Robust computer voice recognition using improved MFCC
algorithm. In: Proceedings of the 2009 international conference on new trends in
information and service science, IEEE, pp. 835–840. 22.
Gold B, Morgan N, Ellis D (2011) Speech and audio signal processing: processing
and perception of speech and music. Wiley, New Jersey.
Huang, K. Y., Wu, C. H., & Su, M. H. (2019). Attention-based convolutional neural
network and long short-term memory for short-term detection of mood disorders
based on elicited speech responses. Pattern Recognition, 88, 668–678.
Konuşmadan Duygu Tanıma Üzerine Detaylı bir İnceleme: Özellikler ve
Sınıflandırma Metotları Emel Çolakoğlu Serhat Hızlısoy Recep Sinan Arslan.
Langari, S., Marvi, H., & Zahedi, M. (2020). Efficient speech emotion recognition
using modified feature extraction. Informatics in Medicine Unlocked, 20, 100424.
Nagesh Singh Chauhan, Naive Bayes, 22, Kasım, 2021).
Pan, Y., Shen, P., & Shen, L. (2012). Speech emotion recognition using support vector
machine. International Journal of Smart Home, 6(2), 101–108.
Sethu, Vidhyasaharan and Epps, Julienand Ambikairajah, Eliathamby, “Speech
BasedEmotionRecognition,” pp. 197–228, September 2015.
120 Sayar et al.
The Application of Capsule Neural Network Based CNN for Speech Emotion
Recognition Xin-Cheng Wen Kun-Hong Liu Wei-Ming Zhang Kai Jiang.
Wang, K., Su, G., Liu, L., & Wang, S. (2020). Wavelet packet analysis for speaker-
independent emotion recognition. Neurocomputing, 398, 257–264.
Yao, Z., Wang, Z., Liu, W., Liu, Y., & Pan, J. (2020). Speech emotion recognition
using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN
and LLDRNN. Speech Communication, 120, 11–19.