1 s2.0 S0378437122002291 Main
1 s2.0 S0378437122002291 Main
Physica A
journal homepage: www.elsevier.com/locate/physa
article info a b s t r a c t
Article history: Music is often considered as the language of emotions. The way it stimulates the emo-
Received 10 November 2021 tional appraisal across people from different communities, culture and demographics has
Received in revised form 21 January 2022 long been known and hence categorizing on the basis of emotions is indeed an intriguing
Available online 23 March 2022
basic research area. Indian Classical Music (ICM) is famous for its ambiguous nature, i.e.
Keywords: its ability to evoke a number of mixed emotions through only a single musical narration,
Indian Classical Music and hence classifying evoked emotions from ICM becomes a more challenging task. With
Emotions the rapid advancements in the field of Deep Learning, this Music Emotion Recognition
Instruments (MER) task is becoming more and more relevant and robust, hence can be applied to
Classification one of the most challenging test case i.e. classifying emotions elicited from ICM. In this
CNN
paper we present a new dataset called JUMusEmoDB which presently has 1600 audio
MFDFA
clips (approximately 30 s each) where 400 clips each correspond to happy, sad, calm
and anxiety emotional scales. The initial annotations and emotional classification of the
database was done based on an emotional rating test (5-point Likert scale) performed by
100 participants. The clips have been taken from different conventional ‘raga’ renditions
played in two Indian stringed instruments – sitar and sarod by eminent maestros of
ICM and digitized in 44.1 kHz frequency. The ragas, which are unique to ICM, are
described as musical structures capable of inducing different moods or emotions. For
supervised classification purposes, we have used Convolutional Neural Network (CNN)
based architectures (resnet50, mobilenet v2.0, squeezenet v1.0 and a proposed ODE-
Net) on corresponding music spectrograms of the 6400 sub-clips (where every clip
was segmented into 4 sub-clips) which contain both time as well as frequency domain
information. Along with emotion classification, instrument classification based response
was also attempted on the same dataset using the CNN based architectures. In this
context, a nonlinear technique, Multifractal Detrended Fluctuation Analysis (MFDFA)
was also applied on the musical clips to classify them on the basis of complexity
values extracted from the method. The initial classification accuracy obtained from the
applied methods are quite inspiring and have been corroborated with ANOVA results
to determine the statistical significance. This type of CNN based classification algorithm
using a rich corpus of Indian Classical Music is unique even in the global perspective
and can be replicated in other modalities of music also. The link to this newly developed
dataset has been provided in the dataset description section of the paper. This dataset is
∗ Corresponding author at: Rekhi Centre of Excellence for the Science of Happiness, IIT Kharagpur, India.
E-mail address: [email protected] (A. Banerjee).
https://doi.org/10.1016/j.physa.2022.127261
0378-4371/© 2022 Elsevier B.V. All rights reserved.
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
still under development and we plan to include more data containing other emotional
as well as instrumental entities into consideration.
© 2022 Elsevier B.V. All rights reserved.
1. Introduction
Music is an integral part in our day-to-day lives — ranging from the countless hours of staying plugged into our
earphones while working or commuting to attending live concerts (especially during the pre-Covid19 era and hopefully
once Covid19 ends) music has a profound impact on our mental states. Music imposes emotions and has the power of
singularly altering our moods. It is usually considered that the songs which are usually in major keys or scales induce
happiness whilst those in minor keys or scales create a comparatively sad ambience. But this cannot be considered as a
rule of thumb because emotions are subjective. There are instances where people have attributed songs in major scales as
few of the saddest songs of all times. In most cases, this might be complicated since the response to melody and rhythm
is ascribed as a biological instinct. A non-musician’s ear can feel the pain when Garry Moore plays ‘‘Loner’’ whereas a
baby can respond to different scales and melodies. Thus, ascribing an individual musical piece to a particular emotion can
be a challenging task.
Over the years, researchers have tried various ways to determine emotional content in audio clips including music,
speech as well as natural sounds. Along the lines, a significant development has been noticed in the field of Music Emotion
Recognition (MER) [1] which is considered as an expanding sub-domain of Music Information Retrieval (MIR) dealing
with identification of emotions from musical clips using a combination of several signal processing and machine learning
techniques. Categorization of music based on disparate emotions is not only a difficult task but also quite promising
lately, especially with the advancements of Artificial Intelligence and emergence of personalized music recommender
systems like Spotify and Apple Music [2]. The applications of MER are also profound in the areas of music therapy where
music can be used to reduce stress and other cognitive disorders [3,4]. MER as a problem was first introduced by Barthet
et al. [1] which fostered several advancements made on this discipline since then. For any audio (speech, music) based
emotion classification task, the acoustic features are of utmost importance. An arsenal of such acoustic features is usually
composed of (but not restricted to) (a) Rhythmic Features: Tempo, Silence etc., (b) Timbral Features: MFCC, Average
Energy, Spectral Centroid, Spectral Tilt etc., and (c) Chroma Features. These acoustic features quantify the musicality of an
audio clip which essentially contributes to emotion recognition in those audio clips. Such knowledge-driven approaches
when applied on validated and organized datasets, usually lead to efficient model paradigms. Machine Learning techniques
have been profusely applied in the field of MER which comprises of different learning strategies including supervised,
unsupervised, etc. The aforementioned features comprise of the feature matrix [5]. In case of a supervised classification
task, a standard Machine Learning algorithm like SVM [6] or ANN [7] are capable of handling those features as inputs and
to eventually classify emotions induced by several audio clips [8]. On the other hand, in case of unsupervised learning
strategies clustering algorithms like k-means [9], DBSCN [10], Fuzzy Clustering [11] etc. suitably assign audio clips to
different clusters [12]. In some cases, a dimensionality reduction approach like PCA or U-MAP is used to find the best
subset of features.
In the last decade, Deep Learning has made significant progress in a multitude of fields including computer vision,
medical imaging, natural language processing, finance, e-commerce and so on [13–28]. Deep Learning methods are usually
advantageous over the traditional Machine Learning methods provided the model is trained on a significantly large
database. Such Deep Learning approaches have been employed in the field of MER as well to detect different emotions
associated with music [29–31]. Recurrent Neural Network (RNN), Long–Short-Term-Memory (LSTM) and Transformers
are the predominant models when it comes to time-series classification or prediction tasks [32–35]. But, taking insights
from the domain of computer vision, Convolutional Neural Networks (CNN) based methods have been found promising
in the field of Music Emotion Recognition where the idea is to convert audio signals to respective spectrograms and
utilize them as inputs. So, essentially it narrows down to an image classification problem. Music emotion identification
tasks have been widely performed with varieties of songs or musical clips and distinct emotions but mostly restricted to
western music which includes datasets like Computer Audition Lab 500-song (CAL500) [36], CAL500exp [37]. On the other
hand, emotions elicited by Indian Classical Music (ICM) are inherently complex. Usually, MER becomes a matter of huge
challenge especially because of its ambiguous emotional response. Studies involving non-linear techniques have been
conducted in the recent past to understand this complex behavior of ICM and its manifestation in the human brain [38–
49]. MER, particularly with ICM is still an uncharted territory for researchers especially because of the lack of structured
datasets comprising of ICM clips. In order to fathom the relevance of emotion induction in Indian Classical Music a proper
database is needed. Therefore, in this study we have introduced our own database JUMusEmoDB which is an extended
version of the one used in our previous study [50]. Our database is enriched with 1600 music clips from genre of Indian
Classical Music, out of which 400 clips have been categorized to each of the four classes of emotions, namely, anger, calm,
happy, and sad. Each clip is of 30 s length which is long enough for introducing an emotional imposition [36]. Out of the
1600 clips, 800 clips are parts of different ‘raga’ renditions improvised in sitar by an eminent maestro of ICM, and the
2
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
remaining 800 clips are several ‘raga’ renditions improvised in sarod by another eminent maestro of ICM. Each raga in
ICM evokes not only a particular emotion (rasa), but a superposition of different emotional states such as joy, sadness,
anger, disgust, fear and so on. In order to decipher the predominant emotions which were conveyed in the chosen ragas,
emotion annotations were performed by 100 participants based on 5-point Likert scale.
Like MER, another well-studied sub-problem of Music Information Retrieval (MIR) is Musical Instrument Identification
which is associated with the perception and understanding of musical timbre. Timbre of an instrument is related to
the physical characteristics of the instrument, particularly the properties of the material — this makes timbre a unique
identifier of an instrument. Machine Learning and Deep Learning techniques have also been successfully applied to this
domain of musical instrument identification [51–56]. As mentioned previously in the paper, we have two different string
instruments, sarod, and sitar, where the former is fretless, but the latter is not. So, this becomes our second sub-problem
along with the MER (first sub-problem).
In this paper we have introduced a dataset which has been used in our study and named it as JUMusEmoDB. It is
a novel dataset comprising of a variety of instrumental (sitar and sarod) clips from Indian Classical Music (ICM) genre.
This dataset is currently composed of musical clips from four different emotions, namely happy, sad, calm and anxiety.
For the analysis purposes, we have considered two approaches: (i) a Deep Learning based approach using CNNs and
spectrograms, and (ii) a non-linear feature-based analysis technique just for the sake of exploring the dataset. There was
no attempt to compare the performances of these two approaches as inherently they are very different from one another
and the sole intention was to demonstrate the applicability and versatility of our newly created dataset for the enthusiastic
researchers. For each of the aforementioned approaches, we have two sub-problems, namely (a) the music emotion
classification problem, and (b) the instrument identification problem. Our Deep Learning based method is essentially
an image processing-oriented approach to classify the dataset into (a) emotion tags, and (b) different instruments. We
primarily extracted the spectrogram of a clip and fed the processed spectrogram image into existing deep Convolutional
Neural Network (CNN) based architectures which were subsequently trained and validated on respective training and
validation datasets in order to be eventually tested on a test dataset. For this study, we have utilized four different
existing CNN architectures including ResNet50, MobileNet v2.0, SqueezeNet v1.0 and our proposed ODE-Net. In both the
sub-problems, the proposed ODE-Net has outperformed the remaining three models. (ii) The application of nonlinear
methodologies for source modeling indicates the importance of non-deterministic/ chaotic approaches in understanding
the underlying intricacies of speech/music signals [57–63]. In this context fractal analysis of the signal which reveals the
geometry embedded in acoustic signals assumes significance. The audio signals were subjected to fractal analysis first
by Voss and Clarke [64], who applied the same on the amplitude spectra of the audio signals to find out a characteristic
frequency fc , which separates white noise (which is a measure of flatness of the power spectrum) at frequencies much
lower than fc from very correlated behavior (∼1/f 2 ). It is well-established experience that naturally evolving geometries
and phenomena can never be characterized using a single scaling; it goes without saying that different parts of the
system would scale differently, i.e., the clustering pattern is not uniform over the whole system. Such a system is better
characterized as a ‘multifractal’ [65]. A multifractal can be loosely thought of as an interwoven set constructed from
sub-sets with different local fractal dimensions. Real world systems are mostly multifractal in nature. Music too, has
non-uniform property in its movement [66,67]. Multifractal methods such as Multifractal Detrended Fluctuation Analysis
(MFDFA) [68] have been applied in a number of recent studies to characterize the improvisation of an artist [46,69],
sonification of EEG data using emotional music clips [38], and a number of other related analysis on extracting features
from music signals [43,44,70]. In this study, we applied these nonlinear multifractal features to obtain a classification
algorithm for the four emotional classes in question here and also extend the same for instrumental classification. Robust
statistical tests in the form of one way ANOVA [71] have been performed on the obtained results to determine the
significance of the results.
To the best of our knowledge, our dataset is the first of its kind to contain musical clips from ICM genre imparting a
variety of emotions. Our dataset is still under development and we plan to include more data containing not only other
emotional features but also from different musical instruments eventually making the dataset comprising of 9 different
emotions usually noticed in the ragas. It is also vast and can be employed by the scientific community for any kind
of explorative analysis including other Machine Learning algorithms for classifying different emotions or even musical
instruments, investigation of the impact of Indian Classical Music on human brain and also to conduct cross-cultural
studies combining both Western Classical and Indian Classical musical clips. An interesting outcome of this novel music-
emotion dataset would be to compare the emotional appraisal with respect to Indian and Western Classical Music, how
the brain processes involved in the faculties vary from one another.
2. Methods
Fig. 1. Flowchart of the Emotion Recognition task consisting of Preprocessing, Spectrogram conversion and then classification using CNN based
models. For simplicity purposes, only a sample spectrogram (for segment 2) has been shown here, but we have converted each individual segment
to its corresponding spectrogram.
supervised learning approach using Convolutional Neural Network (CNN) based architectures for 2 classification problems:
(i) music emotion classification (4 classes) and (ii) instrument classification (2 classes). In such an approach the network
will design its own set of features and will learn the complex structural representations of (i) music which is generating
various emotions (ii) music which is being generated by various instruments. Furthermore, a successful classification with
a high accuracy rate will not only validate the authenticity of the JUMusEmoDB database, but also provide a baseline
or benchmark of comparison for other approaches. The methods section contains Preprocessing steps and Spectrogram
extraction alongside the proposed method and the models used for comparison. A flowchart is shown in Fig. 1 outlining
the steps for emotion recognition task. For simplicity purposes, only a sample spectrogram (for segment 2) has been
shown here, but we have converted each of the individual segments to their corresponding spectrograms.
2.1. Preprocessing
In our database we have audio clips each of which are approximately 30 s long. Since 30 s duration is long enough for
perceiving an emotion evoked by an audio clip, we divide each audio clip into 4 individual segments [72]. This not only
increases the size of the database but also makes the task more challenging [73]. We have used CNN based frameworks for
the supervised classification task because CNNs have been found to perform exceedingly well in the domain of computer
vision for image recognition or classification tasks. So, the input to our proposed model along with the competing models
needs to be an image. In order to do so we have converted the audio data clips to corresponding spectrograms employing
Short-Time-Fourier-Transform (STFT) technique.
2.2. Spectrograms
Spectral features have been previously used for emotion recognition tasks [74–76]. Spectrograms represent the
fluctuations of various frequency components with time present in the clip. The color or brightness of a spectrogram
represents the strength or magnitude of a frequency component at each time frame. Therefore, they provide a vital
ensemble of rich spectral information features for audio signal analysis, furthermore, suggesting that they can be directly
applied as inputs to a convolutional neural network-based architecture.
To extract the melspectrogram, we have employed the Short-Time-Fourier-Transform (STFT) technique which is used
to compute the sinusoidal frequency and phase content of local segments of a time-varying signal. The idea is to partition
the signal into shorter segments or bins or windows of equal length and then evaluate the Fourier transform on each of
these individual segments, which can be plotted as a function of time eventually obtaining a spectrogram. The difference
between a spectrogram and a melspectrogram is that the latter is a dot product of the former with mel filterbanks. In our
application, we have considered a window size of 2048 and a hop size of 512 (see Fig. 2).
4
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
2.3. Models
Convolutional Neural Networks (CNN) have been found to work well with images, especially for solving image
classification or segmentation-based problems. We employed a similar technique here, after converting the respective
audio clips to corresponding spectrograms — so we treat both of these emotion recognition and instrument recognition (or
classification) problems as image recognition (or classification) problems. Hence, we have used CNN based architectures
to classify (i) disparate emotions elicited by separate audio clips in our database, JUMusEmoDB and also (ii) distinct
instruments used for recording these audio clips. We have trained our proposed Neural ODE based method which gives
the highest accuracy on the test datasets for both the problems. Apart from this model, for comparison purposes we have
also trained three different models — ResNet-50, SqueezeNet 1.0 and MobileNet v2.0 and tested each of these models on
the test dataset. In this section we will describe each of these models used in the study.
2.3.1. ResNet
Arguably considered as one of the most disruptive architectures in the domain of computer vision and deep learning
in the recent past, ResNets [77] are essentially residual learning paradigms with substantially deeper network but with
lower computational complexity.
In order to tackle the problem of data overfitting, a common trend in the machine learning community is to introduce
more layers to make the network architecture deeper. However, increasing the depth of a network architecture can lead
to a notorious problem of vanishing gradients which comes with serious ramifications including saturation of convergence
with a very high training loss and therefore low accuracy.
To tackle this aforementioned problem of vanishing gradient, Resnets were introduced [77] which give high accuracy
despite having deeper layers in the network architecture. The core implementation includes a short-cut identity connec-
tion that can skip one or more layers (also known as skip-connection). This allows the network to fit the stacked layers to
a residual mapping using a residual block as shown in Fig. 3 (inside the dashed red box) instead of letting them directly
fit the desired underlying mapping. We have used Resnet-50 architecture (Fig. 3) as a comparison model in our case.
2.3.2. SqueezeNet
SqueezeNet was introduced as a lighter modification of deep convolutional neural network architecture with an
accuracy near to AlexNet with 50 times fewer parameters when trained on the ILSVRC dataset [78]. This lighter paradigm
offers some advantages over architectures like AlexNet including feasibility of deployment on hardware like FPGSs where
the memory is limited. The authors introduced a fire module (Fig. 4a) which is a stacking of a squeeze layer with 1 x 1
convolutional filters followed by an expand layer comprising of both 1 x 1 and 3 x 3 convolutional filters. The number of
kernels in the expand layer should be more than the squeeze layer in order to limit the number of input channels to the 3 x
3 filters. Finally, a late downsampling is introduced late in the network so that convolutional layers have large activation
maps due to delayed downsampling which leads to a higher classification accuracy. We have used vanilla SqueezeNet
(Fig. 4b) as a comparison model.
2.3.3. MobileNet
MobileNet was first introduced by Google in 2018 [79]. This new network comes with a novel lightweight architecture
where the Depthwise Separable Convolution has been introduced as a replacement to the traditional convolutional layers
— this reduces the model size and computational complexity of the network and makes it suitable for low-powered
mobile devices (hence the name). Two kinds of blocks are observed in the network architecture – a residual block or
5
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
Fig. 3. Resnet-50 architecture (along with the residual block inside the dashed red box) used as a comparison model for our data.
Fig. 4. SqueezeNet Architecture. (a) fire module with a squeeze layer followed by an expand layer. (b) Vanilla SqueezeNet architecture.
stride 1 and a downsizing block or a stride 2 as shown in Fig. 5. Each block is composed of three layers where the first
layer is a 1 x 1 convolutional filter with ReLU6 activation, the second layer acts as the depthwise convolutional layer with
another ReLU6 activation, and finally the third layer with another 1 x 1 convolutional kernel but without any non-linear
activation. Using ReLU6 instead of a traditional ReLU activation increases the robustness of the network when used with
low-precision computation. There is also a skip or shortcut connection like traditional residual networks which leads to
better training and accuracy.
where t ∈ 0, . . . , T is a residual block and T is the maximum number of layers, z(t) ∈ RD is the tth layer (can also be
represented as the D-dimensional state), and f is a function learned by layers inside the block and Θ is the set of weight
parameters. So, it is evident that the Eq. (1) relating to a residual block is nothing but a discrete approximation of an
ordinary differential equation (ODE) where the discretization step has been considered as unity [77]. So, a generalized
version of Eq. (1) for any discretization step ∆t can be written as:
z (t + 1) − z(t)
= f (z (t ) , θ (t )) (2)
∆t
6
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
Fig. 5. MobileNetV2 architecture with stride 1 or residual block and stride 2 or downsizing block.
Fig. 6. The proposed Neural ODE based classification framework. The second residual block has exactly similar architecture as the first residual block
and thus has been avoided in the illustration.
Now, as ∆t becomes infinitesimally small and tends to zero we can consider f to be continuous in t and globally
Lipschitz continuous in z. Accordingly we get the following equations:
z (t + 1) − z(t)
lim = f (z (t ) , θ (t ))
∆t →0 ∆t (3)
dz(t)
= f (z(t), θ (t))
dt
We consider zin = z (0) , zout = z(T ) where zin corresponds to the state at t = 0, and zout is the state at some T where
T is a positive real number. The function f decides the dynamics of the state-space Neural ODE. The value of T ideally
tends to infinity but in practice it is fixed. Thus, the final state or output zout is dependent on the input or initial state zin
and the function f learned by the trainable layers which are parameterized by Θ . So, the Neural ODE can be represented
by:
∫ T
zout = zin + f (z (t ) , θ (t )) dt (4)
0
We have used the adjoint sensitivity method to train Neural ODE as explained in [80]. We have also followed their
implementation of Neural ODE for classifying MNIST images and adapted for our dataset accordingly. The proposed model
we have used in our paper is shown in Fig. 6 where the input is the Mel spectrogram of the audio signal. We have used
two residual blocks in our paradigm and finally we get the output.
7
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
The analysis of the music signals is done using MATLAB [81] in this paper and for each step an equivalent mathematical
representation is given which is taken from the prescription of Kantelhardt et al. [68].
The complete procedure is divided into the following steps:
Step 1: converting the noise like structure of the signal into a random walk like signal. It can be represented as:
∑
Y (i) = (xk − x) (5)
For ν = Ns + 1........, 2 Ns, where yν (i) is the least square fitted value in the bin ν . In this work, a least square linear
fit using first order polynomial (MF-DFA -1) is performed. The study can also be extended to higher orders by fitting
quadratic, cubic, or higher order polynomials.
Step 4: The q-order overall RMS variation for various scale sizes can be obtained by the use of following equation:
( )
{ } 1
Ns q
1 ∑[ 2 ]q
Fq (s) = F (s, v) 2 (7)
Ns
v=1
where q is an index that can take all possible values except zero, because in that case the factor 1/q is infinite. Fig. 7(a)
and (b) are representative plots which show the variation of F(q) vs q for ‘‘original’’ and ‘‘randomly shuffled’’ time series
of a chosen music clip, respectively.
Step 5: The scaling behavior of the fluctuation function is obtained by drawing the log–log plot of Fq(s) vs. s for each
value of q.
The h(q) is called the generalized Hurst exponent. The Hurst exponent is measure of self-similarity and correlation
properties of time series produced by fractal. The presence or absence of long range correlation can be determined using
Hurst exponent. A mono-fractal time series is characterized by unique h(q) for all values of q. Fig. 7(c) gives the variation
of h(q) with (q).
The generalized Hurst exponent h(q) of MF-DFA is related to the classical scaling exponent τ (q) by the relation:
A mono-fractal series with long range correlation is characterized by linearly dependent q order exponent τ (q) with
a single Hurst exponent H. Multifractal signal on the other hand, possess multiple Hurst exponent and in this case, τ (q)
depends non-linearly on q [82].
The singularity spectrum f(α ) is related to h(q) by
α = h(q) + qh‘(q)
f(α ) = q[α − h(q)] + 1 (10)
where α is the singularity strength and f(α ) specify the dimension of subset series that is characterized by α . The
multifractal spectrum is capable of providing information about relative importance of various fractal exponents in the
series e.g., the width of the spectrum denotes range of exponents. A quantitative characterization of the spectra may be
obtained by least square fitting it to a quadratic function [83] around the position of maximum α0 ,
where C is an additive constant C = f(α0 ) = 1. B indicates the asymmetry of the spectrum. It is zero for a symmetric
spectrum; we have later quantified the asymmetry of the spectrum using Asymmetry Index to characterize the nature of
the multifractal spectra. A is a constant whose role is to determine the nature or orientation of the parabola. The width
of the spectrum can be obtained by extrapolating the fitted curve to zero.
8
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
Fig. 7. (a, b): Sample Log2 Fq (s) vs Log2 (scale) plot for original and shuffled series for a specific music clip. (c): Sample h(q) vs (q) plot and (d):
f(α ) vs α plot for original and shuffled series for a specific music clip.
W = α1 − α2 ,
with f(α1 ) = f(α2 ) = 0 (11)
The width of the spectrum gives a measure of the multifractality of the spectrum. Greater is the value of the width
W greater will be the multifractality of the spectrum. For a mono-fractal time series, the width will be zero as h(q) is
independent of q. Fig. 7(a) and (b) give the regression plots of F(q) vs q for ‘‘original’’ and ‘‘randomly shuffled’’ time series
9
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
of a chosen music clip. Fig. 7(c) shows the same variation for h(q) vs q and Fig. 7(d) shows f(α ) vs α plot for a original and
shuffled multifractal series. The origin of multifractality in music signal time series can be verified by randomly shuffling
the original time series data [43,44,46]. All long-range correlations that existed in the original data are destroyed by this
random shuffling and what remains is a totally uncorrelated sequence. Hence, if the multifractality of the original data
was due to long range correlation, the shuffled data will show non-fractal scaling, i.e. the self-similar nature of the signal
would be absent. On the other hand, if the initial h(q) dependence does not change, i.e., if h(q)= hshuffled (q), then the
multifractality is not due to long range correlation but is a result of broad probability density function of the time series.
If any series has multifractality both due to long range correlation as well as probability density function, then the shuffled
series will have smaller width W than the original series. Since the length of our shuffled time-series is on the order of N
∼105 , we have kept our scaling variable between 20, 000 and N/5 which falls under the CLT-regime negating the finite
size effects in the multifractal nature of the signal [84].
For a sample multifractal spectrum (Fig. 8), we have calculated the asymmetry index using the following formula at
par with the asymmetry parameter proposed in [85]:
∆α1 − ∆α2
ηai = (12)
∆α1 + ∆α2
where ηai is the asymmetry index, ∆α1 and ∆α2 are the distances of α0 from the left-most and right-most α values
respectively. If ηai < 0, then the multifractal spectrum has a longer right tail and if ηai > 0, then the multifractal spectrum
has a longer left tail. If the spectrum is symmetric, the Asymmetry Index will be equal to 0 as both ∆α1 and ∆α2 will be
equal. It is known that the multifractal spectrum has a long left tail when the time series has a multifractal structure that is
insensitive to the local fluctuations with small magnitudes. In contrast, the multifractal spectrum will have a long right tail
when the time series have a multifractal structure that is insensitive to the local fluctuations with large magnitudes [86].
3. Experiments
Dataset
JUMusEmoDB consists of 1600 audio clips of 30 s each. The clips have been taken from different conventional raga
renditions played in sitar and sarod by eminent maestros of ICM and recorded with a sample frequency of 44.1 kHz.
For each of the emotional classes, there were 400 clips, of which 200 belonged to sitar while the other 200 were played
in sarod. Initial annotations and emotional classification of the database has been done based on an emotional rating test
(5-point Likert scale) performed by 100 participants. For the multifractal analysis, 10 clips for each emotional class have
been taken and classification attempts was done. The dataset can be found here.
Deep Learning Based Approach
The final layer of each of these architectures is a fully connected layer which is preceded by all the convolutional layers.
Output from the fully connected layer goes to softmax function as an input vector for extracting (i) the four different
types of classes pertaining to four types of emotions (Happy, Sad, Calm and Anxiety) evoked by the music clips in our
database, and (ii) the two different classes of stringed instruments, sitar, and sarod. We have used cuda implementation
of the algorithms to train the models and test them. Experiments were run on an NVIDIA P100 machine with 16 GB
GPU memory and 12 GB RAM. In this section, we will explain our database, JUMusEmoDB in details along with the loss
function used and the final results obtained by us.
10
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
For comparison purposes, a criterion is required to evaluate the performance of the model. This criterion is usually
described as a loss function which determines the deviation of model prediction from the ground truth or target values. For
comparing images, a usual trend is to use mean squared error (MSE), mean absolute error (MAE) and Structural Similarity
Index Metric (SSIM) as loss functions. For time series analysis, MSE is used frequently. But, for classification purposes the
usual tendency is to use entropy-based losses. One such entropy-based loss is cross entropy loss [87] which is a measure
of uncertainty, and it is given as,
⎧ ∫
⎨− p (x) log (p (x)) , x is continuous random v ariable
L (X ) =
⎩− ∑ p (x) log (p (x)) , x is discrete random v ariable
We have used categorical cross entropy loss for comparison purposes. Cross entropy loss uses the concept of entropy
to measure the resemblance of an actual or target (ground truth) output against a predicted output. The magnitude of
cross entropy loss increases along with the divergence of prediction from actual output. Hence a 0 loss represents perfect
prediction, that is, the model performs with 100% accuracy. Cross entropy loss or Log loss is being calculated using the
following equation:
∑
L=− yi log(ŷi )
i
4. Results
1. We have divided the entire dataset into 3 parts comprising of a train dataset, a validation dataset, and a test dataset.
2. We have segmented each of the clips into 4 parts which increases the size of the overall dataset making it better
for training purposes.
3. We have split the training data into mini batches with a batch size of 64. We have used RGB spectrograms as input
images where have resized each channel to 32 × 32 in order to save memory and reduce training time. During
training the categorical cross entropy loss between prediction and target is minimized.
4. We have used Lookahead with Adam optimizer with a learning rate of 0.001. The reason for using Lookahead
optimizer is that it improves the performance of SGD and Adam and their respective variants [88].
During the training phase the model parameters need to be adjusted suitably. With each epoch we get two losses,
namely training loss and validation loss. The loss curves are plotted after every epoch and it can be seen whether the
model is training or not. Also, it can be determined whether the model is overfitting or underfitting the dataset which
is indicative of model performance. This helps us to tune the hyperparameters suitably. The loss curves for both training
and validation with respect to epoch for each model is shown in Figs. 9 and 11. We have considered 50 epochs for our
application. With increase in epoch, convergence is achieved for each of the models considered in our study.
The convergence plots help parameter tuning of CNN models and after adjusting the parameters, we obtain the best
results for each CNN Models. We then test each of the models on the test dataset and obtained 4 different accuracies. We
have only considered Top-1 accuracy in our study.
Confusion matrix reveals a better assessment of classifiers’ performance by providing information about the misclas-
sification rates. Thus, we have demonstrated the confusion matrices for the ODE-Net architecture used in the study for
both types of classification tasks in Figs. 10 and 12.
Emotion Recognition:
Table 1 shows the Top-1 accuracy values of the aforementioned models. It can be noticed that the Proposed Neural
ODE based paradigm gives the highest Top-1 accuracy of 92.969% on the test dataset out of the four competing models,
thus revealing that it fits best with the acquired dataset. It can also be seen that the second-best model is ResNet50.
It has been previously mentioned that the intuition of ODE-Net came from ResNet and apparently ODE-Net works as a
‘‘continuous’’ ResNet and has outperformed ResNet on multiple occasions. So, quite expectedly here as well Proposed
Neural ODE based architecture outperforms ResNet as well as SqueezeNet and MobileNet.
11
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
Table 1
Top-1 Accuracies on test dataset for 4 different models. Proposed Neural
ODE based architecture gives the highest Top-1 accuracy on the test dataset
out of four competing models.
Model Top-1 Accuracy
SqueezeNetV1 87.969%
MobileNetV2 85.078%
ResNet50 91.250%
Proposed ODE-Net 92.969%
Fig. 9. Training and Validation losses for 50 epochs for (a) SqueezeNetV1, (b) MobileNetV2, (c) ResNet50 and (d) Proposed Neural ODE based
architecture.
From the confusion matrix plot in Fig. 10 for proposed ODE-Net, it can be noted that out of the four emotion types,
predicted sad emotion clips have the least misclassification and have been best predicted whereas predicted calm emotion
clips have the most misclassification.
The proposed architecture performs very well for the four emotional classes chosen, with very low mis-classification
rate.
12
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
Fig. 10. Confusion matrix of Proposed Neural ODE based architecture on test dataset. Out of the four emotion types, predicted sad emotion clips
have the least misclassification and have been best predicted whereas predicted calm emotion clips have the most misclassification.
Table 2
Top-1 Accuracies on test dataset for 4 different models. Proposed Neural
ODE based architecture gives the highest Top-1 accuracy on the test dataset
out of four competing models.
Model Top-1 Accuracy
SqueezeNetV1 98.672%
MobileNetV2 99.063%
ResNet50 98.359%
Proposed ODE-Net 99.766%
Instrument Recognition:
While considering the emotional appraisal related to the musical clips used, two types of stringed instruments were
used - sitar and sarod. As an extension to the emotional classification work, using the same architecture, it was envisaged
to classify the instruments used.
Table 2 shows the Top-1 accuracies of the four architectures mentioned before. It can be noticed that the Proposed
Neural ODE based paradigm gives the highest Top-1 accuracy of 99.766% on the test dataset out of the four competing
models, thus revealing that it is able to classify the two instruments almost without any error. It can also be seen that
the second-best model in this case is MobileNetV2. Here again, the proposed ‘‘continuous’’ version of ResNet, i.e., ODENet
has outperformed ResNet.
From the confusion matrix plot in Fig. 12 for proposed ODE-Net, it can be noted that out of the two instrument classes,
predicted sitar clips have zero misclassification and have been predicted accurately whereas 3 predicted sarod clips have
been misclassified as sitar clips.
From the above results we notice that for instrument identification, the test accuracies are higher compared to those
for emotion identification. This indicates the following:
1. Sitar and Sarod can be differentiated very well especially given the fact that sarod is a fretless instrument whilst
sitar is not. So, we can say that CNN based Deep Learning with spectrograms enables excellent source/instrument
recognition especially for fret-based vs fret-less instruments.
2. Emotion Recognition is a more challenging problem compared to Instrument Identification.
13
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
Fig. 11. Training and Validation losses for 50 epochs for (a) SqueezeNetV1, (b) MobileNetV2, (c) ResNet50 and (d) Proposed Neural ODE based
architecture.
One of the drawbacks of this work is that we have not considered enough instruments to validate the aforementioned
claims. But we are building our database and in future we will consider more combination of instruments, particularly
including multiple fret-based and fret-less instruments to verify whether instrument recognition still works well or not.
Multi-fractality based approach
To verify the possibility of classification with the help of robust non-linear techniques, 10 clips of 30 s each was
selected from each emotional class corresponding to the two instruments have been selected. The multifractal width was
computed dividing each clip into six overlapping windows of 5 s each. Thus, we have a corpus of about 240 clips of 5 s
each belonging to four (4) different classes of emotion and taken from two (2) different instruments. A representative
f(α ) vs α plot for clips belonging to four different classes of emotions have been shown in Fig. 13.
From the representative plot shown in Fig. 13, it is evident that the multifractal spectral width for the four emotional
classes is significantly different and thus can be classified with considerable significance using MFDFA techniques. The
multifractal width values and the corresponding shuffled widths have been given in Table 3.
From Table 3, the averaged multifractal widths corresponding to each emotional class have been shown in Fig. 14. The
error bars shown in the figure represent the standard deviation values of the multifractal complexities of each class of
emotions.
From the figure, it is visible that the emotional variation, as obtained from the changes in multifractal complexity in
case of the two stringed instruments follows the same trend although the degree of associated complexity is different.
While for sarod, the multifractality of the signal is in general on the lower side compared to sitar instrument, for which the
14
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
Fig. 12. Confusion matrix of Proposed Neural ODE based architecture on test dataset. Out of the two instrument classes, predicted sitar clips have
zero misclassification and have been predicted accurately whereas 3 predicted sarod clips have been misclassified as sitar clips.
Fig. 13. A comparative multifractal spectra plot for clips belonging to four emotional classes.
multifractality is in general higher. Corresponding to emotion categorization, clips representing happy emotion register
lower values of multifractality while those conveying sad emotion corresponds to the higher values of multifractality. The
multifractality of the clips corresponding to emotions anxiety and calm are very close to each other albeit conventionally
they are known to be emotions which are distinctly different from one another. For sarod, the multifractal complexities
corresponding to anxiety and sad are very close to each other, while an interesting observation is that the complexities
are significantly different for both the instruments in case of happy and calm.
To parameterize the shape of the multifractal spectra, we have computed the asymmetry index (of the multifractal
spectrum) (from Eq. (8) of the Methodology section) for the 10 clips belonging to four different emotional classes and
different instruments, as illustrated in Table 4.
From Table 4 above, it becomes evident that for most of the chosen class of emotions, the multifractal spectrum has a
long right tail, due to which the asymmetry index ηai has a value less than 0, except for the emotional class ‘‘calm’’. This
primarily indicates that for most of the chosen multifractal audio signals, in our experimental cases, is mostly dominated
15
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
Table 3
Multifractal and shuffled width values for the 10 clips chosen corresponding to four target emotions.
Target Sitar
Emotions
Multifractal Shuffled Multifractal Shuffled Multifractal Shuffled Multifractal Shuffled Multifractal Shuffled
Width Width Width Width Width Width Width Width Width Width
Clip 1 Clip 2 Clip 3 Clip 4 Clip 5
Anxiety 1.84 0.18 1.82 0.35 1.75 0.16 1.81 0.07 1.73 0.08
Calm 1.59 0.15 1.66 0.22 1.56 0.23 1.62 0.08 1.72 0.25
Happy 1.20 0.08 1.09 0.14 1.07 0.17 1.09 0.18 1.16 0.18
Sad 2.01 0.25 2.01 0.13 1.90 0.08 2.01 0.15 2.08 0.15
Target Sarod
Emotions
Multifractal Shuffled Multifractal Shuffled Multifractal Shuffled Multifractal Shuffled Multifractal Shuffled
Width Width Width Width Width Width Width Width Width Width
Clip 1 Clip 2 Clip 3 Clip 4 Clip 5
Anxiety 1.82 0.08 1.79 0.16 1.77 0.09 1.73 0.29 1.62 0.18
Calm 1.79 0.25 1.88 0.28 1.72 0.05 1.77 0.31 1.83 0.23
Happy 1.08 0.18 0.98 0.04 1.25 0.07 1.16 0.25 1.26 0.33
Sad 1.74 0.18 1.82 0.31 1.81 0.09 1.82 0.13 1.79 0.17
Fig. 14. Variation of averaged multifractal spectral width corresponding to emotional variation in two instruments.
Table 4
Asymmetry Index of multifractal spectrum corresponding to classes of emotion and instruments.
Emotions Sitar Sarod
Anxiety −0.60 −0.69 0.08 −0.47 −0.87 −0.80 −0.74 −0.68 −0.55 −0.43
Calm −0.56 0.56 0.64 0.42 0.17 0.29 0.50 −0.46 −0.13 −0.25
Happy −0.81 0.42 −0.73 −0.82 −0.37 −0.72 −0.64 −0.27 −0.16 −0.23
Sad −0.64 −0.01 −0.80 −0.80 −0.68 −0.73 −0.09 −0.84 −0.62 −0.78
by small fluctuations. This assumption, however does not hold true for the clips which have been categorized as ‘‘calm’’ as
the asymmetry index for these clips mostly have values >0, and thus are mostly dominated by large fluctuations. This is
evident from the primarily right skewed nature of the multifractal spectra corresponding to ‘‘calm’’ clips shown in Fig. 13.
The classification accuracy in case of the two instruments is also very much evident from Table 4 above.
To get more novel insights into the case, a one-way ANOVA test was conducted to test the statistical significance of
the emotional classification obtained from the multifractal analysis. The results have been presented in the Tables 5–8.
Since the obtained results are strongly significant at 95% confidence interval, Tukey’s HSD comparison test in the form
of post-hoc analysis was performed to test the significance between the tested emotional groups in question. The results
of Tukey’s HSD comparison test have been given in the following table for sitar.
16
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
Table 5
One way ANOVA parameters corresponding to emotion classification in sitar.
Source SS df MS F P
Treatment (between groups) 1.251654 3 0.42 23.51 <.0001
Error 0.561339 32 0.02
SS/Bl
Total 1.81946 35
Table 6
Tukey’s HSD test values for the statistical significance of the test groups
in sitar.
Comparison groups p -value
HSD[.05] = 0.17; HSD[.01] = 0.21
Anxiety vs calm non-significant
Anxiety vs Happy <0.01
Anxiety vs Sad <.05
Calm vs Happy <0.01
Calm vs Sad <0.01
Happy vs Sad <0.01
Table 7
One way ANOVA parameters corresponding to emotion classification in sarod.
Source SS df MS F P
Treatment (between groups) 2.609853 3 0.869951 60.49 <.0001
Error 0.460244 32 0.014383
SS/Bl
Total 3.070097 35
Table 8
Tukey’s HSD test values for the statistical significance of the test groups
in sarod.
Comparison groups p -value
HSD[.05] = 0.17; HSD[.01] = 0.21
Anxiety vs calm <0.05
Anxiety vs Happy <0.01
Anxiety vs Sad non-significant
Calm vs Happy <0.01
Calm vs Sad non-significant
Happy vs Sad <0.01
Table 9
One way ANOVA parameters corresponding to instrument classification in sarod and sitar.
Source SS df MS F P
Treatment (between groups) 0.107339 1 0.107339 28.69 <.0001
Error 0.059311 16 0.003707
SS/Bl
Total 0.16665 17
As is evident from the earlier figure, the values of anxiety and calm are very close to each other to be classified
significantly, the response from Tukey’s HSD test also reveals the same information, while the statistical significance
is considerable for all the other test groups in question.
Similar ANOVA test was performed for the sarod clips also to classify the emotional clips taken from it.
Since the overall ANOVA results come out to be considerably significant in this case also, Tukey’s HSD test was
performed to compare the values between test emotional groups.
One way ANOVA was also performed for the multifractal complexities corresponding to sitar and sarod, the values
have been given in Table 9:
The values obtained statistically corroborate our previous findings from the assessment of raw multifractal complexity
values. The emotions which showed complexity values very near to one another has lower significance ratios compared
to the others.
17
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
In this paper, we have introduced a novel dataset called JUMusEmoDB which presently contains 1600 audio clips where
the duration of each clip is approximately 30 s. There are four categories of emotions in this dataset, viz, happy, sad, calm
and anxiety where individual emotion has 400 clips. While categorizing the emotion based on human response data (100
participants), we have only considered the predominant emotion imparted by an audio clip. These musical recordings or
ragas belong to the genre of Indian Classical Music (ICM) and to the best of our knowledge our dataset is first of its kind
in such a domain. Furthermore, out of these 1600 clips, 800 clips were recorded by sitar and the remaining 800 clips
were recorded by sarod. Therefore, this gives an opportunity to perform Musical Instrument Identification based on the
recordings alongside Music Emotion Recognition. Hence, we essentially attempted to explore the dataset by investigating
into these two sub-problems.
For tackling both the problems we adopted Convolutional Neural Network (CNN) based Deep Learning approach as
well as Multifractal Detrended Fluctuation Analysis (MFDFA) method. Since the recording was long enough, we segmented
individual clips into 4 components eventually obtaining 6400 clips in total. For both emotion recognition and instrument
identification tasks, we have considered 4096 clips from this dataset for training purposes, 1024 clips for validation
purposes, and the remaining 1280 clips for testing purposes. Thus, we had sufficient training data per class for each
sub-problem to avoid over-fitting. For using audio clips as images to be used for training in a CNN based architecture, we
had converted these segmented audio clips to corresponding spectrograms using STFT methodology. For both the sub-
problems we have trained four network architectures out of which our proposed ODE-Net has successfully outperformed
the remaining three architectures. Although for Musical Instrument Identification, the percentage accuracy is more
compared to Music Emotion Recognition. The reason can be attributed to the lesser complexity of instrument classification
compared to emotion classification because emotions are subjective and can be overlapping in nature, and also presence
of lesser number of classes for the instrument identification problem. The confusion matrices demonstrated for both the
sub-problems for the ODE-Net provides some clarity on this aforementioned fact. For the sake of exploring the dataset
further and ratifying the versatility of our newly developed dataset using other methods, we conducted a non-linear
analysis MFDFA analysis on the same dataset. MFDFA algorithm takes the help of a unique parameter, i.e. multifractal
spectral width to segregate different emotional categories. The music clips are in general nonlinear, non-stationary in
nature and the scaling based methods act as mathematical microscopes capturing the inherent complexities, spikes of
the music signals very nicely. This may be the reason we see the complexity values of anxiety and calm in few clips to be
very close to each other, while ideally, they should be very far apart. This innate nature of ICM is beautifully reflected in
the results obtained from the nonlinear classification methodology attempted in this work. Since this is the first of its kind
study which applies Deep Learning and nonlinear based classification algorithm on a newly created dataset, the scientific
community can consider our results as baseline for further comparison. Our results validate our dataset and delineates
that it can be used in the future for Deep Learning and nonlinear based applications.
Future works in this direction
It is important to state here that we did not undertake any feature-based Machine Learning approaches in our study for
the sake of brevity and it can be considered as an interesting future work. We have instead used CNN based architectures
for tackling the problem of music emotion recognition and musical instrument identification since the advantage includes
better overall accuracy because of better extraction of useful features from the data compared to other traditional methods.
However, in future, some studies need to be conducted to comprehend the source of emotions for a given musical clip.
For CNN based classification strategies, the usage of input images (matrices) with bigger sizes (256 × 256, for example)
to explore the possibilities of getting better classification accuracies especially for the emotion recognition task can be
considered as a future work. Moreover, other kind of time-series based Deep Learning architectures like RNN, or LSTM
can be employed for classification tasks — a future work. On the other hand, another future work can include using
non-linear features for unsupervised clustering using k-means or Fuzzy C-means based methods. Yet another future work
includes investigation of the impact of Indian Classical Music on human brain and alongside conducting cross-cultural
studies combining both Western Classical and Indian Classical musical clips. An interesting outcome of this novel music-
emotion dataset would be to compare the emotional appraisal with respect to Indian and Western Classical Music, how
the brain processes involved in the faculties vary from one another. To conclude, we would like to reiterate that the dataset
presented here is one of its kind, versatile and it is still under development. We plan to include more data containing not
only other emotional features but also from different musical instruments eventually making the dataset comprising of
different emotions conventionally presented in the raga renditions.
Sayan Nag: Methodology, Software, Formal analysis, Data curation, Writing – original draft, Investigation. Medha Basu:
Data acquisition, Corpus development, Musicological formalism, Writing, Visualization, Data curation, Conceptualization.
Shankha Sanyal: Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing –
review & editing. Archi Banerjee: Conceptualization, Methodology, Formal analysis, Resources. Dipak Ghosh: Writing –
review & editing, Supervision.
18
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgments
AB (CSRI/PDF-34/2018) and SS (DST/CSRI/2018/78) acknowledge DST, Govt of India for providing the DST-CSRI
(Cognitive Science Research Initiative) Post Doctoral Fellowship to pursue this research.
References
[1] M. Barthet, G. Fazekas, M. Sandler, Music emotion recognition: From content-to context-based models, in: International Symposium on Computer
Music Modeling and Retrieval, Springer, Berlin, Heidelberg, 2012, pp. 228–252.
[2] J.G. de Quirós, S. Baldassarri, J.R. Beltrán, A. Guiu, P. Álvarez, An automatic emotion recognition system for annotating spotify’s songs, in: OTM
Confederated International Conferences on the Move to Meaningful Internet Systems, Springer, Cham, 2019, pp. 345–362.
[3] Matthew E. Sachs, Antonio Damasio, Assal Habibi, The pleasures of sad music: a systematic review, Front. Hum. Neurosci. 9 (2015) 404..
[4] D. Leubner, T. Hinterberger, Reviewing the effectiveness of music interventions in treating depression, Front. Psychol. 8 (1109) (2017).
[5] B.K. Baniya, D. Ghimire, J. Lee, Automatic music genre classification using timbral texture and rhythmic content features, in: 2015 17th
International Conference on Advanced Communication Technology, ICACT, IEEE, 2015, pp. 434–443.
[6] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) 273–297.
[7] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995.
[8] J.H. Juthi, A. Gomes, T. Bhuiyan, I. Mahmud, Music emotion recognition with the extraction of audio features using machine learning approaches,
in: P. Singh, B. Panigrahi, N. Suryadevara, S. Sharma, A. Singh (Eds.), Proceedings of ICETIT 2019. Lecture Notes in Electrical Engineering, vol.
605, Springer, Cham, 2020, http://dx.doi.org/10.1007/978-3-030-30577-%202_27.
[9] Stuartw P. Lloyd, Least square quantization in PCM, Bell telephone laboratories paper published in journal much later: Lloyd, Stuart P. 1982, least
squares quantization in PCM (PDF), IEEE Trans. Inform. Theory 28 (2) (1957) 129–137, http://dx.doi.org/10.1109/TIT.1982.1056489, CiteSeerX
10.1.1.131.1338 Retrieved 2009-04-15.
[10] M. Ester, H.P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, Kdd 96 (34)
(1996) 226–231.
[11] J.C. Bezdek, R. Ehrlich, W. Full, FCM: The fuzzy c-means clustering algorithm, Comput. Geosci. 10 (2–3) (1984) 191–203.
[12] B.G. Patra, D. Das, S. Bandyopadhyay, Unsupervised approach to hindi music mood classification, in: Mining Intelligence and Knowledge
Exploration, Springer, Cham, 2013, pp. 62–69.
[13] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, Adv. Neural Inf. Process. Syst. (2012)
1097–1105.
[14] Y. LeCun, LeNet-5, convolutional neural networks, 2015, URL:http://yann lecun. com/exdb/lenet, 20.
[15] Yann LeCun, Yoshua Bengio, Geoffrey Hinton, Deep learning, Nature 521 (2015) 436–444, 7553.
[16] Goodfellow Ian, et al., Generative adversarial nets, Adv. Neural Inf. Process. Syst. (2014).
[17] Goodfellow Ian, et al., Deep learning, 1, (2) MIT Press, Cambridge, 2016.
[18] Kelvin Xu, et al., Show, attend and tell: Neural image caption generation with visual attention, in: International Conference on Machine Learning,
2015.
[19] Diederik P. Kingma, Max Welling, Auto-encoding variational bayes, 2013, arXiv preprint arXiv:1312.6114.
[20] P. Autoencoders Baldi, Unsupervised learning, and deep architectures, in: Proceedings of ICML Workshop on Unsupervised and Transfer Learning,
2012, pp. 37–49.
[21] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical
Image Computing and Computer-Assisted Intervention, Springer, Cham, 2015, pp. 234–241.
[22] Hammernik Kerstin, et al., Learning a variational network for reconstruction of accelerated MRI data, Magn. Reson. Med. 79 (6) (2018)
3055–3071.
[23] Nitski Osvald, et al., CDF-net: Cross-domain fusion network for accelerated MRI reconstruction, in: International Conference on Medical Image
Computing and Computer-Assisted Intervention, Springer, Cham, 2020.
[24] Bhattacharyya Mayukh, Sayan Nag, Hybrid style siamese network: Incorporating style loss in complimentary apparels retrieval, 2019, arXiv
preprint arXiv:1912.05014.
[25] Pu Yunchen, et al., Variational autoencoder for deep learning of images, labels and captions, Adv. Neural Inf. Process. Syst. (2016).
[26] Nag Sayan, Lookahead optimizer improves the performance of convolutional autoencoders for reconstruction of natural images, 2020, arXiv
preprint arXiv:2012.05694.
[27] Vaswani Ashish, et al., Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017) 5998–6008.
[28] Tom B. Brown, et al., Language models are few-shot learners, 2020, arXiv preprint arXiv:2005.14165.
[29] Liu. Tong, et al., Audio-Based Deep Music Emotion Recognition. AIP Conference Proceedings, vol. 1967 no.1, AIP Publishing LLC, 2018.
[30] Liu Xin, et al., CNN based music emotion classification, 2017, arXiv preprint arXiv:1704.05665.
[31] Yang Yi-Hsuan, Homer H. Chen, Machine recognition of music emotion: A review, ACM Trans. Intell. Syst. Technol. (TIST) 3 (3) (2012) 1–30.
[32] Agrawal Yudhik, Ramaguru Guru Ravi Shanker, Vinoo Alluri, Transformer-based approach towards music emotion recognition from lyrics, 2021,
arXiv preprint arXiv:2101.02051.
[33] S. Rajesh, N.J. Nalini, Musical instrument emotion recognition using deep recurrent neural network, Procedia Comput. Sci. 167 (2020) 16–25.
[34] Liu Huaping, Yong Fang, Qinghua Huang, Music emotion recognition using a variant of recurrent neural network, in: 2018 International
Conference on Mathematics, Modeling, Simulation and Statistics Application, MMSSA 2018, Atlantis Press, 2019.
[35] Malik Miroslav, et al., Stacked convolutional and recurrent neural networks for music emotion recognition, 2017, arXiv preprint arXiv:
1706.02292.
[36] Turnbull Douglas, et al., Towards musical query-by-semantic-description using the cal500 data set, in: Proceedings of the 30th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval, 2007.
[37] Wang Shuo-Yang, et al., Towards time-varying music auto-tagging based on CAL500 expansion, in: 2014 IEEE International Conference on
Multimedia and Expo, ICME, IEEE, 2014.
[38] Sanyal Shankha, et al., Music of brain and music on brain: a novel EEG sonification approach, Cogn. Neurodyn. 13 (1) (2019) 13–31.
19
S. Nag, M. Basu, S. Sanyal et al. Physica A 597 (2022) 127261
[39] Sengupta Sourya, et al., Emotion specification from musical stimuli: An EEG study with AFA and DFA, in: 2017 4th International Conference
on Signal Processing and Integrated Networks, SPIN, IEEE, 2017.
[40] Nag Sayan, et al., Can musical emotion be quantified with neural jitter or shimmer? A novel EEG based study with hindustani classical music,
in: 2017 4th International Conference on Signal Processing and Integrated Networks, SPIN, IEEE, 2017.
[41] Sarkar Rajib, et al., Recognition of emotion in music based on deep convolutional neural network, Multimedia Tools Appl. 79 (1) (2020) 765–783.
[42] Nag Sayan, et al., A fractal approach to characterize emotions in audio and visual domain: A study on cross-modal interaction, 2021, arXiv
preprint arXiv:2102.06038.
[43] S. Sanyal, A. Banerjee, S. Nag, U. Sarkar, S. Roy, R. Sengupta, D. Ghosh, Tagore and neuroscience: A non-linear multifractal study to encapsulate
the evolution of tagore songs over a century, Entertain. Comput. (2020) 100367.
[44] A. Banerjee, S. Sanyal, S. Roy, S. Nag, R. Sengupta, D. Ghosh, A novel study on perception–cognition scenario in music using deterministic and
non-deterministic approach, Physica A 567 (2021) 125682.
[45] He Juan, et al., Non-linear analysis: Music and human emotions, in: 2015 3rd International Conference on Education, Management, Arts,
Economics and Social Science, Atlantis Press, 2015.
[46] S. Sanyal, A. Banerjee, A. Patranabis, K. Banerjee, R. Sengupta, D. Ghosh, A study on improvisation in a musical performance using multifractal
detrended cross correlation analysis, Physica A 462 (2016) 67–83.
[47] Sanyal Shankha, et al., A non linear approach towards automated emotion analysis in hindustani music, 2016, arXiv preprint arXiv:1612.00172.
[48] Sanyal Shankha, et al., Gestalt phenomenon in music? A neurocognitive physics study with EEG, 2017, arXiv preprint arXiv:1703.06491.
[49] Banerjee Archi, et al., Neural (EEG) response during creation and appreciation: A novel study with hindustani raga music, 2017, arXiv preprint
arXiv:1704.05687.
[50] Sarkar Uddalok, et al., Neural network architectures to classify emotions in Indian classical music, 2021, arXiv preprint arXiv:2102.00616.
[51] Solanki Arun, Sachin Pandey, Music instrument recognition using deep convolutional neural networks, Int. J. Inform. Technol. (2019) 1–10.
[52] Han Yoonchang, Jaehun Kim, Kyogu Lee, Deep convolutional neural networks for predominant instrument recognition in polyphonic music,
IEEE/ACM Trans. Audio Speech Lang. Process. 25 (1) (2016) 208–221.
[53] Li Peter, Jiyuan Qian, Tian Wang, Automatic instrument recognition in polyphonic music using convolutional neural networks, 2015, arXiv
preprint arXiv:1511.05520.
[54] Lostanlen Vincent, Carmine-Emanuele Cella, Deep convolutional networks on the pitch spiral for musical instrument recognition, 2016, arXiv
preprint arXiv:1605.06644.
[55] Humphrey Eric, Simon Durand, Brian McFee, OpenMIC-2018: An Open Data-set for Multiple Instrument Recognition, ISMIR, 2018.
[56] Pons Jordi, et al., Timbre analysis of music audio signals with convolutional neural networks, in: 2017 25th European Signal Processing
Conference, EUSIPCO, IEEE, 2017.
[57] A. Behrman, Global and local dimensions of vocal dynamics, J.-Acoust. Soc. Am. 105 (1999) 432–443.
[58] M. Bigerelle, A. Iost, Fractal dimension and classification of music, Chaos Solitons Fractals 11 (14) (2000) 2179–2192.
[59] A. Kumar, S.K. Mullick, Nonlinear dynamical analysis of speech, J. Acoust. Soc. Am. 100 (1) (1996) 615–629.
[60] R. Sengupta, N. Dey, A.K. Datta, D. Ghosh, Assessment of musical quality of tanpura by fractal-dimensional analysis, Fractals 13 (03) (2005)
245–252.
[61] R. Sengupta, N. Dey, A.K. Datta, D. Ghosh, A. Patranabis, Analysis of the signal complexity in sitar performances, Fractals 18 (02) (2010) 265–270.
[62] R. Sengupta, N. Dey, D. Nag, A.K. Datta, Comparative study of fractal behavior in quasi-random and quasi-periodic speech wave map, Fractals
9 (04) (2001) 403–414.
[63] K.J. Hsü, A.J. Hsü, Fractal geometry of music, Proc. Natl. Acad. Sci. 87 (3) (1990) 938–941.
[64] R.F. Voss, J. Clarke, 1/f noise in speech and music, Nature 258 (1975) 317–318.
[65] R. Lopes, N. Betrouni, Fractal and multifractal analysis: a review, Med. Image Anal. 13 (4) (2009) 634–649.
[66] Z.Y. Su, T. Wu, Multifractal analyses of music sequences, Physica D 221 (2) (2006) 188–194.
[67] L. Telesca, M. Lovallo, Revealing competitive behaviours in music by means of the multifractal detrended fluctuation analysis: application to
bach’s sinfonias, in: Proceedings of the Royal Society of London a: Mathematical, Physical and Engineering Sciences (P. Rspa20110118), The
Royal Society, 2011.
[68] J.W. Kantelhardt, S.A. Zschiegner, E. Koscielny-Bunde, S. Havlin, A. Bunde, H.E. Stanley, Multifractal detrended fluctuation analysis of
nonstationary time series, Physica A 316 (1–4) (2002) 87–114.
[69] S. Roy, A. Banerjee, S. Sanyal, Improvisation in Indian classical music: Probing with MB and BE distributions, Jadavpur J. Lang. Linguist. 4 (Sp.
I) (2020) 130–143.
[70] D. Ghosh, R. Sengupta, S. Sanyal, A. Banerjee, Musicality of Human Brain Through Fractal Analytics, Springer, Singapore, 2018.
[71] H.M. Park, Comparing group means: t-tests and one-way ANOVA using stata, SAS, R, and SPSS, in: The University Information Techology Services
(UITS) Center for Statistical and Mathematical Computing, Indiana University, IN, USA, 2009.
[72] Liu Tong, et al., Audio-based deep music emotion recognition. AIP Conference Proceedings, Vol. 1967 no. 1, AIP Publishing LLC, 2018.
[73] Uddalok Sarkar, et al., A simultaneous EEG and EMG study to quantify emotions from hindustani classical music, in: Recent Developments in
Acoustics, Springer, Singapore, 2020, pp. 285–299.
[74] Steven A. Rieger, Rajani Muraleedharan, Ravi P. Ramachandran, Speech based emotion recognition using spectral feature extraction and an
ensemble of kNN classifiers, in: The 9th International Symposium on Chinese Spoken Language Processing, IEEE, 2014.
[75] Yang Yi-Hsuan, Homer H. Chen, Machine recognition of music emotion: A review, ACM Trans. Intell. Syst. Technol. (TIST) 3 (3) (2012) 1–30.
[76] N.J. Nalini, S. Palanivel, Music emotion recognition: The combined evidence of MFCC and residual phase, Egypt. Inform. J. 17 (1) (2016) 1–10.
[77] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Proc. IEEE Conf. Comput. Vis. Pattern Recog. (2016) 770–778.
[78] N. Iandola Forrest, et al., SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size, 2016, arXiv preprint
arXiv:1602.07360.
[79] Sandler Mark, et al., Mobilenetv2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018.
[80] R.T. Chen, Y. Rubanova, J. Bettencourt, D. Duvenaud, Neural ordinary differential equations, 2018, arXiv preprint arXiv:1806.07366.
[81] E.A.F.E. Ihlen, Introduction to multifractal detrended fluctuation analysis in Matlab, Front. physiol. 3 (141) (2012).
[82] Y. Ashkenazy, D.R. Baker, H. Gildor, S. Havlin, Nonlinearity and multifractality of climate change in the past 420, 000 years, Geophys. Res. Lett.
30 (22) (2003).
[83] Y.U. Shimizu, S. Thurner, K. Ehrenberger, Multifractal spectra as a measure of complexity in human posture, Fractals 10 (01) (2002) 103–116.
[84] S. Drozdz, J. Kwapień, P. Oświecimka, R. Rak, Quantitative features of multifractal subtleties in time series, Europhys. Lett. 88 (6) (2010) 60003.
[85] S. Drozdz, P. Oświȩcimka, Detecting and interpreting distortions in hierarchical organization of complex time series, Phys. Rev. E 91 (3) (2015)
030902.
[86] E.A.F.E. Ihlen, Introduction to multifractal detrended fluctuation analysis in matlab, Front. physiol. 3 (141) (2012).
[87] Zhang. Zhilu, Mert R. Sabuncu, Generalized cross entropy loss for training deep neural networks with noisy labels, 2018, arXiv preprint
arXiv:1805.07836.
[88] Zhang Michael, et al., Lookahead optimizer: k steps forward, 1 step back, Adv. Neural Inform. Process. Syst. (2019).
20