Springer Lecture Notes in Computer Science
Springer Lecture Notes in Computer Science
ECAPA-TDNN Architecture
1 Introduction
by verifying the identity of the infants based on their cries. The second scenario
involves neonatal care, where the system identifies a baby’s crying and enables
appropriate actions by caregivers, nurses or mothers.
In infant cry research, various methods based on machine learning and deep
learning have been employed. In the Machine learning methods researchers have
mainly used support vector machine (SVM)[2], K-Nearest Neighbor (KNN)[3],
Gaussian Mixture Model (GMM)[4], logistic regression[5], K-means clustering[6]
and Random Forest[7] are applied to pathological cry classification, cry reason
classification, and cry sound detection. Recently researchers have started using
neural based methods such as multi-layer perceptron[8], general regression neu-
ral network[9], Time Delay Neural network[10]. Novel neural network architec-
tures like Convolutional Neural Network (CNN)[11], Recurrent Neural Network
(RNN)[12] and CNN-RNN networks have opened up new possibilities in infant
cry research.
The main challenge in infant cry analysis is to collect the gold standard datasets.
In the literature researchers have used real as well as synthetic datasets. Ferretti
et.al collected real datasets recorded in the Neonatal Intensive Care Unit (NICU)
of a hospital, synthetic databases including crying with speech, ”beep” sounds
and more[13]. Feier et.al collected TUT Rare Sound Events 2017 dataset con-
taining crying with sound like “glass braking”or “gunshot”, and self -recorded
databases of cries and non- cries[14]. The proposed work focuses on verifying
infants using cry samples, similar to how speaker verification works with adult
speech.However, there is currently no specific database for infant cry speaker ver-
ification. The UBENWA CryCeleb dataset, provided by Ubenwa Health, offers a
labeled collection of infant cries for research purposes. This dataset was released
as a part of CryCeleb 2023 task provided by hugging face for the speaker verifica-
tion. Ubenwa Health has released over 6 hours of manually segmented cry sounds
from 786 newborns to encourage further research in infant cry analysis[15].
The current study performs the infant cry verification from different instances
rather than relying on cries from the same recording. This approach is more rep-
resentative of real-world scenarios where the system needs to verify the infant’s
identity over multiple days. Additionally, it has been observed that verifying
separate parts of a cry from the same recording is easier. This is because in-
fants tend to exhibit consistent characteristics within a single crying session but
not necessarily across different sessions. It suggests that factors other than the
infant’s identity can influence the characteristics of the cry.
2 Related work
3 Model Description
4 Implementation
The cry sound recordings collected for analysis are predominantly short, lasting
between 0.5 to 1.0 seconds, with less than 0.3% of cry sounds exceeding 4 seconds
in duration. These cry sounds are specifically expirations, excluding inspirations,
as they are typically brief, difficult to detect and less informative about the vocal
tract. Below Table 1 provides data which was divided into different recording
periods, encompassing both birth and discharge times. The train, dev and test
splits were established with 348, 40, and 160 cry sound recordings in the ”Both
birth and discharge” category,respectively. There were 183 cry sound recordings
in the ”Only birth” category and 55 in the ”Only discharge” category, both
without any recordings in the dev or test sets.
Split
Time(s) of Recording(s) train dev test
Both birth and discharge 348 40 160
Only Birth 183 0 0
Only Discharge 55 0 0
Total 586 40 160
Table 2 provided information about the splits, containing 40 positive pairs and
1540 negative pairs in the dev set, while the test set consisted of 160 positive
pairs and 25440 negative pairs.
We have a total of 934 unique baby cry containing 26,093 utterances and that
is divided into 348 folders of both birth and discharge, 183 folders of only births
and 55 only discharge folders. Each folder comprises an average of 30 birth files
and 10 discharge files containing of 0-0.5 seconds of audio. To process the data,
we concatenated 0.5 seconds of audio files from all the birth files into a single
birth file and all discharge files into a single discharge file.
As a result, we obtained a total of 934 files from the 934 unique babies and
average duration of each file is 30 seconds. We have taken 3-5 seconds of random
audio from the file for training. Next, we divided the train files into 586 files
containing of 348 files of both birth and discharge taken only birth recordings,
183 files of only births and 55 only discharge files and the validation files into 348
files containing of only discharge files from both birth and discharge recordings.
The split ratio for the train and validation sets was 62:38, respectively.
The ECAPA-TDNN model in this study utilizes TDNN layers with four 1024
channels and one 3072 channel. These layers are designed to capture intricate
patterns and characteristics in the speaker’s voice. The embedding layer, with
a dimension of 192, then transforms these learned features into a concise and
distinctive representation specifically tailored for speaker identification.
The layers in the ECAPA-TDNN model are constructed with carefully chosen
kernel sizes and dilations. The model employs kernel sizes of [5, 3, 3, 3, 1] and
dilations of [1, 2, 3, 4, 1]. These specific configurations allow the model to capture
temporal dependencies at different scales. The larger kernel size of 5 enables the
model to capture broader contextual information, while the smaller kernel sizes
of 3 and 1 focus on capturing more localized patterns.
The Adam optimizer is chosen for this purpose, as it offers adaptive learning
rate capabilities and incorporates momentum to accelerate convergence. By min-
imizing the training loss through these optimization techniques, the model can
effectively learn discriminative speaker representations and achieve high perfor-
mance in speaker verification. The hyper parameter scale and margin of Log
Softmax Wrapper are set to 30 and 0.2 respectively.
5 Results
The fine-tuned model achieving an EER of 20.45 % on the dev set and 27.63 %
on the test set. It is open-sourced along with code that can be used to reproduce
these results. Figure present histograms of scores for positive pairs (orange) and
negative pairs (blue), with the y-axis normalized separately for each color. The
red vertical line indicates the threshold where the EER is achieved
10
6 Discussion
7 Limitation
The proposed ECAPA-TDNN model for speaker verification using baby infant
cries shows promising results, but it does have certain limitations. One limita-
tion is the use of a standard softmax loss function, which may not fully exploit
the model’s potential. Exploring alternative loss functions like triplet loss could
enhance discriminative ability. Further fine-tuning by optimizing hyperparam-
eters, increasing the dataset size, and incorporating additional augmentation
techniques could improve performance and robustness in real-world scenarios.
8 Conclusion
This paper, described the significant work in infant cry analysis and verification
helping both the researchers and medical professionals. In this work, we utilized
the ECAPA-TDNN Pre-trained model and fine-tuned it according to the specific
requirements of the cryceleb dataset.Through the addition of data augmenta-
tion techniques and slight parameter modifications, help to achieved significant
improvements in Equal Error Rate (EER) for both the development and test
data sets. Our approach surpassed the performance of the baseline model by
2%. These results indicate the effectiveness of our proposed modifications and
highlight the potential of leveraging pre-trained models with dataset-specific
fine-tuning and augmentation techniques for improved cryceleb recognition.
Future scope on Infant speaker verification aims to determine and predict the
gender of a baby based on audio recordings of their speech or vocalizations.
The ultimate goal is to develop models that can accurately classify the gender
of infants based on their vocal cues, enabling applications such as automated
gender recognition systems or assisting in medical research.
References
1. Digital Speech Processing.Speaker Verification: Digital Speech Processing by Biing
Hwang Juang, M. Mohan Sondhi, Lawrence R. Rabiner. Encyclopedia of Physical
Science and Technology (Third Edition), 2003.
12