0% found this document useful (0 votes)
41 views

Springer Lecture Notes in Computer Science

This document discusses enhancing infant cry verification using the ECAPA-TDNN deep learning architecture. It aims to analyze and verify infant cries using speaker verification techniques. The ECAPA-TDNN model is pretrained and fine-tuned on the Cryceleb2023 dataset to perform infant cry verification across different recordings, rather than from the same recording. Data augmentation is also applied to further improve performance. The proposed approach achieves significant improvements in verification accuracy compared to previous methods.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Springer Lecture Notes in Computer Science

This document discusses enhancing infant cry verification using the ECAPA-TDNN deep learning architecture. It aims to analyze and verify infant cries using speaker verification techniques. The ECAPA-TDNN model is pretrained and fine-tuned on the Cryceleb2023 dataset to perform infant cry verification across different recordings, rather than from the same recording. Data augmentation is also applied to further improve performance. The proposed approach achieves significant improvements in verification accuracy compared to previous methods.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Enhancing Infant Cry Verification Using

ECAPA-TDNN Architecture

First Author1[0000−1111−2222−3333] , Second Author2,3[1111−2222−3333−4444] , and


Third Author3[2222−−3333−4444−5555]
1
Princeton University, Princeton NJ 08544, USA
2
Springer Heidelberg, Tiergartenstr. 17, 69121 Heidelberg, Germany
[email protected]
http://www.springer.com/gp/computer-science/lncs
3
ABC Institute, Rupert-Karls-University Heidelberg, Heidelberg, Germany
{abc,lncs}@uni-heidelberg.de

Abstract. This paper is concerned with the task of speaker verification


on infant cry. Most of the analysis on infants is done on classification like
pain, hunger, tiredness, discomfort, eructation and so on. Along with di-
agnostics and audio pattern recognition of baby crying sound events.
In real-world scenarios, cry analysis should be able to differentiate be-
tween infant sounds with accuracy. The current paper develops the deep
learning framework along with speaker embedding for speaker verifica-
tion to infant cries. However the use of speaker verification to infant
cries has posed some challenges such as Limited speech samples, devel-
opmental variability, ethical considerations and so on. This work uti-
lizes the cryceleb2023 dataset available on hugging Face. It employs the
ECAPA-TDNN pre-trained model and fine-tuned it based on the specific
requirements of the cryceleb dataset. Additionally, data augmentation
technique is applied to modify certain parameters that further enhance
the performance. As a result, our proposed model achieved significant
improvements in the equal error rate (EER) for both the development
and test sets.

1 Introduction

Speaker verification is a process that identifies individuals based on their voice.


This technology finds applications in various fields such as voice banking, bio-
metrics, credit card authorization and fund transfer[1]. Additionally speaker
verification can also be used in infant cry analysis. The goal of this study is
to analyze and verify the cry sounds of babies using speaker verification system.
These systems enables biometric verification using infants cries and can effec-
tively analyses and verify babies in real-world scenarios. The application of this
work is particularly relevant in monitoring and healthcare environments, where
accurate background noise cancellation is crucial for verifying the infant cry. The
proposed system can be utilized in two specific scenarios. The first scenario aims
to prevent the accidental exchange of infants in hospitals and healthcare centers
2

by verifying the identity of the infants based on their cries. The second scenario
involves neonatal care, where the system identifies a baby’s crying and enables
appropriate actions by caregivers, nurses or mothers.

In infant cry research, various methods based on machine learning and deep
learning have been employed. In the Machine learning methods researchers have
mainly used support vector machine (SVM)[2], K-Nearest Neighbor (KNN)[3],
Gaussian Mixture Model (GMM)[4], logistic regression[5], K-means clustering[6]
and Random Forest[7] are applied to pathological cry classification, cry reason
classification, and cry sound detection. Recently researchers have started using
neural based methods such as multi-layer perceptron[8], general regression neu-
ral network[9], Time Delay Neural network[10]. Novel neural network architec-
tures like Convolutional Neural Network (CNN)[11], Recurrent Neural Network
(RNN)[12] and CNN-RNN networks have opened up new possibilities in infant
cry research.

The main challenge in infant cry analysis is to collect the gold standard datasets.
In the literature researchers have used real as well as synthetic datasets. Ferretti
et.al collected real datasets recorded in the Neonatal Intensive Care Unit (NICU)
of a hospital, synthetic databases including crying with speech, ”beep” sounds
and more[13]. Feier et.al collected TUT Rare Sound Events 2017 dataset con-
taining crying with sound like “glass braking”or “gunshot”, and self -recorded
databases of cries and non- cries[14]. The proposed work focuses on verifying
infants using cry samples, similar to how speaker verification works with adult
speech.However, there is currently no specific database for infant cry speaker ver-
ification. The UBENWA CryCeleb dataset, provided by Ubenwa Health, offers a
labeled collection of infant cries for research purposes. This dataset was released
as a part of CryCeleb 2023 task provided by hugging face for the speaker verifica-
tion. Ubenwa Health has released over 6 hours of manually segmented cry sounds
from 786 newborns to encourage further research in infant cry analysis[15].

The current study performs the infant cry verification from different instances
rather than relying on cries from the same recording. This approach is more rep-
resentative of real-world scenarios where the system needs to verify the infant’s
identity over multiple days. Additionally, it has been observed that verifying
separate parts of a cry from the same recording is easier. This is because in-
fants tend to exhibit consistent characteristics within a single crying session but
not necessarily across different sessions. It suggests that factors other than the
infant’s identity can influence the characteristics of the cry.

To achieve optimal results in infant cry speaker verification, a pre-trained


ECAPA-TDNN model has been used on the Cryceleb dataset. The ECAPA-
TDNN[16] pre-trained model and fine-tuning techniques, has been employed for
better results. Furthermore, data augmentation techniques have been applied to
augment the dataset and improve the performance[17].
3

This paper is organized as follows. In Section II Related work are presented.


Section III Model description. Section IV details on Implementation. Section
V The results and discussion and limitations are presented and finally, Sec VI
concludes the paper by summarizing the main contribution of this work and
describing future work.

2 Related work

Infant cry analysis is an important area of research, as it enables the development


of systems for authenticating and identifying crying infants. However, due to
limited availability of labeled infant cry data, there is a lack of research in this
specific field. To address this, previous methods used in speaker verification
systems can be referenced and adapted for analyzing infant cries.

Literature on speaker verification systems provides insights into popular au-


tomatic speaker recognition (ASR) techniques and models. These systems have
utilized various speech feature extraction methods such as spectrogram, Lin-
ear predictive coding (LPC), Mel-frequency cepstrum coeffiecents (MFCC), per-
ceptual linear prediction (PLP), as well as feature transformation techniques
like Linear Discriminant Analysis (LDA), Maximum Likelihood Linear Trans-
form(MLLT), identification (i-Vector), and Feature space Maximum Likelihood
Linear Regression (FMLLR). Statistical models like Hidden Markov model and
Gaussian mixture models (GMM-HMM), along with techniques such as Maxi-
mum Likelihood Linear Regression (MLLR) and maximum posterior probability
criterion, have been effective in speech recognition tasks.

Deep learning approaches, particularly DNNs, have outperformed traditional


GMM models and gained prominence in HMM state modeling. The integration of
deep learning with traditional methods, such as DNN-HMM acoustic models, has
significantly advanced ASR. CNNs excel in extracting complex speech features,
while RNNs, especially LSTM, effectively model sequential dependencies. End-
to-end speech recognition systems eliminate the need for forced alignment and
optimize sentence sequences holistically.

Deep learning models with multi-layer nonlinear structures have demonstrated


superior performance compared to shallow models. RNNs, with LSTM as a com-
monly used variant, address issues such as unstable training, gradient problems,
and long-term dependencies. Bidirectional LSTM (BLSTM) models leverage
bidirectional memory to capture information over extended time periods. CNNs,
inspired by image processing, have proven valuable in speech signal analysis.
4

The Time Delay Neural Network (TDNN), a one-dimensional CNN, efficiently


captures long-term time-dependency relationships. It effectively learns tempo-
ral dynamics from short-term feature representations. Furthermore, data aug-
mentation techniques, such as semi-supervised training, multilingual processing,
acoustic data perturbation, and speech synthesis, have shown to enhance per-
formance in low-resource scenarios. When it comes to infant cry analysis, the
ECAPA-TDNN architecture, along with appropriate data augmentation tech-
niques, can be utilized. The architecture integrates temporal context and parallel
processing, capturing long-term dependencies in cry signals. By combining this
architecture with data augmentation methods, such as semi-supervised training,
multilingual processing, acoustic data perturbation, and speech synthesis, the
accuracy and robustness of infant cry analysis can be significantly improved.

In conclusion, by drawing from literature on speaker verification systems and


adapting relevant techniques, it is possible to develop effective methods for an-
alyzing infant cries. The ECAPA-TDNN architecture, combined with data aug-
mentation techniques, holds promise for improving the accuracy and reliability
of infant cry analysis systems.

3 Model Description

Current Speaker verification techniques, such as x-vectors based on neural net-


works for speaker representation extraction, have consistently achieved state-of-
the-art results in the field. Ongoing research focuses on enhancing the original
Time Delay Neural Network (TDNN) architecture to further improve perfor-
mance in speaker verification tasks.

The proposed ECAPA-TDNN architecture extends the original x-vector system,


incorporating several enhancements. It introduces 1D dilated convolutions and
dense layers in the initial frame layers to capture temporal context effectively.
Residual connections are added to improve information flow. The architecture
includes an attentive statistics pooling layer that calculates mean and standard
deviations of frame-level features, incorporating an attention system for frame
selection based on relevance.

In the ECAPA-TDNN architecture, the channel and context-dependent statis-


tics pooling layer enhances the attentive statistics pooling used in x-vector sys-
tems. It introduces channel-dependent self-attention scores, allowing the model
to focus on different frames within each channel. This is achieved through shared
projections of frame features and channel-specific linear layers, with softmax nor-
malization applied channel-wise across time. The layer calculates non-weighted
mean and standard deviation values for frames, which are concatenated with
frame activations as input to the attention layer. This adaptation based on global
properties enables the model to consider speaker characteristics that may vary
across channels and time.
5

By incorporating global context, the layer calculates non-weighted mean and


standard deviation values for the frames, which are concatenated with the frame
activations as input to the attention layer. This adaptation based on global
properties of the utterance enables the model to consider speaker characteristics
that may vary across channels and time.

To integrate channel-wise attention, 1D Squeeze-Excitation (SE) Res2Blocks


are employed. Each SE block calculates a channel descriptor through mean pool-
ing and generates channel weights using a sigmoid function. These weights are
used to rescale block activations, allowing the model to re-weight channels based
on global properties.

The ECAPA-TDNN architecture also utilizes multi-scale Res2Net convolutions


in the Res2Blocks, enhancing the model’s ability to capture long-range depen-
dencies and adapt activations based on global context. Feature aggregation and
summation are employed to leverage complementary information from different
hierarchical levels of the network, with Multi-Layer Feature Aggregation (MFA)
concatenating output feature maps of all SE-Res2Blocks and applying statistics
pooling to aggregated multi-scale features. Feature Propagation, where the sum
of outputs from previous blocks is used as input to each SE-Res2Block, enables
the propagation and reuse of features from preceding layers.

By leveraging multi-scale information, enhancing information flow and cap-


turing long-range dependencies, the ECAPA-TDNN architecture improves the
performance of speaker verification tasks.

The multi-layer feature aggregation and summation in ECAPA-TDNN leverage


complementary information from different hierarchical levels of the network. The
Multi-Layer Feature Aggregation (MFA) concatenates the output feature maps
of all the SE-Res2Blocks, providing features from different levels of semantic
complexity. Statistics pooling is then applied to these aggregated multi-scale
features.

Hence, the ECAPA-TDNN architecture integrates various components, includ-


ing attentive statistics pooling, channel-dependent self-attention, 1D Squeeze-
Excitation Res2Blocks, multi-layer feature aggregation and feature propagation,
to achieve enhanced representation learning and performance in speaker verifi-
cation tasks.
6

Fig. 1. The structure of ECAPA-TDNN


7

4 Implementation

4.1 Dataset Details

The cry sound recordings collected for analysis are predominantly short, lasting
between 0.5 to 1.0 seconds, with less than 0.3% of cry sounds exceeding 4 seconds
in duration. These cry sounds are specifically expirations, excluding inspirations,
as they are typically brief, difficult to detect and less informative about the vocal
tract. Below Table 1 provides data which was divided into different recording
periods, encompassing both birth and discharge times. The train, dev and test
splits were established with 348, 40, and 160 cry sound recordings in the ”Both
birth and discharge” category,respectively. There were 183 cry sound recordings
in the ”Only birth” category and 55 in the ”Only discharge” category, both
without any recordings in the dev or test sets.

Table 1. Number of infants by split and recording period.

Split
Time(s) of Recording(s) train dev test
Both birth and discharge 348 40 160
Only Birth 183 0 0
Only Discharge 55 0 0
Total 586 40 160

Table 2 provided information about the splits, containing 40 positive pairs and
1540 negative pairs in the dev set, while the test set consisted of 160 positive
pairs and 25440 negative pairs.

Table 2. Number of pairs in dev and test.

Split +ive pairs -ive pairs Total No of pairs


Dev 40 1540 1580
Test 160 25440 25600
8

We have a total of 934 unique baby cry containing 26,093 utterances and that
is divided into 348 folders of both birth and discharge, 183 folders of only births
and 55 only discharge folders. Each folder comprises an average of 30 birth files
and 10 discharge files containing of 0-0.5 seconds of audio. To process the data,
we concatenated 0.5 seconds of audio files from all the birth files into a single
birth file and all discharge files into a single discharge file.

As a result, we obtained a total of 934 files from the 934 unique babies and
average duration of each file is 30 seconds. We have taken 3-5 seconds of random
audio from the file for training. Next, we divided the train files into 586 files
containing of 348 files of both birth and discharge taken only birth recordings,
183 files of only births and 55 only discharge files and the validation files into 348
files containing of only discharge files from both birth and discharge recordings.
The split ratio for the train and validation sets was 62:38, respectively.

To facilitate the training and validation processes, we utilized the Speech-


Brain library to create dynamic datasets. This involved filtering and sorting the
data based on the previously defined splits Additionally, we incorporated spe-
cific fields within the datasets to encode and process audio signals effectively.
These additions were crucial for maintaining the integrity and coherence of the
cry sequences throughout the training and evaluation stages. By leveraging the
capabilities of the SpeechBrain library, we were able to streamline the data han-
dling process and enhance the overall efficiency of our cry pattern analysis.

4.2 Experimental Setup

The ECAPA-TDNN model in this study utilizes TDNN layers with four 1024
channels and one 3072 channel. These layers are designed to capture intricate
patterns and characteristics in the speaker’s voice. The embedding layer, with
a dimension of 192, then transforms these learned features into a concise and
distinctive representation specifically tailored for speaker identification.

The layers in the ECAPA-TDNN model are constructed with carefully chosen
kernel sizes and dilations. The model employs kernel sizes of [5, 3, 3, 3, 1] and
dilations of [1, 2, 3, 4, 1]. These specific configurations allow the model to capture
temporal dependencies at different scales. The larger kernel size of 5 enables the
model to capture broader contextual information, while the smaller kernel sizes
of 3 and 1 focus on capturing more localized patterns.

The dilations further enhance the model’s ability to aggregate information


across different time steps, allowing it to capture both short-term and long-term
dependencies. By incorporating these kernel sizes and dilations, the ECAPA-
TDNN model can effectively extract high-level representations that encode im-
portant speaker characteristics, contributing to its success in accurate speaker
verification tasks.
9

During the model training process, eighty-dimensional Mel-filter bank features


are employed. These features are carefully selected to capture relevant acoustic
information that is crucial for speaker verification tasks. To optimize the model
effectively, the learning rate is dynamically adjusted within the range of 1e-8 to
1e-3. This adaptive learning rate helps the model converge to an optimal solution
while avoiding potential pitfalls such as overshooting or getting stuck in local
minima.

The Adam optimizer is chosen for this purpose, as it offers adaptive learning
rate capabilities and incorporates momentum to accelerate convergence. By min-
imizing the training loss through these optimization techniques, the model can
effectively learn discriminative speaker representations and achieve high perfor-
mance in speaker verification. The hyper parameter scale and margin of Log
Softmax Wrapper are set to 30 and 0.2 respectively.

5 Results
The fine-tuned model achieving an EER of 20.45 % on the dev set and 27.63 %
on the test set. It is open-sourced along with code that can be used to reproduce
these results. Figure present histograms of scores for positive pairs (orange) and
negative pairs (blue), with the y-axis normalized separately for each color. The
red vertical line indicates the threshold where the EER is achieved
10

Figure 2:Histogram scores for Development set

Fig 3:Histogram scores for Test set

6 Discussion

In a previous study, researchers evaluated two baseline models called ECAPA-


TDNN on a development and test set. The first model, which they called the
”naive” baseline, was trained using pre-existing voxceleb data. However, the clas-
sification accuracy of this model was not sufficient. To improve the performance,
second baseline model was fine-tuned on ECAPA-TDNN model using a dataset
called cryceleb. However, they found that this second model had limitations. It
required more data and a larger number of classes, which in turn slowed down
the learning process of the model. Third model was experimented using cryceleb
dataset with data augmentation on ECAPA-TDNN model. Table 3 provides the
performance (EER) baseline models. First and second row provide information
on baseline and experimented model on third row.
11

Table 3. EER of baseline models

ECAPA-TDNN baseline Dev Test


VoxCeleb pre-trained 37.92% 38.12%
Fine-tuned on CryCeleb 22.50% 29.37%
Fine-tuned on CryCeleb with Data Augmentation 20.45% 27.63%

7 Limitation
The proposed ECAPA-TDNN model for speaker verification using baby infant
cries shows promising results, but it does have certain limitations. One limita-
tion is the use of a standard softmax loss function, which may not fully exploit
the model’s potential. Exploring alternative loss functions like triplet loss could
enhance discriminative ability. Further fine-tuning by optimizing hyperparam-
eters, increasing the dataset size, and incorporating additional augmentation
techniques could improve performance and robustness in real-world scenarios.

8 Conclusion
This paper, described the significant work in infant cry analysis and verification
helping both the researchers and medical professionals. In this work, we utilized
the ECAPA-TDNN Pre-trained model and fine-tuned it according to the specific
requirements of the cryceleb dataset.Through the addition of data augmenta-
tion techniques and slight parameter modifications, help to achieved significant
improvements in Equal Error Rate (EER) for both the development and test
data sets. Our approach surpassed the performance of the baseline model by
2%. These results indicate the effectiveness of our proposed modifications and
highlight the potential of leveraging pre-trained models with dataset-specific
fine-tuning and augmentation techniques for improved cryceleb recognition.

Future scope on Infant speaker verification aims to determine and predict the
gender of a baby based on audio recordings of their speech or vocalizations.
The ultimate goal is to develop models that can accurately classify the gender
of infants based on their vocal cues, enabling applications such as automated
gender recognition systems or assisting in medical research.

References
1. Digital Speech Processing.Speaker Verification: Digital Speech Processing by Biing
Hwang Juang, M. Mohan Sondhi, Lawrence R. Rabiner. Encyclopedia of Physical
Science and Technology (Third Edition), 2003.
12

2. Home Encyclopedia of AlgorithmsReference work entry Support Vector Ma-


chines1992; Boser, Guyon, VapnikEncyclopedia of Algorithms pp 928–932 Nello
Cristianini & Elisa Ricci
3. Knn National conference on Technology innovation in Mechatronics,Energy
Management and Intelligent communication(NCTIMEMIC-2017) International
Journal of Advanced Scientific Technologies ,Engineering and Management
Sciences (IJASTEMS-ISSN: 2454-356X) Volume.3,Special Issue.1,April.2017
www.ijastems.org Page 234 Speaker Identification Using K-Nearest Neighbors
(k-NN) Classifier Employing MFCC and Formants as Features Sreelekshmi S
Kumar,PG student and Syama R,Assistant Professor, Department of Elecronics
and Communication Engineering,College of engineering kidangoor
4. GMM Based Speaker Recognition on Readily Available Databases
Brett R. Wildermoth & Kuldip K. Paliwal School of Microelec-
tronic Engineering, Griffith University, Brisbane, Australia 4111 Email:
[email protected],[email protected]
5. .Regularized Logistic Regression Fusion for Speaker Verification Ville Hauta-
maki ¨ 1, Kong Aik Lee1, Tomi Kinnunen2, Bin Ma1, and Haizhou Li1
1Human Language Technology Department, Institute for Infocomm Research,
Singapore. 2School of Computing, University of Eastern Finland, Finland.
vmhautamaki,kalee,mabin,hlii2r.a-star.edu.sg [email protected]
6. I.J. Modern Education and Computer Science, 2018, 11, 19-28 Pub-
lished Online November 2018 in MECS (http://www.mecs-press.org/) DOI:
10.5815/ijmecs.2018.11.03 Copyright © 2018 MECS I.J. Modern Education and
Computer Science, 2018, 11, 19-28 A Speaker Recognition System Using Gaussian
Mixture Model, EM Algorithm and K-Means Clustering Mr. Ajinkya N. Jadhav
Department of Computer science and Engineering, Rajarambapu Institute of Tech-
nology, Islampur 415414, India Email: [email protected] Mr. Nagaraj V.
Dharwadkar Department of Computer science and Engineering, Rajarambapu Insti-
tute of Technology, Islampur 415414, India Email: [email protected]
Received: 31 July 2018; Accepted: 07 October 2018; Published: 08 November 2018
7. Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).https:
//citation-needed.springer.com/v2/references/10.1023/A:1010933404324?
format=refman&flavour=citation Issue Date October 2001 DOI https:
//doi.org/10.1023/A:1010933404324
8. Paper Multilayer perceptrons for classification and regression Author links open
overlay panel Fionn Murtagh Neurocomputing Volume 2, Issues 5–6, July 1991,
Pages 183-197
9. A General Regression Neural Network November 1991 IEEE Transactions on Neu-
ral Networks2(6):568-578 DOI: 10.1109/72.97934 Authors:Donald F. SpechtMaui
Imaging, Inc.
10. A time delay neural network architecture for efficient modeling of long tem-
poral contexts Vijayaditya Peddinti1 , Daniel Povey1,2 , Sanjeev Khudanpur1,2
1Center for Language and Speech Processing & 2Human Language Technology
Center of Excellence Johns Hopkins University, Baltimore, MD 21218, USA vi-
jay.p,[email protected], [email protected]
11. Conceptual Understanding of Convolutional Neural Network- A Deep Learning
Approach Author links open overlay panel Sakshi Indolia a, Anil Kumar Goswami
b, S.P. Mishra b, Pooja Asopa a Procedia Computer Science Volume 132, 2018,
Pages 679-688
13

12. Rumelhart, David E; Hinton, Geoffrey E, and Williams, Ronald J (Sept.


1985).Learning internal representations by error propagation.Tech. rep. ICS 8504.
San Diego, California: Institute for Cognitive Science, University of California.
13. M. Severini, D. Ferretti, E. Principi, S. Squartini, Automatic detection of
cry sounds in neonatal intensive care units by using deep learning and
acoustic scene simulation. IEEE Access. 7, 51982–51993 (2019). https://
doi.org/10.1109/ACCESS.2019.2911427

14. F. Feier, I. Enatescu, C. Ilie, I. Silea, in 2014 International Conference on Op-


timization of Electrical and Electronic Equipment, OPTIM 2014. Newborns’ cry
analysis classification using signal processing and data mining, (2014), pp. 880–885.
https://doi.org/10.1109/OPTIM.2014. 6850990
15. CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds David
Budaghyan1 , Arsenii Gorin1 , Cem Subakan2,3,4 , Charles C. Onu1,2,5 1Ubenwa
Health, 2Mila-Quebec AI Institute, 3Universite Laval, ´ 4Concordia University
5McGill University,arXiv:2305.00969v3 [cs.SD] 15 May 2023
16. Conference PaperPDF AvailableECAPA-TDNN: Emphasized Channel Atten-
tion, Propagation and Aggregation in TDNN Based Speaker VerificationOctober
2020 DOI:10.21437/Interspeech.2020-2650 Conference: Interspeech 2020 At: Shang-
haiBrecht DesplanquesJenthe ThienpondtKris DemuynckGhent University
17. Procedia Computer Scienc Volume 112, 2017, Pages 316-3 Improving speech recog-
nition using data augmentation and acoustic model fusion Author links open overlay
panel Ilyes Rebai a, Yessine BenAyed a, Walid Mahdi a, Jean-Pierre Lorré b

You might also like