Oral Qualification Examination_Kun_Zhou

© Copyright National University of Singapore. All Rights Reserved.
Emotional Voice Conversion with
Non-parallel data
PhD candidate: Kun Zhou
Supervisor: Prof. Li Haizhou
Dept. of Electrical and Computer Engineering, National University of Singapore

Content
• Introduction
• Related Work
• My PhD Research
• Conclusion
01

Introduction
• Emotional voice conversion (EVC)
o Convert the emotional state from one to another (e.g. from happy to sad)
o Preserving linguistic content, speaker identity …
02
Figure 1: At run-time, an EVC framework converts the emotional state from one to another.
• Emotion in Speech
o Speech conveys information through:
- Linguistic aspect (what we speak)
- Para-linguistic aspect (how we speak): e.g. emotional state

• Applications
Introduction
02
- Conversational Agents
- Social Robots

• Current Challenges in Emotional Voice Conversion:
02 Introduction
o Non-parallel training;
o Limited data training;
o Emotional prosody modelling;
o Lack of controllability;
o Lack of generalizability;

Publications During PhD
• Accepted
• Submitted
02
[1] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming spectrum and prosody for emotional voice
conversion with non-parallel data”, Speaker Odyssey 2020;
[2] Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li, “Converting anyone’s emotion: towards
speaker-independent emotional voice conversion”, Interspeech 2020;
[3] Kun Zhou, Berrak Sisman, Haizhou Li, “Vaw-gan for disentanglement and recomposition of emotional
elements in speech”, IEEE Spoken Language Technology Workshop (SLT) 2021;
[4] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “ Seen and unseen emotional style transfer with a
new emotional speech dataset”, IEEE International Conference on Acoustic, Speech and Signal
Processing (ICASSP), 2021
[5] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited data emotional voice conversion leveraging text-to-
speech: two-stage sequence-to-sequence training”, Interspeech 2021;
[1] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Emotional voice conversion: theory, databases and
ESD”, submitted to Speech Communication;

Related Work
03
• Training Stage
“Source Analysis  Mapping  Target Analysis”
[1] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,”
IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.

03 Related Work
According to the training data, EVC can be divided into two types:
- EVC with parallel data: Source: Target:
Source and target speech share the same linguistic content;
Expensive and difficult to collect!
- EVC with non-parallel data: Source: Target: -- our focus!
Linguistic content is different;
Easy for real-life applications!

Related Work
03
• Conversion Stage
“Analysis  Mapping  Synthesis”
Reference audio Griffin-Lim WaveRNN Parallel WaveGAN
Example: Waveform Generation with different vocoders:
o We care about emotional expression;
o Emotional expression  Feature Mapping (Our focus!);
o Speech quality  Waveform Generation (Not our focus!);
o To get a better speech quality  Train a better vocoder

03 Related Work
• Speaker voice conversion vs. Emotional voice conversion
o Speaker voice conversion:
- convert speaker identity;
- mainly focus on spectrum conversion;
- consider prosody as speaker-independent;
o Emotional voice conversion:
- convert emotional style;
- focus on both spectrum and prosody conversion;
- prosody (intonation, speech rate, energy, …) plays an
important role!

My PhD Research
• CycleGAN-based EVC;
• Speaker-independent EVC;
• EVC for seen and unseen emotions;
• Limited data EVC;
We always focus on non-parallel & limited data solutions!
04
Improve generalizability
Prosody modelling
Duration modelling

CycleGAN-based EVC [2]
• Motivation
o F0 is an essential part of the intonation;
- Supra-segmental and hierarchical nature;
- Linear transformation is insufficient; F0 is difficult to model!
• Contribution
o A parallel-data-free emotional voice conversion framework;
o Convert spectral and prosodic features with CycleGAN;
o Modelling F0 over multiple time scales with continuous wavelet
transform (CWT);
o Investigate different training strategies: joint vs. separate training;
02
[2] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming spectrum and prosody for emotional voice conversion with non-parallel
training data”, Speaker Odyssey 2020

CycleGAN-based EVC [2]
02
- Lower scales capture the short-term variations, such as syllables and
phonemes;
- Higher scales capture the long-term variations, such as phrases and
utterances;

Cycle-GAN based EVC [2]
02
o Training Stage:
- Spectral CycleGAN: learn the feature mapping of spectral features;
- Prosody CycleGAN: learn the feature mapping of CWT-based F0 features;
o Conversion Stage:
Spectral & Prosody CycleGAN convert the input features from source to target
emotion type;

02 Cycle-GAN based EVC [2]
o From 1st XAB preference test:
CWT analysis of F0 improves the emotion similarity to the target emotion type;
o From 2nd XAB preference test:
Separate training of spectral and prosodic features outperforms the joint training;
Source (Neutral) Converted Angry Converted Sad Converted Surprise
- Experiments

Speaker-independent EVC [3]
03
[3] Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker-independent Emotional
Voice Conversion”, Interspeech 2020
• Motivation
• Contribution
o Emotional expression is believed to share some common cues across
individuals;
For example:
Happy tends to have a higher mean and std of F0 than Sad
o Previous EVC studies assume that emotion is speaker-dependent;
o Study emotion through speaker-independent perspective;
o Study prosody modelling with CWT and F0 conditioning for emotion-
independent encoder training;
(The First Study!)

03
• Related Work: VAW-GAN [4]
- Conditional VAE + Discriminator
• Disentangle Emotional Elements from Speech:
- Spectral features (SP):
Speaker + Phonetic + Prosodic info
- Provide emotion ID and F0 to decoder;
- The encoder learns to discard emotion
related information;
[4] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein
generative adversarial networks,” Interspeech, 2017.

03
• Proposed Framework
o Training Stage:
o Conversion Stage:
VAW-GAN for Prosody:
- Encoder learns emotion-independent representations from CWT-F0;
- Generator learns to reconstruct prosody features with one-hot emotion ID;
- Discriminator learns to judge whether the reconstructed features real or not;

03
o Training Stage:
o Conversion Stage:
VAW-GAN for Spectrum:
- Encoder learns emotion-independent representations from spectrum;
- Generator learns to reconstruct prosody features with one-hot emotion ID
and F0;
- Discriminator learns to judge whether the reconstructed features real or not;

03 Speaker-independent EVC [3]
- XAB preference test for emotion
similarity;
- XAB preference test for speaker
similarity;
- Both validates the effectiveness of our proposed framework in both speaker-
dependent and speaker-independent settings!
- CWT analysis for prosody
modelling;
- F0 conditioning for encoder
training;
- Performance with seen and
unseen speakers;
We would like to show the
effectiveness of :

04 EVC for seen and unseen emotions [4]
• Motivation
o Current EVC frameworks represent each emotion with one-hot emotion
label
- learn to remember a fixed set of emotions;
- insufficient to describe emotional styles
(emotional styles present subtle difference even with the same emotion
category)
• Contribution
o A one-to-many emotional style transfer framework;
o Use pre-trained SER to describe emotional styles;
[4] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Seen and unseen emotional style transfer for voice conversion with a new emotional
speech dataset”, IEEE ICASSP 2021
Non-parallel training;
Seen and unseen emotional style transfer (The First Study!)

• Stage I: Emotion Descriptor Training
• Stage II: Encoder-Decoder Training with VAW-GAN
• Stage III: Run-time Conversion
04 EVC for seen and unseen emotions

04 EVC for seen and unseen emotions
- AB preference test for speech
quality;
- XAB preference test for emotion
similarity;
- Validate the effectiveness of proposed framework for both seen and
unseen emotion conversion.

Limited data EVC [5]
05
[5] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-
to-Sequence Training”, Interspeech 2021
• Motivation
o Speech duration conversion has been a missing point in frame-based
models;
o Sequence-to-sequence (seq2seq) methods predict the duration with
attention mechanism;
o A seq2seq framework usually needs a large amount of training data!
• Contribution
o A seq2seq EVC framework leveraging TTS:
Emotional voice conversion & Emotional text-to-speech;
Require a limited amount of emotional speech data;
Joint model spectrum, prosody and duration;
Non-parallel training;
Many-to-many conversion;
(Lack of large amount of emotional speech data!)

• Two-stage Training:
- Stage I: Style Initialization:
Style Encoder learns speaker style from a large TTS corpus;
- Stage II: Emotion Training:
Style Encoder acts as emotion encoder to learn emotional style;
Limited data EVC
05

Limited data EVC
05
Figure 3: Visualization of emotion embedding derived from (a) style encoder and (b) emotion encoder.
- Emotion embedding derived from emotion encoder can form separate groups;
- A significant separation between angry, happy, surprise and neutral, sad;

• Speech Samples*
Limited data EVC
05
Source CycleGAN[2] StarGAN[3] Our Propose Target
Neutral-to-Angry
Neutral-to-Happy
Neutral-to-Sad
Neutral-to-Surprise
* For more speech samples: https://kunzhou9646.github.io/IS21/
[2] K. Zhou, B. Sisman, and H. Li, “Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data,” in Proc.
Odyssey 2020 The Speaker and Language Recognition Workshop, 2020, pp. 230–237
[3] G. Rizos, A. Baird, M. Elliott, and B. Schuller, “Stargan for emotional speech conversion: Validated by data augmentation of end-to-end
emotion recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020,
pp. 3502–3506

Limited data EVC
05
• Speech Samples*
(Emotional Text-to-Speech)
Input text:
“Clear than clear water”
Angry
Happy
Surprise
Sad
Clear than clear water.
Clear than clear water…
Clear than clear water?
Clear than clear wate!

05 Limited data EVC
Proposed framework significantly outperforms the baselines in
emotion similarity evaluation

06 Conclusion
o Emotional voice conversion: theory and challenges;
o Our work:
- Cycle-GAN based EVC [2]; [Speaker Odyssey 2020]
- Speaker-independent EVC [3]; [INTERSPEECH 2020]
- EVC for seen and unseen emotions [4]; [ICASSP 2021]
- Limited data EVC [5]; [INTERSPEECH 2021]
All codes are publicly available!
o Future studies:
- Emotional voice conversion with emotion strength control;
- Emotion interpolations for emotional voice conversion;
- Cross-lingual representations for emotional voice conversion;
[2] Kun Zhou, Berrak Sisman, Haizhou Li, “Transforming spectrum and prosody for emotional voice conversion with non-parallel training data”, Speaker Odyssey 2020
[3] Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li, “Converting Anyone’s Emotion: Towards Speaker-independent Emotional Voice Conversion”, Interspeech 2020
[4] Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li, “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset”, IEEE ICASSP 2021
[5] Kun Zhou, Berrak Sisman, Haizhou Li, “Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training”, Interspeech 2021

THANK YOU

Oral Qualification Examination_Kun_Zhou

More Related Content

Similar to Oral Qualification Examination_Kun_Zhou (20)

Recently uploaded (20)

Oral Qualification Examination_Kun_Zhou

Editor's Notes