0% found this document useful (0 votes)
12 views

High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks

This document introduces a new neural vocoder called unified source-filter generative adversarial networks (uSFGAN). Unlike previous neural vocoders, uSFGAN represents the source-filter architecture in its generator network, allowing unified optimization of both source excitation generation and resonance filtering. Evaluation shows uSFGAN achieves comparable sound quality to HiFi-GAN and outperforms WORLD in pitch transformation tasks. The source-filter modeling also improves robustness to unseen pitch values compared to other neural vocoders.

Uploaded by

www.bryan.ale.x
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks

This document introduces a new neural vocoder called unified source-filter generative adversarial networks (uSFGAN). Unlike previous neural vocoders, uSFGAN represents the source-filter architecture in its generator network, allowing unified optimization of both source excitation generation and resonance filtering. Evaluation shows uSFGAN achieves comparable sound quality to HiFi-GAN and outperforms WORLD in pitch transformation tasks. The source-filter modeling also improves robustness to unseen pitch values compared to other neural vocoders.

Uploaded by

www.bryan.ale.x
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL.

31, 2023 3717

High-Fidelity and Pitch-Controllable Neural Vocoder


Based on Unified Source-Filter Networks
Reo Yoneyama , Yi-Chiao Wu , and Tomoki Toda , Senior Member, IEEE

Abstract—We introduce unified source-filter generative adver- unseen data, such as unseen speakers and F0 , and continuously
sarial networks (uSFGAN), a waveform generative model condi- generate high-fidelity speech. For example, the vocoder of a
tioned on acoustic features, which represents the source-filter archi- multi-speaker TTS for few-shot voice cloning should be able to
tecture in a generator network. Unlike the previous neural-based
source-filter models in which parametric signal process modules tackle a wide range of F0 because of the frequent adaptations
are combined with neural networks, our approach enables unified of arbitrary speakers since it is impractical to collect a corpus
optimization of both the source excitation generation and resonance covering all unseen speakers, and even the utterances of the
filtering parts to achieve higher sound quality. In the uSFGAN seen speakers may include out-of-range F0 values. Moreover,
framework, several specific regularization losses are proposed to SVS often requires a significant deviation of F0 to generate a
enable the source excitation generation part to output reasonable
source excitation signals. Both objective and subjective experi- singing voice that transcends physical limitations. Therefore,
ments are conducted, and the results demonstrate that the pro- the vocoder of SVS should have high robustness to unseen F0
posed uSFGAN achieves comparable sound quality to HiFi-GAN over an extensive range.
in the speech reconstruction task and outperforms WORLD in the However, most vocoders do not meet the mentioned require-
F0 transformation task. Moreover, we argue that the F0 -driven ments. Specifically, conventional vocoders [1], [2] based on
mechanism and the inductive bias obtained by source-filter mod-
eling improve the robustness against unseen F0 in training as source-filter models [3], [4], [5] can flexibly control speech
shown by the results of experimental evaluations. Audio samples characteristics, but the quality of the generated speech is low
are available at our demo site at https://chomeyama.github.io/ because of their over-simplified speech production process. Re-
PitchControllableNeuralVocoder-Demo/. cent high-fidelity neural vocoders [6], [7], [8], [9], [10], [11],
Index Terms—Speech synthesis, neural vocoder, source-filter [12], [13], [14], [15] lack the robustness to unseen data because
model, unified source-filter networks. of their purely data-driven training-manners. For example, the
state-of-the-art neural vocoder, HiFi-GAN [12], fails to generate
high-fidelity speech when the input features include F0 values
I. INTRODUCTION
deviating from the F0 range of the training data. Furthermore,
PEECH synthesis is a technology of generating speech
S waveforms on the basis of text or acoustic features. In
particular, models conditioned on acoustic features are called
compared with the conventional source-filter models, those
neural vocoders have poorer interpretability and less flexible
controllability of speech characteristics. One reasonable way for
vocoders. Vocoders have been widely adopted in many voice neural vocoders to satisfy the above requirements is to introduce
applications, such as text-to-speech (TTS), singing voice syn- source-filter modeling to obtain sufficient flexibility and induc-
thesis (SVS), and voice conversion (VC). In the applications, the tive bias for human speech production. Several approaches [16],
quality of the final generated waveform strongly depends on the [17], [18], [19], [20], [21], [22], [23] have been investigated
performance of the vocoder. Specifically, vocoders are required to combine the source-filter architecture with deep neural net-
to generate speech of high sound quality in addition to functions works using signal-processing-based modules. However, there
for flexibly and independently controlling the generated speech are several problems with incorporating the parametric (signal-
in accordance with given acoustic features (e.g., F0 , timbre, processing-based) module and strong constraints into neural
and periodicity). Furthermore, vocoders should be robust to vocoders. For instance, the partial utilization of signal processing
makes the optimization of the entire speech generation process
Manuscript received 3 January 2023; revised 19 July 2023; accepted 25 difficult and degrades sound quality and F0 controllability since
August 2023. Date of publication 11 September 2023; date of current version the neural networks must compensate for the incomplete output
20 October 2023. This work was supported in part by the JST, CREST under of the parametric modules.
Grant JPMJCR19A3 and in part by the Japan Society for the Promotion of
Science (JSPS) Grants-in-Aid for Scientific Research (KAKENHI) under Grant To achieve flexibility and interpretability of source-filter
JP21H05054. The associate editor coordinating the review of this manuscript modeling while maintaining the high sound quality of neural
and approving it for publication was Prof. Hung-yi Lee. (Corresponding author: vocoders, we propose a novel framework of source-filter model-
Reo Yoneyama.)
Reo Yoneyama and Yi-Chiao Wu are with the Graduate School of Infor- ing on a single neural network, significantly reducing the effects
matics, Nagoya University, Nagoya 464-8601, Japan (e-mail: yoneyama.reo@ of the ad hoc designs. Unlike previous approaches that model
g.sp.m.is.nagoya-u.ac.jp; [email protected]). either the source excitation generation part or the resonance
Tomoki Toda is with the Information Technology Center, Nagoya University,
Nagoya 464-8601, Japan (e-mail: [email protected]). filtering part on the basis of signal processing as described in
Digital Object Identifier 10.1109/TASLP.2023.3313410 Section II, our approach enables the simultaneous optimization

© 2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
3718 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 31, 2023

characteristics such as F0 , timbre, and aperiodicity. The con-


ventional source-filter-based vocoders such as STRAIGHT [1]
and WORLD [2] achieve flexible controllability of speech char-
acteristics while maintaining reasonable sound quality. Both
models involve several assumptions to simplify the mathemat-
ical source-filter modeling of the speech production process.
For example, these vocoders make assumptions based on prior
knowledge, such as time-invariant linear filter and stationary
Gaussian process. The source signals are modeled by a mixed
excitation source that switches between pulse trains and white
noise as it switches between voiced and unvoiced intervals,
Fig. 1. Comparison of the architectures of conventional and neural vocoders representing periodicity as a binary series of voiced or unvoiced
in terms of the source-filter modeling.
parts. However, the mixed excitation source modeling loses
detailed temporal and phase information of the original speech,
which often deteriorates the sound quality of the synthesized
of these two parts of the entire network, which leads to better speech. Because of the simplified and ad hoc mathematical
sound quality. Our approach described in Section III is based on modeling, the conventional source-filter models achieve low
generative adversarial networks (GANs) [24] and separates the sound quality.
generator network into a source excitation generation network
and a resonance filtering network using a regularization loss
B. Neural Vocoders Based on Generative Models
on the intermediate output of the network. To further obtain
better F0 robustness and F0 controllability, we adopt a F0 -driven Because of the more powerful modeling capacity of current
source excitation generation mechanism. Our experimental re- neural networks, neural vocoders have markedly improved the
sults shown in Section IV demonstrate that our proposed uni- naturalness of synthesized speech. WaveNet [6], which recur-
fied source-filter GAN (uSFGAN) achieves comparable sound sively predicts samples at sample level with the dilated convo-
quality to HiFi-GAN with much greater robustness against lution neural network (DCNN), has shown an impressively high
unseen F0 . Moreover, uSFGAN achieves better sound quality sound quality. WaveNet, originally designed for text-to-speech
than other models presented in several previous works, such as (TTS) applications, was initially conditioned on linguistic fea-
WORLD [2], neural source-filter (NSF) [21], and quasi-periodic tures. It is also possible to condition WaveNet on acoustic
parallel waveGAN (QP-PWG) [25], [26] with high F0 control- features, similar to conventional vocoders. To replace tradi-
lability. Furthermore, we demonstrate that uSFGAN models tional vocoders with WaveNet, WaveNet vocoder [29], which
output reasonable source excitation signals via visualization. is conditioned on acoustically derived features, was proposed.
In this article, the previously proposed techniques proposed Taking WaveNet as a vocoder significantly improves the syn-
in [27], [28] are organized, improved, and evaluated in a unified thetic speech quality and greatly reduces the required training
manner. Additionally, we newly considered another network data, making the WaveNet vocoder feasible for practical TTS
structure and input acoustic features, and thoroughly assessed systems.
their behaviors with more detailed experimental evaluations However, WaveNet has a low speech generation speed owing
than our previous works. We provide the code of our model at to its autoregressive generation mechanism. WaveRNN [30]
https://github.com/chomeyama/HN-UnifiedSourceFilterGAN. adopts a lightweight recurrent neural network (RNN) structure
with acoustic feature conditions and hardware-friendly designs
to achieve real-time generation. These autoregressive models
II. RELATED WORK often use teacher forcing, a technique that provides the correct
In this section, we systematically introduce previous studies values instead of the output from previous steps. Although it is
on vocoders based on the framework of the source-filter architec- very effective in stabilizing training, the mismatches between
ture [5]. The advantages and disadvantages of each approach are the training and inference stages cause the exposure bias [31]
discussed. Fig. 1 shows the architectures in the previous studies problem, resulting in quality degradation.
and our proposed method. Non-autoregressive models using inverse autoregressive
flows (IAF) have been investigated as an alternative real-
time waveform generation approach. These IAF-based mod-
A. Conventional Source-Filter Model
els [32], [33], [34] achieve higher inference speed through par-
The source-filter architecture [5] is based on the idea that allel waveform generation. Distillation techniques are adopted
the human speech production process can be approximated by to alleviate the low training efficiency of IAF models due
the modulation of a source excitation generated by vocal fold to their autoregressive training manner. However, distillation
vibrations and a spectral filter model of vocal tract resonances. requires complex two-stage training, and connected teacher
The assumption of independence between the two parts provides and student networks necessitate a large-scale memory for
us with high interpretability and flexible controllability of speech training.
YONEYAMA et al.: HIGH-FIDELITY AND PITCH-CONTROLLABLE NEURAL VOCODER BASED ON UNIFIED SOURCE-FILTER NETWORKS 3719

To address these issues, GAN-based vocoders have been


widely explored to take advantage of the compact generator
size because the discriminator greatly helps the compact gen-
erator achieve high-fidelity speech generation. Parallel Wave-
GAN (PWG) [35] and MelGAN [36] are the recent most pop-
ular GAN-based vocoders, and many subsequent GAN-based
vocoders are based on them [11], [12], [13], [14], [25], [26], [27],
[37], [38], [39], [40], [41]. Non-autoregressive models without
GAN are also proposed. WaveGlow [7] and WaveFlow [8] are
normalizing flow-based neural vocoders. These flow-based neu-
ral vocoders achieve high speech generation speed with convinc-
ing sound quality. Moreover, denoising diffusion probabilistic-
based neural vocoders such as DiffWave [9] and WaveGrad [10]
Fig. 2. Overall architecture of proposed uSFGAN.
have been proposed that iteratively refine Gaussian noise into
speech via a Markov chain. Although the above-mentioned
vocoders achieve impressive high-fidelity speech generation, generation parts are trained in a data-driven manner, suffer from
independently controlling speech characteristics similarly to significant degradation of their performance when there are
the conventional source-filter vocoders is challenging for these unseen acoustic features such as F0 values that are outside of
vocoders because of their purely data-driven training. the F0 range of training data.
To tackle speech controllability, the WaveNet-based QP-
Net [42], [43] and PWG-based QP-PWG [25], [26] vocoders III. UNIFIED SOURCE-FILTER GAN
with pitch-dependent dilated convolution neural networks (PD- To develop a high-fidelity and F0 -controllable neural vocoder,
CNNs) have been proposed. PDCNNs effectively improve F0 we propose uSFGAN, which represents the source-filter ar-
controllability by dynamically changing the CNN dilation size in chitecture with a single neural network based on GAN. The
accordance with the input F0 . Although QP-PWG outperforms generator network is factorized into a source excitation gener-
PWG in F0 controllability, there is still room for improvement ation network (source network) and a resonance filtering net-
in F0 controllability. work (filter network) using a regularization loss to make the
source network output reasonable source excitation signals. To
C. Neural Vocoders Based on Source-Filter Modeling further improve the F0 controllability, we introduce F0 -driven
mechanisms designed on the basis of QP-PWG and NSF into
Many neural vocoders based on the source-filter architecture the source network. Moreover, inspired by the recent successes
are proposed to combine the high modeling capacity of deep of the neural vocoders that adopt harmonic-plus-noise (HN)
neural networks with the merits of conventional source-filter speech modeling [13], [14], [21], [22], we introduce HN source
modeling. For example, NSF [20], [21], [44] models the source excitation generation to obtain better sound quality. The overall
excitation generation part by adding multiple sinusoidal sig- architecture of uSFGAN is shown in Fig. 2, and the generator
nals and the resonance filtering part through multistage dilated architectures are shown in Fig. 3.
convolution neural networks. The neural homomorphic vocoder
(NHV) [22] generates waveforms on the basis of partly trainable
A. Factorization of Generator Network
digital signal processing modules with adversarial training. As
another example, LPCNet [16] adopts linear filtering based on To make the proposed generator function like a source-filter
linear predictive coding [45], [46], [47], [48], and the neural model for achieving high acoustic controllability, two novel reg-
network predicts the residual signal in an autoregressive manner, ularization losses are applied to the output of the source network
whereas GlotGAN [17], [18] and GELP [19] generate it in a to achieve reasonable source excitation signal generation.
non-autoregressive manner. 1) Spectral Envelope Flattening Regularization Loss: The
Despite their practical approach to introducing the source- first regularization loss is designed on the assumption that the
filter architecture in deep neural networks, using signal- spectral envelopes of the source excitation signal are flat and
processing-based modules under the ad hoc assumptions usually their amplitude is constant. To match the constraints for the
results in sound quality and F0 controllability degradations. We output signal of the source network, we take the L1 norm
hypothesize that the reason for the degradation is the massive of the log amplitude spectral envelopes of the output source
burden on neural networks and the lack of ability to control excitation signal calculated by using a simplified version [27] of
F0 . For instance, in NSF, the insufficient capacity of the source CheapTrick [49]. The spectral envelope flattening regularization
excitation generation part, which outputs the source excitation loss is formulated as
 
signal by adding a fixed number of sinusoidal signals, forces the 1
spectral filtering part based on multistage dilated convolution Lreg (G) = Ez || log Êz ||1 , (1)
N
neural networks to compensate for the missing information of
the source excitation signal. As another example, LPCNet, Glot- where || · ||1 , Êz , N , and z denote the L1 norm, the magnitude
GAN, and GELP, whose neural-network-based source excitation of the source spectral envelopes of the output source excitation
3720 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 31, 2023

Fig. 3. Details of generator architectures. (a) Primary uSFGAN generator. (b) Harmonic-plus-noise uSFGAN generator. (c) QP-PWG macroblock. (d) Periodicity
estimator. The red lines and blocks are used only for cascade harmonic-plus-noise source excitation generation. The output layers consist of two pairs of ReLU
activation and one-by-one (1 × 1) convolution layers.

signal, the number of elements in the magnitude, and Gaussian signal to the generator generated by the same formula as that
noise signal, respectively. Note that when this loss reaches zero, of NSF. The signal retains the input F0 as the fundamental
the linear amplitude values Ê are one over all frequency and frequency but with an additional random noise signal. Moreover,
time frames. The initial uSFGAN paper [27] employed the L2 we apply PDCNNs, which effectively enlarge the receptive fields
norm was employed for this loss function, but we use the L1 in accordance with the input F0 by dynamically changing the
norm because of the improvement in several objective evalua- DCNN dilation factors. We found that using both the sinusoidal
tion indices, such as F0 reconstruction accuracy and voiced or input and PDCNNs significantly improves F0 controllability.
unvoiced decision error rate. However, the PDCNNs also tend to introduce undesired periodic
2) Residual Spectra Targeting Regularization Loss: The sec- components to the unvoiced segments. This tendency prevents
ond loss is designed to utilize the residual spectra calculated from the proper generation of other aperiodic source components,
the target speech and spectral envelopes extracted in the same such as frication, aspiration, and transient sources, which ad-
way as above. However, to minimize the effects of the estimation versely affect sound quality and naturalness.
error of F0 and phases between generated and ground-truth To improve the source excitation signal modeling, especially
speeches, we apply the mel-filter-bank to the amplitude spec- for the unvoiced parts, we introduce a harmonic-plus-noise exci-
trogram. The residual spectra regularization loss is formulated tation generation mechanism inspired by the current successful
as works [13], [14], [21], [22] based on [50]. To explicitly model
  the periodic and aperiodic components, previous works [13],
1
Lreg (G) = Ex,z || log ψ(Sx ) − log ψ(Ŝz ) ||1 , (2) [14], [21], [22], [50] prepared two networks for generating each
N component and devised the architecture and input features for
where x, ψ, and N denote the ground-truth speech, function each. We adopt two harmonic-plus-noise modeling schemes,
that transforms a spectral magnitude into the corresponding mel- the cascade and parallel model structures, referring to Period-
spectrogram and the number of elements in the mel-spectrogram, Net [13]. Hono et al. represent the dependence of the periodic
respectively; Sx denotes the magnitude of residual spectra that and aperiodic speech signals with the model structure. The
have the same frame-wise average power as that of the ground- cascade model structure combines the periodic and aperiodic
truth speech, and Ŝz denotes the spectral magnitude of the output speech generators in series so that the latter generator can predict
source excitation signal. Unlike the spectral envelope flattening the aperiodic component taking into account the dependence of
regularization loss, this loss leaves the power estimation to the the periodic component. On the other hand, the parallel model
source network, similarly to an actual human speech production structure assumes their independence. To ascertain whether the
process where the power is controlled during the sound genera- cascade or parallel structure scheme is superior in modeling the
tion. source excitation signal, we propose the two approaches follow-
ing PeriodNet. Moreover, the periodicity estimation is crucial for
the naturalness of generated speech. Regarding NHV [22] and
B. F0 -Driven Source Excitation Generation HN parallel waveGAN (HN-PWG) [14], we prepare a network
Source excitation signals have high periodicity owing to to estimate periodicity-related weights from acoustic features
their generation process that is based on vocal folds vibrations. and mix periodic and aperiodic source components on the basis
Inspired by NSF [20], [21], [44], we input a sinusoidal-based of the weights.
YONEYAMA et al.: HIGH-FIDELITY AND PITCH-CONTROLLABLE NEURAL VOCODER BASED ON UNIFIED SOURCE-FILTER NETWORKS 3721

The HN source excitation generation module consists of three The final loss function of the generator can be written as the sum
networks: the harmonic network, noise network, and periodicity of the regularization loss Lreg , the auxiliary spectral loss Lspc ,
estimator, as shown in Fig. 3(b). Note that the red lines and and the adversarial loss Ladv :
red blocks are used only in the cascade approach. The har-
LG (G, D) = Lreg (G) + λspc Lspc (G) + λadv Ladv (G, D), (6)
monic network outputs latent features l(h) that correspond to
the periodic components of the source excitation signal from where λspc and λadv are loss balancing hyperparameters.
a sinusoidal signal and auxiliary features. On the other hand, GAN-based adversarial training is effective for neural
the noise network outputs latent features l(n) that correspond to vocoders to implicitly learn perceptual aspects, such as phases,
the aperiodic components of the source excitation signal from required for generating high-quality samples. GAN-based
a random noise signal and auxiliary features. In the cascade vocoders usually adopt auxiliary losses in the spectral domain
approach, the noise network also receives the output of the and the feature matching loss to avoid mode collapse and
harmonic network. We use the QP-PWG macroblock shown in improve the training stability. PWG [35] adopts the multi-
Fig. 3(c) in the harmonic network, while the PWG macroblock is resolution short-time Fourier transform (STFT) loss that can
used in the noise network. We adopt the harmonicity estimator partly capture information about distance in phases in addition
of HN-PWG as the periodicity estimator shown in Fig. 3(d). to the spectral structure between the natural speech x and the
Conditioned on the auxiliary features, the periodicity estimator generated speech x̂ = G(z). It is formulated as the sum of the
outputs the channel-wise and sample-wise weights a within spectral convergence losses (Lsc ) and the log STFT magnitude
[0, 1] corresponding to the speech periodicity. The two generated losses (Lmag ) as follows:
representations are summed element-wise using the estimated M
weights. The source excitation latent feature l is formulated as 1  (m)
Lspc (G) = L (G) (7)
M m=1 s
(h) (n)
lt,i = at,i · lt,i + (1 − at,i ) · lt,i (3)
Ls (G) = Ex,z [Lsc (x, G(z)) + Lmag (x, G(z))] (8)
where the subscripts indicate the ith channel of the tth sample
|| |STFT(x)| − |STFT(x̂)| ||F
of each latent feature or weight. Since periodicity is estimated Lsc (x, x̂) = (9)
from auxiliary features, the input sinusoidal signal is generated || |STFT(x)| ||F
using the continuous F0 values obtained by interpolating the 1
Lmag (x, x̂) = || log |STFT(x)| − log |STFT(x̂)| ||1 , (10)
discontinuous F0 values. N
The cascade approach comprises three steps, as shown in (m)
Fig. 3(b). First, the harmonic network outputs the periodic where M and Ls denotes the number of sets of analysis pa-
source excitation representation, which is modulated using the rameters for STFT and the spectral loss defined as (8) calculated
channel-wise weights predicted by the periodicity estimator. with the mth set. Moreover, || · ||F , |STFT(·)|, and N denote
Second, a random noise signal is mapped to a latent represen- the Frobenius norm, the STFT magnitudes, and the number of
tation and mixed with a periodic source representation using elements in the magnitude, respectively.
a 1x1 convolution layer and the noise network. Finally, the On the other hand, HiFi-GAN [12] adopts the L1 loss
output latent feature of the noise network is modulated using in the mel-spectrogram domain because of the more human
the weights and summed up with the modulated periodic source perception-related advantage. It is formulated as follows:
 
excitation representation to output the final source excitation 1
Lspc (G) = Ex,s || φ(x) − φ(G(z)) ||1 , (11)
representation. On the other hand, in the parallel approach, the N
aperiodic source representation is generated without the output
where φ and N denote the function of converting a speech
periodic source representation of the harmonic network.
signal to the corresponding mel-spectrogram and the number
of elements in the mel-spectrogram, respectively.
C. Adversarial Training We aim to develop a vocoder capable of synthesizing speech
The training procedure of uSFGAN is common for GAN- that faithfully reflects the input acoustic features. Since uSF-
based training plus auxiliary regularization losses. The dis- GAN is conditioned on vocoder features, such as F0 , spectral
criminator D is trained to identify natural samples as real envelopes, and aperiodicity, the estimation error is inevitable.
and generated samples as f ake by minimizing the following Therefore, we argue that a looser constraint in the auxiliary
optimization criterion: spectral loss eases the mismatch between input and real features,
    especially F0 and phases. Although the multi-resolution STFT
LD (G, D) = Ex (1 − D(x))2 + Ez D(G(z))2 , (4) loss and feature matching loss facilitate fine matches between the
where x denotes the natural samples distributed from the data ground-truth and generated speeches, it is difficult for uSFGAN
distribution of the natural samples, and z is random noise to satisfy them fully. In fact, for our best-proposed model,
distributed from the Gaussian distribution N (0, I). On the other the mel-spectral L1 loss is used with the exact formulation as
hand, the generator G is trained to deceive the discriminator by that of HiFi-GAN. The application of the mel-filter-bank eases
minimizing the following adversarial loss: the effect of F0 and phase mismatch, making the optimization
  more straightforward and reasonable. Moreover, with adversar-
Ladv (G, D) = Ez (1 − D(G(z)))2 . (5) ial training with sufficiently strong sophisticated discriminators,
3722 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 31, 2023

the generator can learn reasonable phases from the adversarial number of residual blocks from the original configuration:
loss. Furthermore, although HiFi-GAN and MelGAN [36] adopt PDCNNs 10 − → 30 and DCNNs 10 − → 30. The capacities
the feature matching loss to obtain the deep classification infor- of the QP-PWG model and the basic uSFGAN model de-
mation provided by the discriminator, uSFGAN does not adopt tailed below are the same regarding the number of residual
the feature matching loss because of the mismatching problem blocks.
of phase and F0 between generated and ground-truth speeches. We conditioned the HiFi-GAN model by using the mel-
spectrogram as the original model with 80 mel-filter-banks, 1024
IV. EXPERIMENTAL EVALUATIONS fast Fourier transform (FFT) points, 1024 points of the Hanning
window, and the hop size was set to 120 (5 [ms]). We trained it
A. Data Preparation
for 2500 k iterations as the original model with the batch size
We used the VCTK corpus [51], which contains 109 English set to 16, and the batch length set to 18000 (0.75 [s]), using the
speakers. We used only mic2 samples, and p315 was unavailable original setting of the Adam [52] optimizer. The loss weights
owing to a technical problem. The sampling rate was set to followed the original setting. The weights of the adversarial
24 kHz using the sox1 downsampling function. No preprocess- loss, the feature matching loss, and the mel-spectral loss were
ing, such as normalization or low-cut filtering, was applied to the set to 1.0, 2.0, and 45.0, respectively. HN-NSF was conditioned
audio. We divided the dataset following a specific rule to evaluate using discrete F0 , the mel-generalized cepstrum (MGC), and
robustness against unseen F0 values. The minimum and maxi- mel-cepstral aperiodicity (MAP). We trained it for 600 k steps
mum F0 values of the VCTK corpus were respectively found to with the batch size set to 1 as the original model, and the batch
be about 50 Hz and 400 Hz through careful investigation of each length was set to 24000 (1.0 [s]) using the original setting of the
speaker. We limited the F0 range of the training data from 70 Hz Adam optimizer. This model was trained using only the L2 loss
to 340 Hz and excluded two speakers (p271 and p300) from on the log power spectrogram. QP-PWG was conditioned using
the training data to evaluate the robustness of unseen speakers. almost the same features as those for HN-NSF, but continuous F0
Thanks to this limitation, we can evaluate the methods using and a binary sequence representing voiced or unvoiced (V/UV)
various conditions of seen or unseen speakers and F0 ranges. segments were used instead of the discrete F0 . We trained it for
600 k steps with the batch size set to 5 and the batch length set
B. Model Details to 18000 (0.75 [s]) using the original setting of the RAdam [53]
1) Baseline Models: As the baselines, we used the following optimizer. The loss weights followed the original setting. The
four models. weights of the adversarial loss and the multi-resolution STFT
r HiFi-GAN: A high-fidelity GAN-based neural vocoder loss were set to 4.0 and 1.0, respectively.
with four multi-period discriminators and four multi-scale We extracted F0 using the Harvest algorithm [54] with care-
discriminators. HiFi-GAN has no clue for controlling F0 , fully set F0 search range for each speaker. Then we extracted the
so we used it as the baseline for the evaluation of speech log power spectral envelope using the CheapTrick algorithm [49]
reconstruction. To train the HiFi-GAN model, we adopted and coded it into the corresponding 41-dimensional MGC with
the HiFi-GAN V1 [12] configuration and used an unofficial the all-pass-constant set to 0.466. Also, we extracted aperiodicity
open-source implementation2 for training the model. using D4C algorithm [55] and coded them into the correspond-
r WORLD: A conventional source-filter model. This model ing 21-dimensional MAP. These features were calculated with
achieves flexible controllability of acoustic features with a shift period set to 5 ms. The mel-spectrogram was calculated
reasonable sound quality. We used a Python wrapper3 of using the librosa [56] function with the FFT size and window
the original WORLD implementation4 . length set to 1024, and the hop length to 120 (5 [ms]) with a
r HN-NSF: Harmonic-plus-noise neural source-filter with Hanning window.
time-variant and trainable sinc filters that predict their 2) Proposed Models: We used the following three uSFGAN-
cut-off frequency from the input acoustic features. We based models in the comparison experiments.
r uSFGAN: This model was based on our method proposed
reimplemented the model on the basis of the official open-
source code5 without changing the model configuration in [27]. The source network comprises 30 PDCNN blocks
except for increasing the training iterations. with six cycles, the filter network comprises 30 DCNN
r QP-PWG: A F0 -controllable neural vocoder based on blocks with three cycles, and the PWG discriminator and
GAN without the source-filter separation. It controls F0 PWG-based training procedure were used. The modifica-
via the PDCNNs and input auxiliary F0 . We increased the tions are that the regularization loss became the L1 norm,
and the input signal became a one-channel sinusoidal-
1 [Online].
based signal generated by the formula of NSF instead of a
Available: http://sox.sourceforge.net/
2 An unofficial code of HiFi-GAN: https://github.com/kan-bayashi/
two-channel signal (a random noise signal and a sinusoidal-
ParallelWaveGAN based signal without randomness). The updated loss leads
3 A Python wrapper of WORLD vocoder: https://github.com/JeremyCCHsu/
to better performance of the objective metrics, and the input
Python-Wrapper-for-World-Vocoder
signal was for simplification of the comparison.
r C-uSFGAN (Cascade HN-uSFGAN): The first proposed
4 WORLD official implementation https://github.com/mmorise/World
5 NSF official Pytorch implementation: https://github.com/nii-yamagishilab/
project-NN-Pytorch-scripts model with the cascade harmonic-plus-noise excitation
YONEYAMA et al.: HIGH-FIDELITY AND PITCH-CONTROLLABLE NEURAL VOCODER BASED ON UNIFIED SOURCE-FILTER NETWORKS 3723

TABLE I resolutions, resulting in higher computational efficiency and


NUMBER OF MODEL PARAMETERS AND REAL-TIME FACTORS (RTF)
CALCULATED ON A SINGLE GPU (TITAN RTX 3090) AND CPU WITH FOUR
enabling fast waveform generation. On the other hand, the other
THREADS (AMD EPYC 7302) models operate at a fixed temporal resolution consistent with
the output waveform from the input. Since the computational
complexity is proportional to the temporal resolution, these
models tend to have slower speeds than the upsampling-based
approach.
3) Ablation Models: To investigate the effectiveness of each
component in our best-proposed P-uSFGAN described above,
we prepared the following four ablation models for the compar-
ison experiments. The input features and the training procedure
of the ablation models followed those of P-uSFGAN.
r Reg-Loss: P-uSFGAN trained with the spectral envelope
generation, the residual spectra targeting loss, the mel- flattening loss instead of the residual spectra targeting loss.
spectral loss, and the HiFi-GAN discriminator. The har- r HN-SN: P-uSFGAN without the parallel harmonic-plus-
monic network had 20 PDCNN blocks with four cycles, the noise source network but with the generator of the basic
noise network was composed of five CNN blocks without uSFGAN (30 layers of PDCNNs).
cycles, and the filter network was the same as that of the r HiFi-D: P-uSFGAN without the multi-period or multi-
basic uSFGAN. scale discriminator of HiFi-GAN but with the discriminator
r P-uSFGAN (Parallel HN-uSFGAN): The second proposed of PWG. We set λadv = 8.0 to match the reduced number
model. The network architecture was the same as that of of discriminators.
C-uSFGAN except for the parallel or cascade architecture. r Mel-Loss: P-uSFGAN trained with the multi-resolution
This model is based on that in [28], but we made several im- STFT loss of PWG instead of the mel-spectral L1 loss.
provements to it. Specifically, continuous F0 was removed We set λspc = 20.0 so that the loss values before and after
from the auxiliary features, and the two-dimensional BAP the change have roughly the same magnitude.
was changed to the corresponding 21-dimensional MAP.
More details about the feature choices are described in
C. Evaluation of Speech Reconstruction
Appendix A. Moreover, empirically better loss weighting
hyperparameters were used in this article. To evaluate the robustness of the proposed models for un-
To enable the model to access the F0 information only from seen acoustic features, both objective and subjective tests were
the input sine waves, the auxiliary features included only MGC conducted for the speech reconstruction performances. That is,
and MAP in all models. According to our preliminary experi- three evaluation sets, including natural acoustic features within,
ments, this information restriction mechanism is essential for the beyond, and below the F0 training range were adopted.
proposed models to deal with excessively deviated F0 such as 1) Objective Evaluation: As the objective evaluation mea-
the 2.0 × F0 of female speakers with higher average F0 . The ex- surements, the root mean square error of log F0 [Hz] (RMSE),
tractions of these acoustic features followed the same process as the voiced or unvoiced decision error [%] (V/UV), and mel-
the baselines. The batch size and batch length of all the proposed cepstral distortion [dB] (MCD) were used. The results are shown
models were set to 5 and 18000 (750 [ms]), respectively, as in in Table II where results are divided on the basis of F0 range.
the QP-PWG. The uSFGAN was trained with only the auxiliary Each group included 200 utterances containing equal numbers
losses for the first 100 k iterations and with the discriminator in of utterances by seen and unseen speakers. Since the primary
the remaining 500 k steps using the RAdam optimizer with the purpose of our experiment was to investigate the F0 robustness of
same setting as that in QP-PWG. The loss weights were set on the neural vocoders, and we confirmed that the proposed method
the basis of that of QP-PWG: λadv = 4.0, λspc = 1.0, λreg = 1.0. did not cause significant degradation for unknown speakers [28],
On the other hand, C-uSFGAN and P-uSFGAN followed the we only report the evaluation results for all speakers together.
HiFi-GAN training procedure of simultaneously training the Conventional parametric vocoders such as WORLD usually
generator and the discriminators from scratch for 600 k iterations achieve higher objective acoustic controllability than neural
using the Adam optimizer with the same setting as that in HiFi- vocoders [26], and the results of objective evaluation also
GAN. The loss weights were set based on those of HiFi-GAN: demonstrate the same tendency. Specifically, baseline neural
λadv = 1.0, λspc = 45.0, λreg = 1.0. vocoders suffer from degradation when unseen F0 was given,
The model sizes of the baselines and proposed models are even though they partly outperform WORLD in the case of the
shown in Table I. Their inference speeds are also detailed with seen F0 range. In particular, QP-PWG shows large degradation
the real-time factor (RTF) in the same table. As shown in the in the V/UV error rate for the F0 range below the training range.
table, the proposed models are much smaller than HiFi-GAN On the other hand, uSFGAN, whose difference from QP-PWG
with the V1 configuration, whereas HiFi-GAN achieves a much is the explicit decomposition of the source and filter network
higher inference speed on a single GPU and CPU than the and the input sinusoidal-based signal, does not show significant
proposed models. HiFi-GAN adopts a configuration based on degradation in any case. This implies the benefit provided by
upsampling, where the preceding layers have lower temporal the source-filter modeling, that is, an inductive bias for the
3724 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 31, 2023

Fig. 4. Evaluation results of the MOS test. The average scores of all ranges are natural: 4.02, HiFi-GAN: 3.88, WORLD: 3.70, QP-PWG: 3.82, uSFGAN: 3.86,
C-uSFGAN: 3.97, P-uSFGAN: 3.99, and P-uSFGAN - HiFi-D: 3.92.

TABLE II answers, such as where almost all scores were the same or the
RESULTS OF OBJECTIVE EVALUATIONS OF SPEECH RECONSTRUCTION. THE
BEST SCORES ARE IN BOLD
score of natural speech was lower than any system.
The results are shown in Fig. 4 where results are divided on
the basis of F0 range. HN-NSF was clearly inferior to the other
models in sound quality, so we excluded it in the subjective
evaluation experiment because of the possibility of undesired
bias that the other samples would be highly evaluated. HN-NSF
is a very basic baseline, and we speculate that the degradation
was due to the simplicity of the model architecture and its
low capacity to adapt to the large number of speakers in the
VCTK corpus. However, we did not conduct any hyperparameter
tuning on HN-NSF and note that there is a possibility that its
performance can be improved by increasing the number of layers
or introducing adversarial training.
We can see that all models except for WORLD achieve
comparable scores for natural speech. Interestingly, QP-PWG,
which uses the discriminator of PWG, achieves the best score,
outperforming HiFi-GAN. The reason for the improvement of
QP-PWG from the original model would be the increase in
the number of the generator layers (20 − → 60 residual blocks).
However, for the unseen F0 ranges, the proposed C-uSFGAN and
P-uSFGAN achieve the best results, whereas QP-PWG is consid-
erably degraded. Moreover, the differences between HiFi-GAN
and natural speech become more prominent than in the case
within the training F0 range. On the other hand, there are no
significant differences between C-uSFGAN and P-uSFGAN and
natural speech in all cases. These results indicate that HiFi-GAN
is data-driven and QP-PWG is highly data-driven. However, our
speech production process leading to robustness to unseen F0 . proposed C-uSFGAN and P-uSFGAN complement the short-
Note that C-uSFGAN and P-uSFGAN show the best results in comings of a data-driven approach.
V/UV error rate, greatly outperforming WORLD, indicating
the effectiveness of the harmonic-plus-noise architecture and of
D. Evaluation of F0 Transformation
updating the loss functions. In conclusion, the proposed methods
attain acoustic controllability similar to or better than those of Next, we evaluated the performances of F0 transformation
conventional parametric vocoders. with factors within [2−1.0 , 21.0 ]. The magnifications were taken
2) Subjective Evaluation: For the subjective evaluation, we equally on the logarithmic axis with the base at 2. The ground-
conducted an opinion test on sound quality using seven models truth F0 was determined by multiplying the F0 extracted from
and natural speech with ten subjects. Each subject evaluated 20 natural speech with the scale factors, and they were also adopted
utterances per method. We recruited English-speaking evalua- as the input F0 of the models.
tors through Amazon Mechanical Turk and instructed them to 1) Objective Evaluation Settings: We extracted F0 using the
listen to the audio in a quiet room with headphones or earphones. WORLD analyzer by the following procedure. When F0 was
Also, we filtered out scores from evaluators with unreasonable multiplied by a scale factor greater than one, only the upper
YONEYAMA et al.: HIGH-FIDELITY AND PITCH-CONTROLLABLE NEURAL VOCODER BASED ON UNIFIED SOURCE-FILTER NETWORKS 3725

Fig. 5. Objective evaluation results of F0 transformation for the comparison with baseline models. The MCD values of HN-NSF are excluded because it deviates
from the range of the y-axis where the results of the other models are gathered.

Fig. 6. Objective evaluation results of F0 transformation for the ablation study.

bound of the F0 search range was multiplied and transformed; 3) Ablation Study: The objective evaluation results of the
otherwise, only the lower bound of the range was multiplied and ablation study are shown in Fig. 6. From the results for P-
transformed. MCD was calculated using the CheapTrick [49] uSFGAN and P-uSFGAN - HN-SN, we can see that the harmonic-
algorithm provided by the WORLD analyzer, and the extracted plus-noise source network is very effective in improving the
F0 was used for the calculation. However, we downsampled V/UV error rate and RMSE of log F0 . Moreover, the residual
audio signals to 16000 [Hz] before estimating spectral envelopes spectra targeting loss (P-uSFGAN vs P-uSFGAN - Reg-Loss)
because the CheapTrick algorithm sometimes fails in the esti- and mel-spectral loss (P-uSFGAN vs P-uSFGAN - Mel-Loss)
mation when the F0 adaptive window size is larger than the FFT effectively improve the V/UV error rate. P-uSFGAN - HiFi-D
size. We made the available fixed FFT size sufficiently large shows relatively good results in these objective metrics, but it
by reducing the size of the F0 adaptive window through down- is inferior to P-uSFGAN in sound quality, at least in speech
sampling and calculated MCD more accurately. The evaluations reconstruction.
were conducted using the evaluation data whose F0 range was 4) Subjective Evaluation: For the subjective evaluation, we
within the training F0 range (i.e., 70 − 340 [Hz]). conducted preference tests on sound quality using WORLD,
2) Objective Evaluation: The objective evaluation results of C-uSFGAN, and P-uSFGAN for four F0 scaling factors
comparison with baseline models are shown in Fig. 5. The {2−1.0 , 2−0.5 , 20.5 , 21.0 }. Twenty subjects participated, and each
result of log F0 RMSE shows that although other models suffer subject evaluated ten pairs per F0 scaling factor per method
from degradation in extreme cases (F0 × {2−1.0 , 21.0 }), the pair. The results are shown in Fig. 7. From the figures, both
proposed C-uSFGAN and P-uSFGAN models achieve stable C-uSFGAN and P-uSFGAN outperform WORLD for all given
values close to that of WORLD. However, the two models F0 scale factors, and P-uSFGAN is superior to C-uSFGAN in
achieve much lower V/UV error rates than all baseline mod- 3/4 of the items.
els, which we found to have more impact on sound qual-
ity in our preliminary experiments. Moreover, we can see
that all proposed models achieve better MCDs than WORLD. E. Visualization of Output Source Excitation Signals
Again, the V/UV error rate and the RMSE of log F0 in To investigate the behavior of cascade and parallel HN-
QP-PWG degrade as the scale factor increases or decreases, uSFGAN models (C-uSFGAN and P-uSFGAN), we visualized
respectively. In contrast, uSFGAN does not significantly de- their output periodic and aperiodic source excitation signals in
grade for any factor, indicating the benefit of the source-filter Fig. 8 with the spectrograms. These signals were obtained from
decomposition. the output latent representations of l, l(h) , and l(n) using the
3726 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 31, 2023

Fig. 7. Evaluation results of the preference test for F0 transformation with the baseline WORLD and proposed C-uSFGAN and P-uSFGAN.

Fig. 8. Plots of output source excitation signals and spectrograms of C-uSFGAN (upper row) and P-uSFGAN (lower row) for 500 [ms]. The left column indicates
the final source excitation signal, the middle column indicates the periodic source excitation signal, and the right column indicates the aperiodic source excitation
signal.

output layers of the filter network and normalization of the signal C-uSFGAN and P-uSFGAN achieve almost the same perfor-
power. mance in speech reconstruction evaluation, as shown in Sec-
In Fig. 8, the output source excitation signals of C-uSFGAN tion IV-C. However, P-uSFGAN significantly outperforms C-
seem to include fewer aperiodic components than in P- uSFGAN in the evaluation of F0 transformation, as shown in
uSFGAN. Moreover, whereas P-uSFGAN well models the pe- Section IV-D4. From the results, we can conclude that the
riodic and aperiodic components by the corresponding net- disentanglement of periodic and aperiodic components has a
works, C-uSFGAN does not seem to be able to disentan- good effect on the sound quality in F0 transformation scenarios.
gle these components. This indicates that the input aperiodic Thus, we choose P-uSFGAN as our best-proposed model in this
components are ignored as they pass through some networks. work.
YONEYAMA et al.: HIGH-FIDELITY AND PITCH-CONTROLLABLE NEURAL VOCODER BASED ON UNIFIED SOURCE-FILTER NETWORKS 3727

Fig. 9. Plots of output source excitation signals and spectrograms of uSFGAN, C-uSFGAN, and P-uSFGAN (from top to bottom row) with three F0 scaling
factors: 0.5, 1.0, and 2.0 (left to right column), for 50 [ms]. All of them were clipped from the same segment of the same utterance. The original F0 values in this
segment were around 140 [Hz].

Furthermore, source excitation signals of uSFGAN, C- r AuxF0 : This model includes one-dimensional continuous
uSFGAN, and P-uSFGAN for several F0 scaling factors are F0 in the default set of the auxiliary feature. The total
plotted in Fig. 9. The figure shows that all proposed models number of dimensions of the auxiliary feature is 63.
can generate reasonable source excitation signals in accordance r BAP: This model adopts the three-dimensional band-
with the input F0 . aperiodicity extracted using WORLD instead of 21-
dimensional MAP. Coding is performed by one-
dimensional interpolation on the frequency axis, which
V. CONCLUSION compresses the half-FFT size to three. The total number
of dimensions of the auxiliary feature is 44.
In this article, we proposed a novel source-filter modeling
All the ablation models were trained in the same setting as that
strategy that decomposes a single neural network using the
in the P-uSFGAN model except for their auxiliary features. Note
regularization loss on the intermediate output. Thanks to the
that all subnetworks (i.e., harmonic network, noise network,
unified optimization of the source excitation and resonance filter-
filter network, and periodicity estimator) are conditioned using
ing networks, our best-proposed method has been demonstrated
the same auxiliary features.
to achieve equal or higher sound quality than the high-fidelity
The objective evaluation results are shown in Fig. 10. The
neural vocoder while attaining a similar or advanced F0 control-
WORLD results are provided as references. We found that
lability compared with a conventional parametric vocoder in the
differences between the models become apparent when F0 is
analysis-synthesis scenario. More experiments on the practical
significantly high, so the study was conducted with F0 increased
applications of the proposed neural vocoder and controllability
by a factor of five. First, we can see that the MEL model degrades
over other acoustic features, such as spectral envelopes and
even with a small F0 change. Since the mel-spectrogram already
aperiodicity, are left to future research.
contains the F0 information, we speculate that it is difficult for
the model to manipulate F0 by merely changing the sinusoidal
inputs. The AuxF0 model shows significant degradation with F0
APPENDIX A increased by a factor of two or more in its V/UV error rate, which
INVESTIGATION OF INPUT ACOUSTIC FEATURES is more critical for sound quality than the RMSE of log F0 . We
To further investigate the impact of different conditional confirmed that the generated speech is hardly voiced, resulting
acoustic features, we evaluated several models with different in significant degradation. We assume that this tendency is due to
types of conditioning features with the same model architecture. the fact that the inductive bias for speech production provided by
In our proposed methods used in the experimental evaluations the source-filter modeling is not obtained owing to the leakage
(Section IV), we chose the set of {MGC, MAP} as the best of F0 information to the filter network. The total degradation in
combination for the auxiliary features whose total number of the MEL model can be considered to have the same cause. From
dimensions is 62. Here, we compare P-uSFGAN with the fol- these experiences, we concluded that disentanglement of the
lowing three models with different auxiliary features. input acoustic features and the restriction of F0 information leak-
r MEL: This model adopts a full-band 80-dimensional log age to the filter network is essential to gaining the benefit from
mel-spectrogram calculated in the setting described in source-filter modeling. The BAP model, which gives periodicity
Section IV-B instead of the vocoder features. information with fewer dimensions, shows minimal degradation
3728 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 31, 2023

Fig. 10. Objective evaluation results of F0 transformation for the ablation study on auxiliary features.

in both V/UV error rate and RMSE of log F0 . We assume that [15] M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, and
the degradation is because the neural network can ignore a fewer Y. Bengio, “Chunked autoregressive GAN for conditional waveform
synthesis,” in Proc. Int. Conf. Learn. Representations, 2022. [Online].
dimensional input feature (i.e., BAP) when the network can Available: https://openreview.net/forum?id=v3aeIsY_vVX
reconstruct the target waveform from the other input features [16] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis
in training. Moreover, this result suggests the importance of through linear prediction,” in Proc. Int. Conf. Acoust., Speech, Signal
Process., 2019, pp. 5891–5895.
information about periodicity information in neural vocoders [17] B. Bollepalli, L. Juvela, and P. Alku, “Generative adversarial network-
based on periodic and aperiodic component decomposition. based glottal waveform model for statistical parametric speech synthesis,”
in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2017, pp. 3394–3398.
[18] L. Juvela, B. Bollepalli, J. Yamagishi, and P. Alku, “Waveform generation
REFERENCES for text-to-speech synthesis using pitch-synchronous multi-scale genera-
tive adversarial networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal
[1] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Restructuring Process., 2019, pp. 6915–6919.
speech representations using a pitch-adaptive time–frequency smooth- [19] L. Juvela, B. Bollepalli, J. Yamagishi, and P. Alku, “GELP: GAN-
ing and an instantaneous-frequency-based F0 extraction: Possible role excited linear prediction for speech synthesis from mel-spectrogram,”
of a repetitive structure in sounds,” Speech Commun., vol. 27, no. 3/4, in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2019, pp. 694–698,
pp. 187–207, 1999. doi: 10.21437/Interspeech.2019-2008.
[2] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based high- [20] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter-based wave-
quality speech synthesis system for real-time applications,” IEICE Trans. form model for statistical parametric speech synthesis,” in Proc. IEEE Int.
Inf. Syst., vol. 99, no. 7, pp. 1877–1884, 2016. Conf. Acoust., Speech, Signal Process., 2019, pp. 5916–5920.
[3] H. W. Dudley, “Remaking speech,” J. Acoustical Soc. Amer., vol. 11, no. 2, [21] X. Wang and J. Yamagishi, “Neural harmonic-plus-noise waveform model
pp. 169–177, 1939. with trainable maximum voice frequency for text-to-speech synthesis,” in
[4] M. R. Schroeder, “Vocoders: Analysis and synthesis of speech,” Proc. Proc. Speech Synth. Workshop, 2019, pp. 1–6.
IEEE, vol. 54, no. 5, pp. 720–734, May 1966. [22] Z. Liu, K. Chen, and K. Yu, “Neural homomorphic vocoder,” in Proc.
[5] R. McAulay and T. Quatieri, “Speech analysis/synthesis based on a si- Annu. Conf. Int. Speech Commun. Assoc., 2020, pp. 240–244.
nusoidal representation,” IEEE Trans. Acoust., Speech, Signal Process., [23] M. Morrison, Z. Jin, N. J. Bryan, J.-P. Caceres, and B. Pardo, “Neu-
vol. 34, no. 4, pp. 744–754, Aug. 1986. ral pitch-shifting and time-stretching with controllable LPCNet,” 2021,
[6] A. van den Oord et al., “WaveNet: A generative model for raw audio,” in arXiv:2110.02360.
Proc. 9th ISCA Speech Synth. Workshop, 2016, Art. no. 125. [24] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Int. Conf.
[7] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-based gen- Neural Inf. Process. Syst., 2014, pp. 2672–2680.
erative network for speech synthesis,” in Proc. Int. Conf. Acoust., Speech, [25] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, and T. Toda, “Quasi-periodic
Signal Process., 2019, pp. 3617–3621. parallel WaveGAN vocoder: A non-autoregressive pitch dependent dilated
[8] W. Ping, K. Peng, K. Zhao, and Z. Song, “WaveFlow: A compact flow- convolution model for parametric speech generation,” in Proc. Annu. Conf.
based model for raw audio,” in Proc. Int. Conf. Mach. Learn., 2020, Int. Speech Commun. Assoc., 2020, pp. 3535–3539.
pp. 7706–7716. [26] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, and T. Toda, “Quasi-periodic
[9] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A ver- parallel WaveGAN: A non-autoregressive raw waveform generative model
satile diffusion model for audio synthesis,” in Proc. Int. Conf. Learn. Repre- with pitch-dependent dilated convolution neural network,” IEEE/ACM
sentations, 2021. [Online]. Available: https://openreview.net/forum?id=a- Trans. Audio, Speech, Lang. Process., vol. 29, pp. 792–806, 2021.
xFK8Ymz5J [27] R. Yoneyama, Y.-C. Wu, and T. Toda, “Unified source-filter GAN: Unified
[10] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “Wave- source-filter network based on factorization of quasi-periodic parallel
Grad: Estimating gradients for waveform generation,” in Proc. Int. Conf. WaveGAN,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021,
Learn. Representations, 2021. [Online]. Available: https://openreview.net/ pp. 2187–2191.
forum?id=NsMLjcFaO8O [28] R. Yoneyama, Y.-C. Wu, and T. Toda, “Unified source-filter GAN with
[11] A. Mustafa, N. Pia, and G. Fuchs, “StyleMelGAN: An efficient high- harmonic-plus-noise source excitation generation,” in Proc. Annu. Conf.
fidelity adversarial vocoder with temporal adaptive normalization,” in Int. Speech Commun. Assoc., 2022, pp. 848–852.
Proc. Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 6034–6038. [29] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-
[12] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks dependent WaveNet vocoder,” in Proc. Annu. Conf. Int. Speech Commun.
for efficient and high fidelity speech synthesis,” in Proc. Int. Conf. Neural Assoc., 2017, pp. 1118–1122.
Inf. Process. Syst., 2020, pp. 17022–17033. [30] N. Kalchbrenner et al., “Efficient neural audio synthesis,” in Proc. Int.
[13] Y. Hono, S. Takaki, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, Conf. Mach. Learn., 2018, pp. 2415–2424.
“PeriodNet: A non-autoregressive waveform generation model with a [31] K. Arora, L. El Asri, H. Bahuleyan, and J. Cheung, “Why exposure
structure separating periodic and aperiodic components,” in Proc. Int. bias matters: An imitation learning perspective of error accumulation in
Conf. Acoust., Speech, Signal Process., 2021, pp. 6049–6053. language generation,” in Proc. Conf. Assoc. Comput. Linguistics, 2022,
[14] M.-J. Hwang, R. Yamamoto, E. Song, and J.-M. Kim, “High-fidelity pp. 700–710.
parallel WaveGAN with multi-band harmonic-plus-noise model,” in Proc. [32] A. van den Oord et al., “Parallel WaveNet: Fast high-fidelity speech
Annu. Conf. Int. Speech Commun. Assoc., 2021, pp. 2227–2231. synthesis,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 3918–3926.
YONEYAMA et al.: HIGH-FIDELITY AND PITCH-CONTROLLABLE NEURAL VOCODER BASED ON UNIFIED SOURCE-FILTER NETWORKS 3729

[33] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel wave generation in end- [44] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter waveform
to-end text-to-speech,” in Proc. Int. Conf. Learn. Representations, 2019. models for statistical parametric speech synthesis,” IEEE/ACM Trans.
[Online]. Available: https://openreview.net/forum?id=HklY120cYm Audio, Speech, Lang. Process., vol. 28, pp. 402–415, 2020.
[34] R. Yamamoto, E. Song, and J.-M. Kim, “Probability density distillation [45] F. Itakura and S. Saito, “Analysis synthesis telephony based on the maxi-
with generative adversarial networks for high-quality parallel waveform mum likelihood method,” in Proc. 6th Int. Congr. Acoust., 1968, pp. C17–
generation,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2019, C20. [Online]. Available: https://cir.nii.ac.jp/crid/1573950400351247616
pp. 699–703. [46] B. S. Atal and M. R. Schroeder, “Predictive coding of speech signals,” in
[35] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGan: A. fast Proc. Speech Commun. Process., 1967, pp. 360–361.
waveform generation model based on generative adversarial networks with [47] D. Wong, B.-H. Juang, and A. Gray, “An 800 bit/s vector quantization
multi-resolution spectrogram,” in Proc. IEEE Int. Conf. Acoust., Speech, LPC vocoder,” IEEE Trans. Acoust., Speech, Signal Process., vol. 30,
Signal Process., 2020, pp. 6199–6203. no. 5, pp. 770–780, Oct. 1982.
[36] K. Kumar et al., “Melgan: Generative adversarial networks for conditional [48] A. V. McCree and T. P. Barnwell, “A mixed excitation LPC vocoder model
waveform synthesis,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, for low bit rate speech coding,” IEEE Speech Audio Process., vol. 3, no. 4,
pp. 14910–1421. pp. 242–250, Jul. 1995.
[37] R. Yamamoto, E. Song, M.-J. Hwang, and J.-M. Kim, “Parallel waveform [49] M. Morise, “CheapTrick, a spectral envelope estimator for high-quality
synthesis based on generative adversarial networks with voicing-aware speech synthesis,” Speech Commun., vol. 67, pp. 1–7, 2015.
conditional discriminators,” in Proc. IEEE Int. Conf. Acoust., Speech, [50] Y. Stylianou, “Applying the harmonic plus noise model in concatenative
Signal Process., 2021, pp. 6039–6043. speech synthesis,” IEEE Speech Audio Process., vol. 9, no. 1, pp. 21–29,
[38] G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, and L. Xie, “Multi-band Jan. 2001.
MelGAN: Faster waveform generation for high-quality text-to-speech,” in [51] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK corpus:
Proc. Spoken Lang. Technol. Workshop, 2021, pp. 492–498. English multi-speaker corpus for CSTR voice cloning toolkit (version
[39] J. Yang, J. Lee, Y. Kim, H. Cho, and I. Kim, “VocGAN: A high-fidelity 0.92),” [sound]. Univ. Edinburgh. The Centre for Speech Technol. Res.
real-time vocoder with a hierarchically-nested adversarial network,” in (CSTR), 2019, doi: 10.7488/ds/2645.
Proc. Annu. Conf. Int. Speech Commun. Assoc., 2020, pp. 200–204. [52] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in
[40] M. Bińkowski et al., “High fidelity speech synthesis with adversarial Proc. Int. Conf. Learn. Representations, 2015. [Online]. Available: https:
networks,” in Proc. Int. Conf. Learn. Representations, 2020. [Online]. //arxiv.org/abs/1412.6980
Available: https://openreview.net/forum?id=r1gfQgSFDr [53] L. Liu et al., “On the variance of the adaptive learning rate and beyond,” in
[41] J. You et al., “GAN vocoder: Multi-resolution discriminator is all Proc. Int. Conf. Learn. Representations, 2020. [Online]. Available: https:
you need,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021, //openreview.net/forum?id=rkgz2aEKDr
pp. 2177–2181. [54] M. Morise, “Harvest: A high-performance fundamental frequency esti-
[42] Y.-C. Wu, T. Hayashi, P. L. Tobing, K. Kobayashi, and T. Toda, “Quasi- mator from speech signals,” in Proc. Annu. Conf. Int. Speech Commun.
periodic WaveNet vocoder: A pitch dependent dilated convolution model Assoc., 2017, pp. 2321–2325.
for parametric speech generation,” in Proc. Annu. Conf. Int. Speech Com- [55] M. Morise, “D4C, a band-aperiodicity estimator for high-quality speech
mun. Assoc., 2019, pp. 196–200. synthesis,” Speech Commun., vol. 84, pp. 57–65, 2016.
[43] Y.-C. Wu, T. Hayashi, P. L. Tobing, K. Kobayashi, and T. Toda, “Quasi- [56] B. McFee et al., “librosa/librosa: 0.9.2,” 2022, doi: 10.5281/zen-
periodic WaveNet: An autoregressive raw waveform generative model with odo.6759664.
pitch-dependent dilated convolution neural network,” IEEE/ACM Trans.
Audio, Speech, Lang. Process., vol. 29, pp. 1134–1148, 2021.

You might also like