High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks
High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks
Abstract—We introduce unified source-filter generative adver- unseen data, such as unseen speakers and F0 , and continuously
sarial networks (uSFGAN), a waveform generative model condi- generate high-fidelity speech. For example, the vocoder of a
tioned on acoustic features, which represents the source-filter archi- multi-speaker TTS for few-shot voice cloning should be able to
tecture in a generator network. Unlike the previous neural-based
source-filter models in which parametric signal process modules tackle a wide range of F0 because of the frequent adaptations
are combined with neural networks, our approach enables unified of arbitrary speakers since it is impractical to collect a corpus
optimization of both the source excitation generation and resonance covering all unseen speakers, and even the utterances of the
filtering parts to achieve higher sound quality. In the uSFGAN seen speakers may include out-of-range F0 values. Moreover,
framework, several specific regularization losses are proposed to SVS often requires a significant deviation of F0 to generate a
enable the source excitation generation part to output reasonable
source excitation signals. Both objective and subjective experi- singing voice that transcends physical limitations. Therefore,
ments are conducted, and the results demonstrate that the pro- the vocoder of SVS should have high robustness to unseen F0
posed uSFGAN achieves comparable sound quality to HiFi-GAN over an extensive range.
in the speech reconstruction task and outperforms WORLD in the However, most vocoders do not meet the mentioned require-
F0 transformation task. Moreover, we argue that the F0 -driven ments. Specifically, conventional vocoders [1], [2] based on
mechanism and the inductive bias obtained by source-filter mod-
eling improve the robustness against unseen F0 in training as source-filter models [3], [4], [5] can flexibly control speech
shown by the results of experimental evaluations. Audio samples characteristics, but the quality of the generated speech is low
are available at our demo site at https://chomeyama.github.io/ because of their over-simplified speech production process. Re-
PitchControllableNeuralVocoder-Demo/. cent high-fidelity neural vocoders [6], [7], [8], [9], [10], [11],
Index Terms—Speech synthesis, neural vocoder, source-filter [12], [13], [14], [15] lack the robustness to unseen data because
model, unified source-filter networks. of their purely data-driven training-manners. For example, the
state-of-the-art neural vocoder, HiFi-GAN [12], fails to generate
high-fidelity speech when the input features include F0 values
I. INTRODUCTION
deviating from the F0 range of the training data. Furthermore,
PEECH synthesis is a technology of generating speech
S waveforms on the basis of text or acoustic features. In
particular, models conditioned on acoustic features are called
compared with the conventional source-filter models, those
neural vocoders have poorer interpretability and less flexible
controllability of speech characteristics. One reasonable way for
vocoders. Vocoders have been widely adopted in many voice neural vocoders to satisfy the above requirements is to introduce
applications, such as text-to-speech (TTS), singing voice syn- source-filter modeling to obtain sufficient flexibility and induc-
thesis (SVS), and voice conversion (VC). In the applications, the tive bias for human speech production. Several approaches [16],
quality of the final generated waveform strongly depends on the [17], [18], [19], [20], [21], [22], [23] have been investigated
performance of the vocoder. Specifically, vocoders are required to combine the source-filter architecture with deep neural net-
to generate speech of high sound quality in addition to functions works using signal-processing-based modules. However, there
for flexibly and independently controlling the generated speech are several problems with incorporating the parametric (signal-
in accordance with given acoustic features (e.g., F0 , timbre, processing-based) module and strong constraints into neural
and periodicity). Furthermore, vocoders should be robust to vocoders. For instance, the partial utilization of signal processing
makes the optimization of the entire speech generation process
Manuscript received 3 January 2023; revised 19 July 2023; accepted 25 difficult and degrades sound quality and F0 controllability since
August 2023. Date of publication 11 September 2023; date of current version the neural networks must compensate for the incomplete output
20 October 2023. This work was supported in part by the JST, CREST under of the parametric modules.
Grant JPMJCR19A3 and in part by the Japan Society for the Promotion of
Science (JSPS) Grants-in-Aid for Scientific Research (KAKENHI) under Grant To achieve flexibility and interpretability of source-filter
JP21H05054. The associate editor coordinating the review of this manuscript modeling while maintaining the high sound quality of neural
and approving it for publication was Prof. Hung-yi Lee. (Corresponding author: vocoders, we propose a novel framework of source-filter model-
Reo Yoneyama.)
Reo Yoneyama and Yi-Chiao Wu are with the Graduate School of Infor- ing on a single neural network, significantly reducing the effects
matics, Nagoya University, Nagoya 464-8601, Japan (e-mail: yoneyama.reo@ of the ad hoc designs. Unlike previous approaches that model
g.sp.m.is.nagoya-u.ac.jp; [email protected]). either the source excitation generation part or the resonance
Tomoki Toda is with the Information Technology Center, Nagoya University,
Nagoya 464-8601, Japan (e-mail: [email protected]). filtering part on the basis of signal processing as described in
Digital Object Identifier 10.1109/TASLP.2023.3313410 Section II, our approach enables the simultaneous optimization
© 2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
3718 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 31, 2023
Fig. 3. Details of generator architectures. (a) Primary uSFGAN generator. (b) Harmonic-plus-noise uSFGAN generator. (c) QP-PWG macroblock. (d) Periodicity
estimator. The red lines and blocks are used only for cascade harmonic-plus-noise source excitation generation. The output layers consist of two pairs of ReLU
activation and one-by-one (1 × 1) convolution layers.
signal, the number of elements in the magnitude, and Gaussian signal to the generator generated by the same formula as that
noise signal, respectively. Note that when this loss reaches zero, of NSF. The signal retains the input F0 as the fundamental
the linear amplitude values Ê are one over all frequency and frequency but with an additional random noise signal. Moreover,
time frames. The initial uSFGAN paper [27] employed the L2 we apply PDCNNs, which effectively enlarge the receptive fields
norm was employed for this loss function, but we use the L1 in accordance with the input F0 by dynamically changing the
norm because of the improvement in several objective evalua- DCNN dilation factors. We found that using both the sinusoidal
tion indices, such as F0 reconstruction accuracy and voiced or input and PDCNNs significantly improves F0 controllability.
unvoiced decision error rate. However, the PDCNNs also tend to introduce undesired periodic
2) Residual Spectra Targeting Regularization Loss: The sec- components to the unvoiced segments. This tendency prevents
ond loss is designed to utilize the residual spectra calculated from the proper generation of other aperiodic source components,
the target speech and spectral envelopes extracted in the same such as frication, aspiration, and transient sources, which ad-
way as above. However, to minimize the effects of the estimation versely affect sound quality and naturalness.
error of F0 and phases between generated and ground-truth To improve the source excitation signal modeling, especially
speeches, we apply the mel-filter-bank to the amplitude spec- for the unvoiced parts, we introduce a harmonic-plus-noise exci-
trogram. The residual spectra regularization loss is formulated tation generation mechanism inspired by the current successful
as works [13], [14], [21], [22] based on [50]. To explicitly model
the periodic and aperiodic components, previous works [13],
1
Lreg (G) = Ex,z || log ψ(Sx ) − log ψ(Ŝz ) ||1 , (2) [14], [21], [22], [50] prepared two networks for generating each
N component and devised the architecture and input features for
where x, ψ, and N denote the ground-truth speech, function each. We adopt two harmonic-plus-noise modeling schemes,
that transforms a spectral magnitude into the corresponding mel- the cascade and parallel model structures, referring to Period-
spectrogram and the number of elements in the mel-spectrogram, Net [13]. Hono et al. represent the dependence of the periodic
respectively; Sx denotes the magnitude of residual spectra that and aperiodic speech signals with the model structure. The
have the same frame-wise average power as that of the ground- cascade model structure combines the periodic and aperiodic
truth speech, and Ŝz denotes the spectral magnitude of the output speech generators in series so that the latter generator can predict
source excitation signal. Unlike the spectral envelope flattening the aperiodic component taking into account the dependence of
regularization loss, this loss leaves the power estimation to the the periodic component. On the other hand, the parallel model
source network, similarly to an actual human speech production structure assumes their independence. To ascertain whether the
process where the power is controlled during the sound genera- cascade or parallel structure scheme is superior in modeling the
tion. source excitation signal, we propose the two approaches follow-
ing PeriodNet. Moreover, the periodicity estimation is crucial for
the naturalness of generated speech. Regarding NHV [22] and
B. F0 -Driven Source Excitation Generation HN parallel waveGAN (HN-PWG) [14], we prepare a network
Source excitation signals have high periodicity owing to to estimate periodicity-related weights from acoustic features
their generation process that is based on vocal folds vibrations. and mix periodic and aperiodic source components on the basis
Inspired by NSF [20], [21], [44], we input a sinusoidal-based of the weights.
YONEYAMA et al.: HIGH-FIDELITY AND PITCH-CONTROLLABLE NEURAL VOCODER BASED ON UNIFIED SOURCE-FILTER NETWORKS 3721
The HN source excitation generation module consists of three The final loss function of the generator can be written as the sum
networks: the harmonic network, noise network, and periodicity of the regularization loss Lreg , the auxiliary spectral loss Lspc ,
estimator, as shown in Fig. 3(b). Note that the red lines and and the adversarial loss Ladv :
red blocks are used only in the cascade approach. The har-
LG (G, D) = Lreg (G) + λspc Lspc (G) + λadv Ladv (G, D), (6)
monic network outputs latent features l(h) that correspond to
the periodic components of the source excitation signal from where λspc and λadv are loss balancing hyperparameters.
a sinusoidal signal and auxiliary features. On the other hand, GAN-based adversarial training is effective for neural
the noise network outputs latent features l(n) that correspond to vocoders to implicitly learn perceptual aspects, such as phases,
the aperiodic components of the source excitation signal from required for generating high-quality samples. GAN-based
a random noise signal and auxiliary features. In the cascade vocoders usually adopt auxiliary losses in the spectral domain
approach, the noise network also receives the output of the and the feature matching loss to avoid mode collapse and
harmonic network. We use the QP-PWG macroblock shown in improve the training stability. PWG [35] adopts the multi-
Fig. 3(c) in the harmonic network, while the PWG macroblock is resolution short-time Fourier transform (STFT) loss that can
used in the noise network. We adopt the harmonicity estimator partly capture information about distance in phases in addition
of HN-PWG as the periodicity estimator shown in Fig. 3(d). to the spectral structure between the natural speech x and the
Conditioned on the auxiliary features, the periodicity estimator generated speech x̂ = G(z). It is formulated as the sum of the
outputs the channel-wise and sample-wise weights a within spectral convergence losses (Lsc ) and the log STFT magnitude
[0, 1] corresponding to the speech periodicity. The two generated losses (Lmag ) as follows:
representations are summed element-wise using the estimated M
weights. The source excitation latent feature l is formulated as 1 (m)
Lspc (G) = L (G) (7)
M m=1 s
(h) (n)
lt,i = at,i · lt,i + (1 − at,i ) · lt,i (3)
Ls (G) = Ex,z [Lsc (x, G(z)) + Lmag (x, G(z))] (8)
where the subscripts indicate the ith channel of the tth sample
|| |STFT(x)| − |STFT(x̂)| ||F
of each latent feature or weight. Since periodicity is estimated Lsc (x, x̂) = (9)
from auxiliary features, the input sinusoidal signal is generated || |STFT(x)| ||F
using the continuous F0 values obtained by interpolating the 1
Lmag (x, x̂) = || log |STFT(x)| − log |STFT(x̂)| ||1 , (10)
discontinuous F0 values. N
The cascade approach comprises three steps, as shown in (m)
Fig. 3(b). First, the harmonic network outputs the periodic where M and Ls denotes the number of sets of analysis pa-
source excitation representation, which is modulated using the rameters for STFT and the spectral loss defined as (8) calculated
channel-wise weights predicted by the periodicity estimator. with the mth set. Moreover, || · ||F , |STFT(·)|, and N denote
Second, a random noise signal is mapped to a latent represen- the Frobenius norm, the STFT magnitudes, and the number of
tation and mixed with a periodic source representation using elements in the magnitude, respectively.
a 1x1 convolution layer and the noise network. Finally, the On the other hand, HiFi-GAN [12] adopts the L1 loss
output latent feature of the noise network is modulated using in the mel-spectrogram domain because of the more human
the weights and summed up with the modulated periodic source perception-related advantage. It is formulated as follows:
excitation representation to output the final source excitation 1
Lspc (G) = Ex,s || φ(x) − φ(G(z)) ||1 , (11)
representation. On the other hand, in the parallel approach, the N
aperiodic source representation is generated without the output
where φ and N denote the function of converting a speech
periodic source representation of the harmonic network.
signal to the corresponding mel-spectrogram and the number
of elements in the mel-spectrogram, respectively.
C. Adversarial Training We aim to develop a vocoder capable of synthesizing speech
The training procedure of uSFGAN is common for GAN- that faithfully reflects the input acoustic features. Since uSF-
based training plus auxiliary regularization losses. The dis- GAN is conditioned on vocoder features, such as F0 , spectral
criminator D is trained to identify natural samples as real envelopes, and aperiodicity, the estimation error is inevitable.
and generated samples as f ake by minimizing the following Therefore, we argue that a looser constraint in the auxiliary
optimization criterion: spectral loss eases the mismatch between input and real features,
especially F0 and phases. Although the multi-resolution STFT
LD (G, D) = Ex (1 − D(x))2 + Ez D(G(z))2 , (4) loss and feature matching loss facilitate fine matches between the
where x denotes the natural samples distributed from the data ground-truth and generated speeches, it is difficult for uSFGAN
distribution of the natural samples, and z is random noise to satisfy them fully. In fact, for our best-proposed model,
distributed from the Gaussian distribution N (0, I). On the other the mel-spectral L1 loss is used with the exact formulation as
hand, the generator G is trained to deceive the discriminator by that of HiFi-GAN. The application of the mel-filter-bank eases
minimizing the following adversarial loss: the effect of F0 and phase mismatch, making the optimization
more straightforward and reasonable. Moreover, with adversar-
Ladv (G, D) = Ez (1 − D(G(z)))2 . (5) ial training with sufficiently strong sophisticated discriminators,
3722 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 31, 2023
the generator can learn reasonable phases from the adversarial number of residual blocks from the original configuration:
loss. Furthermore, although HiFi-GAN and MelGAN [36] adopt PDCNNs 10 − → 30 and DCNNs 10 − → 30. The capacities
the feature matching loss to obtain the deep classification infor- of the QP-PWG model and the basic uSFGAN model de-
mation provided by the discriminator, uSFGAN does not adopt tailed below are the same regarding the number of residual
the feature matching loss because of the mismatching problem blocks.
of phase and F0 between generated and ground-truth speeches. We conditioned the HiFi-GAN model by using the mel-
spectrogram as the original model with 80 mel-filter-banks, 1024
IV. EXPERIMENTAL EVALUATIONS fast Fourier transform (FFT) points, 1024 points of the Hanning
window, and the hop size was set to 120 (5 [ms]). We trained it
A. Data Preparation
for 2500 k iterations as the original model with the batch size
We used the VCTK corpus [51], which contains 109 English set to 16, and the batch length set to 18000 (0.75 [s]), using the
speakers. We used only mic2 samples, and p315 was unavailable original setting of the Adam [52] optimizer. The loss weights
owing to a technical problem. The sampling rate was set to followed the original setting. The weights of the adversarial
24 kHz using the sox1 downsampling function. No preprocess- loss, the feature matching loss, and the mel-spectral loss were
ing, such as normalization or low-cut filtering, was applied to the set to 1.0, 2.0, and 45.0, respectively. HN-NSF was conditioned
audio. We divided the dataset following a specific rule to evaluate using discrete F0 , the mel-generalized cepstrum (MGC), and
robustness against unseen F0 values. The minimum and maxi- mel-cepstral aperiodicity (MAP). We trained it for 600 k steps
mum F0 values of the VCTK corpus were respectively found to with the batch size set to 1 as the original model, and the batch
be about 50 Hz and 400 Hz through careful investigation of each length was set to 24000 (1.0 [s]) using the original setting of the
speaker. We limited the F0 range of the training data from 70 Hz Adam optimizer. This model was trained using only the L2 loss
to 340 Hz and excluded two speakers (p271 and p300) from on the log power spectrogram. QP-PWG was conditioned using
the training data to evaluate the robustness of unseen speakers. almost the same features as those for HN-NSF, but continuous F0
Thanks to this limitation, we can evaluate the methods using and a binary sequence representing voiced or unvoiced (V/UV)
various conditions of seen or unseen speakers and F0 ranges. segments were used instead of the discrete F0 . We trained it for
600 k steps with the batch size set to 5 and the batch length set
B. Model Details to 18000 (0.75 [s]) using the original setting of the RAdam [53]
1) Baseline Models: As the baselines, we used the following optimizer. The loss weights followed the original setting. The
four models. weights of the adversarial loss and the multi-resolution STFT
r HiFi-GAN: A high-fidelity GAN-based neural vocoder loss were set to 4.0 and 1.0, respectively.
with four multi-period discriminators and four multi-scale We extracted F0 using the Harvest algorithm [54] with care-
discriminators. HiFi-GAN has no clue for controlling F0 , fully set F0 search range for each speaker. Then we extracted the
so we used it as the baseline for the evaluation of speech log power spectral envelope using the CheapTrick algorithm [49]
reconstruction. To train the HiFi-GAN model, we adopted and coded it into the corresponding 41-dimensional MGC with
the HiFi-GAN V1 [12] configuration and used an unofficial the all-pass-constant set to 0.466. Also, we extracted aperiodicity
open-source implementation2 for training the model. using D4C algorithm [55] and coded them into the correspond-
r WORLD: A conventional source-filter model. This model ing 21-dimensional MAP. These features were calculated with
achieves flexible controllability of acoustic features with a shift period set to 5 ms. The mel-spectrogram was calculated
reasonable sound quality. We used a Python wrapper3 of using the librosa [56] function with the FFT size and window
the original WORLD implementation4 . length set to 1024, and the hop length to 120 (5 [ms]) with a
r HN-NSF: Harmonic-plus-noise neural source-filter with Hanning window.
time-variant and trainable sinc filters that predict their 2) Proposed Models: We used the following three uSFGAN-
cut-off frequency from the input acoustic features. We based models in the comparison experiments.
r uSFGAN: This model was based on our method proposed
reimplemented the model on the basis of the official open-
source code5 without changing the model configuration in [27]. The source network comprises 30 PDCNN blocks
except for increasing the training iterations. with six cycles, the filter network comprises 30 DCNN
r QP-PWG: A F0 -controllable neural vocoder based on blocks with three cycles, and the PWG discriminator and
GAN without the source-filter separation. It controls F0 PWG-based training procedure were used. The modifica-
via the PDCNNs and input auxiliary F0 . We increased the tions are that the regularization loss became the L1 norm,
and the input signal became a one-channel sinusoidal-
1 [Online].
based signal generated by the formula of NSF instead of a
Available: http://sox.sourceforge.net/
2 An unofficial code of HiFi-GAN: https://github.com/kan-bayashi/
two-channel signal (a random noise signal and a sinusoidal-
ParallelWaveGAN based signal without randomness). The updated loss leads
3 A Python wrapper of WORLD vocoder: https://github.com/JeremyCCHsu/
to better performance of the objective metrics, and the input
Python-Wrapper-for-World-Vocoder
signal was for simplification of the comparison.
r C-uSFGAN (Cascade HN-uSFGAN): The first proposed
4 WORLD official implementation https://github.com/mmorise/World
5 NSF official Pytorch implementation: https://github.com/nii-yamagishilab/
project-NN-Pytorch-scripts model with the cascade harmonic-plus-noise excitation
YONEYAMA et al.: HIGH-FIDELITY AND PITCH-CONTROLLABLE NEURAL VOCODER BASED ON UNIFIED SOURCE-FILTER NETWORKS 3723
Fig. 4. Evaluation results of the MOS test. The average scores of all ranges are natural: 4.02, HiFi-GAN: 3.88, WORLD: 3.70, QP-PWG: 3.82, uSFGAN: 3.86,
C-uSFGAN: 3.97, P-uSFGAN: 3.99, and P-uSFGAN - HiFi-D: 3.92.
TABLE II answers, such as where almost all scores were the same or the
RESULTS OF OBJECTIVE EVALUATIONS OF SPEECH RECONSTRUCTION. THE
BEST SCORES ARE IN BOLD
score of natural speech was lower than any system.
The results are shown in Fig. 4 where results are divided on
the basis of F0 range. HN-NSF was clearly inferior to the other
models in sound quality, so we excluded it in the subjective
evaluation experiment because of the possibility of undesired
bias that the other samples would be highly evaluated. HN-NSF
is a very basic baseline, and we speculate that the degradation
was due to the simplicity of the model architecture and its
low capacity to adapt to the large number of speakers in the
VCTK corpus. However, we did not conduct any hyperparameter
tuning on HN-NSF and note that there is a possibility that its
performance can be improved by increasing the number of layers
or introducing adversarial training.
We can see that all models except for WORLD achieve
comparable scores for natural speech. Interestingly, QP-PWG,
which uses the discriminator of PWG, achieves the best score,
outperforming HiFi-GAN. The reason for the improvement of
QP-PWG from the original model would be the increase in
the number of the generator layers (20 − → 60 residual blocks).
However, for the unseen F0 ranges, the proposed C-uSFGAN and
P-uSFGAN achieve the best results, whereas QP-PWG is consid-
erably degraded. Moreover, the differences between HiFi-GAN
and natural speech become more prominent than in the case
within the training F0 range. On the other hand, there are no
significant differences between C-uSFGAN and P-uSFGAN and
natural speech in all cases. These results indicate that HiFi-GAN
is data-driven and QP-PWG is highly data-driven. However, our
speech production process leading to robustness to unseen F0 . proposed C-uSFGAN and P-uSFGAN complement the short-
Note that C-uSFGAN and P-uSFGAN show the best results in comings of a data-driven approach.
V/UV error rate, greatly outperforming WORLD, indicating
the effectiveness of the harmonic-plus-noise architecture and of
D. Evaluation of F0 Transformation
updating the loss functions. In conclusion, the proposed methods
attain acoustic controllability similar to or better than those of Next, we evaluated the performances of F0 transformation
conventional parametric vocoders. with factors within [2−1.0 , 21.0 ]. The magnifications were taken
2) Subjective Evaluation: For the subjective evaluation, we equally on the logarithmic axis with the base at 2. The ground-
conducted an opinion test on sound quality using seven models truth F0 was determined by multiplying the F0 extracted from
and natural speech with ten subjects. Each subject evaluated 20 natural speech with the scale factors, and they were also adopted
utterances per method. We recruited English-speaking evalua- as the input F0 of the models.
tors through Amazon Mechanical Turk and instructed them to 1) Objective Evaluation Settings: We extracted F0 using the
listen to the audio in a quiet room with headphones or earphones. WORLD analyzer by the following procedure. When F0 was
Also, we filtered out scores from evaluators with unreasonable multiplied by a scale factor greater than one, only the upper
YONEYAMA et al.: HIGH-FIDELITY AND PITCH-CONTROLLABLE NEURAL VOCODER BASED ON UNIFIED SOURCE-FILTER NETWORKS 3725
Fig. 5. Objective evaluation results of F0 transformation for the comparison with baseline models. The MCD values of HN-NSF are excluded because it deviates
from the range of the y-axis where the results of the other models are gathered.
bound of the F0 search range was multiplied and transformed; 3) Ablation Study: The objective evaluation results of the
otherwise, only the lower bound of the range was multiplied and ablation study are shown in Fig. 6. From the results for P-
transformed. MCD was calculated using the CheapTrick [49] uSFGAN and P-uSFGAN - HN-SN, we can see that the harmonic-
algorithm provided by the WORLD analyzer, and the extracted plus-noise source network is very effective in improving the
F0 was used for the calculation. However, we downsampled V/UV error rate and RMSE of log F0 . Moreover, the residual
audio signals to 16000 [Hz] before estimating spectral envelopes spectra targeting loss (P-uSFGAN vs P-uSFGAN - Reg-Loss)
because the CheapTrick algorithm sometimes fails in the esti- and mel-spectral loss (P-uSFGAN vs P-uSFGAN - Mel-Loss)
mation when the F0 adaptive window size is larger than the FFT effectively improve the V/UV error rate. P-uSFGAN - HiFi-D
size. We made the available fixed FFT size sufficiently large shows relatively good results in these objective metrics, but it
by reducing the size of the F0 adaptive window through down- is inferior to P-uSFGAN in sound quality, at least in speech
sampling and calculated MCD more accurately. The evaluations reconstruction.
were conducted using the evaluation data whose F0 range was 4) Subjective Evaluation: For the subjective evaluation, we
within the training F0 range (i.e., 70 − 340 [Hz]). conducted preference tests on sound quality using WORLD,
2) Objective Evaluation: The objective evaluation results of C-uSFGAN, and P-uSFGAN for four F0 scaling factors
comparison with baseline models are shown in Fig. 5. The {2−1.0 , 2−0.5 , 20.5 , 21.0 }. Twenty subjects participated, and each
result of log F0 RMSE shows that although other models suffer subject evaluated ten pairs per F0 scaling factor per method
from degradation in extreme cases (F0 × {2−1.0 , 21.0 }), the pair. The results are shown in Fig. 7. From the figures, both
proposed C-uSFGAN and P-uSFGAN models achieve stable C-uSFGAN and P-uSFGAN outperform WORLD for all given
values close to that of WORLD. However, the two models F0 scale factors, and P-uSFGAN is superior to C-uSFGAN in
achieve much lower V/UV error rates than all baseline mod- 3/4 of the items.
els, which we found to have more impact on sound qual-
ity in our preliminary experiments. Moreover, we can see
that all proposed models achieve better MCDs than WORLD. E. Visualization of Output Source Excitation Signals
Again, the V/UV error rate and the RMSE of log F0 in To investigate the behavior of cascade and parallel HN-
QP-PWG degrade as the scale factor increases or decreases, uSFGAN models (C-uSFGAN and P-uSFGAN), we visualized
respectively. In contrast, uSFGAN does not significantly de- their output periodic and aperiodic source excitation signals in
grade for any factor, indicating the benefit of the source-filter Fig. 8 with the spectrograms. These signals were obtained from
decomposition. the output latent representations of l, l(h) , and l(n) using the
3726 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 31, 2023
Fig. 7. Evaluation results of the preference test for F0 transformation with the baseline WORLD and proposed C-uSFGAN and P-uSFGAN.
Fig. 8. Plots of output source excitation signals and spectrograms of C-uSFGAN (upper row) and P-uSFGAN (lower row) for 500 [ms]. The left column indicates
the final source excitation signal, the middle column indicates the periodic source excitation signal, and the right column indicates the aperiodic source excitation
signal.
output layers of the filter network and normalization of the signal C-uSFGAN and P-uSFGAN achieve almost the same perfor-
power. mance in speech reconstruction evaluation, as shown in Sec-
In Fig. 8, the output source excitation signals of C-uSFGAN tion IV-C. However, P-uSFGAN significantly outperforms C-
seem to include fewer aperiodic components than in P- uSFGAN in the evaluation of F0 transformation, as shown in
uSFGAN. Moreover, whereas P-uSFGAN well models the pe- Section IV-D4. From the results, we can conclude that the
riodic and aperiodic components by the corresponding net- disentanglement of periodic and aperiodic components has a
works, C-uSFGAN does not seem to be able to disentan- good effect on the sound quality in F0 transformation scenarios.
gle these components. This indicates that the input aperiodic Thus, we choose P-uSFGAN as our best-proposed model in this
components are ignored as they pass through some networks. work.
YONEYAMA et al.: HIGH-FIDELITY AND PITCH-CONTROLLABLE NEURAL VOCODER BASED ON UNIFIED SOURCE-FILTER NETWORKS 3727
Fig. 9. Plots of output source excitation signals and spectrograms of uSFGAN, C-uSFGAN, and P-uSFGAN (from top to bottom row) with three F0 scaling
factors: 0.5, 1.0, and 2.0 (left to right column), for 50 [ms]. All of them were clipped from the same segment of the same utterance. The original F0 values in this
segment were around 140 [Hz].
Furthermore, source excitation signals of uSFGAN, C- r AuxF0 : This model includes one-dimensional continuous
uSFGAN, and P-uSFGAN for several F0 scaling factors are F0 in the default set of the auxiliary feature. The total
plotted in Fig. 9. The figure shows that all proposed models number of dimensions of the auxiliary feature is 63.
can generate reasonable source excitation signals in accordance r BAP: This model adopts the three-dimensional band-
with the input F0 . aperiodicity extracted using WORLD instead of 21-
dimensional MAP. Coding is performed by one-
dimensional interpolation on the frequency axis, which
V. CONCLUSION compresses the half-FFT size to three. The total number
of dimensions of the auxiliary feature is 44.
In this article, we proposed a novel source-filter modeling
All the ablation models were trained in the same setting as that
strategy that decomposes a single neural network using the
in the P-uSFGAN model except for their auxiliary features. Note
regularization loss on the intermediate output. Thanks to the
that all subnetworks (i.e., harmonic network, noise network,
unified optimization of the source excitation and resonance filter-
filter network, and periodicity estimator) are conditioned using
ing networks, our best-proposed method has been demonstrated
the same auxiliary features.
to achieve equal or higher sound quality than the high-fidelity
The objective evaluation results are shown in Fig. 10. The
neural vocoder while attaining a similar or advanced F0 control-
WORLD results are provided as references. We found that
lability compared with a conventional parametric vocoder in the
differences between the models become apparent when F0 is
analysis-synthesis scenario. More experiments on the practical
significantly high, so the study was conducted with F0 increased
applications of the proposed neural vocoder and controllability
by a factor of five. First, we can see that the MEL model degrades
over other acoustic features, such as spectral envelopes and
even with a small F0 change. Since the mel-spectrogram already
aperiodicity, are left to future research.
contains the F0 information, we speculate that it is difficult for
the model to manipulate F0 by merely changing the sinusoidal
inputs. The AuxF0 model shows significant degradation with F0
APPENDIX A increased by a factor of two or more in its V/UV error rate, which
INVESTIGATION OF INPUT ACOUSTIC FEATURES is more critical for sound quality than the RMSE of log F0 . We
To further investigate the impact of different conditional confirmed that the generated speech is hardly voiced, resulting
acoustic features, we evaluated several models with different in significant degradation. We assume that this tendency is due to
types of conditioning features with the same model architecture. the fact that the inductive bias for speech production provided by
In our proposed methods used in the experimental evaluations the source-filter modeling is not obtained owing to the leakage
(Section IV), we chose the set of {MGC, MAP} as the best of F0 information to the filter network. The total degradation in
combination for the auxiliary features whose total number of the MEL model can be considered to have the same cause. From
dimensions is 62. Here, we compare P-uSFGAN with the fol- these experiences, we concluded that disentanglement of the
lowing three models with different auxiliary features. input acoustic features and the restriction of F0 information leak-
r MEL: This model adopts a full-band 80-dimensional log age to the filter network is essential to gaining the benefit from
mel-spectrogram calculated in the setting described in source-filter modeling. The BAP model, which gives periodicity
Section IV-B instead of the vocoder features. information with fewer dimensions, shows minimal degradation
3728 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 31, 2023
Fig. 10. Objective evaluation results of F0 transformation for the ablation study on auxiliary features.
in both V/UV error rate and RMSE of log F0 . We assume that [15] M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, and
the degradation is because the neural network can ignore a fewer Y. Bengio, “Chunked autoregressive GAN for conditional waveform
synthesis,” in Proc. Int. Conf. Learn. Representations, 2022. [Online].
dimensional input feature (i.e., BAP) when the network can Available: https://openreview.net/forum?id=v3aeIsY_vVX
reconstruct the target waveform from the other input features [16] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis
in training. Moreover, this result suggests the importance of through linear prediction,” in Proc. Int. Conf. Acoust., Speech, Signal
Process., 2019, pp. 5891–5895.
information about periodicity information in neural vocoders [17] B. Bollepalli, L. Juvela, and P. Alku, “Generative adversarial network-
based on periodic and aperiodic component decomposition. based glottal waveform model for statistical parametric speech synthesis,”
in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2017, pp. 3394–3398.
[18] L. Juvela, B. Bollepalli, J. Yamagishi, and P. Alku, “Waveform generation
REFERENCES for text-to-speech synthesis using pitch-synchronous multi-scale genera-
tive adversarial networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal
[1] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Restructuring Process., 2019, pp. 6915–6919.
speech representations using a pitch-adaptive time–frequency smooth- [19] L. Juvela, B. Bollepalli, J. Yamagishi, and P. Alku, “GELP: GAN-
ing and an instantaneous-frequency-based F0 extraction: Possible role excited linear prediction for speech synthesis from mel-spectrogram,”
of a repetitive structure in sounds,” Speech Commun., vol. 27, no. 3/4, in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2019, pp. 694–698,
pp. 187–207, 1999. doi: 10.21437/Interspeech.2019-2008.
[2] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based high- [20] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter-based wave-
quality speech synthesis system for real-time applications,” IEICE Trans. form model for statistical parametric speech synthesis,” in Proc. IEEE Int.
Inf. Syst., vol. 99, no. 7, pp. 1877–1884, 2016. Conf. Acoust., Speech, Signal Process., 2019, pp. 5916–5920.
[3] H. W. Dudley, “Remaking speech,” J. Acoustical Soc. Amer., vol. 11, no. 2, [21] X. Wang and J. Yamagishi, “Neural harmonic-plus-noise waveform model
pp. 169–177, 1939. with trainable maximum voice frequency for text-to-speech synthesis,” in
[4] M. R. Schroeder, “Vocoders: Analysis and synthesis of speech,” Proc. Proc. Speech Synth. Workshop, 2019, pp. 1–6.
IEEE, vol. 54, no. 5, pp. 720–734, May 1966. [22] Z. Liu, K. Chen, and K. Yu, “Neural homomorphic vocoder,” in Proc.
[5] R. McAulay and T. Quatieri, “Speech analysis/synthesis based on a si- Annu. Conf. Int. Speech Commun. Assoc., 2020, pp. 240–244.
nusoidal representation,” IEEE Trans. Acoust., Speech, Signal Process., [23] M. Morrison, Z. Jin, N. J. Bryan, J.-P. Caceres, and B. Pardo, “Neu-
vol. 34, no. 4, pp. 744–754, Aug. 1986. ral pitch-shifting and time-stretching with controllable LPCNet,” 2021,
[6] A. van den Oord et al., “WaveNet: A generative model for raw audio,” in arXiv:2110.02360.
Proc. 9th ISCA Speech Synth. Workshop, 2016, Art. no. 125. [24] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Int. Conf.
[7] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-based gen- Neural Inf. Process. Syst., 2014, pp. 2672–2680.
erative network for speech synthesis,” in Proc. Int. Conf. Acoust., Speech, [25] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, and T. Toda, “Quasi-periodic
Signal Process., 2019, pp. 3617–3621. parallel WaveGAN vocoder: A non-autoregressive pitch dependent dilated
[8] W. Ping, K. Peng, K. Zhao, and Z. Song, “WaveFlow: A compact flow- convolution model for parametric speech generation,” in Proc. Annu. Conf.
based model for raw audio,” in Proc. Int. Conf. Mach. Learn., 2020, Int. Speech Commun. Assoc., 2020, pp. 3535–3539.
pp. 7706–7716. [26] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, and T. Toda, “Quasi-periodic
[9] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A ver- parallel WaveGAN: A non-autoregressive raw waveform generative model
satile diffusion model for audio synthesis,” in Proc. Int. Conf. Learn. Repre- with pitch-dependent dilated convolution neural network,” IEEE/ACM
sentations, 2021. [Online]. Available: https://openreview.net/forum?id=a- Trans. Audio, Speech, Lang. Process., vol. 29, pp. 792–806, 2021.
xFK8Ymz5J [27] R. Yoneyama, Y.-C. Wu, and T. Toda, “Unified source-filter GAN: Unified
[10] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “Wave- source-filter network based on factorization of quasi-periodic parallel
Grad: Estimating gradients for waveform generation,” in Proc. Int. Conf. WaveGAN,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021,
Learn. Representations, 2021. [Online]. Available: https://openreview.net/ pp. 2187–2191.
forum?id=NsMLjcFaO8O [28] R. Yoneyama, Y.-C. Wu, and T. Toda, “Unified source-filter GAN with
[11] A. Mustafa, N. Pia, and G. Fuchs, “StyleMelGAN: An efficient high- harmonic-plus-noise source excitation generation,” in Proc. Annu. Conf.
fidelity adversarial vocoder with temporal adaptive normalization,” in Int. Speech Commun. Assoc., 2022, pp. 848–852.
Proc. Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 6034–6038. [29] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-
[12] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks dependent WaveNet vocoder,” in Proc. Annu. Conf. Int. Speech Commun.
for efficient and high fidelity speech synthesis,” in Proc. Int. Conf. Neural Assoc., 2017, pp. 1118–1122.
Inf. Process. Syst., 2020, pp. 17022–17033. [30] N. Kalchbrenner et al., “Efficient neural audio synthesis,” in Proc. Int.
[13] Y. Hono, S. Takaki, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, Conf. Mach. Learn., 2018, pp. 2415–2424.
“PeriodNet: A non-autoregressive waveform generation model with a [31] K. Arora, L. El Asri, H. Bahuleyan, and J. Cheung, “Why exposure
structure separating periodic and aperiodic components,” in Proc. Int. bias matters: An imitation learning perspective of error accumulation in
Conf. Acoust., Speech, Signal Process., 2021, pp. 6049–6053. language generation,” in Proc. Conf. Assoc. Comput. Linguistics, 2022,
[14] M.-J. Hwang, R. Yamamoto, E. Song, and J.-M. Kim, “High-fidelity pp. 700–710.
parallel WaveGAN with multi-band harmonic-plus-noise model,” in Proc. [32] A. van den Oord et al., “Parallel WaveNet: Fast high-fidelity speech
Annu. Conf. Int. Speech Commun. Assoc., 2021, pp. 2227–2231. synthesis,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 3918–3926.
YONEYAMA et al.: HIGH-FIDELITY AND PITCH-CONTROLLABLE NEURAL VOCODER BASED ON UNIFIED SOURCE-FILTER NETWORKS 3729
[33] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel wave generation in end- [44] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter waveform
to-end text-to-speech,” in Proc. Int. Conf. Learn. Representations, 2019. models for statistical parametric speech synthesis,” IEEE/ACM Trans.
[Online]. Available: https://openreview.net/forum?id=HklY120cYm Audio, Speech, Lang. Process., vol. 28, pp. 402–415, 2020.
[34] R. Yamamoto, E. Song, and J.-M. Kim, “Probability density distillation [45] F. Itakura and S. Saito, “Analysis synthesis telephony based on the maxi-
with generative adversarial networks for high-quality parallel waveform mum likelihood method,” in Proc. 6th Int. Congr. Acoust., 1968, pp. C17–
generation,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2019, C20. [Online]. Available: https://cir.nii.ac.jp/crid/1573950400351247616
pp. 699–703. [46] B. S. Atal and M. R. Schroeder, “Predictive coding of speech signals,” in
[35] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGan: A. fast Proc. Speech Commun. Process., 1967, pp. 360–361.
waveform generation model based on generative adversarial networks with [47] D. Wong, B.-H. Juang, and A. Gray, “An 800 bit/s vector quantization
multi-resolution spectrogram,” in Proc. IEEE Int. Conf. Acoust., Speech, LPC vocoder,” IEEE Trans. Acoust., Speech, Signal Process., vol. 30,
Signal Process., 2020, pp. 6199–6203. no. 5, pp. 770–780, Oct. 1982.
[36] K. Kumar et al., “Melgan: Generative adversarial networks for conditional [48] A. V. McCree and T. P. Barnwell, “A mixed excitation LPC vocoder model
waveform synthesis,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, for low bit rate speech coding,” IEEE Speech Audio Process., vol. 3, no. 4,
pp. 14910–1421. pp. 242–250, Jul. 1995.
[37] R. Yamamoto, E. Song, M.-J. Hwang, and J.-M. Kim, “Parallel waveform [49] M. Morise, “CheapTrick, a spectral envelope estimator for high-quality
synthesis based on generative adversarial networks with voicing-aware speech synthesis,” Speech Commun., vol. 67, pp. 1–7, 2015.
conditional discriminators,” in Proc. IEEE Int. Conf. Acoust., Speech, [50] Y. Stylianou, “Applying the harmonic plus noise model in concatenative
Signal Process., 2021, pp. 6039–6043. speech synthesis,” IEEE Speech Audio Process., vol. 9, no. 1, pp. 21–29,
[38] G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, and L. Xie, “Multi-band Jan. 2001.
MelGAN: Faster waveform generation for high-quality text-to-speech,” in [51] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK corpus:
Proc. Spoken Lang. Technol. Workshop, 2021, pp. 492–498. English multi-speaker corpus for CSTR voice cloning toolkit (version
[39] J. Yang, J. Lee, Y. Kim, H. Cho, and I. Kim, “VocGAN: A high-fidelity 0.92),” [sound]. Univ. Edinburgh. The Centre for Speech Technol. Res.
real-time vocoder with a hierarchically-nested adversarial network,” in (CSTR), 2019, doi: 10.7488/ds/2645.
Proc. Annu. Conf. Int. Speech Commun. Assoc., 2020, pp. 200–204. [52] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in
[40] M. Bińkowski et al., “High fidelity speech synthesis with adversarial Proc. Int. Conf. Learn. Representations, 2015. [Online]. Available: https:
networks,” in Proc. Int. Conf. Learn. Representations, 2020. [Online]. //arxiv.org/abs/1412.6980
Available: https://openreview.net/forum?id=r1gfQgSFDr [53] L. Liu et al., “On the variance of the adaptive learning rate and beyond,” in
[41] J. You et al., “GAN vocoder: Multi-resolution discriminator is all Proc. Int. Conf. Learn. Representations, 2020. [Online]. Available: https:
you need,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021, //openreview.net/forum?id=rkgz2aEKDr
pp. 2177–2181. [54] M. Morise, “Harvest: A high-performance fundamental frequency esti-
[42] Y.-C. Wu, T. Hayashi, P. L. Tobing, K. Kobayashi, and T. Toda, “Quasi- mator from speech signals,” in Proc. Annu. Conf. Int. Speech Commun.
periodic WaveNet vocoder: A pitch dependent dilated convolution model Assoc., 2017, pp. 2321–2325.
for parametric speech generation,” in Proc. Annu. Conf. Int. Speech Com- [55] M. Morise, “D4C, a band-aperiodicity estimator for high-quality speech
mun. Assoc., 2019, pp. 196–200. synthesis,” Speech Commun., vol. 84, pp. 57–65, 2016.
[43] Y.-C. Wu, T. Hayashi, P. L. Tobing, K. Kobayashi, and T. Toda, “Quasi- [56] B. McFee et al., “librosa/librosa: 0.9.2,” 2022, doi: 10.5281/zen-
periodic WaveNet: An autoregressive raw waveform generative model with odo.6759664.
pitch-dependent dilated convolution neural network,” IEEE/ACM Trans.
Audio, Speech, Lang. Process., vol. 29, pp. 1134–1148, 2021.