2307.13821v1
2307.13821v1
Vincent Lostanlen1 , Daniel Haider2,3 , Han Han1 , Mathieu Lagrange1 , Peter Balazs2 , and Martin Ehler3
1
Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France.
2
Acoustics Research Institute, Austrian Academy of Sciences, A-1040 Vienna, Austria.
3
University of Vienna, Department of Mathematics, A-1090 Vienna, Austria.
ABSTRACT Y = |Λx|2
teacher
arXiv:2307.13821v1 [cs.SD] 25 Jul 2023
Furthermore, it contrasts with other domains of deep learning (e.g., 2.3. Multiresolution neural network (MuReNN)
image processing) in which all convnet layers are simply initialized
with i.i.d. Gaussian weights [15]. As our original contribution, we train a multiresolution neural net-
Prior work on this problem has focused on advancing the state work, or MuReNN for short. MuReNN comprises two stages, mul-
of the art on a given task, sometimes to no avail [16]. In this article, tiresolution approximation (MRA) and convnet; of which only the
we take a step back and formulate a different question: before we try latter is learned from data. We implement the MRA with a dual-tree
to outperform an auditory filterbank, can we replicate its responses complex wavelet transform (DTCWT) [19]. The DTCWT relies on a
with a neural audio model? To answer this question, we compare multirate filterbank in which each wavelet ψj has a null average and
different “student” models in terms of their ability to learn from a a bandwidth of one octave. Denoting by ξ the sampling rate of x,
black-box function or “teacher” by knowledge distillation (KD). the wavelet ψj has a bandwidth with cutoff frequencies 2−(j+1) π
Given an auditory filterbank Λ and a discrete-time signal x of and 2−j π. Hence, we may subsample the result of the convolution
length T , let us denote the squared magnitude of the filter response (x ∗ ψj ) by a factor of 2j , yielding:
at frequency bin f by Y [f, t] = |Λx|2 [f, 2J t], where 2J is the ∀j ∈ {0, . . . , J − 1}, xj [t] = (x ∗ ψj )[2j t], (4)
chosen hop size or “stride”. Then, given a model ΦW with weights
W, we evaluate the dissimilarity between teacher Λ and student where J is the number of multiresolution levels. We take J = 9 in
ΦW as their (squared) spectrogram-based cosine similarity Lx (W). this paper, which roughly coincides with the number of octaves in
The distance of student and teacher in this similarity measure can the hearing range of humans. The second stage in MuReNN consists
be computed via the L2 distance after normalizing across frequency in defining convnet filters ϕf . Unlike in the Conv1D setting, those
bins f , independently for each time t. Let |Φe W x 2 and Y e denote filters do not operate over the full-resolution input x but over one
these normalized versions of student and teacher, then of its MRA levels xj . More precisely, let us denote by j[f ] the
decomposition level assigned to filter f , and by 2Lj the kernel size
Lx (W) = cosdist |ΦW |2 , Y
for that decomposition level. We convolve xj[f ] with ϕf and apply
T /2 J
F a subsampling factor of 2J−j[f ] , hence:
1 X X e 2 e t] 2 ,
= |ΦW x [f, t] − Y[f, (1)
2 t=1 ΦW x[f, t] = (xj[f ] ∗ ϕf )[2J−j[f ] t]
f =1
Lj −1
X
xj[f ] 2J−j[f ] t − τ ϕf [τ ]
where F is the number of filters. We seek to minimize the quantity = (5)
above by gradient-based optimization on W, on a real-world dataset τ =−Lj
of audio signals {x1 . . . xN }, and with no prior knowledge on Λ.
The two stages of subsampling in Equations 4 and 5 result in a uni-
form downsampling factor of 2J for ΦW x. Each learned FIR filter
2. NEURAL AUDIO MODELS ϕf has an effective receptive field size of 2j[f ]+1 Lj[f ] , thanks to the
subsampling operation in Equation 4. This resembles a dilated con-
2.1. Learnable time-domain filterbanks (Conv1D)
volution [20] with a dilation factor of 2j[f ] , except that the DTCWT
As a baseline, we train a 1-D convnet ΦW with F kernels of the guarantees the absence of aliasing artifacts.
same length 2L. With a constant stride of 2J , ΦW x writes as Besides this gain in frugality, as measured by parameter count
per unit of time, the resort to an MRA offers the opportunity to
L−1
X introduce desirable mathematical properties in the non-learned part
ΦW x[f, t] = (x ∗ ϕf )[2J t] = x 2J t − τ ϕf [τ ],
(2) of the transform (namely, ψf ) and have the MuReNN operator
τ =−L ΦW inherit them, without need for a non-random initialization nor
regularization during training. In particular, ΦW has at least as many
where x is padded by L samples at both ends. Under this setting, the
vanishing moments as ψ f . Furthermore, the DTCWT yields quasi-
trainable weights W are the finite impulse responses of ϕf for all
f , thus amounting to 2LF parameters.We initialize W as Gaussian analytic coefficients: for each j, xj = xRj +ixIj with xIj ≈ H xRj ,
√ where the exponent R (resp. I) denotes the real part (resp. imaginary
i.i.d. entries with null mean and variance 1/ F .
part) and H denotes the Hilbert transform. Since ϕf is real-valued,
the same property holds for MuReNN: ΦI x = H(ΦR x).
2.2. Gabor 1-D convolutions (Gabor1D) We implement MuReNN on GPU via a custom implementation
As a representative of the state of the art (i.e., LEAF [14]), we train of DTCWT in PyTorch1 . Following [19], we use a biorthogonal
a Gabor filtering layer or Gabor1D for short. For this purpose, we wavelet for j = 0 and quarter-shift wavelets for j ≥ 1. We set
parametrize each FIR filter ϕf as Gabor filter; i.e., an exponential Lj = 8Mj where Mj is the number of filters f at resolution j.
sine wave of amplitude af and frequency ηf which is modulated by We refer to [21] for an introduction to deep learning in the wavelet
a Gaussian envelope of width σf . Hence a new definition: domain, with applications to image classification.
!
af τ2 3. KNOWLEDGE DISTILLATION
ϕf [τ ] = √ exp − 2 exp(2πiηf τ ). (3)
2πσf 2σf
3.1. Target auditory filterbanks
Under this setting, the trainable weights W amount to only 3F For each of the three different domains, speech, music and urban
parameters: W = {a1 , σ1 , η1 , . . . , aF , σF , ηF }. Following LEAF, environmental sounds, we use an auditory filterbank Λ that is tailored
we initialize center frequencies ηf and bandwidths σf so as to form to its respective spectrotemporal characteristics.
a mel-frequency filterbank [17] and set amplitudes af to one. We
use the implementation of Gabor1D from SpeechBrain v0.5.14 [18]. 1 https://github.com/kymatio/murenn
2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY
Table 1: Mean and standard deviation of test loss after knowledge distillation over five independent trials. Each column corresponds to a
different neural audio model ΦW while each row corresponds to a different auditory filterbank and audio domain. See Section 4.2 for details.
0.1
0.0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Figure 2: Left to right: evolution of validation losses on different domains with Conv1D (green), Gabor1D (blue), and MuReNN (orange), as a
function of training epochs. The shaded area denotes the standard deviation across five independent trials. See Section 4.2 for details.
Synth A constant-Q filterbank with Q = 8 filters per octave, cover- reverse-mode automatic differentiation in PyTorch to obtain
ing eight octaves with Hann-modulated sine waves. J
F TX/2 e W x 2 [f, t]
X ∂|Φ
∇Lx (W)[i] = (W)
Speech A filterbank with 4-th order Gammatone filters tuned to the ∂W[i]
t=1
f =1
ERB-scale, a frequency scale which is adapted to the equiv-
e W x 2 [f, t] − Y[f,
alent rectangular bandwidths of the human cochlea [22]. In × |Φ e t] (6)
psychoacoustics, Gammatone filters provide a good approx-
for each entry W[i]. Note that the gradient above does not involve
imation to measured responses of the filters of the human
the phases of the teacher filterbank Λ, only its normalized magni-
basilar membrane [3]. Unlike Gabor filters, Gammatone fil-
tude response Y given the input x. Consequently, even though our
ters are asymmetric, both in the time domain and frequency
models ΦW contain a single linear layer, the associated knowledge
domain.We refer to [23] for implementation details.
distillation procedure is nonconvex, and thus resembles the training
of a deep neural network.
Music A variable-Q transform (VQT) with Mj = 12 frequency
bins per octave at every level. The VQT is a variant of the
constant-Q transform (CQT) in which Q is decreased grad- 4. RESULTS AND DISCUSSION
ually towards lower frequencies [24], hence an improved
temporal resolution at the expense of frequency resolution. 4.1. Datasets
Synth As a proof of concept, we construct sine waves in a geometric
Urban A third-octave filterbank inspired by the ANSI S1.11-2004 progression over the frequency range of the target filterbank.
standard for environmental noise monitoring [25]. In this Speech The North Texas vowel database (NTVOW) [29] contains
filterbank, center frequencies are not exactly in a geomet- utterances of 12 English vowels from 50 American speakers,
ric progression. Rather, they are aligned with integer Hertz including children aged three to seven as well as male and
values: 40, 50, 60; 80, 100, 120; 160, 200, 240; and so forth. female adults. In total, it consists of 3190 recordings, each
lasting between one and three seconds.
We construct the Synth teacher via nnAudio [26], a PyTorch port Music The TinySOL dataset [30] contains isolated musical notes
of librosa [27]; and Speech, Music, and Urban using the Large played by eight instruments: accordion, alto saxophone, bas-
Time–Frequency Analysis Toolbox (LTFAT) for MATLAB [28]. soon, flute, harp, trumpet in C, and cello. For each of these
instruments, we take all available pitches in the tessitura (min
= B0 , median = E4 , max = C♯8 ) in three levels of intensity
dynamics: pp, mf, and ff. This results in a total of 1212 audio
3.2. Gradient-based optimization recordings.
Urban The SONYC Urban Sound Tagging dataset (SONYC-UST)
For all four “student” models, we initialize the vector W at random [31] contains 2803 acoustic scenes from a network of au-
and update it iteratively by empirical risk minimization over the tonomous sensors in New York City. Each of these ten-second
training set. We rely on the Adam algorithm for stochastic opti- scenes contains one or several sources of urban noise pollu-
mization with default momentum parameters. Given the definition tion, such as: engines, machinery and non-machinery impacts,
of spectrogram-based cosine distance in Equation 1, we perform powered saws, alert signals, and dog barks.
2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY
Indeed, through limiting the scale and support of the learned fil-
4.2. Benchmarks ters, MuReNN restrains the potential introduction of high-frequency
noises of a learned filter of longer length. The phase misalignment at
For each audio domain, we randomly split its corresponding dataset low frequencies is a natural consequence of the fact that the gradients
into training, testing and validation subsets with a 8:1:1 ratio. During are computed from the magnitudes of the filterbank responses.
training, we select 212 time samples from the middle part of each Finally, we measure the time–frequency localization of all filters
signal, i.e., the FIR length of the filters in the teacher filterbank. We by computing the associated Heisenberg time–frequency ratios [33].
train each model with 100 epochs with an epoch size of 8000. From theory we know that Gaussian windows are optimal in this
Table 1 summarizes our findings. On all three benchmarks, we sense [34]. Therefore, it is not surprising that Gabor1D yields the
observe that MuReNN reaches state-of-the-art performance, as mea- best localized filters, even outperforming the teacher, see Figure 4.
sured in terms of cosine distance with respect to the teacher filterbank Expectedly, the localization of the filters from Conv1D is poor and
after 100 epochs. The improvement with respect to Conv1D is most appears independent of the teacher. MuReNN roughly resembles the
noticeable in the Synth benchmark and least noticeable in the Speech localization of the teachers but has some poorly localized outliers in
benchmark. Furthermore, Figure 2 indicates that Gabor1D barely higher frequencies, deserving further inquiry.
trains at all: this observation is consistent with the sensitivity of
LEAF with respect to initialization, as reported in [32]. We also
5. CONCLUSION
notice that MuReNN trains faster than Conv1D on all benchmarks
except for Urban, a phenomenon deserving further inquiry. Multiresolution neural networks (MuReNN) have the potential to
advance waveform-based deep learning. They offer a flexible and
4.3. Error analysis data-driven procedure for learning filters which are “wavelet-like”:
i.e., narrowband with compact support, vanishing moments, and
The mel-scale initialization of Gabor1D filters and the inductive quasi-Hilbert analyticity. Those experiments based on knowledge
bias of MuReNN enabled by octave localization gives a starting distillation from three domains (speech, music, and urban sounds)
advantage when learning filterbanks on log-based frequency scales, illustrate the suitability of MuReNN for real-world applications. The
as used for the Gammatone and VQT filterbank. Expectedly, this main limitation of MuReNN lies in the need to specify a number
advantage is absent with a teacher filterbank that does not follow a of filters per octave Mj , together with a kernel size Lj . Still, a
geometric progression of center frequencies, as it is the case in the promising finding of our paper is that prior knowledge on Mj and
ANSI scale. Figure 2 reflects these observations. Lj suffices to finely approximate non-Gabor auditory filterbanks,
To examine the individual filters of each model, we take the such as Gammatones on an ERB scale, from a random i.i.d. Gaussian
speech domain as an example and obtain their learned impulse re- initialization. Future work will evaluate MuReNN in conjunction
sponses. Figure 3 visualizes chosen examples at different frequencies with a deep neural network for sample-efficient audio classification.
learned by each model together with the corresponding teacher Gam-
matone filters. In general, all models are able to fit the filter responses
6. ACKNOWLEDGMENT
well. However, it is noticeable that the prescribed envelope for Ga-
bor1D impedes it from learning the asymmetric target Gammatone V.L. thanks Fergal Cotter and Nick Kingsbury for maintaining the
filters. This becomes prominent especially at high frequencies. From dtcwt and pytorch wavelets libraries; LS2N and ÖAW staff for ar-
the strong envelope mismatches at coinciding frequency we may ranging research visits; and Neil Zeghidour for helpful discussions.
deduce that center frequencies and bandwidths did not play well D.H. thanks Clara Hollomey for helping with the implementation
together during training. On the contrary, MuReNN and Conv1D of the filterbanks. V.L. and M.L. are supported by ANR MuReNN;
are flexible enough to learn asymmetric temporal envelopes without D.H., by a DOC Fellowship of the Austrian Academy of Sciences
compromising its regularity in time. Although the learned filters (A 26355); P.B., by FWF projects LoFT (P 34624) and NoMASP (P
of Conv1D are capable of fitting the frequencies well, they suf- 34922); and M.E., by WWTF project CHARMED (VRG12-009).
fer from noisy artifacts, especially outside their essential supports.
2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY