0% found this document useful (0 votes)
10 views5 pages

2307.13821v1

The paper presents a novel neural audio model called multiresolution neural network (MuReNN) that aims to improve auditory filterbank fitting by combining the strengths of convolutional neural networks and parametric models. MuReNN utilizes a discrete wavelet transform to train separate convolutional operators over octave subbands, achieving state-of-the-art performance in fitting auditory filterbanks for various audio domains. The study highlights the potential of MuReNN in applications where psychoacoustic knowledge is limited, advancing the field of machine listening with AI.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

2307.13821v1

The paper presents a novel neural audio model called multiresolution neural network (MuReNN) that aims to improve auditory filterbank fitting by combining the strengths of convolutional neural networks and parametric models. MuReNN utilizes a discrete wavelet transform to train separate convolutional operators over octave subbands, achieving state-of-the-art performance in fitting auditory filterbanks for various audio domains. The study highlights the potential of MuReNN in applications where psychoacoustic knowledge is limited, advancing the field of machine listening with AI.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY

FITTING AUDITORY FILTERBANKS WITH MULTIRESOLUTION NEURAL NETWORKS

Vincent Lostanlen1 , Daniel Haider2,3 , Han Han1 , Mathieu Lagrange1 , Peter Balazs2 , and Martin Ehler3
1
Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France.
2
Acoustics Research Institute, Austrian Academy of Sciences, A-1040 Vienna, Austria.
3
University of Vienna, Department of Mathematics, A-1090 Vienna, Austria.

ABSTRACT Y = |Λx|2
teacher
arXiv:2307.13821v1 [cs.SD] 25 Jul 2023

Waveform-based deep learning faces a dilemma between nonpara- Lx (W)


natural
metric and parametric approaches. On one hand, convolutional neu- x 2 
ral networks (convnets) may approximate any linear time-invariant sounds = cosdist |ΦW x , Y
system; yet, in practice, their frequency responses become more student
ΦW x
irregular as their receptive fields grow. On the other hand, a paramet-
ric model such as LEAF is guaranteed to yield Gabor filters, hence
an optimal time–frequency localization; yet, this strong inductive Figure 1: Graphical outline of the proposed method. We train a
bias comes at the detriment of representational capacity. In this neural network “student” ΦW to regress the squared magnitudes Y
paper, we aim to overcome this dilemma by introducing a neural of an auditory filterbank “teacher” Λ in terms of spectrogram-based
audio model, named multiresolution neural network (MuReNN). cosine distance Lx , on average over a dataset of natural sounds x.
The key idea behind MuReNN is to train separate convolutional
operators over the octave subbands of a discrete wavelet transform
(DWT). Since the scale of DWT atoms grows exponentially between performance. Instead, gradient-based optimization has the potential
octaves, the receptive fields of the subsequent learnable convolu- to reflect the spectrotemporal characteristics of the data at hand.
tions in MuReNN are dilated accordingly. For a given real-world Enabling this potential is particularly important in applications
dataset, we fit the magnitude response of MuReNN to that of a well- where psychoacoustic knowledge is lacking; e.g., animals outside of
established auditory filterbank: Gammatone for speech, CQT for the mammalian taxon [10, 11]. Beyond its perspectives in applied
music, and third-octave for urban sounds, respectively. This is a form science, the study of learnable filterbanks has value for fundamental
of knowledge distillation (KD), in which the filterbank “teacher” is research on machine listening with AI. This is because it repre-
engineered by domain knowledge while the neural network “student” sents the last stage of progress towards general-purpose “end-to-end”
is optimized from data. We compare MuReNN to the state of the art learning, from the raw audio waveform to the latent space of interest.
in terms of goodness of fit after KD on a hold-out set and in terms of Yet, success stories in waveform-based deep learning for audio
Heisenberg time–frequency localization. Compared to convnets and classification have been, up to date, surprisingly few—and even
Gabor convolutions, we find that MuReNN reaches state-of-the-art fewer beyond the realm of speech and music [12]. The core hypothe-
performance on all three optimization problems. sis of our paper is that this shortcoming is due to an inadequate choice
Index Terms— Convolutional neural network, digital filters, of neural network architecture. Specifically, we identify a dilemma
filterbanks, multiresolution analysis, psychoacoustics. between nonparametric and parametric approaches, where the former
are represented by convolutional neural networks (convnets) and the
latter by architectures used in SincNet [13] or LEAF [14]. In theory,
1. INTRODUCTION convnets may approximate any finite impulse response (FIR), given
a receptive field that is wide enough; but in practice, gradient-based
Auditory filterbanks are time-invariant systems whose design takes optimization on nonconvex objectives yields suboptimal solutions
inspiration from domain-specific knowledge in hearing science [1]. [12]. On the other hand, the parametric approaches enforce good
For example, the critical bands of the human cochlea inspires fre- time–frequency localization, yet at the cost of imposing a rigid shape
quency scales such as mel, bark, and ERB [2]. The phenomenon for the learned filters: cardinal sine (inverse-square envelope) for
of temporal masking calls for asymmetric impulse responses, moti- SincNet and Gabor (Gaussian envelope) for LEAF.
vating the design of Gammatone filters [3]. Lastly, the constant-Q Our goal is to overcome this dilemma by developing a neural
transform (CQT), in which the number of filters per octave is fixed, audio model which is capable of learning temporal envelopes from
reflects the principle of octave equivalence in music [4]. data while guaranteeing near-optimal time–frequency localization.
In recent years, the growing interest for deep learning in signal In doing so, we aim to bypass the explicit incorporation of psychoa-
processing has proposed to learn filterbanks from data rather than coustic knowledge as much as possible. This is unlike state-of-the-art
design them a priori [5]. Such a replacement of feature engineering convnets for filterbank learning such as SincNet or LEAF, whose
to feature learning is motivated by the diverse application scope of parametric kernels are initialized according to a mel-frequency scale.
audio content analysis: i.e., conservation biology [6], urban science Arguably, such careful initialization procedures defeat the purpose of
[7], industry [8], and healthcare [9]. Since these applications differ deep learning; i.e., to spare the human effort of feature engineering.
greatly in terms of acoustical content, the domain knowledge which
prevails in speech and music processing is likely to yield suboptimal Companion website: https://github.com/lostanlen/lostanlen2023waspaa
2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY

Furthermore, it contrasts with other domains of deep learning (e.g., 2.3. Multiresolution neural network (MuReNN)
image processing) in which all convnet layers are simply initialized
with i.i.d. Gaussian weights [15]. As our original contribution, we train a multiresolution neural net-
Prior work on this problem has focused on advancing the state work, or MuReNN for short. MuReNN comprises two stages, mul-
of the art on a given task, sometimes to no avail [16]. In this article, tiresolution approximation (MRA) and convnet; of which only the
we take a step back and formulate a different question: before we try latter is learned from data. We implement the MRA with a dual-tree
to outperform an auditory filterbank, can we replicate its responses complex wavelet transform (DTCWT) [19]. The DTCWT relies on a
with a neural audio model? To answer this question, we compare multirate filterbank in which each wavelet ψj has a null average and
different “student” models in terms of their ability to learn from a a bandwidth of one octave. Denoting by ξ the sampling rate of x,
black-box function or “teacher” by knowledge distillation (KD). the wavelet ψj has a bandwidth with cutoff frequencies 2−(j+1) π
Given an auditory filterbank Λ and a discrete-time signal x of and 2−j π. Hence, we may subsample the result of the convolution
length T , let us denote the squared magnitude of the filter response (x ∗ ψj ) by a factor of 2j , yielding:
at frequency bin f by Y [f, t] = |Λx|2 [f, 2J t], where 2J is the ∀j ∈ {0, . . . , J − 1}, xj [t] = (x ∗ ψj )[2j t], (4)
chosen hop size or “stride”. Then, given a model ΦW with weights
W, we evaluate the dissimilarity between teacher Λ and student where J is the number of multiresolution levels. We take J = 9 in
ΦW as their (squared) spectrogram-based cosine similarity Lx (W). this paper, which roughly coincides with the number of octaves in
The distance of student and teacher in this similarity measure can the hearing range of humans. The second stage in MuReNN consists
be computed via the L2 distance after normalizing across frequency in defining convnet filters ϕf . Unlike in the Conv1D setting, those
bins f , independently for each time t. Let |Φe W x 2 and Y e denote filters do not operate over the full-resolution input x but over one
these normalized versions of student and teacher, then of its MRA levels xj . More precisely, let us denote by j[f ] the
decomposition level assigned to filter f , and by 2Lj the kernel size
Lx (W) = cosdist |ΦW |2 , Y

for that decomposition level. We convolve xj[f ] with ϕf and apply
T /2 J
F a subsampling factor of 2J−j[f ] , hence:
1 X X e 2 e t] 2 ,
= |ΦW x [f, t] − Y[f, (1)
2 t=1 ΦW x[f, t] = (xj[f ] ∗ ϕf )[2J−j[f ] t]
f =1
Lj −1
X
xj[f ] 2J−j[f ] t − τ ϕf [τ ]
 
where F is the number of filters. We seek to minimize the quantity = (5)
above by gradient-based optimization on W, on a real-world dataset τ =−Lj
of audio signals {x1 . . . xN }, and with no prior knowledge on Λ.
The two stages of subsampling in Equations 4 and 5 result in a uni-
form downsampling factor of 2J for ΦW x. Each learned FIR filter
2. NEURAL AUDIO MODELS ϕf has an effective receptive field size of 2j[f ]+1 Lj[f ] , thanks to the
subsampling operation in Equation 4. This resembles a dilated con-
2.1. Learnable time-domain filterbanks (Conv1D)
volution [20] with a dilation factor of 2j[f ] , except that the DTCWT
As a baseline, we train a 1-D convnet ΦW with F kernels of the guarantees the absence of aliasing artifacts.
same length 2L. With a constant stride of 2J , ΦW x writes as Besides this gain in frugality, as measured by parameter count
per unit of time, the resort to an MRA offers the opportunity to
L−1
X introduce desirable mathematical properties in the non-learned part
ΦW x[f, t] = (x ∗ ϕf )[2J t] = x 2J t − τ ϕf [τ ],
 
(2) of the transform (namely, ψf ) and have the MuReNN operator
τ =−L ΦW inherit them, without need for a non-random initialization nor
regularization during training. In particular, ΦW has at least as many
where x is padded by L samples at both ends. Under this setting, the
vanishing moments as ψ f . Furthermore, the DTCWT yields quasi-
trainable weights W are the finite impulse responses of ϕf for all 
f , thus amounting to 2LF parameters.We initialize W as Gaussian analytic coefficients: for each j, xj = xRj +ixIj with xIj ≈ H xRj ,
√ where the exponent R (resp. I) denotes the real part (resp. imaginary
i.i.d. entries with null mean and variance 1/ F .
part) and H denotes the Hilbert transform. Since ϕf is real-valued,
the same property holds for MuReNN: ΦI x = H(ΦR x).
2.2. Gabor 1-D convolutions (Gabor1D) We implement MuReNN on GPU via a custom implementation
As a representative of the state of the art (i.e., LEAF [14]), we train of DTCWT in PyTorch1 . Following [19], we use a biorthogonal
a Gabor filtering layer or Gabor1D for short. For this purpose, we wavelet for j = 0 and quarter-shift wavelets for j ≥ 1. We set
parametrize each FIR filter ϕf as Gabor filter; i.e., an exponential Lj = 8Mj where Mj is the number of filters f at resolution j.
sine wave of amplitude af and frequency ηf which is modulated by We refer to [21] for an introduction to deep learning in the wavelet
a Gaussian envelope of width σf . Hence a new definition: domain, with applications to image classification.
!
af τ2 3. KNOWLEDGE DISTILLATION
ϕf [τ ] = √ exp − 2 exp(2πiηf τ ). (3)
2πσf 2σf
3.1. Target auditory filterbanks
Under this setting, the trainable weights W amount to only 3F For each of the three different domains, speech, music and urban
parameters: W = {a1 , σ1 , η1 , . . . , aF , σF , ηF }. Following LEAF, environmental sounds, we use an auditory filterbank Λ that is tailored
we initialize center frequencies ηf and bandwidths σf so as to form to its respective spectrotemporal characteristics.
a mel-frequency filterbank [17] and set amplitudes af to one. We
use the implementation of Gabor1D from SpeechBrain v0.5.14 [18]. 1 https://github.com/kymatio/murenn
2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY

Domain Dataset Teacher Conv1D Gabor1D MuReNN


Speech NTVOW Gammatone 2.12 ± 0.05 10.14 ± 0.09 2.00 ± 0.02
Music TinySOL VQT 8.76 ± 0.2 16.87 ± 0.06 5.28 ± 0.03
Urban SONYC-UST ANSI S1.11 3.26 ± 0.1 13.51 ± 0.2 2.57 ± 0.2
Synth Sine waves CQT 11.54 ± 0.5 22.26 ± 0.9 9.75 ± 0.4

Table 1: Mean and standard deviation of test loss after knowledge distillation over five independent trials. Each column corresponds to a
different neural audio model ΦW while each row corresponds to a different auditory filterbank and audio domain. See Section 4.2 for details.

Synth Speech Music Urban


0.3 Gabor1D
Conv1D
0.2 MuReNN

0.1

0.0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

Figure 2: Left to right: evolution of validation losses on different domains with Conv1D (green), Gabor1D (blue), and MuReNN (orange), as a
function of training epochs. The shaded area denotes the standard deviation across five independent trials. See Section 4.2 for details.

Synth A constant-Q filterbank with Q = 8 filters per octave, cover- reverse-mode automatic differentiation in PyTorch to obtain
ing eight octaves with Hann-modulated sine waves. J
F TX/2 e W x 2 [f, t]
X ∂|Φ
∇Lx (W)[i] = (W)
Speech A filterbank with 4-th order Gammatone filters tuned to the ∂W[i]
t=1
f =1
ERB-scale, a frequency scale which is adapted to the equiv-
e W x 2 [f, t] − Y[f,

alent rectangular bandwidths of the human cochlea [22]. In × |Φ e t] (6)
psychoacoustics, Gammatone filters provide a good approx-
for each entry W[i]. Note that the gradient above does not involve
imation to measured responses of the filters of the human
the phases of the teacher filterbank Λ, only its normalized magni-
basilar membrane [3]. Unlike Gabor filters, Gammatone fil-
tude response Y given the input x. Consequently, even though our
ters are asymmetric, both in the time domain and frequency
models ΦW contain a single linear layer, the associated knowledge
domain.We refer to [23] for implementation details.
distillation procedure is nonconvex, and thus resembles the training
of a deep neural network.
Music A variable-Q transform (VQT) with Mj = 12 frequency
bins per octave at every level. The VQT is a variant of the
constant-Q transform (CQT) in which Q is decreased grad- 4. RESULTS AND DISCUSSION
ually towards lower frequencies [24], hence an improved
temporal resolution at the expense of frequency resolution. 4.1. Datasets
Synth As a proof of concept, we construct sine waves in a geometric
Urban A third-octave filterbank inspired by the ANSI S1.11-2004 progression over the frequency range of the target filterbank.
standard for environmental noise monitoring [25]. In this Speech The North Texas vowel database (NTVOW) [29] contains
filterbank, center frequencies are not exactly in a geomet- utterances of 12 English vowels from 50 American speakers,
ric progression. Rather, they are aligned with integer Hertz including children aged three to seven as well as male and
values: 40, 50, 60; 80, 100, 120; 160, 200, 240; and so forth. female adults. In total, it consists of 3190 recordings, each
lasting between one and three seconds.
We construct the Synth teacher via nnAudio [26], a PyTorch port Music The TinySOL dataset [30] contains isolated musical notes
of librosa [27]; and Speech, Music, and Urban using the Large played by eight instruments: accordion, alto saxophone, bas-
Time–Frequency Analysis Toolbox (LTFAT) for MATLAB [28]. soon, flute, harp, trumpet in C, and cello. For each of these
instruments, we take all available pitches in the tessitura (min
= B0 , median = E4 , max = C♯8 ) in three levels of intensity
dynamics: pp, mf, and ff. This results in a total of 1212 audio
3.2. Gradient-based optimization recordings.
Urban The SONYC Urban Sound Tagging dataset (SONYC-UST)
For all four “student” models, we initialize the vector W at random [31] contains 2803 acoustic scenes from a network of au-
and update it iteratively by empirical risk minimization over the tonomous sensors in New York City. Each of these ten-second
training set. We rely on the Adam algorithm for stochastic opti- scenes contains one or several sources of urban noise pollu-
mization with default momentum parameters. Given the definition tion, such as: engines, machinery and non-machinery impacts,
of spectrogram-based cosine distance in Equation 1, we perform powered saws, alert signals, and dog barks.
2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY

Figure 3: Compared impulse responses of Conv1D (left), Gabor1D


(center), and MuReNN (right) with different center frequencies after
convergence, with a Gammatone filterbank as target. Solid blue Figure 4: Distribution of Heisenberg time–frequency ratios for each
(resp. dashed red) lines denote the real part of the impulse responses teacher–student pair (lower is better). See Section 4.3 for details.
of the learned filters (resp. target). See Section 4.3 for details.

Indeed, through limiting the scale and support of the learned fil-
4.2. Benchmarks ters, MuReNN restrains the potential introduction of high-frequency
noises of a learned filter of longer length. The phase misalignment at
For each audio domain, we randomly split its corresponding dataset low frequencies is a natural consequence of the fact that the gradients
into training, testing and validation subsets with a 8:1:1 ratio. During are computed from the magnitudes of the filterbank responses.
training, we select 212 time samples from the middle part of each Finally, we measure the time–frequency localization of all filters
signal, i.e., the FIR length of the filters in the teacher filterbank. We by computing the associated Heisenberg time–frequency ratios [33].
train each model with 100 epochs with an epoch size of 8000. From theory we know that Gaussian windows are optimal in this
Table 1 summarizes our findings. On all three benchmarks, we sense [34]. Therefore, it is not surprising that Gabor1D yields the
observe that MuReNN reaches state-of-the-art performance, as mea- best localized filters, even outperforming the teacher, see Figure 4.
sured in terms of cosine distance with respect to the teacher filterbank Expectedly, the localization of the filters from Conv1D is poor and
after 100 epochs. The improvement with respect to Conv1D is most appears independent of the teacher. MuReNN roughly resembles the
noticeable in the Synth benchmark and least noticeable in the Speech localization of the teachers but has some poorly localized outliers in
benchmark. Furthermore, Figure 2 indicates that Gabor1D barely higher frequencies, deserving further inquiry.
trains at all: this observation is consistent with the sensitivity of
LEAF with respect to initialization, as reported in [32]. We also
5. CONCLUSION
notice that MuReNN trains faster than Conv1D on all benchmarks
except for Urban, a phenomenon deserving further inquiry. Multiresolution neural networks (MuReNN) have the potential to
advance waveform-based deep learning. They offer a flexible and
4.3. Error analysis data-driven procedure for learning filters which are “wavelet-like”:
i.e., narrowband with compact support, vanishing moments, and
The mel-scale initialization of Gabor1D filters and the inductive quasi-Hilbert analyticity. Those experiments based on knowledge
bias of MuReNN enabled by octave localization gives a starting distillation from three domains (speech, music, and urban sounds)
advantage when learning filterbanks on log-based frequency scales, illustrate the suitability of MuReNN for real-world applications. The
as used for the Gammatone and VQT filterbank. Expectedly, this main limitation of MuReNN lies in the need to specify a number
advantage is absent with a teacher filterbank that does not follow a of filters per octave Mj , together with a kernel size Lj . Still, a
geometric progression of center frequencies, as it is the case in the promising finding of our paper is that prior knowledge on Mj and
ANSI scale. Figure 2 reflects these observations. Lj suffices to finely approximate non-Gabor auditory filterbanks,
To examine the individual filters of each model, we take the such as Gammatones on an ERB scale, from a random i.i.d. Gaussian
speech domain as an example and obtain their learned impulse re- initialization. Future work will evaluate MuReNN in conjunction
sponses. Figure 3 visualizes chosen examples at different frequencies with a deep neural network for sample-efficient audio classification.
learned by each model together with the corresponding teacher Gam-
matone filters. In general, all models are able to fit the filter responses
6. ACKNOWLEDGMENT
well. However, it is noticeable that the prescribed envelope for Ga-
bor1D impedes it from learning the asymmetric target Gammatone V.L. thanks Fergal Cotter and Nick Kingsbury for maintaining the
filters. This becomes prominent especially at high frequencies. From dtcwt and pytorch wavelets libraries; LS2N and ÖAW staff for ar-
the strong envelope mismatches at coinciding frequency we may ranging research visits; and Neil Zeghidour for helpful discussions.
deduce that center frequencies and bandwidths did not play well D.H. thanks Clara Hollomey for helping with the implementation
together during training. On the contrary, MuReNN and Conv1D of the filterbanks. V.L. and M.L. are supported by ANR MuReNN;
are flexible enough to learn asymmetric temporal envelopes without D.H., by a DOC Fellowship of the Austrian Academy of Sciences
compromising its regularity in time. Although the learned filters (A 26355); P.B., by FWF projects LoFT (P 34624) and NoMASP (P
of Conv1D are capable of fitting the frequencies well, they suf- 34922); and M.E., by WWTF project CHARMED (VRG12-009).
fer from noisy artifacts, especially outside their essential supports.
2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY

7. REFERENCES [19] I. W. Selesnick, R. G. Baraniuk, and N. C. Kingsbury, “The


dual-tree complex wavelet transform,” IEEE Signal Proc. Mag.,
[1] R. F. Lyon, Human and machine hearing: Extracting meaning vol. 22, no. 6, pp. 123–151, 2005.
from sound. Cambridge University Press, 2017.
[20] A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,
[2] R.-A. Knight and J. Setter, The Cambridge Handbook of Pho- K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stim-
netics. Cambridge University Press, 2021. berg, et al., “Parallel WaveNet: Fast high-fidelity speech syn-
thesis,” in Proc. ICML, 2018, pp. 3918–3926.
[3] B. R. Glasberg and B. C. Moore, “Derivation of auditory filter
shapes from notched-noise data,” Hearing Research, vol. 47, [21] F. Cotter, “Uses of complex wavelets in deep convolutional
no. 1, pp. 103–138, 1990. neural networks,” Ph.D. dissertation, University of Cambridge,
2020.
[4] J. C. Brown, “Calculation of a constant-Q spectral transform,”
J. Acoust. Soc. Am., vol. 89, no. 1, pp. 425–434, 1991. [22] B. C. J. Moore and B. R. Glasberg, “Suggested formulae for
calculating auditory-filter bandwidths and excitation patterns,”
[5] M. Dörfler, T. Grill, R. Bammer, and A. Flexer, “Basic filters J. Acoust. Soc. Am., vol. 74, no. 3, pp. 750–753, 09 1983.
for convolutional neural networks applied to music: Training
or design?” Neural Comput. Appl., vol. 32, pp. 941–954, 2020. [23] T. Necciari, N. Holighaus, P. Balazs, Z. Průša, P. Majdak, and
O. Derrien, “Audlet filter banks: A versatile analysis/synthesis
[6] D. Stowell, “Computational bioacoustics with deep learning: a framework using auditory frequency scales,” Applied Sciences,
review and roadmap,” PeerJ, vol. 10, p. e13152, 2022. vol. 8, no. 1, 2018.
[7] J. P. Bello, C. Silva, O. Nov, R. L. Dubois, A. Arora, J. Salamon, [24] C. Schörkhuber, A. Klapuri, N. Holighaus, and M. Dörfler,
C. Mydlarz, and H. Doraiswamy, “SONYC: A system for “A matlab toolbox for efficient perfect reconstruction time-
monitoring, analyzing, and mitigating urban noise pollution,” frequency transforms with log-frequency resolution,” in Proc.
Communications of the ACM, vol. 62, no. 2, pp. 68–77, 2019. AES, 2014.
[8] R. Zhao, R. Yan, Z. Chen, K. Mao, P. Wang, and R. X. Gao, [25] J. Antoni, “Orthogonal-like fractional-octave-band filters,” J.
“Deep learning and its applications to machine health monitor- Acoust. Soc. Am., vol. 127, no. 2, pp. 884–895, 2010.
ing,” Mechanical Systems and Signal Processing, vol. 115, pp.
213–237, 2019. [26] K. W. Cheuk, H. Anderson, K. Agres, and D. Herremans,
“nnAudio: An on-the-fly gpu audio to spectrogram conversion
[9] P. Bizopoulos and D. Koutsouris, “Deep learning in cardiology,” toolbox using 1d convolutional neural networks,” IEEE Access,
IEEE Rev. Biomed. Eng., vol. 12, pp. 168–193, 2018. vol. 8, pp. 161 981–162 003, 2020.
[10] F. J. Bravo Sanchez, M. R. Hossain, N. B. English, and S. T. [27] B. McFee, M. McVicar, D. Faronbi, I. Roman, M. Gover,
Moore, “Bioacoustic classification of avian calls from raw S. Balke, S. Seyfarth, A. Malek, C. Raffel, V. Lostanlen, et al.,
sound waveforms with an open-source deep learning architec- “librosa/librosa: 0.10.0.post2,” Mar. 2023. [Online]. Available:
ture,” Scientific Reports, vol. 11, no. 1, pp. 1–12, 2021. https://doi.org/10.5281/zenodo.7746972
[11] M. Faiß, “Adaptive representations of sound for automatic in- [28] Z. Průša, P. Søndergaard, P. Balazs, and N. Holighaus, “LT-
sect recognition,” Master’s thesis, Naturalis Biodiversity Cen- FAT: A Matlab/Octave toolbox for sound processing,” in Proc.
ter, 2022. CMMR, 2013, pp. 299–314.
[12] F. Lluı́s, J. Pons, and X. Serra, “End-to-end music source [29] P. F. Assmann and W. F. Katz, “Time-varying spectral change
separation: Is it possible in the waveform domain?” arXiv in the vowels of children and adults,” J. Acoust. Soc. Am., vol.
preprint arXiv:1810.12187, 2018. 108, no. 4, pp. 1856–1866, 2000.
[13] M. Ravanelli and Y. Bengio, “Speaker recognition from raw [30] C. E. Cella, D. Ghisi, V. Lostanlen, F. Lévy, J. Fineberg, and
waveform with SincNet,” in Proc. IEEE SLT, 2018. Y. Maresz, “OrchideaSOL: A dataset of extended instrumental
techniques for computer-aided orchestration,” in Proc. ICMC,
[14] N. Zeghidour, O. Teboul, F. de Chaumont Quitry, and 2020.
M. Tagliasacchi, “LEAF: A learnable frontend for audio classi-
fication,” in Proc. ICML, 2021. [31] M. Cartwright, A. E. Mendez Mendez, G. Dove, J. Cramer,
V. Lostanlen, H.-H. Wu, J. Salamon, O. Nov, and J. P. Bello,
[15] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the “SONYC Urban Sound Tagging (SONYC-UST): A multil-
importance of initialization and momentum in deep learning,” abel dataset from an urban acoustic sensor network,” in Proc.
in Proc. ICML, 2013, pp. 1139–1147. DCASE, 2019.
[16] J. Schlüter and G. Gutenbrunner, “EfficientLEAF: A faster [32] M. Anderson, T. Kinnunen, and N. Harte, “Learnable fron-
learnable audio frontend of questionable use,” in Proc. EU- tends that do not learn: Quantifying sensitivity to filterbank
SIPCO. IEEE, 2022, pp. 205–208. initialisation,” in Proc. IEEE ICASSP, 2023.
[17] N. Zeghidour, N. Usunier, I. Kokkinos, T. Schaiz, G. Synnaeve, [33] S. Mallat, A wavelet tour of signal processing. Elsevier, 1999.
and E. Dupoux, “Learning filterbanks from raw speech for
phone recognition,” in Proc. IEEE ICASSP. IEEE, 2018. [34] K. Gröchenig, Foundations of time-frequency analysis, ser.
Appl. Numer. Harmon. Anal. Boston, MA: Birkhäuser, 2001.
[18] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell,
L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong,
et al., “SpeechBrain: A general-purpose speech toolkit,” arXiv
preprint arXiv:2106.04624, 2021.

You might also like