2307.13821v1

The paper presents a novel neural audio model called multiresolution neural network (MuReNN) that aims to improve auditory filterbank fitting by combining the strengths of convolutional neural networks and parametric models. MuReNN utilizes a discrete wavelet transform to train separate convolutional operators over octave subbands, achieving state-of-the-art performance in fitting auditory filterbanks for various audio domains. The study highlights the potential of MuReNN in applications where psychoacoustic knowledge is limited, advancing the field of machine listening with AI.

Uploaded by

Vincent Lostanlen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views5 pages

2307.13821v1

Uploaded by

Vincent Lostanlen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY

FITTING AUDITORY FILTERBANKS WITH MULTIRESOLUTION NEURAL NETWORKS

Vincent Lostanlen1 , Daniel Haider2,3 , Han Han1 , Mathieu Lagrange1 , Peter Balazs2 , and Martin Ehler3
1
Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France.
2
Acoustics Research Institute, Austrian Academy of Sciences, A-1040 Vienna, Austria.
3
University of Vienna, Department of Mathematics, A-1090 Vienna, Austria.

ABSTRACT Y = |Λx|2
teacher
arXiv:2307.13821v1 [cs.SD] 25 Jul 2023

Waveform-based deep learning faces a dilemma between nonpara- Lx (W)

natural
metric and parametric approaches. On one hand, convolutional neu- x 2
ral networks (convnets) may approximate any linear time-invariant sounds = cosdist |ΦW x , Y
system; yet, in practice, their frequency responses become more student
ΦW x
irregular as their receptive fields grow. On the other hand, a paramet-
ric model such as LEAF is guaranteed to yield Gabor filters, hence
an optimal time–frequency localization; yet, this strong inductive Figure 1: Graphical outline of the proposed method. We train a
bias comes at the detriment of representational capacity. In this neural network “student” ΦW to regress the squared magnitudes Y
paper, we aim to overcome this dilemma by introducing a neural of an auditory filterbank “teacher” Λ in terms of spectrogram-based
audio model, named multiresolution neural network (MuReNN). cosine distance Lx , on average over a dataset of natural sounds x.
The key idea behind MuReNN is to train separate convolutional
operators over the octave subbands of a discrete wavelet transform
(DWT). Since the scale of DWT atoms grows exponentially between performance. Instead, gradient-based optimization has the potential
octaves, the receptive fields of the subsequent learnable convolu- to reflect the spectrotemporal characteristics of the data at hand.
tions in MuReNN are dilated accordingly. For a given real-world Enabling this potential is particularly important in applications
dataset, we fit the magnitude response of MuReNN to that of a well- where psychoacoustic knowledge is lacking; e.g., animals outside of
established auditory filterbank: Gammatone for speech, CQT for the mammalian taxon [10, 11]. Beyond its perspectives in applied
music, and third-octave for urban sounds, respectively. This is a form science, the study of learnable filterbanks has value for fundamental
of knowledge distillation (KD), in which the filterbank “teacher” is research on machine listening with AI. This is because it repre-
engineered by domain knowledge while the neural network “student” sents the last stage of progress towards general-purpose “end-to-end”
is optimized from data. We compare MuReNN to the state of the art learning, from the raw audio waveform to the latent space of interest.
in terms of goodness of fit after KD on a hold-out set and in terms of Yet, success stories in waveform-based deep learning for audio
Heisenberg time–frequency localization. Compared to convnets and classification have been, up to date, surprisingly few—and even
Gabor convolutions, we find that MuReNN reaches state-of-the-art fewer beyond the realm of speech and music [12]. The core hypothe-
performance on all three optimization problems. sis of our paper is that this shortcoming is due to an inadequate choice
Index Terms— Convolutional neural network, digital filters, of neural network architecture. Specifically, we identify a dilemma
filterbanks, multiresolution analysis, psychoacoustics. between nonparametric and parametric approaches, where the former
are represented by convolutional neural networks (convnets) and the
latter by architectures used in SincNet [13] or LEAF [14]. In theory,
1. INTRODUCTION convnets may approximate any finite impulse response (FIR), given
a receptive field that is wide enough; but in practice, gradient-based
Auditory filterbanks are time-invariant systems whose design takes optimization on nonconvex objectives yields suboptimal solutions
inspiration from domain-specific knowledge in hearing science [1]. [12]. On the other hand, the parametric approaches enforce good
For example, the critical bands of the human cochlea inspires fre- time–frequency localization, yet at the cost of imposing a rigid shape
quency scales such as mel, bark, and ERB [2]. The phenomenon for the learned filters: cardinal sine (inverse-square envelope) for
of temporal masking calls for asymmetric impulse responses, moti- SincNet and Gabor (Gaussian envelope) for LEAF.
vating the design of Gammatone filters [3]. Lastly, the constant-Q Our goal is to overcome this dilemma by developing a neural
transform (CQT), in which the number of filters per octave is fixed, audio model which is capable of learning temporal envelopes from
reflects the principle of octave equivalence in music [4]. data while guaranteeing near-optimal time–frequency localization.
In recent years, the growing interest for deep learning in signal In doing so, we aim to bypass the explicit incorporation of psychoa-
processing has proposed to learn filterbanks from data rather than coustic knowledge as much as possible. This is unlike state-of-the-art
design them a priori [5]. Such a replacement of feature engineering convnets for filterbank learning such as SincNet or LEAF, whose
to feature learning is motivated by the diverse application scope of parametric kernels are initialized according to a mel-frequency scale.
audio content analysis: i.e., conservation biology [6], urban science Arguably, such careful initialization procedures defeat the purpose of
[7], industry [8], and healthcare [9]. Since these applications differ deep learning; i.e., to spare the human effort of feature engineering.
greatly in terms of acoustical content, the domain knowledge which
prevails in speech and music processing is likely to yield suboptimal Companion website: https://github.com/lostanlen/lostanlen2023waspaa
2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY

Furthermore, it contrasts with other domains of deep learning (e.g., 2.3. Multiresolution neural network (MuReNN)
image processing) in which all convnet layers are simply initialized
with i.i.d. Gaussian weights [15]. As our original contribution, we train a multiresolution neural net-
Prior work on this problem has focused on advancing the state work, or MuReNN for short. MuReNN comprises two stages, mul-
of the art on a given task, sometimes to no avail [16]. In this article, tiresolution approximation (MRA) and convnet; of which only the
we take a step back and formulate a different question: before we try latter is learned from data. We implement the MRA with a dual-tree
to outperform an auditory filterbank, can we replicate its responses complex wavelet transform (DTCWT) [19]. The DTCWT relies on a
with a neural audio model? To answer this question, we compare multirate filterbank in which each wavelet ψj has a null average and
different “student” models in terms of their ability to learn from a a bandwidth of one octave. Denoting by ξ the sampling rate of x,
black-box function or “teacher” by knowledge distillation (KD). the wavelet ψj has a bandwidth with cutoff frequencies 2−(j+1) π
Given an auditory filterbank Λ and a discrete-time signal x of and 2−j π. Hence, we may subsample the result of the convolution
length T , let us denote the squared magnitude of the filter response (x ∗ ψj ) by a factor of 2j , yielding:
at frequency bin f by Y [f, t] = |Λx|2 [f, 2J t], where 2J is the ∀j ∈ {0, . . . , J − 1}, xj [t] = (x ∗ ψj )[2j t], (4)
chosen hop size or “stride”. Then, given a model ΦW with weights
W, we evaluate the dissimilarity between teacher Λ and student where J is the number of multiresolution levels. We take J = 9 in
ΦW as their (squared) spectrogram-based cosine similarity Lx (W). this paper, which roughly coincides with the number of octaves in
The distance of student and teacher in this similarity measure can the hearing range of humans. The second stage in MuReNN consists
be computed via the L2 distance after normalizing across frequency in defining convnet filters ϕf . Unlike in the Conv1D setting, those
bins f , independently for each time t. Let |Φe W x 2 and Y e denote filters do not operate over the full-resolution input x but over one
these normalized versions of student and teacher, then of its MRA levels xj . More precisely, let us denote by j[f ] the
decomposition level assigned to filter f , and by 2Lj the kernel size
Lx (W) = cosdist |ΦW |2 , Y

for that decomposition level. We convolve xj[f ] with ϕf and apply
T /2 J
F a subsampling factor of 2J−j[f ] , hence:
1 X X e 2 e t] 2 ,
= |ΦW x [f, t] − Y[f, (1)
2 t=1 ΦW x[f, t] = (xj[f ] ∗ ϕf )[2J−j[f ] t]
f =1
Lj −1
X
xj[f ] 2J−j[f ] t − τ ϕf [τ ]

where F is the number of filters. We seek to minimize the quantity = (5)
above by gradient-based optimization on W, on a real-world dataset τ =−Lj
of audio signals {x1 . . . xN }, and with no prior knowledge on Λ.
The two stages of subsampling in Equations 4 and 5 result in a uni-
form downsampling factor of 2J for ΦW x. Each learned FIR filter
2. NEURAL AUDIO MODELS ϕf has an effective receptive field size of 2j[f ]+1 Lj[f ] , thanks to the
subsampling operation in Equation 4. This resembles a dilated con-
2.1. Learnable time-domain filterbanks (Conv1D)
volution [20] with a dilation factor of 2j[f ] , except that the DTCWT
As a baseline, we train a 1-D convnet ΦW with F kernels of the guarantees the absence of aliasing artifacts.
same length 2L. With a constant stride of 2J , ΦW x writes as Besides this gain in frugality, as measured by parameter count
per unit of time, the resort to an MRA offers the opportunity to
L−1
X introduce desirable mathematical properties in the non-learned part
ΦW x[f, t] = (x ∗ ϕf )[2J t] = x 2J t − τ ϕf [τ ],

(2) of the transform (namely, ψf ) and have the MuReNN operator
τ =−L ΦW inherit them, without need for a non-random initialization nor
regularization during training. In particular, ΦW has at least as many
where x is padded by L samples at both ends. Under this setting, the
vanishing moments as ψ f . Furthermore, the DTCWT yields quasi-
trainable weights W are the finite impulse responses of ϕf for all
f , thus amounting to 2LF parameters.We initialize W as Gaussian analytic coefficients: for each j, xj = xRj +ixIj with xIj ≈ H xRj ,
√ where the exponent R (resp. I) denotes the real part (resp. imaginary
i.i.d. entries with null mean and variance 1/ F .
part) and H denotes the Hilbert transform. Since ϕf is real-valued,
the same property holds for MuReNN: ΦI x = H(ΦR x).
2.2. Gabor 1-D convolutions (Gabor1D) We implement MuReNN on GPU via a custom implementation
As a representative of the state of the art (i.e., LEAF [14]), we train of DTCWT in PyTorch1 . Following [19], we use a biorthogonal
a Gabor filtering layer or Gabor1D for short. For this purpose, we wavelet for j = 0 and quarter-shift wavelets for j ≥ 1. We set
parametrize each FIR filter ϕf as Gabor filter; i.e., an exponential Lj = 8Mj where Mj is the number of filters f at resolution j.
sine wave of amplitude af and frequency ηf which is modulated by We refer to [21] for an introduction to deep learning in the wavelet
a Gaussian envelope of width σf . Hence a new definition: domain, with applications to image classification.
!
af τ2 3. KNOWLEDGE DISTILLATION
ϕf [τ ] = √ exp − 2 exp(2πiηf τ ). (3)
2πσf 2σf
3.1. Target auditory filterbanks
Under this setting, the trainable weights W amount to only 3F For each of the three different domains, speech, music and urban
parameters: W = {a1 , σ1 , η1 , . . . , aF , σF , ηF }. Following LEAF, environmental sounds, we use an auditory filterbank Λ that is tailored
we initialize center frequencies ηf and bandwidths σf so as to form to its respective spectrotemporal characteristics.
a mel-frequency filterbank [17] and set amplitudes af to one. We
use the implementation of Gabor1D from SpeechBrain v0.5.14 [18]. 1 https://github.com/kymatio/murenn
2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY

Domain Dataset Teacher Conv1D Gabor1D MuReNN

Speech NTVOW Gammatone 2.12 ± 0.05 10.14 ± 0.09 2.00 ± 0.02
Music TinySOL VQT 8.76 ± 0.2 16.87 ± 0.06 5.28 ± 0.03
Urban SONYC-UST ANSI S1.11 3.26 ± 0.1 13.51 ± 0.2 2.57 ± 0.2
Synth Sine waves CQT 11.54 ± 0.5 22.26 ± 0.9 9.75 ± 0.4

Table 1: Mean and standard deviation of test loss after knowledge distillation over five independent trials. Each column corresponds to a
different neural audio model ΦW while each row corresponds to a different auditory filterbank and audio domain. See Section 4.2 for details.

Synth Speech Music Urban

0.3 Gabor1D
Conv1D
0.2 MuReNN

0.1

0.0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

Figure 2: Left to right: evolution of validation losses on different domains with Conv1D (green), Gabor1D (blue), and MuReNN (orange), as a
function of training epochs. The shaded area denotes the standard deviation across five independent trials. See Section 4.2 for details.

Synth A constant-Q filterbank with Q = 8 filters per octave, cover- reverse-mode automatic differentiation in PyTorch to obtain
ing eight octaves with Hann-modulated sine waves. J
F TX/2 e W x 2 [f, t]
X ∂|Φ
∇Lx (W)[i] = (W)
Speech A filterbank with 4-th order Gammatone filters tuned to the ∂W[i]
t=1
f =1
ERB-scale, a frequency scale which is adapted to the equiv-
e W x 2 [f, t] − Y[f,

alent rectangular bandwidths of the human cochlea [22]. In × |Φ e t] (6)
psychoacoustics, Gammatone filters provide a good approx-
for each entry W[i]. Note that the gradient above does not involve
imation to measured responses of the filters of the human
the phases of the teacher filterbank Λ, only its normalized magni-
basilar membrane [3]. Unlike Gabor filters, Gammatone fil-
tude response Y given the input x. Consequently, even though our
ters are asymmetric, both in the time domain and frequency
models ΦW contain a single linear layer, the associated knowledge
domain.We refer to [23] for implementation details.
distillation procedure is nonconvex, and thus resembles the training
of a deep neural network.
Music A variable-Q transform (VQT) with Mj = 12 frequency
bins per octave at every level. The VQT is a variant of the
constant-Q transform (CQT) in which Q is decreased grad- 4. RESULTS AND DISCUSSION
ually towards lower frequencies [24], hence an improved
temporal resolution at the expense of frequency resolution. 4.1. Datasets
Synth As a proof of concept, we construct sine waves in a geometric
Urban A third-octave filterbank inspired by the ANSI S1.11-2004 progression over the frequency range of the target filterbank.
standard for environmental noise monitoring [25]. In this Speech The North Texas vowel database (NTVOW) [29] contains
filterbank, center frequencies are not exactly in a geomet- utterances of 12 English vowels from 50 American speakers,
ric progression. Rather, they are aligned with integer Hertz including children aged three to seven as well as male and
values: 40, 50, 60; 80, 100, 120; 160, 200, 240; and so forth. female adults. In total, it consists of 3190 recordings, each
lasting between one and three seconds.
We construct the Synth teacher via nnAudio [26], a PyTorch port Music The TinySOL dataset [30] contains isolated musical notes
of librosa [27]; and Speech, Music, and Urban using the Large played by eight instruments: accordion, alto saxophone, bas-
Time–Frequency Analysis Toolbox (LTFAT) for MATLAB [28]. soon, flute, harp, trumpet in C, and cello. For each of these
instruments, we take all available pitches in the tessitura (min
= B0 , median = E4 , max = C♯8 ) in three levels of intensity
dynamics: pp, mf, and ff. This results in a total of 1212 audio
3.2. Gradient-based optimization recordings.
Urban The SONYC Urban Sound Tagging dataset (SONYC-UST)
For all four “student” models, we initialize the vector W at random [31] contains 2803 acoustic scenes from a network of au-
and update it iteratively by empirical risk minimization over the tonomous sensors in New York City. Each of these ten-second
training set. We rely on the Adam algorithm for stochastic opti- scenes contains one or several sources of urban noise pollu-
mization with default momentum parameters. Given the definition tion, such as: engines, machinery and non-machinery impacts,
of spectrogram-based cosine distance in Equation 1, we perform powered saws, alert signals, and dog barks.
2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY

Figure 3: Compared impulse responses of Conv1D (left), Gabor1D

(center), and MuReNN (right) with different center frequencies after
convergence, with a Gammatone filterbank as target. Solid blue Figure 4: Distribution of Heisenberg time–frequency ratios for each
(resp. dashed red) lines denote the real part of the impulse responses teacher–student pair (lower is better). See Section 4.3 for details.
of the learned filters (resp. target). See Section 4.3 for details.

Indeed, through limiting the scale and support of the learned fil-
4.2. Benchmarks ters, MuReNN restrains the potential introduction of high-frequency
noises of a learned filter of longer length. The phase misalignment at
For each audio domain, we randomly split its corresponding dataset low frequencies is a natural consequence of the fact that the gradients
into training, testing and validation subsets with a 8:1:1 ratio. During are computed from the magnitudes of the filterbank responses.
training, we select 212 time samples from the middle part of each Finally, we measure the time–frequency localization of all filters
signal, i.e., the FIR length of the filters in the teacher filterbank. We by computing the associated Heisenberg time–frequency ratios [33].
train each model with 100 epochs with an epoch size of 8000. From theory we know that Gaussian windows are optimal in this
Table 1 summarizes our findings. On all three benchmarks, we sense [34]. Therefore, it is not surprising that Gabor1D yields the
observe that MuReNN reaches state-of-the-art performance, as mea- best localized filters, even outperforming the teacher, see Figure 4.
sured in terms of cosine distance with respect to the teacher filterbank Expectedly, the localization of the filters from Conv1D is poor and
after 100 epochs. The improvement with respect to Conv1D is most appears independent of the teacher. MuReNN roughly resembles the
noticeable in the Synth benchmark and least noticeable in the Speech localization of the teachers but has some poorly localized outliers in
benchmark. Furthermore, Figure 2 indicates that Gabor1D barely higher frequencies, deserving further inquiry.
trains at all: this observation is consistent with the sensitivity of
LEAF with respect to initialization, as reported in [32]. We also
5. CONCLUSION
notice that MuReNN trains faster than Conv1D on all benchmarks
except for Urban, a phenomenon deserving further inquiry. Multiresolution neural networks (MuReNN) have the potential to
advance waveform-based deep learning. They offer a flexible and
4.3. Error analysis data-driven procedure for learning filters which are “wavelet-like”:
i.e., narrowband with compact support, vanishing moments, and
The mel-scale initialization of Gabor1D filters and the inductive quasi-Hilbert analyticity. Those experiments based on knowledge
bias of MuReNN enabled by octave localization gives a starting distillation from three domains (speech, music, and urban sounds)
advantage when learning filterbanks on log-based frequency scales, illustrate the suitability of MuReNN for real-world applications. The
as used for the Gammatone and VQT filterbank. Expectedly, this main limitation of MuReNN lies in the need to specify a number
advantage is absent with a teacher filterbank that does not follow a of filters per octave Mj , together with a kernel size Lj . Still, a
geometric progression of center frequencies, as it is the case in the promising finding of our paper is that prior knowledge on Mj and
ANSI scale. Figure 2 reflects these observations. Lj suffices to finely approximate non-Gabor auditory filterbanks,
To examine the individual filters of each model, we take the such as Gammatones on an ERB scale, from a random i.i.d. Gaussian
speech domain as an example and obtain their learned impulse re- initialization. Future work will evaluate MuReNN in conjunction
sponses. Figure 3 visualizes chosen examples at different frequencies with a deep neural network for sample-efficient audio classification.
learned by each model together with the corresponding teacher Gam-
matone filters. In general, all models are able to fit the filter responses
6. ACKNOWLEDGMENT
well. However, it is noticeable that the prescribed envelope for Ga-
bor1D impedes it from learning the asymmetric target Gammatone V.L. thanks Fergal Cotter and Nick Kingsbury for maintaining the
filters. This becomes prominent especially at high frequencies. From dtcwt and pytorch wavelets libraries; LS2N and ÖAW staff for ar-
the strong envelope mismatches at coinciding frequency we may ranging research visits; and Neil Zeghidour for helpful discussions.
deduce that center frequencies and bandwidths did not play well D.H. thanks Clara Hollomey for helping with the implementation
together during training. On the contrary, MuReNN and Conv1D of the filterbanks. V.L. and M.L. are supported by ANR MuReNN;
are flexible enough to learn asymmetric temporal envelopes without D.H., by a DOC Fellowship of the Austrian Academy of Sciences
compromising its regularity in time. Although the learned filters (A 26355); P.B., by FWF projects LoFT (P 34624) and NoMASP (P
of Conv1D are capable of fitting the frequencies well, they suf- 34922); and M.E., by WWTF project CHARMED (VRG12-009).
fer from noisy artifacts, especially outside their essential supports.
2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY

7. REFERENCES [19] I. W. Selesnick, R. G. Baraniuk, and N. C. Kingsbury, “The

dual-tree complex wavelet transform,” IEEE Signal Proc. Mag.,
[1] R. F. Lyon, Human and machine hearing: Extracting meaning vol. 22, no. 6, pp. 123–151, 2005.
from sound. Cambridge University Press, 2017.
[20] A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,
[2] R.-A. Knight and J. Setter, The Cambridge Handbook of Pho- K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stim-
netics. Cambridge University Press, 2021. berg, et al., “Parallel WaveNet: Fast high-fidelity speech syn-
thesis,” in Proc. ICML, 2018, pp. 3918–3926.
[3] B. R. Glasberg and B. C. Moore, “Derivation of auditory filter
shapes from notched-noise data,” Hearing Research, vol. 47, [21] F. Cotter, “Uses of complex wavelets in deep convolutional
no. 1, pp. 103–138, 1990. neural networks,” Ph.D. dissertation, University of Cambridge,
2020.
[4] J. C. Brown, “Calculation of a constant-Q spectral transform,”
J. Acoust. Soc. Am., vol. 89, no. 1, pp. 425–434, 1991. [22] B. C. J. Moore and B. R. Glasberg, “Suggested formulae for
calculating auditory-filter bandwidths and excitation patterns,”
[5] M. Dörfler, T. Grill, R. Bammer, and A. Flexer, “Basic filters J. Acoust. Soc. Am., vol. 74, no. 3, pp. 750–753, 09 1983.
for convolutional neural networks applied to music: Training
or design?” Neural Comput. Appl., vol. 32, pp. 941–954, 2020. [23] T. Necciari, N. Holighaus, P. Balazs, Z. Průša, P. Majdak, and
O. Derrien, “Audlet filter banks: A versatile analysis/synthesis
[6] D. Stowell, “Computational bioacoustics with deep learning: a framework using auditory frequency scales,” Applied Sciences,
review and roadmap,” PeerJ, vol. 10, p. e13152, 2022. vol. 8, no. 1, 2018.
[7] J. P. Bello, C. Silva, O. Nov, R. L. Dubois, A. Arora, J. Salamon, [24] C. Schörkhuber, A. Klapuri, N. Holighaus, and M. Dörfler,
C. Mydlarz, and H. Doraiswamy, “SONYC: A system for “A matlab toolbox for efficient perfect reconstruction time-
monitoring, analyzing, and mitigating urban noise pollution,” frequency transforms with log-frequency resolution,” in Proc.
Communications of the ACM, vol. 62, no. 2, pp. 68–77, 2019. AES, 2014.
[8] R. Zhao, R. Yan, Z. Chen, K. Mao, P. Wang, and R. X. Gao, [25] J. Antoni, “Orthogonal-like fractional-octave-band filters,” J.
“Deep learning and its applications to machine health monitor- Acoust. Soc. Am., vol. 127, no. 2, pp. 884–895, 2010.
ing,” Mechanical Systems and Signal Processing, vol. 115, pp.
213–237, 2019. [26] K. W. Cheuk, H. Anderson, K. Agres, and D. Herremans,
“nnAudio: An on-the-fly gpu audio to spectrogram conversion
[9] P. Bizopoulos and D. Koutsouris, “Deep learning in cardiology,” toolbox using 1d convolutional neural networks,” IEEE Access,
IEEE Rev. Biomed. Eng., vol. 12, pp. 168–193, 2018. vol. 8, pp. 161 981–162 003, 2020.
[10] F. J. Bravo Sanchez, M. R. Hossain, N. B. English, and S. T. [27] B. McFee, M. McVicar, D. Faronbi, I. Roman, M. Gover,
Moore, “Bioacoustic classification of avian calls from raw S. Balke, S. Seyfarth, A. Malek, C. Raffel, V. Lostanlen, et al.,
sound waveforms with an open-source deep learning architec- “librosa/librosa: 0.10.0.post2,” Mar. 2023. [Online]. Available:
ture,” Scientific Reports, vol. 11, no. 1, pp. 1–12, 2021. https://doi.org/10.5281/zenodo.7746972
[11] M. Faiß, “Adaptive representations of sound for automatic in- [28] Z. Průša, P. Søndergaard, P. Balazs, and N. Holighaus, “LT-
sect recognition,” Master’s thesis, Naturalis Biodiversity Cen- FAT: A Matlab/Octave toolbox for sound processing,” in Proc.
ter, 2022. CMMR, 2013, pp. 299–314.
[12] F. Lluı́s, J. Pons, and X. Serra, “End-to-end music source [29] P. F. Assmann and W. F. Katz, “Time-varying spectral change
separation: Is it possible in the waveform domain?” arXiv in the vowels of children and adults,” J. Acoust. Soc. Am., vol.
preprint arXiv:1810.12187, 2018. 108, no. 4, pp. 1856–1866, 2000.
[13] M. Ravanelli and Y. Bengio, “Speaker recognition from raw [30] C. E. Cella, D. Ghisi, V. Lostanlen, F. Lévy, J. Fineberg, and
waveform with SincNet,” in Proc. IEEE SLT, 2018. Y. Maresz, “OrchideaSOL: A dataset of extended instrumental
techniques for computer-aided orchestration,” in Proc. ICMC,
[14] N. Zeghidour, O. Teboul, F. de Chaumont Quitry, and 2020.
M. Tagliasacchi, “LEAF: A learnable frontend for audio classi-
fication,” in Proc. ICML, 2021. [31] M. Cartwright, A. E. Mendez Mendez, G. Dove, J. Cramer,
V. Lostanlen, H.-H. Wu, J. Salamon, O. Nov, and J. P. Bello,
[15] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the “SONYC Urban Sound Tagging (SONYC-UST): A multil-
importance of initialization and momentum in deep learning,” abel dataset from an urban acoustic sensor network,” in Proc.
in Proc. ICML, 2013, pp. 1139–1147. DCASE, 2019.
[16] J. Schlüter and G. Gutenbrunner, “EfficientLEAF: A faster [32] M. Anderson, T. Kinnunen, and N. Harte, “Learnable fron-
learnable audio frontend of questionable use,” in Proc. EU- tends that do not learn: Quantifying sensitivity to filterbank
SIPCO. IEEE, 2022, pp. 205–208. initialisation,” in Proc. IEEE ICASSP, 2023.
[17] N. Zeghidour, N. Usunier, I. Kokkinos, T. Schaiz, G. Synnaeve, [33] S. Mallat, A wavelet tour of signal processing. Elsevier, 1999.
and E. Dupoux, “Learning filterbanks from raw speech for
phone recognition,” in Proc. IEEE ICASSP. IEEE, 2018. [34] K. Gröchenig, Foundations of time-frequency analysis, ser.
Appl. Numer. Harmon. Anal. Boston, MA: Birkhäuser, 2001.
[18] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell,
L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong,
et al., “SpeechBrain: A general-purpose speech toolkit,” arXiv
preprint arXiv:2106.04624, 2021.

Nutrition Care Plan - Dysphagia
No ratings yet
Nutrition Care Plan - Dysphagia
1 page
MKN Flexichef
No ratings yet
MKN Flexichef
8 pages
Voice Disorder Detection Using Long Short Term Memory (LSTM) Model
No ratings yet
Voice Disorder Detection Using Long Short Term Memory (LSTM) Model
4 pages
Universality for Real Symmetric Matrix Models
No ratings yet
Universality for Real Symmetric Matrix Models
651 pages
Taylor MacLaurin Solved Problems
No ratings yet
Taylor MacLaurin Solved Problems
72 pages
Bick - Technoecology
No ratings yet
Bick - Technoecology
142 pages
Edge universality of separable covariance matrices
No ratings yet
Edge universality of separable covariance matrices
58 pages
Literature Review On 5 Pen PC Technology
100% (1)
Literature Review On 5 Pen PC Technology
4 pages
700 GMAT Data Sufficiency Questions With Explanations
No ratings yet
700 GMAT Data Sufficiency Questions With Explanations
31 pages
Detecting Human and Non-Human Vocal Productions in
No ratings yet
Detecting Human and Non-Human Vocal Productions in
25 pages
ISMIR 2019 Tutorial - Waveform-Based Music Processing With Deep Learning
No ratings yet
ISMIR 2019 Tutorial - Waveform-Based Music Processing With Deep Learning
152 pages
STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency
No ratings yet
STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency
14 pages
a2a831c4-d304-4784-80d0-365940503311
No ratings yet
a2a831c4-d304-4784-80d0-365940503311
15 pages
Final Synopsis
No ratings yet
Final Synopsis
23 pages
Rave
No ratings yet
Rave
15 pages
Elle GATX March 14
No ratings yet
Elle GATX March 14
20 pages
Transfer Learning Scenarios On Deep Learning For Ultrasound Based Image Segmentation
No ratings yet
Transfer Learning Scenarios On Deep Learning For Ultrasound Based Image Segmentation
10 pages
Deep Learning and Music Adversaries
No ratings yet
Deep Learning and Music Adversaries
13 pages
10 1109@JSTSP 2019 2909479
No ratings yet
10 1109@JSTSP 2019 2909479
13 pages
mrac_paper1a
No ratings yet
mrac_paper1a
11 pages
2202.01367v1
No ratings yet
2202.01367v1
11 pages
On Terminology Indonesia Circle 1992
No ratings yet
On Terminology Indonesia Circle 1992
38 pages
Macrophage._050621
No ratings yet
Macrophage._050621
11 pages
Deep Learning For Audio Signal Processing
No ratings yet
Deep Learning For Audio Signal Processing
14 pages
1802.04208
No ratings yet
1802.04208
16 pages
2021 Deep Learning Audio Book
No ratings yet
2021 Deep Learning Audio Book
38 pages
Training_Convolutional_Neural_Networks_w
No ratings yet
Training_Convolutional_Neural_Networks_w
8 pages
951-170-214 EN TCM 12-99714
No ratings yet
951-170-214 EN TCM 12-99714
44 pages
One Deep Music Representation To Rule Them All? A Comparative Analysis of Different Representation Learning Strategies
No ratings yet
One Deep Music Representation To Rule Them All? A Comparative Analysis of Different Representation Learning Strategies
27 pages
Artificial Neural Network Application For The Temporal Properties of Acoustic Perception
No ratings yet
Artificial Neural Network Application For The Temporal Properties of Acoustic Perception
12 pages
Jubilee - BF 07 DOMESTIC PACKAGE INSURANCE POLICY
No ratings yet
Jubilee - BF 07 DOMESTIC PACKAGE INSURANCE POLICY
8 pages
AUDIO CHORD RECOGNITION WITH RECURRENT NEURAL NETWORKS
No ratings yet
AUDIO CHORD RECOGNITION WITH RECURRENT NEURAL NETWORKS
6 pages
Pert Usa PHD
No ratings yet
Pert Usa PHD
232 pages
AJSAT Vol.5 No.2 July Dece 2016 pp.23 30
No ratings yet
AJSAT Vol.5 No.2 July Dece 2016 pp.23 30
8 pages
Learning Representations From Audio Using Autoencoders
No ratings yet
Learning Representations From Audio Using Autoencoders
11 pages
2021 Learning Multi-Pitch Estimation From Weakly Aligned Score-Audio Pairs Using a Multi-Label CTC Loss
No ratings yet
2021 Learning Multi-Pitch Estimation From Weakly Aligned Score-Audio Pairs Using a Multi-Label CTC Loss
5 pages
2301.02886v2
No ratings yet
2301.02886v2
5 pages
2403.09598v2
No ratings yet
2403.09598v2
5 pages
2501.05987v2
No ratings yet
2501.05987v2
5 pages
2502.17527v1
No ratings yet
2502.17527v1
5 pages
Deep Learning Techniques For Identifying Voices
No ratings yet
Deep Learning Techniques For Identifying Voices
27 pages
DDSP Differentiable Digital Signal Processing
No ratings yet
DDSP Differentiable Digital Signal Processing
19 pages
Musical_Genre_Classification_Using_Advanced_Audio_Analysis_and_Deep_Learning_Techniques
No ratings yet
Musical_Genre_Classification_Using_Advanced_Audio_Analysis_and_Deep_Learning_Techniques
11 pages
BRIR生成模型
No ratings yet
BRIR生成模型
5 pages
The Eight Spirits of Evil by Evagrius of Pontus: November 2019
No ratings yet
The Eight Spirits of Evil by Evagrius of Pontus: November 2019
13 pages
Prabharoop Interim Report
No ratings yet
Prabharoop Interim Report
4 pages
PFA_INR
No ratings yet
PFA_INR
75 pages
Unsupervised Features Learning For Audio Analysis
No ratings yet
Unsupervised Features Learning For Audio Analysis
4 pages
Dong 2020
No ratings yet
Dong 2020
5 pages
Ph.D. Thesis Computationally Efficient Methods For Polyphonic Music Transcription
No ratings yet
Ph.D. Thesis Computationally Efficient Methods For Polyphonic Music Transcription
232 pages
DAFx2018 Paper 63
No ratings yet
DAFx2018 Paper 63
9 pages
Lost An Len 2021 Jim
No ratings yet
Lost An Len 2021 Jim
35 pages
2015-Elsevier-Speaker-identification-using-vowels-features-through-a-combined-method-of-formants-wavelets-and-neural-network-classifiers
No ratings yet
2015-Elsevier-Speaker-identification-using-vowels-features-through-a-combined-method-of-formants-wavelets-and-neural-network-classifiers
9 pages
2408.17358v1
No ratings yet
2408.17358v1
5 pages
Environmental Sound Classificationwith Convolutional Neural Networks
No ratings yet
Environmental Sound Classificationwith Convolutional Neural Networks
6 pages
Sugardine Versus Povidine-Iodine in Management of Pressure Ulcers
No ratings yet
Sugardine Versus Povidine-Iodine in Management of Pressure Ulcers
10 pages
Optimal Match Time Series Non-Linearly Sequence Alignment: Vector Quantization (VQ) Is A Classical
No ratings yet
Optimal Match Time Series Non-Linearly Sequence Alignment: Vector Quantization (VQ) Is A Classical
4 pages
Chord Detection Using Deep Learning
No ratings yet
Chord Detection Using Deep Learning
7 pages
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
No ratings yet
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
14 pages
tmp1737 TMP
No ratings yet
tmp1737 TMP
12 pages
Solve The Magic Square 2 7 6 9 5 1 4 3 8
No ratings yet
Solve The Magic Square 2 7 6 9 5 1 4 3 8
5 pages
Wavelet Based Feature Extraction For Phoneme Recognition
No ratings yet
Wavelet Based Feature Extraction For Phoneme Recognition
4 pages
Appendix A1 Data For The IEEE 10-Bus Distribution Test System
No ratings yet
Appendix A1 Data For The IEEE 10-Bus Distribution Test System
14 pages
000063
No ratings yet
000063
7 pages
Transdermal and Topical Drug Delivery. From Theory To Clinical Practice
No ratings yet
Transdermal and Topical Drug Delivery. From Theory To Clinical Practice
2 pages
Audio Analysis Using The Discrete Wavelet Transform: 1 2 Related Work
No ratings yet
Audio Analysis Using The Discrete Wavelet Transform: 1 2 Related Work
6 pages
Melnet: A Generative Model For Audio in The Frequency Domain
No ratings yet
Melnet: A Generative Model For Audio in The Frequency Domain
14 pages
Solving The Cocktail Party Problem Using Deep Neural Networks
No ratings yet
Solving The Cocktail Party Problem Using Deep Neural Networks
2 pages
Bhavan'S College: Andheri (West)
No ratings yet
Bhavan'S College: Andheri (West)
2 pages
CV 1 PSV Laboratory
No ratings yet
CV 1 PSV Laboratory
5 pages
Generative Evaluation of Audio Representations
No ratings yet
Generative Evaluation of Audio Representations
17 pages
Music Generation with NLP-1
No ratings yet
Music Generation with NLP-1
15 pages
DL3282 Technical Manual
No ratings yet
DL3282 Technical Manual
31 pages
Paper 10
No ratings yet
Paper 10
7 pages
Nietjet 0602S 2018 003
No ratings yet
Nietjet 0602S 2018 003
5 pages
X Plore 8000 Pi 9094148 en Master
No ratings yet
X Plore 8000 Pi 9094148 en Master
10 pages
Invariant Representations It Is A Central Function of The Brain To Analyze Input Streams
No ratings yet
Invariant Representations It Is A Central Function of The Brain To Analyze Input Streams
3 pages
2.0 Product: Guideform Specification EPM9650 - P Q M
No ratings yet
2.0 Product: Guideform Specification EPM9650 - P Q M
7 pages
LaserTurretPoster 1
No ratings yet
LaserTurretPoster 1
1 page
Jean-Claude Risset - Fifty Years of Digital Sound For Music
100% (1)
Jean-Claude Risset - Fifty Years of Digital Sound For Music
6 pages
A Robust Audio Deepfake Detection System Via Multi-View Feature
No ratings yet
A Robust Audio Deepfake Detection System Via Multi-View Feature
5 pages
Chord Detection Using Deep Learning
No ratings yet
Chord Detection Using Deep Learning
8 pages
Apple Case Study Assignment
No ratings yet
Apple Case Study Assignment
11 pages
Fire Code Provisions On Sprinkler Systems and Standpipe Systems
75% (4)
Fire Code Provisions On Sprinkler Systems and Standpipe Systems
59 pages
Timbre Analysis of Music Audio Signals With Convolutional Neural Networks
No ratings yet
Timbre Analysis of Music Audio Signals With Convolutional Neural Networks
5 pages
The Emergence of Deep Learning: New Opportunities For Music and Audio Technologies
No ratings yet
The Emergence of Deep Learning: New Opportunities For Music and Audio Technologies
2 pages
100 Drugs For Your DND
No ratings yet
100 Drugs For Your DND
9 pages
Method of Statement of Pylons
No ratings yet
Method of Statement of Pylons
9 pages
DL For Acoustics
No ratings yet
DL For Acoustics
4 pages
Wartsila SP A Id 2S Engines e ICC
No ratings yet
Wartsila SP A Id 2S Engines e ICC
6 pages
Science 8 DLL Q1 Week 6
No ratings yet
Science 8 DLL Q1 Week 6
5 pages
Mos Painting Works.
No ratings yet
Mos Painting Works.
4 pages
Data Pro 2015 Waec
No ratings yet
Data Pro 2015 Waec
14 pages
78 Olds CSMCHPT 0
100% (1)
78 Olds CSMCHPT 0
25 pages
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
From Everand
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
Fouad Sabry
No ratings yet
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
From Everand
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
Fouad Sabry
No ratings yet

2307.13821v1

Uploaded by

2307.13821v1

Uploaded by

2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 22-25, 2023, New Paltz, NY

FITTING AUDITORY FILTERBANKS WITH MULTIRESOLUTION NEURAL NETWORKS

Waveform-based deep learning faces a dilemma between nonpara- Lx (W)

Domain Dataset Teacher Conv1D Gabor1D MuReNN

Synth Speech Music Urban

Figure 3: Compared impulse responses of Conv1D (left), Gabor1D

7. REFERENCES [19] I. W. Selesnick, R. G. Baraniuk, and N. C. Kingsbury, “The

You might also like