0% found this document useful (0 votes)
13 views

Two-Step Knowledge Distillation For Tiny Speech Enhancement

Uploaded by

yyd19948
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Two-Step Knowledge Distillation For Tiny Speech Enhancement

Uploaded by

yyd19948
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

TWO-STEP KNOWLEDGE DISTILLATION FOR TINY SPEECH ENHANCEMENT

Rayan Daod Nathoo∗ , Mikolaj Kegler∗ , Marko Stamenovic

Bose Corporation, USA

ABSTRACT offers additional useful context compared to the ground truth data
Tiny, causal models are crucial for embedded audio machine by itself. Unlike pruning, this process does not involve modifying
learning applications. Model compression can be achieved via dis- the student network from its original dense form, which reduces the
arXiv:2309.08144v1 [cs.SD] 15 Sep 2023

tilling knowledge from a large teacher into a smaller student model. complexity of the deployment process. In this work, we focus on
In this work, we propose a novel two-step approach for tiny speech KD due to its above-outlined benefits over pruning.
enhancement model distillation. In contrast to the standard approach KD methods have been applied to various classification tasks
of a weighted mixture of distillation and supervised losses, we firstly in the audio domain [12, 13]. However, KD has not been exten-
pre-train the student using only the knowledge distillation (KD) ob- sively explored for causal low-latency SE, which often requires tiny
jective, after which we switch to a fully supervised training regime. networks (sub-100k parameters) optimized for low-resource wear-
We also propose a novel fine-grained similarity-preserving KD loss, able devices, such as hearing aids [5, 6]. So-called response-based
which aims to match the student’s intra-activation Gram matrices to KD approaches use the pre-trained teacher model’s outputs to train a
that of the teacher. Our method demonstrates broad improvements, student network [14, 15]. However, distillation can be further facili-
but particularly shines in adverse conditions including high compres- tated by taking advantage of intermediate representations of the two
sion and low signal to noise ratios (SNR), yielding signal to distor- models, not just their outputs [10]. One common obstacle in such
tion ratio gains of 0.9 dB and 1.1 dB, respectively, at -5 dB input feature-based KD is the dimensionality mismatch between teacher
SNR and 63× compression compared to baseline. and student activations due to the model size difference. To alleviate
this issue, [16] proposed aligning intermediate features, while [17]
Index Terms— speech enhancement, knowledge distillation,
used attention maps to do so. The latter was applied in the context
tinyML, model compression
of SE in [18] using considerably large, non-causal student models
intended for offline applications. In [19], the authors addressed the
1. INTRODUCTION dimensionality mismatch problem for the causal SE models by us-
ing frame-level Similarity Preserving KD [20] (SPKD). SPKD cap-
In recent years, deep neural network (DNN) models have become a tures the similarity patterns between network activations for different
common approach to the speech enhancement (SE) problem, due to training examples and aims to match those patterns between the stu-
their performance [1, 2, 3]. However, large, powerful models are of- dent and the frozen pre-trained teacher models. The authors of [19]
ten unsuitable for resource-constrained platforms, like hearing aids also introduced fusion blocks, analogous to [21], to distill relation-
or wearables, because of their memory footprint, computational la- ships between consecutive layers.
tency, and power consumption [2, 4, 5, 6]. To meet these constraints,
audio TinyML research tends to focus on designing model architec- Here, we show that the efficacy of conventional KD methods is
tures with small numbers of parameters, using model compression limited for tiny, causal SE models. To improve distillation efficacy,
techniques to reduce the size of large models, or both [4, 5, 6, 7]. we first extend the method from [19] by computing SPKD for each
Pruning is a popular method for reducing the size of DNN mod- bin of the latent representations, corresponding to the time frame (as
els for SE [4, 5, 6, 8]. The goal of pruning is to remove weights in [19]) but also the frequency bin of the input, thus providing more
least contributing to model performance. In its simplest form, this resolution for KD loss optimization. The proposed extension out-
can be performed post-training by removing weights with the low- performs other similarity-based KD methods which we also explore.
est magnitudes. Online pruning, where the model is trained and Second, we hypothesize that matching a large teacher model might
pruned concurrently, builds on post-training pruning by exposing be challenging for small student models and thus may lead to sub-
the model to pruning errors while training, allowing it to adapt to optimal performance. Inspired by [22], we propose a novel two-step
this form of compression noise [4]. Unstructured pruning of indi- framework for distilling tiny SE models. In the first step, the student
vidual weights can yield impressive model size reduction with little is pre-trained using only the KD criterion to match the activation
performance sacrifice, but corresponding savings in computational patterns of the teacher, with no additional ground truth supervision.
throughput are not possible without hardware support for sparse in- The goal of this unsupervised KD pre-training is to make the student
ference, which is unusual in embedded hardware. Structured pruning similar to the teacher prior to the main training. Then, the pre-trained
of blocks of weights and/or neurons is often designed with broader student model is further optimized in a supervised fashion and/or us-
hardware compatibility in mind, but the performance drop tends to ing KD routines. We find that pre-training using the proposed SPKD
be larger than for the unstructured case [6]. method at the level of the individual bin of the latent representation,
In contrast to pruning, knowledge distillation (KD) adopts a dif- followed by fully supervised training yields superior performance
ferent framework. The goal of KD is to utilize a strong pre-trained compared to other distillation approaches utilizing weighted mix-
teacher model to guide the training of the smaller student [9, 10, 11]. tures of KD and supervised losses. We report the performance of
The underlying assumption is that the pre-trained teacher network our method across various student model sizes, input mixture signal-
to-noise ratios (SNRs), and finally, assess the similarity between the
∗ These authors contributed equally to this work. activation patterns of the teacher and distilled student.
(a) Distillation process (b) Self-Similarity Gram matrices (c) Flow matrices
Teacher (frozen)
(1)

...

(2)

...

Student

N
Fig. 1: (a) Distillation process overview (b) Self-Similarity Gram matrices computation. (c) Flow matrices computation ( denotes matrix
multiplication). Note that, for clarity, transpositions and matrix multiplications are applied only to the last two dimensions of each tensor.

2. METHODS 2.2. Self-similarity local knowledge distillation

2.1. Model architecture Inspired by previous work [19, 20], we address the issue of dimen-
sionality mismatch between teacher and student models by comput-
Our backbone architecture for the exploration of tiny SE KD is the ing similarity-based distillation losses. The method captures and
Convolutional Recurrent U-Net for SE (CRUSE) topology [7]. How- compares the relationship between batch items at each layer out-
ever, note that the methodology developed here can, in principle, put, between teacher and student (Fig. 1a, Llocal
KD ). We refer to this
be applied to any other architecture. The CRUSE model operates relationship as the self-similarity Gram matrix Gx .
in the time-frequency domain and takes power-law compressed log- Self-similarity matrices (Fig. 1b) can be computed for an ex-
mel spectrograms (LMS) as input. The LMS is obtained by taking ample network latent activation X of shape [b, c, t, f ], where b -
the magnitude of the complex short-time Fourier transform (STFT, batch size, c - channel, t - activation width (corresponding to the
512/256 samples frame/hop size), processing it through a Mel-space input time dimension), f - activation height (corresponding to the
filterbank (80 bins, covering 50-8k Hz range) and finally compress- input frequency dimension), as shown in Fig. 1b. The original im-
ing the result by raising it to the power of 0.3. The model output is plementation from [20] involves reshaping X to [b, ctf ] and matrix
a real-valued time-frequency mask bounded within the range (0, 1) multiplying it by its transpose XT to obtain the [b, b] symmetric self-
through the sigmoid activation of the final block. The mask is ap- similarity matrix G. Analogously, this operation can be performed
plied to the noisy model input and reconstituted into the time domain for each t or f dimension independently with resulting Gt/f matri-
using the inverse STFT and the noisy input phase. ces of size [t/f , b, b]. Such an increase in granularity improved the
The model comprises four encoder/decoder blocks and a bot- KD performance in [19]. Here, we obtain even more detailed intra-
tleneck with grouped GRU units (4 groups), reducing the computa- activation Gram matrices by considering each (t, f ) bin separately,
tional complexity compared to a conventional GRU layer with the resulting in the Gtf self-similarity matrix with shape [t, f , b, b].
same number of units [23]. The encoder/decoder blocks are com- Finally, the local KD loss is computed using self-similarity ma-
posed of 2D convolution/transpose convolution layers with (2, 3) trices Gx of any kind x obtained from teacher T and student S as:
kernels (time, frequency) and (1, 2) strides, followed by cumulative
layer normalization [24] and leaky ReLU activation (α = 0.2). To 1 X 2
Llocal
KD = GTxi − GS
x
i
, (1)
further reduce the model complexity, skip connections between the b2 i F
encoder and decoder used in classic U-Net are replaced with 1x1
convolutions, whose outputs are summed into the decoder inputs. where i is the layer index and ∥∥2F is the Frobenius l2 norm.
We enforce the model’s frame-level causality by using causal con-
volutions and causal cumulative layer norms. The total algorithmic 2.3. Information flow knowledge distillation
latency of the model is 32 ms (single STFT frame) [2].
In our experiments, both teacher and student are CRUSE mod- The above-outlined local similarity losses can be extended to capture
els and their sizes are adjusted by changing the number of units in relationships between activations of subsequent layers of the teacher
each block. In particular, the teacher uses {32, 64, 128, 192} en- and student models (Fig. 1a, LfKD
low
). The method is inspired by the
coder/decoder channels and 960 bottleneck GRU units, resulting in Flow of Solution Procedure (FSP) matrices introduced in [22] and
1.9M parameters, and 13.34 MOps/frame (i.e. the number of opera- aims to not only match local similarity between the teacher and stu-
tions required to process a single STFT frame). Our default student dent in the corresponding layers but also global inter-layer relations.
uses {8, 16, 32, 32} encoder/decoder channels and 160 bottleneck We propose two versions of flow matrices between layers i and
GRU units resulting in 62k parameters (3.3% of the teacher), and j in our model (Fig. 1c). The first one, Gi→jt , leverages Gt self-
0.84 MOps/frame (6.3% of the teacher). similarity matrices. Thereby each self-similarity block shares the t-
dimension and thus the interaction between the layers’ self-similarity Table 1: One-step KD for tiny SE. Output: LKD comparing teacher
can be captured by performing matrix multiplication of Git and trans- and student outputs (similar to [15]). Gx : Feature-based LKD using
posed Gjt (both sized [t, b, b]) for each time frame t. self-similarity matrix of type x (Fig. 1b). All models are initialized
The second version leverages Gtf self-similarity matrices. with the same random weights and use γ = 0.5 (Eq. 3).
However, the f dimension in our model changes for each block
due to the strided convolutions. To quantify the relationship be- ∆SDR ∆PESQ ∆eSTOI ∆DNS-MOS
Model
i/j (dB) (MOS) (%) BAK OVRL SIG
tween layers i and j of different dimensions we reshaped Gtf to the
size of [t, b, fi/j , b]. Then for each time-batch-item pair (t,b), we Teacher 8.65 1.25 10.07 1.44 0.69 0.06
obtain a [fi/j , b] sub-matrix, which can be matrix multiplied with Student 6.34 0.75 5.82 1.27 0.55 -0.02
its transpose to obtain the flow matrix Gi→j
tf of size [t, b, fi , fj ]. Distillation
We define the loss similarly to Eq. 1 by comparing the teacher
T S Output [15] 6.35 0.75 5.59 1.33 0.56 -0.03
Gxi→j flow matrix with the student Gx i→j flow matrix, of the same G [20] 6.32 0.75 5.70 1.29 0.56 -0.02
kind x, for every 2-layer-combination (i, j): Gt [19] 6.50 0.77 5.95 1.33 0.55 -0.04
1 XX T S 2 Gf 6.47 0.74 6.03 1.29 0.56 -0.02
LfKD
low
= Gxi→j − Gx i→j (2) Gtf (ours) 6.68 0.77 5.99 1.36 0.57 -0.04
b2 i j>i F

Table 2: Two-step KD. Step 1 - Student pre-training using only


2.4. Training objective and two-step KD
LKD (γ = 1) or no pre-training (None). Step 2 - LP SA : stu-
We use phase-sensitive spectrum approximation (PSA) [25], with dent training with only PSA loss (γ = 0; supervised), Gtf : Loss
clean speech as a target, as the supervised portion LP SA of the total from Eq. 3 using Gtf -based LKD and γ = 0.5 (best from Table 1).
loss. The KD portion LKD of the total loss does not use the ground
truth objective but instead, features obtained from the pre-trained, ∆SDR ∆PESQ ∆eSTOI ∆DNS-MOS
Model
frozen teacher model. In particular, LKD can match model outputs (dB) (MOS) (%) BAK OVRL SIG
(i.e. response distillation, analogous to [15]), Gx (Llocal
KD ), or Gx
i→j
Teacher 8.65 1.25 10.07 1.44 0.69 0.06
(LfKDlow
) matrices introduced in Sections 2.2 and 2.3, respectively. Student 6.34 0.75 5.82 1.27 0.55 -0.02
LP SA and LKD are mixed using γ coefficient to form the total loss: Step 1 Step 2
L = γLKD + (1 − γ)LP SA (3) None Gtf 6.68 0.77 5.99 1.36 0.57 -0.04
LP SA 6.46 0.78 6.07 1.29 0.56 -0.02
Inspired by [22], we propose a two-step KD approach by separating Gi→j
t Gtf 6.54 0.78 5.88 1.33 0.56 -0.04
the student distillation process into two distinct parts. Step 1: In the
first step, γ = 1 for a fixed set of epochs to solely minimize LKD . LP SA 6.54 0.79 5.87 1.33 0.57 -0.02
Gi→j
tf
While excluding the supervised LP SA does not contribute to the op- Gtf 6.76 0.80 6.06 1.33 0.57 -0.03
timal objective performance, this step provides strong initial weights LP SA 6.77 0.81 6.38 1.34 0.59 -0.01
Gtf
for further student model training. Step 2: After this pre-training Gtf 6.75 0.80 6.34 1.32 0.57 -0.02
step, the student is further optimized to maximize its objective per-
formance using a fully supervised loss by setting γ = 0, (LP SA
only) or a weighted LKD /LP SA loss obtained by setting γ = 0.5.
For one-step KD using a weighted LKD /LP SA loss, we set γ = 0.5. the non-reverberant evaluation set consisting of 150 clips of noisy
speech samples and their respective clean references.
We quantify the performance of each model via Signal-to-
2.5. Training setup
Distortion ratio (SDR) [27], wide-band Perceptual Evaluation of
During training, each epoch consists of 5,000 training steps, with Speech Quality (PESQ) [28], and extended Short-Time Objective
each step being a 32-item batch of 2-second-long audio clips. We Intelligibility (eSTOI) [29]. We also report scores obtained from
used the Adam optimizer with 6 · 10−5 learning rate. The teacher DNS-MOS P.835 [30], a DNN mean opinion score (MOS) estimator
model is trained till convergence to ensure the best performance for showing a good correlation with subjective ratings. All of our results
subsequent distillations. We train each student model for a total of are reported as improvements over unprocessed noisy inputs (∆).
400 epochs (2M steps, 35.5k+ hours of audio). For student model
pre-training (Step 1, Section 2.4), we use 100 epochs with γ = 1,
3.1. Self-similarity local KD
thus excluding supervised term LP SA (Eq. 3).
Table 1 shows the efficacy of local similarity-based one-step KD ap-
3. RESULTS proaches when training student models from scratch. Using teacher
output as LKD in Eq. 3 or G similarity [20] does not improve, or
We use the dataset from the Interspeech 2020 Deep Noise Suppres- even deteriorates the student performance. Gt similarity proposed
sion (MS-DNS) Challenge [1] for experimentation, which consists in [19] provides 0.16 dB SDR improvement over the student alone,
of 500+ hours of clean speech and 100+ hours of noise, all mono alongside the best PESQ score. Our proposed time-frequency simi-
clips sampled at 16 kHz. For model training, we mix the speech and larity calculation method Gtf outperforms Gt by doubling its SDR
noise at various SNR levels, sampled from a uniform distribution be- improvement (+0.34 dB, w.r.t. the student alone) and increasing all
tween -5 and 15 dB. We employ a LUFS-based SNR calculation for other scores. This suggests that increasing the granularity of the sim-
more perceptually relevant mixtures and to de-emphasize the effects ilarity matrix in the Llocal
KD calculation facilitates the KD process and
of impulsive noises [26]. To evaluate the trained models we use overall improves the performance of the distilled student model.
Student Gtf pre-training Gtf → PSA Table 3: Evaluation of the two-step KD approach on the MS-DNS
Mean(diag) = 0.46 Mean(diag) = 0.73 Mean(diag) = 0.65 dataset remixed at fixed SNRs. Proposed: two-step KD using Gtf
Mean(all) = 0.33 Mean(all) = 0.41 Mean(all) = 0.40 1.0
8
pre-training followed by fully-supervised training (best in Table 2).
0.8
Student block

6 ∆SDR ∆PESQ ∆eSTOI ∆DNS-MOS

Similarity
0.6 SNR (dB) Model
(dB) (MOS) (%) BAK OVRL SIG
4
0.4 Teacher 14.05 0.62 19.12 2.16 1.02 0.64
2
0.2 -5 Student 10.82 0.30 10.07 1.86 0.79 0.51
0 Proposed 11.73 0.35 11.61 1.98 0.81 0.47
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0.0
Teacher block Teacher block Teacher block Teacher 12.30 0.92 17.83 1.99 0.98 0.40
0 Student 9.65 0.49 10.56 1.75 0.75 0.26
Fig. 2: Block-wise CKA similarity between students and teacher Proposed 10.23 0.56 11.51 1.84 0.79 0.25
networks, averaged over the MS-DNS test set. Mean(diag) and Teacher 10.27 1.21 13.98 1.65 0.78 0.02
Mean(all) denote the average similarity for the corresponding blocks 5 Student 7.97 0.69 8.58 1.44 0.59 -0.10
(diagonal) or all the block combinations, respectively. Proposed 8.43 0.76 9.32 1.51 0.62 -0.09

Table 4: Impact of the student model size on the two-step KD per-


3.2. Two-step KD formance. OPS: number of operations per frame at inference time.
Proposed: two-step KD using Gtf pre-training followed by fully-
Table 2 presents the results of the proposed two-step distillation pro- supervised training (best in Table 2).
cess described in Section 2.4. We find that using time-preserving
flow matrices Gi→jt as the LKD pre-training objective (Step 1) yields Model
Params / OPS ∆SDR ∆PESQ ∆eSTOI ∆DNS-MOS
comparable or worse performance than using local similarity Gtf (M) (dB) (MOS) (%) BAK OVRL SIG
with no pre-training. However, changing the pre-training objective Teacher 1.9 / 13.34 8.65 1.25 10.07 1.44 0.69 0.06
to Gi→j
tf , which captures interactions between latent representations Student 4.42 0.50 2.59 1.21 0.47 -0.07
in greater detail, yields improvement across nearly all the metrics 0.03 / 0.42
Proposed 5.52 0.61 4.55 1.18 0.47 -0.05
when paired with Gtf -based KD as Step 2. Most interestingly, pre- Student 6.34 0.75 5.82 1.27 0.55 -0.02
training the student with Gtf criterion and continuing with only the 0.06 / 0.84
Proposed 6.77 0.81 6.38 1.34 0.59 -0.01
supervised loss LP SA provides substantial improvements across all Student 7.24 0.93 7.53 1.38 0.62 0.00
the metrics, especially SDR (+0.44 dB, w.r.t. the student alone) and 0.24 / 2.48
Proposed 7.60 0.97 7.71 1.41 0.64 0.01
eSTOI (+0.56%), suggesting improved intelligibility.
Student 7.51 0.99 7.97 1.39 0.63 0.01
We further investigate our best two-stage KD approach by per- 0.35 / 3.08
Proposed 7.54 1.01 8.22 1.38 0.64 0.02
forming Central Kernel Alignment (CKA) [31] analysis. In prin-
ciple, CKA allows us to compare the similarity between activation
patterns across different models in response to a set of inputs. We
In Table 4 we showcase the efficacy of the proposed KD frame-
use the entire evaluation dataset to probe the models and compute
work for various student sizes using the same teacher. We observe
CKA similarities for each pair of layers (averaged over all the audio
that the smaller the downstream model, the larger benefit our KD
clips). Fig. 2-left presents CKA similarity between the teacher and
method provides over the student trained alone. In particular, for the
student trained independently. Fig. 2-middle compares the teacher to
30k-parameter student (∼1.5% teacher size), the improvements are
the student pre-trained with Gtf criterion (only Step 1). As expected,
the largest with over 1 dB SDR, 0.1 PESQ, and nearly 2% eSTOI.
the first step alone increases the similarity between the correspond-
For model sizes above 200k (∼15% teacher size), the improvements
ing teacher and student layers (diagonal). Finally, Fig. 2-right shows
start to plateau. These findings indicate that our method provides the
the best student from Table 2, namely Step 1: Gtf LKD -only pre-
largest performance boost for the most resource-constrained cases,
training, and Step 2: fully-supervised LP SA . The overall similarity
usually deemed as the most challenging [4, 5, 6].
to the teacher decreases but remains much higher than for the student
trained independently. This suggests that a brief pre-training distilla-
tion (γ = 1, no LP SA ) allows the student to develop its unique solu- 4. CONCLUSIONS
tion starting from strong prior knowledge inherited from the teacher.
This work proposes a novel two-step KD protocol for distilling tiny,
causal SE models. No previous KD work has investigated this class
3.3. Impact of the student model size and mixture SNR of embedded-scale SE. Our framework consists of two distinct steps:
The MS-DNS evaluation dataset consists of relatively high SNRs 1. Distilling the student model using only KD objective using only
between 0 and 19 dB (mean 9.07 dB). To assess the SNR-dependent our proposed fine-grained self-similarity matrix Gtf for computing
benefit of the proposed two-step KD approach, we remix the entire distillation loss; 2. Training the model obtained in Step 1 via super-
evaluation set to obtain mixtures of the same speech and noise clips vised loss. Our results show that tiny SE models distilled in this fash-
but at fixed SNRs of -5, 0 and 5 dB. In Table. 3, we observe the ion perform better than KD methods utilizing weighted loss between
inverse relationship between the benefit of our approach and SNR supervised and distillation objectives. Our experimental evaluation
of the noisy mixtures. In particular, for -5 dB SNR mixtures, our shows that the proposed two-step KD provides the largest benefits
KD approach improves student performance by approximately 1 dB for low-SNR mixtures and smaller student models. Future work
SDR, 1.5% eSTOI, and 0.1 DNS-MOS BAK. This is an important should explore integrating the proposed two-step KD with pruning
observation as tiny SE models (here, 3.3% teacher size) tend to ex- and/or quantization to achieve SE models of even lower complex-
hibit the most significant performance drop in the low-SNR cases, ity and apply the method to other audio-to-audio problems such as
compared to their larger counterparts [5]. source separation, bandwidth extension, or signal improvement.
5. REFERENCES [16] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, An-
toine Chassang, Carlo Gatta, and Yoshua Bengio, “Fitnets:
[1] Chandan K. A. Reddy et al., “The interspeech 2020 deep noise Hints for thin deep nets,” arXiv preprint arXiv:1412.6550,
suppression challenge: Datasets, subjective testing framework, 2014.
and challenge results,” in Proc. Interspeech. ISCA, 2020, pp.
[17] Sergey Zagoruyko and Nikos Komodakis, “Paying more atten-
2492–2496.
tion to attention: Improving the performance of convolutional
[2] Zhong-Qiu Wang, Gordon Wichern, Shinji Watanabe, and neural networks via attention transfer,” in ICLR, 2017.
Jonathan Le Roux, “Stft-domain neural speech enhancement
[18] Wooseok Shin, Hyun Joon Park, Jin Sob Kim, Byung Hoon
with very low algorithmic latency,” IEEE TASLP, vol. 31, pp.
Lee, and Sung Won Han, “Multi-view attention transfer for
397–410, 2022.
efficient speech enhancement,” in Proc. Interspeech. ISCA,
[3] Bryce Irvin, Marko Stamenovic, Mikolaj Kegler, and Li-Chia 2022.
Yang, “Self-supervised learning for speech enhancement
through synthesis,” in ICASSP. IEEE, 2023, pp. 1–5. [19] Jiaming Cheng, Ruiyu Liang, Yue Xie, Li Zhao, Björn
Schuller, Jie Jia, and Yiyuan Peng, “Cross-Layer Similarity
[4] Jiasi Chen and Xukan Ran, “Deep learning with edge comput- Knowledge Distillation for Speech Enhancement,” in Proc. In-
ing: A review,” Proceedings of the IEEE, vol. 107, no. 8, pp. terspeech. ISCA, 2022, pp. 926–930.
1655–1674, 2019.
[20] Frederick Tung and Greg Mori, “Similarity-preserving knowl-
[5] Igor Fedorov et al., “TinyLSTMs: Efficient neural speech en- edge distillation,” in CVPR. IEEE, 2019, pp. 1365–1374.
hancement for hearing aids,” in Proc. Interspeech. ISCA, 2020,
pp. 4054–4058. [21] Pengguang Chen, Shu Liu, Hengshuang Zhao, and Jiaya Jia,
“Distilling knowledge via knowledge review,” in CVPR. IEEE,
[6] Marko Stamenovic, Nils Westhausen, Li-Chia Yang, Carl 2021, pp. 5008–5017.
Jensen, and Alex Pawlicki, “Weight, block or unit? explor-
ing sparsity tradeoffs for speech enhancement on tiny neural [22] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim, “A
accelerators,” in NeurIPS ENLSP, 2021. gift from knowledge distillation: Fast optimization, network
minimization and transfer learning,” in CVPR. IEEE, 2017,
[7] Sebastian Braun, Hannes Gamper, Chandan KA Reddy, and
pp. 7130–7138.
Ivan Tashev, “Towards efficient models for real-time deep
noise suppression,” in ICASSP. IEEE, 2021, pp. 656–660. [23] Ke Tan and DeLiang Wang, “Learning complex spectral map-
ping with gated convolutional recurrent networks for monau-
[8] Zeyuan Wei, Li Hao, and Xueliang Zhang, “Model Compres-
ral speech enhancement,” IEEE TASLP, vol. 28, pp. 380–390,
sion by Iterative Pruning with Knowledge Distillation and Its
2020.
Application to Speech Enhancement,” in Proc. Interspeech.
ISCA, 2022, pp. 941–945. [24] Yi Luo and Nima Mesgarani, “Conv-TasNet: Surpassing
[9] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean, “Distilling ideal time–frequency magnitude masking for speech separa-
the knowledge in a neural network,” in NIPS Deep Learning tion,” IEEE TASLP, vol. 27, no. 8, pp. 1256–1266, 2019.
and Representation Learning Workshop, 2015. [25] Hakan Erdogan, John R Hershey, Shinji Watanabe, and
[10] Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Jonathan Le Roux, “Phase-sensitive and recognition-boosted
Dacheng Tao, “Knowledge distillation: A survey,” Interna- speech separation using deep recurrent neural networks,” in
tional Journal of Computer Vision, vol. 129, no. 6, pp. 1789– ICASSP. IEEE, 2015, pp. 708–712.
1819, 2021. [26] ITU-R, “ITU-R BS.1770-4: Algorithms to measure audio pro-
[11] Chuanguang Yang, Xinqiang Yu, Zhulin An, and Yongjun Xu, gramme loudness and true-peak audio level,” 2015.
“Categories of response-based, feature-based, and relation- [27] Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R
based knowledge distillation,” in Advancements in Knowledge Hershey, “Sdr–half-baked or well done?,” in ICASSP. IEEE,
Distillation: Towards New Horizons of Intelligent Systems, pp. 2019, pp. 626–630.
1–32. Springer, 2023.
[28] ITU-R, “P.862 : Perceptual evaluation of speech quality
[12] Florian Schmid, Shahed Masoudian, Khaled Koutini, and Ger- (PESQ): An objective method for end-to-end speech quality
hard Widmer, “Knowledge distillation from transformers assessment of narrow-band telephone networks and speech
for low-complexity acoustic scene classification,” in DCASE codecs,” 2015.
workshop, 2022.
[29] Jesper Jensen and Cees H Taal, “An algorithm for predict-
[13] Ji Won Yoon, Beom Jun Woo, Sunghwan Ahn, Hyeonseung ing the intelligibility of speech masked by modulated noise
Lee, and Nam Soo Kim, “Inter-kd: Intermediate knowledge maskers,” IEEE TASLP, vol. 24, no. 11, pp. 2009–2022, 2016.
distillation for ctc-based automatic speech recognition,” in SLT
workshop. IEEE, 2023, pp. 280–286. [30] Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler, “Dns-
mos p.835: A non-intrusive perceptual objective speech quality
[14] Xiang Hao, Shixue Wen, Xiangdong Su, Yun Liu, Guanglai metric to evaluate noise suppressors,” in ICASSP. IEEE, 2022,
Gao, and Xiaofei Li, “Sub-band knowledge distillation frame- pp. 886–890.
work for speech enhancement,” in Proc. Interspeech. ISCA,
2020, pp. 2687–2691. [31] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Ge-
offrey Hinton, “Similarity of neural network representations
[15] Sotaro Nakaoka, Li Li, Shota Inoue, and Shoji Makino, revisited,” in ICML, 2019, pp. 3519–3529.
“Teacher-student learning for low-latency online speech en-
hancement using wave-u-net,” in ICASSP. IEEE, 2021, pp.
661–665.

You might also like