2503.18579v1
2503.18579v1
Abstract—We propose an unsupervised variational acoustic are obtained based on noisy samples, such that the model
clustering model for clustering audio data in the time-frequency can reconstruct clean data. Given the importance of temporal
domain. The model leverages variational inference, extended to dependencies in audio processing, alternative architectures to
an autoencoder framework, with a Gaussian mixture model as
a prior for the latent space. Specifically designed for audio the VAE such as the stochastic temporal convolutional network
applications, we introduce a convolutional-recurrent variational [11] have been proposed for more effective speech processing.
autoencoder optimized for efficient time-frequency processing. Moreover, for the task of unsupersived clustering, to the
Our experimental results considering a spoken digits dataset best of our knowledge, its application to audio signals using
demonstrate a significant improvement in accuracy and cluster- generative models was not considered before.
ing performance compared to traditional methods, showcasing
the model’s enhanced ability to capture complex audio patterns. In this work, we propose an unsupervised variational acous-
tic clustering (UVAC) model, building on [6] and [5] towards
Index Terms—Unsupervised clustering, variational autoen- the unsupervised clustering of audio data. We consider a
coder, Gaussian mixture model, spoken digits convolutional-recurrent neural network (NN) model, and facil-
itate temporal processing by inputting the NN with a window
I. I NTRODUCTION of time frames [12]. In comparison to traditional approaches,
Unsupervised clustering is crucial in audio applications UVAC substantially enhances unsupervised accuracy, normal-
[1], particularly for hardware-constrained devices like hearing ized mutual information and other clustering metrics.
aids [2], where different processing is applied per detected II. VARIATIONAL I NFERENCE
acoustic scene [3]. Traditional methods struggle to model the Consider a dataset X = {x(i) }N i=1 with N independent
complex, high-dimensional nature of audio signals, resulting in and identically distributed samples. From Bayes theorem, we
suboptimal clustering [4]. Variational autoencoders (VAEs) are can perform (statistical) inference by obtaining the posterior
a promising tool for the task since they are capable of learning distribution
more efficient, low-dimensional representations of data [5].
pθ (x(i) |z)pθ (z)
Variational autoencoders are frequently employed in unsu- pθ (z|x(i) ) = , (1)
pervised learning tasks due to their ability to learn from data pθ (x(i) )
without the need for labels or ground truth [5]. Naturally, the where z is a latent variable that meaningfully represents the
approach was modified towards clustering, where the prior dis- underlying distribution of the data, and θ are model parame-
tribution of latent variables, commonly a multivariate Gaussian ters. However, for most practical cases, pθ (x(i) ) is unknown.
distribution, was changed to a multivariate Gaussian mixture As a work around, the probability distribution pθ (x(i) ) can be
model, allowing for clustering behavior [6]–[8]. Variational marginalized as
clustering was successfully applied to image applications,
X
pθ (x(i) ) = pθ (x(i) |z)pθ (z), (2)
more specifically using the MNIST dataset [9]. z
In audio, particularly in speech processing, variational au- being intractable as it requires summing over all possible
toencoders have been used for speech enhancement applica- values of z – often with high order or complex relations.
tions [10], where latent representations for the clean speech To make (2) and (1) tractable, we introduce a variational
distribution with parameters ϕ to approximate the intractable
This work was supported by the Robust AI for SafE (radar) signal
processing (RAISE) collaboration framework between Eindhoven Univer- posterior, qϕ (z|x(i) ) ≈ pθ (z|x(i) ), which we apply to (2),
sity of Technology and NXP Semiconductors, including a Privaat-Publieke X pθ (x(i) |z)pθ (z)
Samenwerkingen-toeslag (PPS) supplement from the Dutch Ministry of Eco- pθ (x(i) ) = qϕ (z|x(i) ) , (3)
nomic Affairs and Climate Policy.
z
qϕ (z|x(i) )
and define z qϕ (z|x(i) ) as the expectation over qϕ (z|x(i) ),
P
qϕ (x|c) qϕ (z|x)
represented by Eqϕ (z|x(i) ) [·], allowing us to rewrite (3) as c x z
16
32
64
128
128
128
reparam.
128
64
32
16
1
L 2 ¨ dz
z
x ϵ x̂
L
C
C
Σϕ
Fig. 2: Schematic of the proposed convolutional-recurrent variational autoencoder. The number in each layer indicates output
channels. The C encoder layers consist of Conv2D with BatchNorm2D and ReLU functions in all layers. The G layers are
gate recurrent units (GRUs). The layers of the latent space are linear (L) without activation. The decoder layers are Conv2D.T,
with ReLU activation in all layers but the last, with sigmoid. All C kernels are (8,8) with stride (2,2) and padding (3,3).
Moreover, we define the cost function to be optimized as preliminary tests. Finally, the spectrograms are normalized by
the average of (9) over all samples of the dataset X, as their mean and variance, with ranges limited from 0 to 1 by
N a min-max adjustment.
1 X (i) The input to the neural network is the padded and pre-
J (θ, ϕ, φ) = L (θ, ϕ, φ), (12)
N i=1 processeded full second of audio, containing 128 frequency
bins and 99 time bins. The zero-frequency bin is removed
resulting in the optimization problem
from all samples. By feeding the network with the entire
max J (θ, ϕ, φ) ⇔ min −J (θ, ϕ, φ), (13) duration of a file, we leverage time dependencies, as observed
θ,ϕ,φ θ,ϕ,φ in [12]. Importantly, we generate an activity detection vector
which right-hand side can be minimized via backpropagation. containing zeros in the zero-padded time-indexes, and ones
otherwise. This vector multiplies both target and prediction
IV. A RCHITECTURE during the calculation of the reconstruction loss. If this is not
Unsupervised clustering using a variational autoencoder done, the model might learn to classify zero-padding duration.
framework was previously applied to image applications [6]–
[8], specifically for the classification of handwritten digits B. Autoencoder
with the MNIST dataset [9]. For such a case, where the Generative models tend to require a massive number of
input dimension is relatively small, the authors have employed, parameters to achieve desirable performance [15]. To form
successfully, fully connected NNs as encoder and decoder. a model with a reduced number of parameters, we define a
In our case, we consider the unsupervised clustering of au- convolutional-recurrent variational autoencoder composed of
dio signals, validating the proposal through the AudioMNIST a convolutional-recurrent encoder qϕ (z|x(i) ), a latent space
dataset [14]. Nevertheless, the input size is unwieldy for a with prior pφ (z), and a convolutional decoder pθ (x(i) |z), as
fully connected NN, as it becomes inefficient given the un- shown in Fig. 2. We refer to the proposed model, for simplicity,
necessary number of parameters. Envisioning a more efficient as the unsupervised variational acoustic clustering (UVAC)
implementation, we define the autoencoder as a convolutional- model. Notice that when defining the architecture, recurrent
recurrent network. Additionally, as audio strongly depends on layers were necessary as removing them would make the
time-correlations, we employ an explicit time-context at the performance insufficient in terms of accuracy.
input [12] to enhance temporal processing. We set the latent space dimension as dz = 10 because
early experiments indicated that this value is effective for
A. Data unsupervised clustering on the AudioMNIST dataset. Never-
We consider the problem of unsupervised clustering of theless, linear layers are used in the latent space to convert
spoken digits with the AudioMNIST dataset [14]. The dataset the number of channels to the chosen latent dimension, a
contains 30000 audio samples – of which 24000 are randomly necessary modification in the proposed model when compared
selected for training, 3000 for validation, and 3000 for testing to fully connected VAEs in literature [6]. The first linear
– where each file contains the audio recording, at 16 kHz, layer converts the number of channels from 128, at the end
of a spoken digit. The speakers are of different gender and of the encoder, to 2 · dz , where the first dz data points are
age. The raw audio data are pre-processeded as follows. First, taken as values for the (multivariate) mean, and the last dz
we pad zeros to each audio sample until the desired duration for the variance. The second linear layer converts the latent
of 1 second is achieved. The padded audio is then applied to dimension to 128 channels. Moreover, as previously described,
a Short-term Fourier Transform (STFT), with length of 960 the prior distribution for the latent space pφ (z) is a multivariate
samples, Hann window of the same size, and a hop of 480 Gaussian mixture model. For our experiments, we choose
samples. Moreover, we take the module of the output of the C = 10 GMM components, which is the same as the number
STFT and limit the frequency range to 128 frequency bins – of classes in the dataset.
approximately 6 kHz. Such a frequency range has showed We initialize all convolutional, recurrent, and linear layers
to be sufficient for classification tasks on AudioMNIST in with the Kaiming uniform initialization for the weights and
zeros for the biases. The GMM prior pφ (z), defined in (8), TABLE I: Unsupervised clustering metrics on the test set
is initialized as follows. The weighting vector πc takes a of AudioMNIST, either considering labels as clusters or by
uniform distribution, with lower and upper bound 0.0 and applying K-means, GMM optimized by EM, and UVAC as
1.0, respectively. The vector of means µc is initialized with clustering methods, averaged over 10 independent runs.
the Xavier uniform initialization, and the variances Σc are all Method Accuracy (%) NMI Silhouette DBI
set as zero. Importantly, the clusters are chosen based on the None (labels) 100.00 1.00 -0.04 5.56
K-means 18.40 0.10 0.13 2.04
probability of data point x(i) belonging to the cth cluster [6]: GMM-EM 17.62 0.09 0.13 1.95
UVAC 70.78 0.71 0.21 1.61
pφ (c)pφ (z|c)
qϕ (c|x(i) ) = pθ (c|z) = =
pφ (z) 4) Davies-Bouldin index: the Davies-Boudlin Index (DBI)
π N (z; µc , Σc ) [19] is defined as the average similarity ratio of each cluster
Pc . (14)
c πc N (z; µc , Σc )
with the cluster that is most similar to it. A lower DBI indicates
During training, the complete model is used. For applying better clustering – in terms of compactness and separation. Its
the NN model in inference mode, however, only the encoder range can vary from 0 to infinity.
and the latent space are necessary, because we perform cluster-
B. Other methods
ing based on the GMM prior. Therefore, the inference UVAC
model has much less parameters than the version used in For the sake of comparison, we perform unsupervised
training. In numbers, the complete model has 2M parameters, clustering using two traditional methods, named K-means [20]
while roughly 1.3M are used for inference. Although this num- and the optimization of a Gaussian mixture model using the
ber of parameters is low when compared to other generative expectation-maximization (EM) algorithm (GMM-EM) [21].
approaches [15], further studies in model size reduction are of Other classical approaches and derivations are assumed to
interest, but out of scope for this paper. achieve similar performance as K-means and GMM-EM. We
also tried to apply dimensionality reduction techniques on
V. E XPERIMENTAL EVALUATION
the pre-processed data with the traditional methods, but no
In this section we present the results obtained with the performance improvement was obtained.
UVAC model described in Section IV-B. We trained the model To the best of our knowledge, this is the first study to
for 500 epochs using the Adam optimizer, with an initial apply variational autoencoders for the unsupervised clustering
learning rate of 0.005, exponentially decaying until the final of audio data. While other generative models are capable of
value of 0.0005 at the last epoch. The batch size considered clustering, they have not been specifically employed for audio
for training was of 64 data points. Differently from [6], we data clustering as we tackle in this paper.
keep both terms in (9) with equal weight. During tests, we
noticed that the performance is very sensible to the weighting C. Results
of the KL divergence term, and for the AudioMNIST dataset, The obtained results are shown in Table I. For the labels,
a weighting of 1.0 for each term achieved the best results. K-means, and GMM-EM, we calculate the metrics with the
Additionally, we keep M = 1 in (10) for efficiency. STFT data, while for UVAC we use the latent variable z.
A. Metrics First, notice how the classes provided with the AudioMNIST
In the following we describe the considered metrics for dataset, referent to the spoken digits, do not form good clusters
unsupervised clustering. in terms of the Silhouette score and DBI. Improvements are
1) Unsupervised accuracy: in unsupervised clustering clear when K-means and GMM-EM are applied, raising the
tasks, the numeric labels may not correspond directly to Silhouette score and lowering the DBI, but dramatically reduc-
the ground truth labels. We then consider an unsupervised ing unsupervised accuracy and normalized mutual information.
approach for calculating accuracy, which consists of finding Such methods, therefore, show themselves insufficient for
the matching truth labels for the clusters via the Hungarian an accurate unsupervised clustering when data are of high
algorithm [16]. Unsupervised accuracy ranges from 0 to 100%. dimensionality or contain complex relations. On the other
2) Normalized mutual information: the normalized mutual hand, when UVAC is applied, accuracy and NMI are increased
information (NMI) is an information theoretic approach that to approximately 71%, and the unsupervised clustering metrics
evaluates the clustering quality by measuring the amount of are enhanced if compared to the other approaches.
shared information between clustering assignments and truth Based on these results, for audio applications in general,
labels [17]. Its range is from 0 to 1. we expect the accuracy to be limited when unsupervised
3) Silhouette score: the Silhouette score [18] measures how clustering objectives are involved, almost as an abstract form
similar a data point is to its own cluster in comparison to of regularization. A perfect match with the truth labels does
other clusters. It combines cohesion (how close data points not provide good clusters, as we can see from the metrics
within a cluster are) and separation (how distinct is a cluster obtained with the labels. Furthermore, the sufficiently high
from another). The range is from -1 to +1: -1 indicates accuracy (around 71%) obtained by the proposed model shows
misclassification; 0 tells us that clusters overlap; and +1 that the digit being spoken, alone, is a major part of the
indicate optimal clustering. data features, but accompanied by other features present in
60 60 60 9
40 40 40 8
20 20 7
20 6
0 0 5
0
20 20 4
20 3
40 40 2
40
60 60 1
60 0
60 40 20 0 20 40 60 60 40 20 0 20 40 60 60 40 20 0 20 40 60
(a) K-means (b) GMM-EM (c) UVAC
Fig. 3: Clusters obtained for the AudioMNIST test set. The data size is reduced for plotting using t-distributed stochastic
neighbor embedding. Each color represents a cluster (labels shown in bar plot), and each circle in the plot is a data point.
the STFTs, e.g., voice frequency, microphone noise, time to [6] Y. Uğur, G. Arvanitakis, and A. Zaidi, “Variational Information Bottle-
pronounce a number, etc. This “regularizing” effect is mainly neck for Unsupervised Clustering: Deep Gaussian Mixture Embedding,”
Entropy, vol. 22, no. 2, 2020.
caused by the complexity of the audio data, which is not [7] N. Dilokthanakul, P. A. M. Mediano, M. Garnelo, M. C. H. Lee,
observed for image processing [6]–[8]. This also makes the H. Salimbeni, K. Arulkumaran, and M. Shanahan, “Deep Unsupervised
inclusion of annealing strategies in the loss function nontrivial. Clustering with Gaussian Mixture Variational Autoencoders,” 2017.
[8] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou, “Variational deep
Additionally, in Fig. 3 we show an example of the clusters embedding: an unsupervised and generative approach to clustering,” in
obtained for AudioMNIST’s test set. Visually, K-means and Proceedings of the 26th International Joint Conference on Artificial
GMM-EM clusters are similar, being insufficient in terms Intelligence, IJCAI 2017, 2017, p. 1965–1972.
[9] L. Deng, “The MNIST Database of Handwritten Digit Images for Ma-
of compactness and separation. On the other hand, UVAC’s chine Learning Research,” IEEE Signal Processing Magazine, vol. 29,
clusters are more spread and compact for most classes. Clearly, no. 6, pp. 141–142, 2012.
the traditional methods are limited in their ability to capture [10] H. Fang, G. Carbajal, S. Wermter, and T. Gerkmann, “Variational
Autoencoder for Speech Enhancement with a Noise-Aware Encoder,” in
the (complex) underlying data structure. The incorporation of 2021 IEEE International Conference on Acoustics, Speech and Signal
advanced clustering strategies, like UVAC, when dealing with Processing (ICASSP), 2021, pp. 676–680.
audio data, might drastically improve the performance of a [11] J. Richter, G. Carbajal, and T. Gerkmann, “Speech Enhancement with
Stochastic Temporal Convolutional Networks,” in Proc. Interspeech
system that depends on unsupervised clustering. 2020, 2020, pp. 4516–4520.
[12] L. V. Fiorio, B. Karanov, B. Defraene, J. David, F. Widdershoven,
VI. C ONCLUSION W. Van Houtum, and R. M. Aarts, “Spectral Masking With Explicit
Time-Context Windowing for Neural Network-Based Monaural Speech
We proposed a variational autoencoder model for unsuper- Enhancement,” IEEE Access, vol. 12, pp. 154 843–154 852, 2024.
vised acoustic clustering of audio data. The proposed model is [13] J. R. Hershey and P. A. Olsen, “Approximating the Kullback Leibler
a convolutional-recurrent variational autoencoder with linear Divergence Between Gaussian Mixture Models,” in 2007 IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing
layers in the latent space, which has a multivariate GMM (ICASSP), vol. 4, 2007, pp. IV–317–IV–320.
prior. The obtained clustering results with the UVAC model [14] S. Becker, J. Vielhaben, M. Ackermann, K.-R. Müller, S. Lapuschkin,
over the AudioMNIST dataset show substantial improvements and W. Samek, “AudioMNIST: Exploring Explainable Artificial Intelli-
gence for audio analysis on a simple benchmark,” Journal of the Franklin
in relation to other approaches. The UVAC is capable of Institute, 2023.
maintaining high accuracy and normalized mutual information [15] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child,
while increasing clustering quality. For future works, we S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural
Language Models,” 2020.
suggest the development of a robust annealing strategy of the [16] H. W. Kuhn, “The Hungarian method for the assignment problem,”
loss terms aimed at audio applications. Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
[17] N. X. Vinh, J. Epps, and J. Bailey, “Information Theoretic Measures
R EFERENCES for Clusterings Comparison: Variants, Properties, Normalization and
Correction for Chance,” J. Mach. Learn. Res., vol. 11, p. 2837–2854,
[1] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Dec. 2010.
Discriminative embeddings for segmentation and separation,” in 2016 [18] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and
IEEE International Conference on Acoustics, Speech and Signal Pro- validation of cluster analysis,” Journal of Computational and Applied
cessing (ICASSP), 2016, pp. 31–35. Mathematics, vol. 20, pp. 53–65, 1987.
[2] L. E. Humes, S. E. Rogers, A. K. Main, and D. L. Kinney, “The acoustic [19] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE
environments in which older adults wear their hearing aids: Insights Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-
from datalogging sound environment classification,” American Journal 1, no. 2, pp. 224–227, 1979.
of Audiology, vol. 27, no. 4, pp. 594–603, 2018. [20] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K-Means
[3] G. Park, W. Cho, K.-S. Kim, and S. Lee, “Speech enhancement for Clustering Algorithm,” Journal of the Royal Statistical Society. Series
hearing aids with deep learning on environmental noises,” Applied C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
Sciences, vol. 10, no. 17, 2020. [21] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood
[4] J. Foote, “An overview of audio information retrieval,” Multimedia Syst., from Incomplete Data Via the EM Algorithm,” Journal of the Royal
vol. 7, no. 1, p. 2–10, Jan. 1999. Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22,
[5] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” in 1977.
2nd International Conference on Learning Representations, ICLR 2014,
2014.