2503.18579v1

The document presents an unsupervised variational acoustic clustering (UVAC) model designed for clustering audio data using a convolutional-recurrent variational autoencoder and a Gaussian mixture model as a prior. The model demonstrates improved clustering accuracy and performance on spoken digits datasets compared to traditional methods, leveraging variational inference to capture complex audio patterns. The architecture is optimized for efficient time-frequency processing, making it suitable for applications in hardware-constrained environments like hearing aids.

Uploaded by

吴京城

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

2503.18579v1

Uploaded by

吴京城

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Unsupervised Variational Acoustic Clustering

Luan Vinı́cius Fiorio Bruno Defraene Johan David

Department of Electrical Engineering NXP Semiconductors NXP Semiconductors
Eindhoven University of Technology Leuven, Belgium Leuven, Belgium
Eindhoven, The Netherlands [email protected] [email protected]
[email protected]

Frans Widdershoven Wim van Houtum Ronald M. Aarts

NXP Semiconductors NXP Semiconductors Department of Electrical Engineering
Eindhoven, The Netherlands Eindhoven, The Netherlands Eindhoven University of Technology
arXiv:2503.18579v1 [eess.AS] 24 Mar 2025

[email protected] [email protected] Eindhoven, The Netherlands

[email protected]

Abstract—We propose an unsupervised variational acoustic are obtained based on noisy samples, such that the model
clustering model for clustering audio data in the time-frequency can reconstruct clean data. Given the importance of temporal
domain. The model leverages variational inference, extended to dependencies in audio processing, alternative architectures to
an autoencoder framework, with a Gaussian mixture model as
a prior for the latent space. Specifically designed for audio the VAE such as the stochastic temporal convolutional network
applications, we introduce a convolutional-recurrent variational [11] have been proposed for more effective speech processing.
autoencoder optimized for efficient time-frequency processing. Moreover, for the task of unsupersived clustering, to the
Our experimental results considering a spoken digits dataset best of our knowledge, its application to audio signals using
demonstrate a significant improvement in accuracy and cluster- generative models was not considered before.
ing performance compared to traditional methods, showcasing
the model’s enhanced ability to capture complex audio patterns. In this work, we propose an unsupervised variational acous-
tic clustering (UVAC) model, building on [6] and [5] towards
Index Terms—Unsupervised clustering, variational autoen- the unsupervised clustering of audio data. We consider a
coder, Gaussian mixture model, spoken digits convolutional-recurrent neural network (NN) model, and facil-
itate temporal processing by inputting the NN with a window
I. I NTRODUCTION of time frames [12]. In comparison to traditional approaches,
Unsupervised clustering is crucial in audio applications UVAC substantially enhances unsupervised accuracy, normal-
[1], particularly for hardware-constrained devices like hearing ized mutual information and other clustering metrics.
aids [2], where different processing is applied per detected II. VARIATIONAL I NFERENCE
acoustic scene [3]. Traditional methods struggle to model the Consider a dataset X = {x(i) }N i=1 with N independent
complex, high-dimensional nature of audio signals, resulting in and identically distributed samples. From Bayes theorem, we
suboptimal clustering [4]. Variational autoencoders (VAEs) are can perform (statistical) inference by obtaining the posterior
a promising tool for the task since they are capable of learning distribution
more efficient, low-dimensional representations of data [5].
pθ (x(i) |z)pθ (z)
Variational autoencoders are frequently employed in unsu- pθ (z|x(i) ) = , (1)
pervised learning tasks due to their ability to learn from data pθ (x(i) )
without the need for labels or ground truth [5]. Naturally, the where z is a latent variable that meaningfully represents the
approach was modified towards clustering, where the prior dis- underlying distribution of the data, and θ are model parame-
tribution of latent variables, commonly a multivariate Gaussian ters. However, for most practical cases, pθ (x(i) ) is unknown.
distribution, was changed to a multivariate Gaussian mixture As a work around, the probability distribution pθ (x(i) ) can be
model, allowing for clustering behavior [6]–[8]. Variational marginalized as
clustering was successfully applied to image applications,
X
pθ (x(i) ) = pθ (x(i) |z)pθ (z), (2)
more specifically using the MNIST dataset [9]. z
In audio, particularly in speech processing, variational au- being intractable as it requires summing over all possible
toencoders have been used for speech enhancement applica- values of z – often with high order or complex relations.
tions [10], where latent representations for the clean speech To make (2) and (1) tractable, we introduce a variational
distribution with parameters ϕ to approximate the intractable
This work was supported by the Robust AI for SafE (radar) signal
processing (RAISE) collaboration framework between Eindhoven Univer- posterior, qϕ (z|x(i) ) ≈ pθ (z|x(i) ), which we apply to (2),
sity of Technology and NXP Semiconductors, including a Privaat-Publieke X pθ (x(i) |z)pθ (z)
Samenwerkingen-toeslag (PPS) supplement from the Dutch Ministry of Eco- pθ (x(i) ) = qϕ (z|x(i) ) , (3)
nomic Affairs and Climate Policy.
z
qϕ (z|x(i) )
and define z qϕ (z|x(i) ) as the expectation over qϕ (z|x(i) ),
P
qϕ (x|c) qϕ (z|x)
represented by Eqϕ (z|x(i) ) [·], allowing us to rewrite (3) as c x z

pθ (x(i) |z)pθ (z)

(a) Inference model
pθ (x(i) ) = Eqϕ (z|x(i) ) . (4)
qϕ (z|x(i) )
pφ (z|c) pθ (x|z)
Based on (2), we know that the model’s ability to represent c z x
data x can be measured by its log-likelihood, thus, its max-
imization becomes a cost function for obtaining the model (b) Generative model
parameters θ, and is defined as follows:
Fig. 1: Inference and generative models for unsupervised
N
X clustering.
max log pθ (X) = max log pθ (x(i) ). (5)
θ θ
i=1 The considered inference and generative models are pre-
Nevertheless, we want to modify (4) in such a way that we sented in Fig. 1, where we consider two assumptions: i) for
are able to arrive at a similar expression as (5), which can be inference, the data are originated from component c of an
used as a lower bound for optimization. Moreover, applying unknown GMM qϕ (x(i) |c); and ii) the latent variables have
the log operator to (4) gives a GMM prior pφ (z), as described in (8), for the generation
of data. The inference model shown in Fig. 1a is given
pθ (x(i) |z)pθ (z)

(i)
log pθ (x ) = log Eqϕ (z|x(i) ) , (6) by the encoder qϕ (z|x(i) ) = N (z; µϕ , Σϕ ), where a neural
qϕ (z|x(i) ) network f with parameters ϕ and input x(i) generates the
which allows us for using Jensen’s inequality since log is a multivariate output fϕ (x(i) ) = [µϕ , Σϕ ]. On the other hand,
concave function, defining the variational lower bound L: the generative model from Fig. 1b is composed by a NN
decoder g with parameters θ that takes z as input and is
described as pθ (x(i) |z) = [x̂(i) ] = gθ (z), with x̂(i) being the

(i) (i) pθ (z)
log pθ (x ) ≥ Eqϕ (z|x(i) ) log pθ (x |z) + log
qϕ (z|x(i) ) generated data.
h i All parameters of the model (θ, ϕ, and φ) are optimized
= Eqϕ (z|x(i) ) log pθ (x(i) |z) − DKL (qϕ (z|x(i) ) || pθ (z))
simultaneously to maximize the variational lower bound, mod-
= L(i) (θ, ϕ). (7) ified from (7) to meet the GMM prior:

In (7), Eqϕ (z|x(i) ) log pθ (x(i) |z) represents the reconstruc- L(i) (θ, ϕ, φ) =
tion error of x from z, i.e., how well the latent variables z h i
explain the data x. The DKL term DKL (qϕ (z|x(i) ) || pθ (z)) Eqϕ (z|x(i) ) log pθ (x(i) |z) − DKL (qϕ (z|x(i) ) || pφ (z)). (9)
stands for the Kullback-Leibler (KL) divergence, which quan-
tifies the difference between the prior pθ (z) and the variational In practice, to allow gradient flow, the reconstruction error in
distribution qϕ (z|x(i) ). The objective of variational inference (9) can be obtained with the reparametrization trick [5] and
is to maximize L in terms of θ and ϕ, corresponding to finding Monte Carlo sampling [6]:
a good approximation qϕ (z|x(i) ) for the posterior pθ (z|x(i) ). h i 1 X
Eqϕ (z|x(i) ) log pθ (x(i) |z) ≈ pθ (x(i) |zm ), (10)
III. U NSUPERVISED VARIATIONAL C LUSTERING M m
Given the complexity of the problem, we consider a varia-
with m being a Monte Carlo sample out of M , and zm = µϕ +
tional autoencoder [5] framework. The VAE is composed by 1/2
a neural network encoder, which takes the data x to a latent Σϕ ϵm (reparametrization trick), with an auxiliary random
representation z, and a NN decoder that generates data based variable ϵm ∼ N (0, I). The KL divergence term of (9),
on the latent variables. i.e., the divergence between a single component multivariate
To allow clustering behavior, we choose a multivariate Gaussian and a multivariate GMM with C components, cannot
Gaussian mixture model (GMM) prior pφ (z), replacing pθ (z) be calculated in closed-form, but it can be approximated
in (7) for the latent space [6], given by [6], [13]. Thus, we approximate (9) considering that Σϕ =
X diag({Σϕ,j }dj=1
z
) and Σc = diag({Σc,j }dj=1
z
) as
pφ (z) = πc N (z; µc , Σc ), (8)
c
DKL (qϕ (z|x(i) ) || pφ (z)) ≈
where c is one component in the mixture – a cluster. The multi- "
X 1 X (µϕ,j − µc,j )2
variate GMM defined in (8) is based on multivariate Gaussians − log πc exp −
N with means µc and variances Σc , whose dimension is the c
2 j Σc,j
same as that of z (dz ), linearly combined by a weighting
#!
Σc,j Σϕ,j
vector πc . The GMM parameters are condensed in a variable + log −1+ . (11)
Σϕ,j Σc,j
φ := {πc , µc , Σc }C
c=1 , for C components.
Encoder qϕ pz|xq Latent space Decoder pθ px|zq
µϕ

128

reparam.

128

1
L 2 ¨ dz
z
x ϵ x̂

L
C

C
Σϕ

Fig. 2: Schematic of the proposed convolutional-recurrent variational autoencoder. The number in each layer indicates output
channels. The C encoder layers consist of Conv2D with BatchNorm2D and ReLU functions in all layers. The G layers are
gate recurrent units (GRUs). The layers of the latent space are linear (L) without activation. The decoder layers are Conv2D.T,
with ReLU activation in all layers but the last, with sigmoid. All C kernels are (8,8) with stride (2,2) and padding (3,3).
Moreover, we define the cost function to be optimized as preliminary tests. Finally, the spectrograms are normalized by
the average of (9) over all samples of the dataset X, as their mean and variance, with ranges limited from 0 to 1 by
N a min-max adjustment.
1 X (i) The input to the neural network is the padded and pre-
J (θ, ϕ, φ) = L (θ, ϕ, φ), (12)
N i=1 processeded full second of audio, containing 128 frequency
bins and 99 time bins. The zero-frequency bin is removed
resulting in the optimization problem
from all samples. By feeding the network with the entire
max J (θ, ϕ, φ) ⇔ min −J (θ, ϕ, φ), (13) duration of a file, we leverage time dependencies, as observed
θ,ϕ,φ θ,ϕ,φ in [12]. Importantly, we generate an activity detection vector
which right-hand side can be minimized via backpropagation. containing zeros in the zero-padded time-indexes, and ones
otherwise. This vector multiplies both target and prediction
IV. A RCHITECTURE during the calculation of the reconstruction loss. If this is not
Unsupervised clustering using a variational autoencoder done, the model might learn to classify zero-padding duration.
framework was previously applied to image applications [6]–
[8], specifically for the classification of handwritten digits B. Autoencoder
with the MNIST dataset [9]. For such a case, where the Generative models tend to require a massive number of
input dimension is relatively small, the authors have employed, parameters to achieve desirable performance [15]. To form
successfully, fully connected NNs as encoder and decoder. a model with a reduced number of parameters, we define a
In our case, we consider the unsupervised clustering of au- convolutional-recurrent variational autoencoder composed of
dio signals, validating the proposal through the AudioMNIST a convolutional-recurrent encoder qϕ (z|x(i) ), a latent space
dataset [14]. Nevertheless, the input size is unwieldy for a with prior pφ (z), and a convolutional decoder pθ (x(i) |z), as
fully connected NN, as it becomes inefficient given the un- shown in Fig. 2. We refer to the proposed model, for simplicity,
necessary number of parameters. Envisioning a more efficient as the unsupervised variational acoustic clustering (UVAC)
implementation, we define the autoencoder as a convolutional- model. Notice that when defining the architecture, recurrent
recurrent network. Additionally, as audio strongly depends on layers were necessary as removing them would make the
time-correlations, we employ an explicit time-context at the performance insufficient in terms of accuracy.
input [12] to enhance temporal processing. We set the latent space dimension as dz = 10 because
early experiments indicated that this value is effective for
A. Data unsupervised clustering on the AudioMNIST dataset. Never-
We consider the problem of unsupervised clustering of theless, linear layers are used in the latent space to convert
spoken digits with the AudioMNIST dataset [14]. The dataset the number of channels to the chosen latent dimension, a
contains 30000 audio samples – of which 24000 are randomly necessary modification in the proposed model when compared
selected for training, 3000 for validation, and 3000 for testing to fully connected VAEs in literature [6]. The first linear
– where each file contains the audio recording, at 16 kHz, layer converts the number of channels from 128, at the end
of a spoken digit. The speakers are of different gender and of the encoder, to 2 · dz , where the first dz data points are
age. The raw audio data are pre-processeded as follows. First, taken as values for the (multivariate) mean, and the last dz
we pad zeros to each audio sample until the desired duration for the variance. The second linear layer converts the latent
of 1 second is achieved. The padded audio is then applied to dimension to 128 channels. Moreover, as previously described,
a Short-term Fourier Transform (STFT), with length of 960 the prior distribution for the latent space pφ (z) is a multivariate
samples, Hann window of the same size, and a hop of 480 Gaussian mixture model. For our experiments, we choose
samples. Moreover, we take the module of the output of the C = 10 GMM components, which is the same as the number
STFT and limit the frequency range to 128 frequency bins – of classes in the dataset.
approximately 6 kHz. Such a frequency range has showed We initialize all convolutional, recurrent, and linear layers
to be sufficient for classification tasks on AudioMNIST in with the Kaiming uniform initialization for the weights and
zeros for the biases. The GMM prior pφ (z), defined in (8), TABLE I: Unsupervised clustering metrics on the test set
is initialized as follows. The weighting vector πc takes a of AudioMNIST, either considering labels as clusters or by
uniform distribution, with lower and upper bound 0.0 and applying K-means, GMM optimized by EM, and UVAC as
1.0, respectively. The vector of means µc is initialized with clustering methods, averaged over 10 independent runs.
the Xavier uniform initialization, and the variances Σc are all Method Accuracy (%) NMI Silhouette DBI
set as zero. Importantly, the clusters are chosen based on the None (labels) 100.00 1.00 -0.04 5.56
K-means 18.40 0.10 0.13 2.04
probability of data point x(i) belonging to the cth cluster [6]: GMM-EM 17.62 0.09 0.13 1.95
UVAC 70.78 0.71 0.21 1.61
pφ (c)pφ (z|c)
qϕ (c|x(i) ) = pθ (c|z) = =
pφ (z) 4) Davies-Bouldin index: the Davies-Boudlin Index (DBI)
π N (z; µc , Σc ) [19] is defined as the average similarity ratio of each cluster
Pc . (14)
c πc N (z; µc , Σc )
with the cluster that is most similar to it. A lower DBI indicates
During training, the complete model is used. For applying better clustering – in terms of compactness and separation. Its
the NN model in inference mode, however, only the encoder range can vary from 0 to infinity.
and the latent space are necessary, because we perform cluster-
B. Other methods
ing based on the GMM prior. Therefore, the inference UVAC
model has much less parameters than the version used in For the sake of comparison, we perform unsupervised
training. In numbers, the complete model has 2M parameters, clustering using two traditional methods, named K-means [20]
while roughly 1.3M are used for inference. Although this num- and the optimization of a Gaussian mixture model using the
ber of parameters is low when compared to other generative expectation-maximization (EM) algorithm (GMM-EM) [21].
approaches [15], further studies in model size reduction are of Other classical approaches and derivations are assumed to
interest, but out of scope for this paper. achieve similar performance as K-means and GMM-EM. We
also tried to apply dimensionality reduction techniques on
V. E XPERIMENTAL EVALUATION
the pre-processed data with the traditional methods, but no
In this section we present the results obtained with the performance improvement was obtained.
UVAC model described in Section IV-B. We trained the model To the best of our knowledge, this is the first study to
for 500 epochs using the Adam optimizer, with an initial apply variational autoencoders for the unsupervised clustering
learning rate of 0.005, exponentially decaying until the final of audio data. While other generative models are capable of
value of 0.0005 at the last epoch. The batch size considered clustering, they have not been specifically employed for audio
for training was of 64 data points. Differently from [6], we data clustering as we tackle in this paper.
keep both terms in (9) with equal weight. During tests, we
noticed that the performance is very sensible to the weighting C. Results
of the KL divergence term, and for the AudioMNIST dataset, The obtained results are shown in Table I. For the labels,
a weighting of 1.0 for each term achieved the best results. K-means, and GMM-EM, we calculate the metrics with the
Additionally, we keep M = 1 in (10) for efficiency. STFT data, while for UVAC we use the latent variable z.
A. Metrics First, notice how the classes provided with the AudioMNIST
In the following we describe the considered metrics for dataset, referent to the spoken digits, do not form good clusters
unsupervised clustering. in terms of the Silhouette score and DBI. Improvements are
1) Unsupervised accuracy: in unsupervised clustering clear when K-means and GMM-EM are applied, raising the
tasks, the numeric labels may not correspond directly to Silhouette score and lowering the DBI, but dramatically reduc-
the ground truth labels. We then consider an unsupervised ing unsupervised accuracy and normalized mutual information.
approach for calculating accuracy, which consists of finding Such methods, therefore, show themselves insufficient for
the matching truth labels for the clusters via the Hungarian an accurate unsupervised clustering when data are of high
algorithm [16]. Unsupervised accuracy ranges from 0 to 100%. dimensionality or contain complex relations. On the other
2) Normalized mutual information: the normalized mutual hand, when UVAC is applied, accuracy and NMI are increased
information (NMI) is an information theoretic approach that to approximately 71%, and the unsupervised clustering metrics
evaluates the clustering quality by measuring the amount of are enhanced if compared to the other approaches.
shared information between clustering assignments and truth Based on these results, for audio applications in general,
labels [17]. Its range is from 0 to 1. we expect the accuracy to be limited when unsupervised
3) Silhouette score: the Silhouette score [18] measures how clustering objectives are involved, almost as an abstract form
similar a data point is to its own cluster in comparison to of regularization. A perfect match with the truth labels does
other clusters. It combines cohesion (how close data points not provide good clusters, as we can see from the metrics
within a cluster are) and separation (how distinct is a cluster obtained with the labels. Furthermore, the sufficiently high
from another). The range is from -1 to +1: -1 indicates accuracy (around 71%) obtained by the proposed model shows
misclassification; 0 tells us that clusters overlap; and +1 that the digit being spoken, alone, is a major part of the
indicate optimal clustering. data features, but accompanied by other features present in
60 60 60 9
40 40 40 8
20 20 7
20 6
0 0 5
0
20 20 4
20 3
40 40 2
40
60 60 1
60 0
60 40 20 0 20 40 60 60 40 20 0 20 40 60 60 40 20 0 20 40 60
(a) K-means (b) GMM-EM (c) UVAC
Fig. 3: Clusters obtained for the AudioMNIST test set. The data size is reduced for plotting using t-distributed stochastic
neighbor embedding. Each color represents a cluster (labels shown in bar plot), and each circle in the plot is a data point.
the STFTs, e.g., voice frequency, microphone noise, time to [6] Y. Uğur, G. Arvanitakis, and A. Zaidi, “Variational Information Bottle-
pronounce a number, etc. This “regularizing” effect is mainly neck for Unsupervised Clustering: Deep Gaussian Mixture Embedding,”
Entropy, vol. 22, no. 2, 2020.
caused by the complexity of the audio data, which is not [7] N. Dilokthanakul, P. A. M. Mediano, M. Garnelo, M. C. H. Lee,
observed for image processing [6]–[8]. This also makes the H. Salimbeni, K. Arulkumaran, and M. Shanahan, “Deep Unsupervised
inclusion of annealing strategies in the loss function nontrivial. Clustering with Gaussian Mixture Variational Autoencoders,” 2017.
[8] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou, “Variational deep
Additionally, in Fig. 3 we show an example of the clusters embedding: an unsupervised and generative approach to clustering,” in
obtained for AudioMNIST’s test set. Visually, K-means and Proceedings of the 26th International Joint Conference on Artificial
GMM-EM clusters are similar, being insufficient in terms Intelligence, IJCAI 2017, 2017, p. 1965–1972.
[9] L. Deng, “The MNIST Database of Handwritten Digit Images for Ma-
of compactness and separation. On the other hand, UVAC’s chine Learning Research,” IEEE Signal Processing Magazine, vol. 29,
clusters are more spread and compact for most classes. Clearly, no. 6, pp. 141–142, 2012.
the traditional methods are limited in their ability to capture [10] H. Fang, G. Carbajal, S. Wermter, and T. Gerkmann, “Variational
Autoencoder for Speech Enhancement with a Noise-Aware Encoder,” in
the (complex) underlying data structure. The incorporation of 2021 IEEE International Conference on Acoustics, Speech and Signal
advanced clustering strategies, like UVAC, when dealing with Processing (ICASSP), 2021, pp. 676–680.
audio data, might drastically improve the performance of a [11] J. Richter, G. Carbajal, and T. Gerkmann, “Speech Enhancement with
Stochastic Temporal Convolutional Networks,” in Proc. Interspeech
system that depends on unsupervised clustering. 2020, 2020, pp. 4516–4520.
[12] L. V. Fiorio, B. Karanov, B. Defraene, J. David, F. Widdershoven,
VI. C ONCLUSION W. Van Houtum, and R. M. Aarts, “Spectral Masking With Explicit
Time-Context Windowing for Neural Network-Based Monaural Speech
We proposed a variational autoencoder model for unsuper- Enhancement,” IEEE Access, vol. 12, pp. 154 843–154 852, 2024.
vised acoustic clustering of audio data. The proposed model is [13] J. R. Hershey and P. A. Olsen, “Approximating the Kullback Leibler
a convolutional-recurrent variational autoencoder with linear Divergence Between Gaussian Mixture Models,” in 2007 IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing
layers in the latent space, which has a multivariate GMM (ICASSP), vol. 4, 2007, pp. IV–317–IV–320.
prior. The obtained clustering results with the UVAC model [14] S. Becker, J. Vielhaben, M. Ackermann, K.-R. Müller, S. Lapuschkin,
over the AudioMNIST dataset show substantial improvements and W. Samek, “AudioMNIST: Exploring Explainable Artificial Intelli-
gence for audio analysis on a simple benchmark,” Journal of the Franklin
in relation to other approaches. The UVAC is capable of Institute, 2023.
maintaining high accuracy and normalized mutual information [15] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child,
while increasing clustering quality. For future works, we S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural
Language Models,” 2020.
suggest the development of a robust annealing strategy of the [16] H. W. Kuhn, “The Hungarian method for the assignment problem,”
loss terms aimed at audio applications. Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
[17] N. X. Vinh, J. Epps, and J. Bailey, “Information Theoretic Measures
R EFERENCES for Clusterings Comparison: Variants, Properties, Normalization and
Correction for Chance,” J. Mach. Learn. Res., vol. 11, p. 2837–2854,
[1] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Dec. 2010.
Discriminative embeddings for segmentation and separation,” in 2016 [18] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and
IEEE International Conference on Acoustics, Speech and Signal Pro- validation of cluster analysis,” Journal of Computational and Applied
cessing (ICASSP), 2016, pp. 31–35. Mathematics, vol. 20, pp. 53–65, 1987.
[2] L. E. Humes, S. E. Rogers, A. K. Main, and D. L. Kinney, “The acoustic [19] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE
environments in which older adults wear their hearing aids: Insights Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-
from datalogging sound environment classification,” American Journal 1, no. 2, pp. 224–227, 1979.
of Audiology, vol. 27, no. 4, pp. 594–603, 2018. [20] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K-Means
[3] G. Park, W. Cho, K.-S. Kim, and S. Lee, “Speech enhancement for Clustering Algorithm,” Journal of the Royal Statistical Society. Series
hearing aids with deep learning on environmental noises,” Applied C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
Sciences, vol. 10, no. 17, 2020. [21] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood
[4] J. Foote, “An overview of audio information retrieval,” Multimedia Syst., from Incomplete Data Via the EM Algorithm,” Journal of the Royal
vol. 7, no. 1, p. 2–10, Jan. 1999. Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22,
[5] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” in 1977.
2nd International Conference on Learning Representations, ICLR 2014,
2014.

Forbes 12 Best Stocks To Buy For 2024
No ratings yet
Forbes 12 Best Stocks To Buy For 2024
29 pages
D U C G M V A: EEP Nsupervised Lustering With Aussian Ixture Ariational Utoencoders
No ratings yet
D U C G M V A: EEP Nsupervised Lustering With Aussian Ixture Ariational Utoencoders
12 pages
Intro To Vae
No ratings yet
Intro To Vae
89 pages
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
No ratings yet
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
89 pages
Is Simple Better?: Revisiting Simple Generative Models For Unsupervised Clustering
No ratings yet
Is Simple Better?: Revisiting Simple Generative Models For Unsupervised Clustering
6 pages
Wikipedia VAE
No ratings yet
Wikipedia VAE
9 pages
8.auto-Encoding Variational Bayes
No ratings yet
8.auto-Encoding Variational Bayes
14 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
Reparametrization Trick
No ratings yet
Reparametrization Trick
8 pages
L11 - UCLxDeepMind DL2020
No ratings yet
L11 - UCLxDeepMind DL2020
68 pages
1312.6114v1
No ratings yet
1312.6114v1
9 pages
Density Estimation Using Real NVP
No ratings yet
Density Estimation Using Real NVP
32 pages
Auto Encoding Variational Bayes
No ratings yet
Auto Encoding Variational Bayes
14 pages
IAF Kingma Et Al 2016
No ratings yet
IAF Kingma Et Al 2016
16 pages
Tung Kieu - Probabilistic - Graphical - Model - Report
No ratings yet
Tung Kieu - Probabilistic - Graphical - Model - Report
9 pages
Tutorial on diffusion models
No ratings yet
Tutorial on diffusion models
4 pages
P - Improving latent variable discriptiveness by modelling rather than ad-hoc factors
No ratings yet
P - Improving latent variable discriptiveness by modelling rather than ad-hoc factors
11 pages
1804.00891v3
No ratings yet
1804.00891v3
19 pages
08_VariationalInference
No ratings yet
08_VariationalInference
31 pages
Auto-Encoding_Variational_Bayes
No ratings yet
Auto-Encoding_Variational_Bayes
8 pages
Mixture Models
No ratings yet
Mixture Models
16 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
51 pages
Autoencoder For Neuroimage: Abstract. Variational Autoencoder (Vae) As A Class of Neural Networks
No ratings yet
Autoencoder For Neuroimage: Abstract. Variational Autoencoder (Vae) As A Class of Neural Networks
7 pages
Time Grad
No ratings yet
Time Grad
11 pages
Introduction To VAE
No ratings yet
Introduction To VAE
5 pages
ACV - Notes - Final
No ratings yet
ACV - Notes - Final
7 pages
On The Challenges of Learning With Inference Networks On Sparse, High-Dimensional Data
No ratings yet
On The Challenges of Learning With Inference Networks On Sparse, High-Dimensional Data
14 pages
Adversarial Variational Bayes
No ratings yet
Adversarial Variational Bayes
14 pages
Variational Autoencoder
No ratings yet
Variational Autoencoder
21 pages
VAE talk.compressed - 副本
No ratings yet
VAE talk.compressed - 副本
59 pages
220110038_MuskanSharma_III IT
No ratings yet
220110038_MuskanSharma_III IT
10 pages
Denoising Autoencoders tr1316
No ratings yet
Denoising Autoencoders tr1316
16 pages
Mod 3 Advanced AI
No ratings yet
Mod 3 Advanced AI
37 pages
Early Warning Via Transitions in Latent Stochastic Dynamical Systems
No ratings yet
Early Warning Via Transitions in Latent Stochastic Dynamical Systems
14 pages
Variation Al
No ratings yet
Variation Al
25 pages
A Beginner’s Guide to Variational Inference
No ratings yet
A Beginner’s Guide to Variational Inference
48 pages
Exploring the Latent Space of Autoencoders with
No ratings yet
Exploring the Latent Space of Autoencoders with
34 pages
Presentation - Deeplearning2015 Courville Autoencoder Extension 01
No ratings yet
Presentation - Deeplearning2015 Courville Autoencoder Extension 01
61 pages
Sinha Et Al. - 2019 - Variational Adversarial Active Learning
No ratings yet
Sinha Et Al. - 2019 - Variational Adversarial Active Learning
10 pages
Flow Based Deep Generative Models Report
No ratings yet
Flow Based Deep Generative Models Report
12 pages
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
No ratings yet
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
3 pages
Deep Generative Models
No ratings yet
Deep Generative Models
55 pages
1812.06834
No ratings yet
1812.06834
48 pages
Variational Deep Embedding
No ratings yet
Variational Deep Embedding
8 pages
Lec14 PDF
No ratings yet
Lec14 PDF
7 pages
Suter 19 A
No ratings yet
Suter 19 A
10 pages
DLbook
No ratings yet
DLbook
165 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
lec12
No ratings yet
lec12
15 pages
3 Bayesian Deep Learning
No ratings yet
3 Bayesian Deep Learning
33 pages
Linear Factor Models
No ratings yet
Linear Factor Models
14 pages
Denoising Adversarial Autoencoders
No ratings yet
Denoising Adversarial Autoencoders
17 pages
CSD411-Week14-AutoRBM_1731474657667996771673434e1e7d46
No ratings yet
CSD411-Week14-AutoRBM_1731474657667996771673434e1e7d46
18 pages
Demystifying Variational Diffusion Models
No ratings yet
Demystifying Variational Diffusion Models
48 pages
RBF2
No ratings yet
RBF2
40 pages
Martinet Z 1993
No ratings yet
Martinet Z 1993
12 pages
Mixtures of Gaussian - PPT
No ratings yet
Mixtures of Gaussian - PPT
12 pages
Gen AI Unit 2
No ratings yet
Gen AI Unit 2
65 pages
Lecture # 6 Latent Variable Models
No ratings yet
Lecture # 6 Latent Variable Models
55 pages
Z-Forcing: Training Stochastic Recurrent Networks
No ratings yet
Z-Forcing: Training Stochastic Recurrent Networks
11 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
2503.17551v1
No ratings yet
2503.17551v1
10 pages
2503.18590v1
No ratings yet
2503.18590v1
5 pages
2503.10639v1
No ratings yet
2503.10639v1
20 pages
2503.14476v1
No ratings yet
2503.14476v1
16 pages
Kemper EQ Measurements
No ratings yet
Kemper EQ Measurements
22 pages
Metro Internship Report
83% (6)
Metro Internship Report
77 pages
Ket/Pet/Fce/Cae/Cpe Preparation Materials: Key English Test (Ket)
100% (1)
Ket/Pet/Fce/Cae/Cpe Preparation Materials: Key English Test (Ket)
3 pages
Ams8061en 80407 1.22
No ratings yet
Ams8061en 80407 1.22
230 pages
Multi-Storey Steel Framed Structures To Eurocode 3 Bs en 1993-1-1:2000
No ratings yet
Multi-Storey Steel Framed Structures To Eurocode 3 Bs en 1993-1-1:2000
2 pages
Argus Brokenshield
No ratings yet
Argus Brokenshield
7 pages
Toyota
No ratings yet
Toyota
29 pages
Title: Geometrical Nonlinear Analysis of A Cantilever Beam Subjected To An End Force
No ratings yet
Title: Geometrical Nonlinear Analysis of A Cantilever Beam Subjected To An End Force
4 pages
Mole Worksheets-Full Set Chance Batchelor
No ratings yet
Mole Worksheets-Full Set Chance Batchelor
8 pages
ABT Precision Bearings Catalog
No ratings yet
ABT Precision Bearings Catalog
56 pages
Netdiag White Paper
No ratings yet
Netdiag White Paper
45 pages
JEE Main 2025 Session 1_Results PDF_90%ile+
No ratings yet
JEE Main 2025 Session 1_Results PDF_90%ile+
19 pages
Obs 12345
No ratings yet
Obs 12345
6 pages
An Introduction To Python For Absolute Beginners
No ratings yet
An Introduction To Python For Absolute Beginners
127 pages
Food Security: A Critical Analysis of Wheat Procurement, Transportation, and Storage Challenges in Punjab, Pakistan
No ratings yet
Food Security: A Critical Analysis of Wheat Procurement, Transportation, and Storage Challenges in Punjab, Pakistan
32 pages
Lajpat Nagar 2
No ratings yet
Lajpat Nagar 2
1,495 pages
Accounting
No ratings yet
Accounting
414 pages
IB Questionbank Mathematical Studies 3rd Edition 1
No ratings yet
IB Questionbank Mathematical Studies 3rd Edition 1
7 pages
ICU Complex - Check List
No ratings yet
ICU Complex - Check List
5 pages
Best Strategy For Asset Integrityweb 140129045704 Phpapp01
No ratings yet
Best Strategy For Asset Integrityweb 140129045704 Phpapp01
9 pages
Student Registration Form
No ratings yet
Student Registration Form
3 pages
IsiXhosa HL P2 May-June 2021
No ratings yet
IsiXhosa HL P2 May-June 2021
25 pages
Victor Daniel - 2024 Resume
No ratings yet
Victor Daniel - 2024 Resume
2 pages
SRP-350352plusII PSP Installation Eng Rev 1 00
No ratings yet
SRP-350352plusII PSP Installation Eng Rev 1 00
15 pages
Insulators, Conductors
No ratings yet
Insulators, Conductors
52 pages
Ldiag v04.34.001 Bootable Uefi Ug
No ratings yet
Ldiag v04.34.001 Bootable Uefi Ug
201 pages
Invoice
No ratings yet
Invoice
2 pages
How To Add A Descriptive Flexfield (DFF) in A Custom Oracle Apps Form
No ratings yet
How To Add A Descriptive Flexfield (DFF) in A Custom Oracle Apps Form
17 pages
Article 4 Ben Ufo Truth
No ratings yet
Article 4 Ben Ufo Truth
3 pages