article - audio intent detection classification problem
article - audio intent detection classification problem
Ali Yassine
Politecnico di Torino
s312920
[email protected]
I. P ROBLEM OVERVIEW
The proposed competition is a classification task using an
audio intent dataset, which is a collection of audio recordings
of English language commands from a variety of speakers Fig. 1: Representation of a recording in the time domain.
from different nations and levels of proficiency. The idea of
this competition is to properly identify each recording’s stated
command (action - object).
There are two different parts of the dataset:
• a development set of 9,854 recordings for which a label
has been provided.
• an evaluation set of 1455 recordings.
Based on the development set, several considerations are
possible. First of all, the data is completely unbalanced be-
cause almost 95% of respondents identified themselves as
native English (United States) speakers. Second, samples from
the recordings were mostly taken at two different frequencies
16000 Hz and 22050 Hz. Moreover, 96.95% of the recordings
were sampled on 16000 Hz. Finally, recordings’ lengths vary
and are not consistent; in fact, we have 124 different lengths
for recordings with 2.639s as the mean. Some outliers are
present a with duration of 20s, which upon manual inspection
appeared to be silence. I proposed a method for determining
a common ground for all recordings and extract a specific
amount of features to provide to my classification models be-
cause the submitted recordings had variable rate and duration.
The most interesting part of the evaluation set, however, was
the fact that all recordings were made for individuals who self-
identified as native English (United States) speakers and who
regularly used the language at work or school. Fig. 2: Magnitude of two recordings of same class in dB
In order to get a better understanding of our data, we
can visually evaluate multiple signals in both the time and
frequency domains. These two domains contain features for
each signal. Figure 1 depicts one signal in the time domain. Regarding the frequency domain, in order to clearly illus-
The recordings may contain silence, which could make up trate the variation and uniqueness of the data at hand. Figures
a sizeable portion of our recordings, in addition to voice 2 shows the frequencies of two distinct recordings measured
instructions. in decibels belonging to the same order given by a user.
II. P ROPOSED A PPROACH
A. Data preprocessing
As was previously mentioned, the data at hand is not clean;
it is filled with unnecessary data that could be noise or silence.
The following important steps were used to tackle this issue.
• Audio trimming: involves deleting unneeded informa-
tion—in this case, silence—from the beginning and end
of each audio signal.
• Noise reduction: involves eliminating unwanted audio
sources by suppressing ancillary noises like fan noise.
A good illustration of the difference of the same recording Fig. 4: MFCCs extraction algorithm [2]
before and after the trimming and noise reduction is shown in
Figure 3 below.
models both before and after hyperparameter tuning. recordings that are available.
• Apply convolutional neural networks as a model for
Model Parameters Values classification [6].
C 0.1, 1, 4, 8, 10, 50
gamma 0.01, 0.1, 1 • Use sophisticated filters to reduce data noise.
SVM kernel rbf • Implement advanced feature selection algorithms.
n estimators 10, 100, 1000
max depth 3, 5, 7 The outcomes are encouraging, but they still need to be
min samples leaf 1, 2, 3 improved. However, it should be noted that the lack of further
Random Forest criterion gini, entropy implementation was caused by limited resources, particularly
TABLE I: Hyperparameters considered those linked to computational performance.
R EFERENCES
[1] Salomons, Etto Havinga, Paul. (2015). A Survey on the Feasibility of
Sound Classification on Wireless Sensor Nodes. Sensors. 15. 7462-7498.
10.3390/s150407462.
[2] M. A. Hossan, S. Memon and M. A. Gregory, ”A novel approach for
MFCC feature extraction,” 2010 4th International Conference on Signal
Processing and Communication Systems, Gold Coast, QLD, Australia,
2010, pp. 1-5, doi: 10.1109/ICSPCS.2010.5709752.
[3] Jolliffe Ian T. and Cadima Jorge 2016Principal com-
ponent analysis: a review and recent developments
Phil. Trans. R. Soc. A.3742015020220150202. Avail-
able:http://doi.org/10.1098/rsta.2015.0202
[4] Lu, L., Zhang, HJ. Li, S. Content-based audio classification and
segmentation by using support vector machines. Multimedia Systems 8,
482–492 (2003). Available: https://doi.org/10.1007/s00530-002-0065-0
[5] L. Grama and C. Rusu, ”Audio signal classification using Linear Pre-
dictive Coding and Random Forests,” 2017 International Conference on
Fig. 6: Accuracy of both models before and after grid search Speech Technology and Human-Computer Dialogue (SpeD), Bucharest,
Romania, 2017, pp. 1-9, doi: 10.1109/SPED.2017.7990431.
[6] S. Hershey et al., ”CNN architectures for large-scale audio classifica-
tion,” 2017 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), New Orleans, LA, USA, 2017, pp. 131-
135, doi: 10.1109/ICASSP.2017.7952132.