Automatic Classification of Singing Voice Quality
Automatic Classification of Singing Voice Quality
net/publication/4215614
CITATIONS READS
4 118
2 authors, including:
Bozena Kostek
Gdansk University of Technology
321 PUBLICATIONS 1,949 CITATIONS
SEE PROFILE
All content following this page was uploaded by Bozena Kostek on 26 October 2019.
4
v
E22
E23
E24
E25
Ev5
t
2
AX
6
7
F2
br
Wr
Ev1
Ev2
C_
md
M1
M1
K4M
ASE23 20.61 ASE24 15.85 PLPmax 8.34
PLP
AS
AS
AS
AS
AS
AS
SF
SF
AS
AS
SFM16 19.81 K4max 15.57 SFM17 8.29
K4max 16.32 ASEv16 14.71 ASEv23 8.26 Figure 1 Mean correlation values obtained for best
parameters according to the Fisher statistic
md2 15.91 PLPmax 14.46 br 7.91 analysis
ASE22 15.45 ASCv 13.41 ASEv5 7.85
Table 3 Best parameters and the corresponding
The same parameters are winning for professional - Fisher statistic values for all pairs of recognized
MA (Music Academy) student pairs but the singing voice type classes
corresponding Fisher statistic values are already Soprano - tenor Alto - tenor Soprano - alto
smaller. For such pairs as MA student – amateur Param. FS. Param. FS Param. FS
values of the Fisher statistic are the lowest ones and ASE23 20.16 ASE23 27.48 ASC 7.83
some additional parameters (e.g. Brightness) seem to
K4Pr 14.97 ASE22 20.21 ASE21 7.72
be important in the recognition process.
In order to check if the parameters are correlated, a SFMv20 14.47 ASEv24 18.17 SC 6.67
correlation analysis was performed and a mean md1 14.16 SFMv16 18.03 SFM12 6.07
correlation value for each parameter and each sound ASE32 14.15 SFMv15 15.89 ASE27 5.4
was calculated (Fig. 1) accordingly to (Eq. 6). m1 13.77 ASSv 15.66 SFMv10 5.23
Parameter K4max had the lowest correlation value, and md2 13.69 br 15.81 SFM15 5.1
spectral parameters are strongly correlated to others.
br 13.32 SFMv14 15.01 SFMv6 5.09
So K4max although it does not have the highest Fisher
statistic value seems to be very important because it
carries on an additional non-spectral information. 3.2 ANN Classification
K N param −1( j ≠ l ) Feature vectors were divided into two equal sets. The
Kl = ∑ ∑
k =1 j =1
corr ( p k l , p kj ) (6) first one was used for the ANN training purposes.
Other feature vectors were used to test the
generalization performance of the network and to
where K is the number of the sounds, Nparam is the calculate recognition effectiveness. The ANN was a
number of a parameter, corr(x,y) is the cross- feed-forward type network. In the first layer the
correlation function. number of neurons was equal to number of parameters,
Similarly, the same analysis procedure was the hidden layer was experimentally set to 20 neurons.
performed for voice type classes. The results obtained In the output layer the number of neurons was equal to
are presented in Table 3. As may be seen the best the number of recognized classes (in both cases – 3
parameters differ from quality classes in the context of classes of quality and voice type recognition). During
separation quality. The importance of harmonic learning process a validation test was used and the
parameters (md1, md2, m1) and time parameters training was stopped if the validation error was rising
for 100 cycles. The training of the ANN went quite
smoothly, around 200 to 300 cycles were sufficient for feature vector were not correlated and carried on
this phase. The activation function was sigmoid. supplementary information, which in a sum gave a
In Tables 4 - 7 results for singing voice quality and good separation of recognized classes.
type classification are presented. In the case of the In the case of the vocal quality classification a
vocal quality classification all 1060 sounds were used, lower effectiveness was obtained (in average 84%), but
in case of the vocal type recognition the number of the most of the errors occurred between professionals and
derived feature vectors was lower, because sounds of Music Academy students. Some of sounds sung by
the poorest singers were not used for the vocal type them were of good quality as well as it happened for
recognition. In some cases of amateurs it was hard to an amateur singer. It is important to observe that there
judge whether a vocalist was a soprano or an alto occurred only 3 errors in recognition between
singer). Output classes were denoted ‘professionals’, professional and amateur singers (2% of the total
‘MA students’, ‘amateurs’ for quality, and ‘soprano’, number of sounds).
‘alto’ and ‘tenor’ for the voice type cases.
3.3 Rough-Set-Based Classification
Table 4 Results obtained for the singing voice type
recognition (number of recognized sounds) by the In order to check the recognition effectiveness, RSES,
ANN the rough set decision system was used [11]. Feature
In \ out Soprano Alto Tenor vectors were divided into training and testing sets.
Soprano 76 1 0 Parameters were quantized according to the RSES
Alto 5 54 0 system principles. The local discretization was used.
Tenor 0 0 67 In Tables 8 and 9 the RSES-based decision system
recognition results are presented. In the case of quality
Table 5 Singing voice type recognition (ANN) classification the number of rules extracted was 7550,
No. of sounds Errors Accuracy [%] the minimum length of a rule was 2, the maximum
Soprano 77 1 91.81 length was equal to 10. An example of rules obtained
is given below:
Alto 59 5 79.17
If ASSv=(-Inf,-0.701) and SFM16=(-Inf,-0.080) and
Tenor 67 0 100 F2=(0.153,Inf) and K4max=(-Inf,-0.747) then
Total 203 6 97.04 Quality=GOOD
Table 6 Results obtained for singing voice quality On the other hand, in the case of the singing voice
classification (number of recognized sounds) by type recognition, the number of rules derived was
the ANN
6900, the minimum length was 2 and the maximum
in \ out Professionals AM students Amateurs was equal to 12.
Professionals 157 11 3
MA students 19 152 21 Table 8 Results obtained for singing voice quality
Amateurs 3 28 137 classification (number of recognized sounds) by
the RSES system
Table 7 Singing voice quality classification (ANN)
No. of sounds Errors Accuracy [%] Professional
In \ out s MA students Amateurs
Professional 171 14 91.81
Professionals 157 15 2
MA student 192 40 79.17
MA students 35 110 58
Amateur 168 31 81.55
Amateurs 8 34 114
Total 531 85 84.00
Table 9 Results obtained for the singing voice type
Recognition of the voice type seemed to be an classification (number of recognized sounds) by
easier task. For over 200 testing sounds only 6 singing the RSES system
voices were badly recognized resulting in the total In \ out Professionals MA students Amateurs
effectiveness of 97% (100% accuracy in the tenor
Professionals 73 6 0
recognition was obtained). Soprano and alto sounds
were also well recognized (only 6 errors) although low MA students 7 51 3
Fishers statistic values were obtained for those classes. Amateurs 2 6 57
An explanation is that parameters contained in the
The recognition effectiveness obtained is lower then The great advantage of the rough-set system is that
in the case of ANNs. The average recognition decision rules could be reviewed by the experimenter
effectiveness for rough set decision system is 71% for conducting tests and their content can be easily
quality and correspondingly 87% for the voice type analyzed. At this stage of experiments it is also worth
recognition. ANNs recognition results were 84% and mentioning that rules derived from the rough set-based
98% respectively. analysis may be compared to the subjective test results.
The subjective evaluation is very important in quality
4. Conclusions judgment. Thus what is further proposed attempts to
find a correlation between attributes named by the
The results derived from the statistical analysis experts and parameters extracted from singing voices.
show that high values of Fisher statistic may not be This can be done by analyzing rough set-based
sufficient for a good separation between classes, and knowledge induction obtained from the decision
contrarily low values may describe better quality of system. Moreover the quantization procedure results
parameters if the correlation analysis performed on the can also be compared with the ones obtained by
same data indicates this. The best example is the experts in a more subjective way. In such a way the
presented K4max parameter, which has lover Fisher whole classification process would get a subjective
statistic values in comparison to the spectral justification. Such a scheme of experiments was
parameters, but it is much less correlated to other already tested on musical samples and seemed well
parameters. adapted to music-related studies. This is of advantages
The experiments carried out show good of the rough set analysis over the neural network-based
effectiveness of the ANN-based singing voice classification which does not allow for carrying out
recognition system (especially in the case of the such quality comparison.
recognition of singing voice type – 98%). Accuracy
attained by the system of the level of 84% in the case 5. References
of singing voice quality recognition can be explained
by the fact that sometimes it is difficult to classify a [1] BLOOTHOOF G., “The sound level of the singers
singer into one class. Amateur singers sometimes formant in professional singing”, J. Acoust. Soc. Am. 79 (6),
pp. 2028-2032, 1986.
manage to sing a vowel well (rather casually) and
[2] HERMANSKY H., “Perceptual Linear Predictive (PLP)
contrarily it occurs that some vowels from Analysis of Speech”, Journal of Acoust. Soc. Am., pp. 1738-
professionals are not perfect. In the experiments 1752, April 1990.
carried out there were only 3 errors in the recognition [3] KOSTEK B., CZYZEWSKI A. (2001). Representing
between professional and amateur singers (2% of the Musical Instrument Sounds for Their Automatic
total number of sounds). On the other hand, the Classification, J. Audio Eng. Soc., 49, 9, 768-785.
effectiveness of the system can be drastically improved [4] KOSTEK B., Soft computing in acoustics, Physica
if more than one singing vowel could be analyzed, then Verlag, New York, Heidelberg, 1999.
casual recognition errors between neighboring classes [5] KOSTEK B., SZCZUKO P., ŻWAN P., Processing of
Musical Data Employing Rough Sets and Artificial Neural
would not influence the total quality judgment.
Networks, in Rough Sets and Current Trends in Computing,
Results obtained the RSES decision system are less RSCTC, Uppsala, Sweden, Lecture Notes in Atificial
optimistic. This can be explained by the complexity of Intelligence, LNAI 3066, Springer Verlag, Berlin,
the problem related to the choice of parameters and Heidelberg, New York, 2004, 539-548.
their quantization. A non-linear neural system [6] MENDES A., “Acoustic effect of vocal training”, 17th
managed to analyze properly the parametrized data. In ICA Proceedings vol. VIII, pp. 106-107, Rome 2001.
the experiments performed the methodology proposed [7] ROTHMAN H.B. “Why we don’t like these singers”, 17th
utilized the Fisher statistic and the correlation analyses ICA Proceedings, vol. VIII, pp. 114-115, Rome 2001.
aimed at diminishing the number of parameters used in [8] SUNDBERG J. “The science of the singing voice”,
Northern Illinois University Press, Dekalb, Illinois, 1987.
the classification process. Although such procedures
[9] SZCZUKO P., DALKA P., DABROWSKI P. and
are easily justified with neural networks, this can be a KOSTEK B. (2004). MPEG-7-based Low-Level Descriptor
drawback while using a rough set system. The rough Effectiveness in the Automatic Musical Sound Classification,
set-based analysis has an ability to search for 116 Audio Eng. Conv., Preprint No. 6105, Berlin.
significant attributes, reducts and core, thus in future [10] ZWAN P. “Glottal source parametrization”, Proc of X.
experiments all 259 parameters should be contained in Symposium on New Trends in Audio Video Technology,
decision tables rather then diminishing their number Wroclaw, pp. 64-72 (in Polish), 2004.
before the analysis starts. [11] Rough-set Exploration System ver. 2.2 user manual:
logic.mimuw.edu.pl/~rses/RSES_doc_eng.pdf