ECMA-418!2!2nd Edition December 2022
ECMA-418!2!2nd Edition December 2022
Psychoacoustic metrics
for ITT equipment —
Part 2 (models based on
human perception)
Reference number
ECMA-123:2009
1 Scope ...................................................................................................................................................... 1
2 Conformance ......................................................................................................................................... 1
3 Normative references ............................................................................................................................ 1
4 Terms and definitions ........................................................................................................................... 2
5 A hearing model approach to calculate psychoacoustic parameters ............................................. 4
5.1 Psychoacoustic hearing model ........................................................................................................... 4
5.1.1 Overview ................................................................................................................................................. 4
5.1.2 Pre-processing of input data ................................................................................................................ 5
5.1.3 Outer and middle/inner ear filtering .................................................................................................... 5
5.1.4 Auditory filtering bank .......................................................................................................................... 7
5.1.5 Segmentation ......................................................................................................................................... 9
5.1.6 Rectification ......................................................................................................................................... 10
5.1.7 Calculation of root-mean-square values ........................................................................................... 10
5.1.8 Nonlinearity to transform sound pressure into specific loudness ................................................ 10
5.1.9 Consideration of threshold in quiet ................................................................................................... 11
6 Identification and evaluation of prominent tonalities using a psychoacoustic tonality
calculation method .............................................................................................................................. 13
6.1 Determination of tonality .................................................................................................................... 13
6.1.1 Tonalities and their relationships to the threshold of hearing ....................................................... 13
6.1.2 Multiple tones in a critical band, and time-variation of tonality due to their interaction ............. 13
6.2 Psychoacoustic tonality calculation method ................................................................................... 13
6.2.1 Overview ............................................................................................................................................... 13
6.2.2 Autocorrelation function ..................................................................................................................... 14
6.2.3 Averaging of ACFs .............................................................................................................................. 16
6.2.4 Application of ACF window ................................................................................................................ 16
6.2.5 Estimation of tonal loudness ............................................................................................................. 18
6.2.6 Resampling to common time basis ................................................................................................... 18
6.2.7 Noise reduction ................................................................................................................................... 19
6.2.8 Calculation of time-dependent specific tonality .............................................................................. 20
6.2.9 Calculation of averaged specific tonality.......................................................................................... 21
6.2.10 Calculation of time-dependent tonality ............................................................................................. 21
6.2.11 Calculation of representative values ................................................................................................. 22
6.3 Information to be recorded for prominent tonalities ....................................................................... 23
7 Identification and evaluation of prominent roughness using a psychoacoustic roughness
calculation method .............................................................................................................................. 24
7.1 Psychoacoustic roughness calculation method .............................................................................. 24
7.1.1 Overview ............................................................................................................................................... 24
7.1.2 Envelope calculation and downsampling ......................................................................................... 25
7.1.3 Calculation of scaled power spectrum .............................................................................................. 26
7.1.4 Noise reduction of the envelopes ...................................................................................................... 26
7.1.5 Spectral weighting ............................................................................................................................... 27
7.1.6 Optional entropy weighting based on randomness of modulation rate ........................................ 31
7.1.7 Calculation of time-dependent specific roughness ......................................................................... 33
7.1.8 Calculation of representative values ................................................................................................. 34
7.1.9 Calculation of time-dependent roughness ....................................................................................... 34
7.1.10 Calculation of representative values ................................................................................................. 34
7.1.11 Calculation of roughness for binaural signals ................................................................................. 34
7.2 Information to be recorded for prominent roughness ..................................................................... 35
ECMA-418 Parts 1 and 2 are psychoacoustic standards and as such prescribe methods that represent the
perception of noise emitted by ITT equipment. Sound signals recorded by the procedures of ECMA-74 are
analysed using the psychoacoustic methods of ECMA-418 Parts 1 and 2. While intended for ITT equipment,
the methods may be useful for other applications as well.
The psychoacoustic methods in this standard, ECMA-418 Part 2 are based on a human hearing model of Sottek
that expresses specific loudness, which describes level- and frequency-dependent masking and threshold of
hearing. The model approximates the well-established Zwicker specific loudness method, but was extended by
using a modified Bark scale covering the entire audible frequency range and an improved nonlinear matching
of loudness at higher levels, which leads to a significant improvement of the prediction quality for several
loudness matching experiments using synthetic and technical sounds.
Additional models described in this standard use the specific loudness to express the strength of perceived
tonality and roughness. The models of this standard, Part 2, are more intricate than those of Part 1, which
considers sound pressure in narrow and critical bands and hearing threshold.
− The hearing model, tonality, and roughness procedures of Clauses 5, 6, and 7 were refined, and the
descriptions of these procedures improved to assist implementation.
− In Clause 5, a figure showing auditory filter bank response of the hearing model of Sottek was added
to assist implementation.
− An entropy weighted roughness based on modulation rate random was added to Clause 7.1 for
applications in which measured rotational speed is available.
− Clause 8 was added to describe loudness of sounds with subcritical or larger bandwidths.
ECMA-418 series consists of the following parts, under the general title “Psychoacoustic metrics for ITT
equipment”:
This Ecma Standard was developed by Technical Committee 26 and was adopted by the General
Assembly of December 2022.
1 Scope
This standard describes the hearing model and psychoacoustic metrics dependent on the hearing model. The
input to the hearing model are sound signals recorded using the procedures of ECMA-74. The hearing model
expresses specific loudness [1]. Psychoacoustic models use the specific loudness to express the strength of any
tonalities or roughness in the sound generated by Information Technology and Telecommunications (ITT)
equipment. While developed for ITT equipment, the psychoacoustic methods of this standard may be relevant
to other applications like automobiles, consumer appliances, etc.
The tonality metric of this standard uses the auto-correlation function to describe causes of perceived tonality
such as individual or multiple steady or time-varying discrete tones, individual or multiple spectrally elevated
bands or slopes of noise, and combinations of these phenomena. A similar approach was published in 1998 to
determine “pitch salience” [2].
The roughness metric presented in this standard uses a spectrum of the sound signal envelope, refined by a
quadratic fit estimator, to describe roughness arising from sound signal envelope variations within a critical band
at modulation rates between 20 and around 300 Hz. For steady sounds, roughness perception peaks at
modulation rates of 70 Hz.
The loudness metric of this standard uses a nonlinear combination of tonal and noise loudness calculated as
intermediate results of the tonality algorithm to achieve a very good match of perceived loudness, especially for
sounds with a subcritical bandwidth (sounds containing tonal and noise components).
2 Conformance
Measurements are in conformity with this Standard if they meet the following requirements:
3 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references, the
latest edition of the referenced document (including any amendments) applies.
For the purposes of this document, the following terms and definitions apply.
NOTE If a definition is identical to that in another standard, that standard and definition number is given in brackets.
4.1
loudness
𝑁
perceived magnitude of a sound, which depends on the acoustic properties of the sound and the specific
listening conditions, as estimated by that the average human listener with normal hearing.
NOTE 2 Loudness depends primarily upon the sound pressure level, although it also depends upon the frequency,
bandwidth, and duration of the sound.
NOTE 3 A sound that is twice as loud as another sound is characterized by doubling the number of sones.
4.2
specific loudness
𝑁′
perceived magnitude or volume of sound in a critical band.
NOTE 1 The unit of specific loudness is expressed in terms of sone per Bark.
4.3
equal loudness contour
the sound pressure level1 for which the average human listener with normal hearing perceive constant loudness
when presented with a single frequency (pure) tone.
NOTE 1 Equal loudness contour is parameterized by the sound pressure level and frequency of the presented tone.
See ISO 226:2003.
4.4
threshold of hearing
level of a sound at which, under specified conditions, a person gives 50 % of correct detection responses on
repeated trials.
4.5
critical band
filter within the human cochlea describing the frequency resolution of the auditory system with characteristics
that are usually estimated from the results of masking experiments.
4.6
critical bandwidth
bandwidth of a critical band.
NOTE 1 Each critical bandwidth has a width of one unit on the critical band rate scale.
1 The definition of sound pressure level is given in the terms and definitions of ECMA-74.
NOTE 1 Frequencies on the critical band rate scale are expressed in Bark.
4.8
tonality
a characteristic of sound containing a single-frequency component or narrow-band components that emerge
audibly from the total sound.
NOTE 1 Tonality can arise from individual or multiple steady or time-varying discrete tones, individual or multiple
spectrally elevated bands or slopes of noise, and combinations of these phenomena.
4.9
envelope
the instantaneous amplitude of a signal.
NOTE 1 The instantaneous amplitude describes the low-frequent variations of the amplitude. It has a significantly
lower frequency than the carrier frequency of the signal.
4.10
roughness
a characteristic of sound with the quality of being uneven yet steady.
NOTE 1 Roughness can arise if the envelope of a sound signal within a critical band has temporal variation.
4.11
modulation
fluctuation of the envelope of a signal over time.
NOTE 1 Modulation is expressed in terms of its strength (modulation index) and the speed at which it changes
(modulation rate).
4.12
modulation rate
frequency of changes of the envelope of a signal.
NOTE 2 The word “rate” is used to avoid confusion with the sound frequency.
This clause describes a perception-model-based procedure for determining the specific loudness of a sound,
the hearing model of Sottek. There are different loudness calculation procedures, such as the German standard
DIN 45631/A1 [3] and the international standard ISO 532-1 [4] (both based on Zwicker’s loudness model) as well
as the Dynamic Loudness Model (by Chalupper and Fastl) [5], the Time Varying Loudness model (by Glasberg
and Moore) [6], and the loudness calculation algorithm based on the hearing model of Sottek, allowing for the
prediction of the perceived loudness of time-varying sounds in many cases (ISO 532-2[7] only applies to
stationary sounds). However, previous studies of Rennies et al. [8], [9] showed that the predictions for some time-
varying sounds do not match the loudness ratings of normal-hearing listeners. To address this issue, the
influence of specific signal properties of the sounds on the assessment of loudness was examined in
Reference [1] focusing on impulsive sounds. On the basis of these experiments, it was studied how far the
hearing model approach to time-varying loudness according to Sottek can account for the specific signal
properties of these sounds. It could be shown that the hearing model approach to time-varying loudness
performs better than other existing loudness models: The hearing model, characterized especially by the
application of an improved nonlinearity and the steeper curve progression at higher levels, leads to a significant
improvement of the prediction quality for several loudness matching experiments using synthetic and technical
sounds. In addition, the auditory filter bank used is based on an extended Bark scale covering the entire audible
frequency range while matching the experimental results related to critical bandwidth better than other models.
Further, the hearing model is able to predict the nonlinear behaviour with respect to just-noticeable amplitude
differences and variations. [1]
The hearing model described in this clause transforms sound pressure to loudness, where the unit of loudness
is soneHMS , where HMS stands for “according to the Hearing Model of Sottek” and denotes that the loudness
differs from other definitions. The result of the hearing model can be used as the basis for further psychoacoustic
analyses.
5.1.1 Overview
Figure 1 displays the basic hearing model structure for calculating specific loudness as the basis for determining
other psychoacoustic sensations. Subsequently, the different signal processing blocks of the hearing model are
briefly explained.
Figure 1 — Basic hearing model structure, including the auditory filter bank, where CBF is the number
of critical band filters in the filter bank.
Initially, the first 5 ms of the input signal (corresponding to 𝑛fade in = 0,005 ⋅ 48000 = 240 samples) are multiplied
with a trigonometric weighting function
𝜋𝑛
𝑤fade in (𝑛) = 0,5 −0,5 ⋅ cos ( ) (1)
𝑛fade in
with 𝑛 = 0, 1, … , 𝑛fade in − 1 in order to reduce artifacts due to filter oscillations in case of signals starting with
non-zero values.
Second, zero-padding on both ends of the signal shall be performed to facilitate later processing steps. The
number of zeros at the end 𝑛zeros,end is calculated as:
where 𝑛samples is the number of samples of the signal and 𝑛new equals to:
where the ceil(𝑥) operator gives the smallest integer value higher than or equal to the number 𝑥. The band-
dependent block size 𝑠b (𝑧) and the hop size3 𝑠h (𝑧) are defined in detail in Clause 5.1.5 and 𝑠b,max and 𝑠h,max
are the largest band-dependent block size and hop size of all used filter stages, which are defined in Clause
6.2.2 for the tonality and in Clause 7.1.1 for the roughness. The number of zeros at the start 𝑛zeros,start shall be
equal to 𝑠b,max . The zero-padded sound pressure signal is named 𝑝(𝑛).
5.1.3.1 Theory
The pre-processing consists of filtering the input signal 𝑝(𝑛) with transfer functions of the outer and of the
middle/inner ear. The transfer function of the outer ear was modelled based on measured head related transfer
functions (HRTFs). The transfer function of the middle/inner ear was chosen such that the filtering together with
the loudness threshold LTQ(𝑧) (as explained in Clause 5.1.9) leads to a loudness estimation emulating the
equal-loudness contours from 20 to 90 phon (with a step size of 10 phon) and the lower threshold of hearing.4
The middle/inner ear filter is optimized on the equal-loudness contours of ISO 226:2003.
2 If the input data is sampled at a different sampling rate than 48 kHz, a resampling to 48 kHz needs to be performed.
3 The hop size is the time shift to the next calculation block, smaller than block size if overlapping is used. It is related to the
percent overlap ov by 𝑠h (𝑧) = 𝑠b (𝑧) ∙ (100 − ov)/100.
4 In Zwicker’s loudness model [3] the influence of the outer and middle ear transfer functions is considered by the ear’s
transmission characteristic 𝑎0 .
5.1.3.2 Implementation
The transfer function of the resulting filter is shown in Figure 3. The overall filter is composed of a filter modelling
the influence of the outer ear and a filter modelling the influence of the middle/inner ear. Those filters are also
shown in Figure 3.
Each second-order filter 𝐻𝑘 (𝑓) can be implemented using the recursive Formula (5)
2 2
Filter coefficients
k 𝑏0𝑘 𝑏1𝑘 𝑏2𝑘 𝑎1𝑘 𝑎2𝑘
1 1,015896 -1,925299 0,922118 -1,925299 0,938014
2 0,958943 -1,806088 0,876439 -1,806088 0,835382
3 0,961372 -1,763632 0,821788 -1,763632 0,783160
4 2,225804 -1,434650 -0,498204 -1,434650 0,727599
5 0,471735 -0,366092 0,244145 -0,366092 -0,284120
6 0,115267 0,000000 -0,115267 -1,796003 0,805838
7 0,988029 -1,912434 0,926132 -1,912434 0,914161
8 1,952238 0,162320 -0,667994 0,162320 0,284244
5.1.4.1 Theory
An auditory filter bank consisting of overlapping asymmetric filters models the frequency-dependent critical
bandwidths and the tuning curves of the frequency-to-place transform of the inner ear, which mediates the firing
of the auditory hair cells as the traveling wave from an incoming sound event progresses along the basilar
membrane. The shape of the auditory filters matches the gammatone filters [11]. The amplitude is chosen such
that the filter has a gain of 0 dB at the centre frequency 𝐹(𝑧), with 𝑧 denoting the critical band rate scale. This
0 dB gain varies slightly for the first critical bands due to influence of the negative frequencies as seen in
Figure 4. The critical bandwidth ∆𝑓(𝑧) is chosen such that it corresponds to the equivalent rectangular
bandwidth (implementation details are given in Formulae (9) and (10)). The inconstant ratio of bandwidth versus
frequency of the auditory filter bank conveys a high frequency resolution at low frequencies and a high time
resolution at high frequencies, with a very small product of time and frequency resolution at all frequencies,
which empowers, for example, human hearing’s recognition of short-duration low-frequency events. The
impulse responses of the auditory filters are chosen as modulated low-pass filters (𝑗 is the imaginary unit):
1 1 𝑡 𝑘−1 𝑡
ℎLP,𝑧 (𝑡) = 𝜀(𝑡) ∙ ∙ ∙( ) exp (− ) (7)
(𝑘 − 1)! 𝜏(𝑧) 𝜏(𝑧) 𝜏(𝑧)
where 𝑘 is the filter order5, 𝜀(𝑡) is the unit step function and the exclamation mark denotes the factorial operation.
𝜏(𝑧) is a time constant, related to ∆𝑓(𝑧) by
1 2𝑘 − 2 1
𝜏(𝑧) = ∙( )∙ . (8)
22𝑘−1 𝑘 − 1 ∆𝑓(𝑧)
The centre frequencies 𝐹(𝑧) and corresponding bandwidths ∆𝑓(𝑧) of the filter bank are calculated as
∆𝑓(𝑓 = 0)
𝐹(𝑧) = sinh(𝑐𝑧) (9)
𝑐
2
∆𝑓(𝑧) = √(∆𝑓(𝑓 = 0)) + (𝑐𝐹(𝑧))2 , (10)
Values for 𝑧 are chosen from 0,5 to 26,5 with a step size of ∆𝑧 = 0,5. ∆𝑓(𝑓 = 0) = 81,9289 Hz and 𝑐 = 0,1618.
These functions and settings lead to a better matching to the Bark table by Zwicker [12] than other existing
formulae, as documented in detail in Reference [1]. The unit of the critical band rate scale of this auditory filter
bank is Bark HMS , where HMS stands for “according to the Hearing Model of Sottek” and denotes that the critical
bands differ from other definitions.
(1 − 𝑑)𝑘
ℎLP,𝑧 (𝑛) = 𝜀(𝑛) ∙ 𝑛𝑘−1 𝑑 𝑛 , (11)
∑𝑘−1
𝑖=1 𝑒𝑖 𝑑
𝑖
1
with time index 𝑛 and 𝑑 = exp (− ) is used6; 𝑒𝑖 depends on the filter order 𝑘 and is given below for a specific
𝑟𝑠 𝜏(𝑧)
value of 𝑘. The band-pass filtering using ℎ𝑧 (𝑡) can be implemented using the discrete approximation of the
band-pass filter
𝑗2𝜋𝐹(𝑧)𝑛 2𝜋𝐹(𝑧)𝑛
ℎ𝑧 (𝑛) = 2 ∙ Re ( ℎLP,𝑧 (𝑛) ∙ exp ( )) = 2 ∙ ℎLP,𝑧 (𝑛) ∙ cos ( ). (12)
𝑓𝑠 𝑓𝑠
5.1.4.2 Implementation
In the following, instructions for the implementation of the auditory filters are given: Digital filtering can be
implemented using the recursive Formula (13):
𝑘−1 𝑘
For the discrete low-pass filter ℎLP,𝑧 (𝑛) as described in Formula (11), the real-valued filter coefficients are
𝑘
𝑎𝑚 = (−𝑑)𝑚 ( ), (14)
𝑚
and
(1 − 𝑑)𝑘
𝑏𝑚 = 𝑑 𝑚 𝑒𝑚 . (15)
∑𝑘−1
𝑖=1 𝑒𝑖 𝑑
𝑖
With a used filter order of 𝑘 = 5 the coefficients 𝑒𝑖 in Formula (11) and in Formula (15) are given as
1
𝑒0 = 0, 𝑒1 = 1, 𝑒2 = 11, 𝑒3 = 11, and 𝑒4 = 1 . As explained above, 𝑑 = exp (− ) with 𝜏(𝑧) as defined in
𝑟𝑠 𝜏(𝑧)
Formula (8).
′
𝑗2𝜋𝐹(𝑧)𝑚
𝑎𝑚 = 𝑎𝑚 exp ( ) (16)
𝑟s
and
′
𝑗2𝜋𝐹(𝑧)𝑚
𝑏𝑚 = 𝑏𝑚 exp ( ), (17)
𝑟s
with a sampling rate of 𝑟s = 48 kHz. Using these modified filter coefficients in the recursive Formula (13) results
in a discrete implementation of the auditory filters. The filter results in a complex-valued band-pass signal with
a single-sided spectrum. Two times the even part of the spectrum of this signal corresponds to the real-valued
band-pass signal. Thus, the real-valued band-pass signal can be determined as the double real part of the
complex result.
Figure 4 shows the magnitude of the transfer functions of the auditory filter bank, calculated by filtering a digital
Dirac pulse (sampling rate: 48000 Hz, duration 1 s) using the filter coefficients 7 defined in Formulae (16) and
(17) with a subsequent Fourier transform on the real-value band-pass signal.
5.1.5 Segmentation
For further processing, segmentation into blocks needs to be performed and blockwise root-mean-square (RMS)
values need to be calculated. For the segmentation, the band-dependent block size 𝑠b (𝑧) and the hop size 𝑠h (𝑧)
can be chosen depending on the application. Values for 𝑠b (𝑧) and 𝑠h (𝑧) for the calculation of the psychoacoustic
tonality are given in Clause 6.2.2 and for the calculation of the psychoacoustic roughness in Clause 7.1.1.
with 0 ≤ 𝑛′ ≤ 𝑠b (𝑧) − 1, where the time index 𝑙 describes the block number of each block, starting with 𝑙 = 0
(corresponding to a time of 0 ms). 𝑖start (𝑧) is an index offset that guarantees that the first block of all stages
corresponds to the same time reference. It is defined as:
Thus, each block 𝑝𝑙,𝑧 (𝑛′ ) ranges from 𝑛 = 𝑙 ∙ 𝑠h (𝑧) + 𝑖start (𝑧) to 𝑛 = 𝑙 ∙ 𝑠h (𝑧) + 𝑖start (𝑧) + 𝑠b (𝑧) − 1. The last
value of 𝑙, 𝑙last (z), is dependend on the filter band and the value 𝑛new defined in Formula (3):
𝑛new + 𝑠h (𝑧)
𝑙last (𝑧) = ceil ( )−1. (20)
𝑠h (𝑧)
5.1.6 Rectification
Subsequent half-wave rectification accounts for the fact that the auditory nerves fire only when the basilar
membrane vibrates in a specific direction [13]. The resulting band-pass signals are calculated as:
With the segmented and rectified blocks 𝑝rect,𝑙,𝑧 (𝑛′ ), the RMS-values are calculated for each block as:
𝑠b (𝑧)−1
2
𝑝̃(𝑙, 𝑧) = √ ∑ 𝑝rect,𝑙,𝑧 2 (𝑛′) , (22)
𝑠b (𝑧)
𝑛′=0
The factor of 2 is necessary to compensate for the signal energy which was lost due to the half-wave rectification.
The dependency on the time index 𝑙 is dropped in the following, since the further processing steps are applied
to each time block in the same way.
The compressive nonlinearity of the auditory system is significant for the loudness perception. The specific
loudness distribution, resulting from the application of this nonlinearity to the excitation pattern, also forms the
basis for calculating other psychoacoustic parameters such as tonality, roughness or fluctuation. Such a
nonlinearity function has proven applicable to predict many phenomena like ratio loudness, just-noticeable
amplitude differences and modulation thresholds as well as the level dependence of roughness.
The nonlinearity between specific loudness and sound pressure was reconsidered in the hearing model
according to results of many listening tests [14]. Further improvements for higher levels above approximately
80 dB were achieved by introducing a nonlinearity function according to Formula (23):
𝑣𝑖 −𝑣𝑖−1
𝑀 𝛼
𝑝̃ 𝑝̃ 𝛼
𝐴′ (𝑝̃) = 𝑐N ∙ ( ) ∙ ∏ (1 + ( ) ) (23)
𝑝̃0 𝑝̃𝑡𝑖
𝑖=1
with root-mean-square values of sound pressure 𝑝̃ and thresholds 𝑝̃𝑡𝑖 in Pa, 𝑝̃0 = 20 µPa. The 𝑀 thresholds 𝑝̃𝑡𝑖
can be derived from Table 2; 𝛼 is set to 1,5; 𝑐N = 0,0211668 is a calibration factor with the
Table 2 — 𝑴 = 𝟖 thresholds and exponents for the nonlinearity function for Formula (23)
𝑖 1 2 3 4 5 6 7 8
̃ ′ (𝑧)
𝑁 = 𝐴′ (𝑝̃(𝑧)) (24)
can be interpreted as the specific loudness of the signal without consideration of the threshold in quiet.
The function according to Formula (23) results from an optimization procedure to fit the experimental data with
the lowest root-mean-square error [14]. It has a steep slope at high levels, which agrees with results of
experiments from Buus et al. [15] and Epstein et al. [16]
The specific loudness in each band 𝑧 is zero if it is at or below a critical-band-dependent specific loudness
threshold LTQ(𝑧). The band-specific loudness threshold LTQ(𝑧) is given for each used band number 𝑧 from 0,5
to 26,5 in Table 3. Figure 5 shows the loudness threshold LTQ(𝑧) in dependency of the center frequency of the
bands.
8 HMS stands for “according to the Hearing Model of Sottek” and denotes that the calculated loudness and the critical bands
differ from other definitions.
9 The calibration factor 𝑐 can be adjusted within a tolerance of 0,25 % to account for the effects of different implementations.
N
′
̃ ′ (𝑧) − LTQ(𝑧),
𝑁 ̃ ′ (𝑧) ≥ LTQ(𝑧)
𝑁
𝑁basis (𝑧) = { . (25)
0 ̃ ′ (𝑧) < LTQ(𝑧)
𝑁
′
The result 𝑁basis (𝑧) is the specific basis loudness of the signal. The specific basis loudness can be used as basis
for other psychoacoustic parameters such as tonality (see Clause 6) and roughness (see Clause 7).
A signal is considered to be audible when its total loudness value exceeds 0,01 soneHMS , where total basis
loudness is calculated summing all specific basis loudness values, using ∆𝑧 = 0,5 as
CBF
𝑖
𝑁basis = ′
∑ 𝑁basis ( ) ∙ ∆𝑧 . (26)
2
𝑖=1
Consideration of both total and specific basis loudness has the benefit of allowing loudness summation of
sounds consisting of multiple components near threshold.
Recent investigations showed that existing loudness procedures underestimate the loudness of tonal signals [17].
Clause 8.1 describes a new loudness algorithm based on a nonlinear weighting of the partial loudness of tonal
and non-tonal components derived in Clause 6.2.
Prominent perceived tonalities arise from a variety of causes including but not limited to prominent discrete
tones: discrete tones, non-pure tones, narrow elevated noise bands, combinations of tones and narrow elevated
noise bands, band-edges of various slopes terminating elevated noise bands of various bandwidths, and
combinations of these. This clause defines a procedure for identifying and ranking tonalities from any causes.
Discrete tones or other tonalities should only be classified as prominent if they are, in fact, audible in the noise
emissions of the equipment under test. For the tonality calculation methods as described in ECMA-418 – Part 1:
Dominant discrete tones, a pre-calculation screening test is recommended concerning audibility of the tonality.
From calibrated acoustical measurement time-data, this step is not required with the psychoacoustic tonality
calculation method regardless of proximity to the threshold of hearing because the method inherently considers
the threshold of hearing and the psychoacoustic loudness of tonal and non-tonal components.
6.1.2 Multiple tones in a critical band, and time-variation of tonality due to their interaction
The noise emitted by a machine may contain multiple tones or narrowband tonalities, several of which may fall
within a single critical band. Besides the likelihood of increased overall tonality strength due to a plurality of
tones within one critical bandwidth, there is a strong likelihood of beating interference between or among the
plural tonalities causing time structure (amplitude modulation): periodic additions and cancellations affecting the
strength of the perceived tonality within that critical band. In this case the sound is often perceived as “rough”,
leading to the psychoacoustics sensation of “roughness”. A method for the identification of prominent roughness
is described in Clause 7.
6.2.1 Overview
Tonality perceptions arising from spectrally-elevated noise bands of various widths and slopes and from non-
pure tones as well as from discrete (pure) tones, and from combinations of these, can be mis-measured or
escape measure in “hybrid” sound pressure based tools and tools sensitive only to discrete tones. To address
such issues, a new psychoacoustically-based tonality calculation method based on the hearing model in
Clause 5[18] was developed. The applicability of the model was investigated for technical sounds and compared
to established methods of tonality calculation [19], [20], [21]. The method automatically considers the threshold of
hearing because the hearing threshold is built into the hearing model [21].
Recent research results show a strong correlation between tonality perception and the partial loudness of tonal
sound components [22], [23], [24]. Therefore, the new hearing model approach to tonality on the basis of the
perceived loudness of tonal content has been developed. The new model evaluates the nonlinear and time-
dependent specific loudness of both tonal and broadband components, which are separated using the
autocorrelation function. This model has been validated by many sound situations and listening tests [19].
In early publications, Licklider assumed that human pitch perception is based on both spectral and temporal
cues [25]. According to Licklider, the neuronal processing in human hearing applies a running autocorrelation
analysis of the critical band signals. Under this assumption, psychoacoustic tonality phenomena like difference-
tone perception or the missing-fundamental phenomenon (”virtual pitch”) can be explained.
The further processing for tonality calculation is performed similarly as published in References [19], [20], and
[21] as shown in Figure 6 and described in detail as follows:
Figure 6 — Calculation of tonality based on the scaled ACFs as described in Reference [19], but with
frequency-dependent analysis window borders
Recently, it was proposed to use the autocorrelation function of the band-pass signals to separate tonal content
from noise [1]. The autocorrelation function of white Gaussian noise is characterized by a Dirac impulse. Any
broadband noise signal has at least a non-periodic autocorrelation function with high values at low lags, whereas
the autocorrelation function (ACF) of periodic signals shows also a periodic structure [28]. Thus, the loudness of
the tonal component can be estimated by analyzing the ACF at a certain range with respect to the lag 𝑚, and
also the loudness of the remaining (noisy) part.The calculation of the sliding ACF is time-consuming. Therefore,
the sliding ACF is calculated block-wise using the discrete Fourier Transform (DFT) to shorten computing time.
An overlap of 75% is used for neighbouring blocks. There is a low-pass effect due to averaging over the block
length. The ACF is performed on the same rectified blocks 𝑝rect,𝑙,𝑧 (𝑛′ ) (see Formula (21)) of the overlapping
critical band signals, from which the root-mean-square values were calculated in Clause 5.1.7.
For slowly varying low-frequency band-pass signals, a greater block length 𝑠b (𝑧) is necessary than for higher-
frequency bands. Thus, different block lengths are used, depending on the frequency band. The block length is
Table 4 — Block length 𝒔𝐛 (𝒛) and hop size 𝒔𝐡 (𝒛) for the calculation of the autocorrelation function
∆𝑓(𝑧) 0 − 85 Hz 85 − 170 Hz 170 − 340 Hz > 340 Hz
𝑧 0,5 − 1,5 2−8 8,5 − 12,5 ≥ 13
𝑠b (𝑧) 8192 4096 2048 1024
𝑠h (𝑧) 2048 1024 512 256
For each block of length 𝑠b (𝑧), an unscaled autocorrelation function 𝜑𝑙,𝑧 (𝑚) is calculated in two steps: first a
2𝑠b -point DFT10 of 𝑝rect,𝑙,𝑧 (𝑛′) is performed by zero padding, where 𝑠b (𝑧) is the block size given in Table 4, with
a subsequent calculation of the squared magnitude:
2
𝑃rect,𝑙,𝑧 (𝑘) = |DFT2𝑠b (𝑝rect,𝑙,𝑧 (𝑛′))| , 0 ≤ 𝑘 < 2𝑠b (𝑧) , (27)
and second the Inverse Discrete Fourier Transform (IDFT11) of 𝑃rect,𝑙,𝑧 (𝑘) is calculated12:
The next step is to compute a new estimate of an unbiased normalized autocorrelation function that
3
compensates for lower overlaps at higher lag 𝑚 (windowed, only values for 0 ≤ 𝑚 < 4 𝑠 (𝑧) needed):13
b
𝜑𝑙,𝑧 (𝑚)
𝜑unscaled,𝑙,𝑧 (𝑚) 3
, 0 ≤ 𝑚 < 𝑠b (𝑧)
4
b (𝑧)−𝑚−1
√∑𝑠𝑛′=0 𝑠b (𝑧)−𝑚−1
𝑝rect,𝑙,𝑧 2 (𝑛′) ⋅ ∑𝑛′=0 𝑝rect,𝑙,𝑧 2 (𝑛′ + 𝑚) + 𝜀 (29)
= ,
3
0, 𝑠 (𝑧) ≤ 𝑚 < 2𝑠b (𝑧)
{ 4 b
10 The N-point DFT is defined as 𝑋(𝑘) = DFT (𝑥(𝑛)) = ∑𝑁−1 𝑥(𝑛) ∙ e−𝑗2𝜋𝑘𝑛/𝑁 .
𝑁 𝑛=0
11 The K-point IDFT is defined as 𝑥(𝑛) = IDFT (𝑋(𝑘)) = 1 ∑𝐾−1 𝑋(𝑘) ∙ e+𝑗2𝜋𝑘𝑛/𝐾 .
𝐾 𝑘=0 𝐾
12 The presented calculations use two-sided spectra. This must be considered in an implementation since some signal
processing libraries also use symmetry properties in their function calls to speed up the calculation and thus expect adjusted
call parameters.
13 A common problem in estimating a blockwise autocorrelation function is the decreasing overlap of the blocks with
increasing lag 𝑚. The unscaled autocorrelation
𝑠b (𝑧)−𝑚−1
does not consider this problem and thus leads to decreasing values for higher lag values, even if the signal is perfectly
periodic. The commonly used approach for the unbiased autocorrelation, which aims to compensate for this problem, is
𝑠b (𝑧)−𝑚−1
1
𝜑𝑧 (𝑚) = ∑ 𝑝𝑧 (𝑛′)𝑝𝑧 (𝑛′ + 𝑚) .
𝑠b (𝑧) − |𝑚|
𝑛′=0
However, this approach may lead to unwanted effects, since the result does not necessarily satisfy the condition
𝜑𝑧 (𝑚) ≤ 𝜑𝑧 (0), which is an essential property of the ACF. The new approach for the unbiased autocorrelation solves this
problem by considering the energies of the overlapping parts of the blocks [29]. A drawback of this approach is the
overestimation of the ACF of noise signals for higher lag values, but these values are neglected in further processing.
The autocorrelation function has to to be calculated with two different block lengths for some frequency bands
to allow averaging over neighbouring bands in later processing steps, as explained in the following Clause 6.2.3.
The entire ACF is multiplied with the specific basis loudness of the signal15:
′
𝜑𝑧 ′(𝑚) = 𝑁basis (𝑧) ∙ 𝜑𝑧 (𝑚), (30)
resulting in scaled16 ACFs 𝜑𝑧′ (𝑚) which can be used for further analysis of the tonality.
First, ACFs of neighbouring bands are averaged in order to reduce noise. Averaging is performed over 2𝑁𝐵 + 1
bands, i.e., each band is averaged with the neighbouring 𝑁𝐵 lower and 𝑁𝐵 higher frequency bands. The value
𝑁𝐵 is chosen depending on the block size as described in Table 5. Since averaging needs to be performed with
identical block size, it needs to be ensured that the autocorrelation function of neighbouring bands is available
in the same block size. Thus, for frequency bands close to block size changes, the autocorrelation function
needs to be calculated with two different block sizes. If not enough neighbouring frequency bands exist (for the
lower frequency bands), 𝑁𝐵 is reduced such that averaging is still performed symmetrically centred around the
particular frequency band. An exception is made for the lowest frequency band, which is averaged only with the
second-lowest frequency band. This is necessary, because a symmetric averaging is not possible because of
the missing lower band. No averaging on the other hand results in high noise artefacts.
𝑁𝐵 2 2 1 0
In a next step, the ACFs are averaged over neighbouring blocks in time for further reduction of noise. This block
averaging is performed only for the block sizes 𝑠b = 8192 and 𝑠b = 4096, in which case the ACF in a given block
is averaged with the ACFs in the preceding and the subsequent blocks. The averaging is not performed for the
first and the last block because there is no preceding respectively nor subsequent block.
The outcome of the two averaging steps is a modified, noise reduced scaled ACF 𝜑̅𝑧 ′(𝑚).
A lag window with frequency-dependent limits (𝜏start (𝑧) and 𝜏end (𝑧)) according to Formulae (31) and (32) is
applied to the ACF 𝜑̅𝑧 ′(𝑚) to separate tonal from noisy content:
0,5
𝜏start (𝑧) = max ( ,𝜏 ), (31)
∆𝑓(𝑧) min
4
𝜏end (𝑧) = max ( ,𝜏 (𝑧) + 1 ms) . (32)
∆𝑓(𝑧) start
14 The additive constant 𝜀 = 10−12 is used throughout the complete document to avoid division by zero in several formulae.
16 The ACF is scaled such that 𝜑 ′(0) represents the specific loudness 𝑁′(𝑧).
𝑧
It can be shown that the autocorrelation function of a periodic signal is itself periodic [28]. In the case of a pure
tone, the period of the ACF equals the period of the tone. Consequently, the signal energy of a pure tone can
be identified at multiples of the signal period. For white Gaussian noise, the autocorrelation function is a Dirac
impulse, weighted by the power spectral density of the noise [28]. In case of broadband white noise, the
autocorrelation function converges towards a Dirac impulse.
ACF window
8
-4
-8
t0 tstart tend
0 20 40 60 80
t / ms
Figure 7 — Positioning of the ACF window for tonal content separation. This example shows the
autocorrelation function of a tone in pink background noise
Figure 7 visualizes the placement of the ACF window for the autocorrelation function of a tone in pink
background noise.17
where the ceil(𝑥) operator gives the smallest integer value higher than or equal to the number 𝑥 and the floor(𝑥)
operator gives the greatest integer value smaller than or equal to the number 𝑥. The window is applied by setting
all elements of 𝜑̅𝑧 ′(𝑚′) except the ones from index 𝑚start (𝑧) to index 𝑚end (𝑧) to zero and subtracting the mean
of the windowed part of the ACF:
𝑚end (𝑧)
∑𝑚=𝑚 𝜑̅𝑧 ′(𝑚)
start (𝑧)
′ (𝑚)
𝜑𝑧,𝜏 = {𝜑̅𝑧 ′(𝑚) − , 𝑚start (𝑧) ≤ 𝑚 ≤ 𝑚end (𝑧) (35)
𝑀
0, else
17 The motivation for the limits given in Formulae (31) and (32) is as follows: In Figure 7, the energy distribution at small
lags results from the noisy background and is disregarded by appropriately choosing the lower window border. Nevertheless,
narrow-band noise also causes a perception of tonality when the bandwidth is comparatively small (i.e., few critical bands).
This effect leads to a trade-off in the placement of the window borders: For a smaller bandwidth, the effect of the low-pass
filtered noise on the ACF reaches higher lags than for a larger bandwidth. Thus, the window needs to be moved to higher
lags for a lower bandwidth. On the other hand, higher lags are less reliable because they are calculated from a smaller
number of samples. Therefore, the upper limit of the window should not be chosen too large.
The specific loudness of the tonal component is estimated by evaluating the spectrum of the ACF inside the lag
′ (𝑚′).
window 𝜑𝑧,𝜏 A 16384-point DFT 18 of the 𝑀 samples is performed by zero-padding, where the number
16384 is chosen as two times the largest block size 𝑠b (𝑧) given in Table 4:
′ (𝑚′))
Φ′𝑧,𝜏 (𝑘) = DFT16384 (𝜑𝑧,𝜏 . (36)
The maximum magnitude of the spectrum is searched, meaning, that the largest tonal content is extracted 19:
̂tonal
𝑁 ′
(𝑧) is a first estimation of the specific loudness of the tonal component. The frequency 𝑓ton (𝑧) of this
component in the critical band centred around 𝑧 can be estimated by first finding the DFT index 𝑘max
corresponding to the maximum of Φ′𝑧,𝜏 (𝑘).
While this approach is capable of analysing tonalities with a rather high frequency resolution, it might
underestimate tonal content when the corresponding frequency changes quickly inside of one block. This should
be considered, even though the adaptive block size with smaller blocks for high frequencies aims at reducing
this problem, since quickly varying frequencies usually occur at high frequencies.
For the further processing, the dependency of the time of each processed block becomes important. Thus, the
time index 𝑙 (which was dropped in Clause 6.2.2) needs to be considered. Since the results of different bands
are in a different time basis at this stage of the processing due to a different block length, the bands with a
higher block size are resampled to correspond to the time basis of the blocks calculated with the smallest block
size of 1024. The resampling is done by linear interpolation. In Table 6, the interpolation factors 𝑖 for each critical
band 𝑧 are given.
Table 6 — Interpolation factors for critical bands with different block size
𝑧 0,5 − 1,5 2−8 8,5 − 12,5 ≥ 13
𝑠b (𝑧) 8192 4096 2048 1024
𝑖 8 4 2 1
For all time-dependent variables (𝑖 − 1) new samples are inserted between two given adjacent samples by
simple linear interpolation.
18 The N-point DFT is defined as 𝑋(𝑘) = DFT (𝑥(𝑛)) = ∑𝑁−1 𝑥(𝑛) ∙ e−𝑗2𝜋𝑘𝑛/𝑁 .
𝑁 𝑛=0
19 The normalization by 𝑀 is necessary to calculate the energy of the windowed ACF from the DFT result. The scaling factor
2
2 is necessary because of the half-wave rectified signal.
𝑛samples
𝑙end = ceil ( ⋅ 𝑟sd ). (40)
𝑟s
̂tonal
𝑁 ′
(𝑙, 𝑧) is a first estimation of the specific loudness of the tonal component. However, the specific loudness
of the tonal component is usually overestimated at this stage of the estimation process due to the tonal character
of noise in the narrow-band filtered bands. Thus, further noise reduction is necessary. This is done by application
of nonlinear sigmoid weighting of tonal vs. noise components. 𝑁 ̂tonal
′
(𝑙, 𝑧) is the tonal part of the specific loudness
of the complete band-pass signal. The corresponding specific loudness of the complete band-pass signal is
given by the autocorrelation function at zero lag:
′ ′ (𝑚
𝑁signal (𝑙, 𝑧) = 𝜑̅𝑙,𝑧 = 0). (41)
A first approximation of the signal-to-noise ratio in the band of interest can be derived as
̂tonal
𝑁 ′
(𝑙, 𝑧)
̂ (𝑙, 𝑧)
SNR = . (42)
′
𝑁signal (𝑙, 𝑧) − 𝑁̂tonal
′ (𝑙, 𝑧) + 𝜀
Since the estimation of the tonal component might contain unsteady parts, low-pass filtering is performed over
the temporal dimension of 𝑁 ̂tonal
′ ̂ (𝑙, 𝑧). A cutoff frequency of 3,5 Hz is used.20 Low-pass filters with
(𝑙, 𝑧) and SNR
the same filter coefficients are used for all critical bands. The filter defined in Formula (11) is used with order
𝑘 = 3. The filter coefficients of the low-pass filter ℎLP (𝑙) can be calculated according to Formulae (14) and
(15).21 The filtered signals are then
̃tonal
𝑁 ′
(𝑙, 𝑧) ̂tonal
= 𝑁 ′
(𝑙, 𝑧) ∗ ℎLP (𝑙) (43)
and
̃ (𝑙, 𝑧)
SNR = ̂ (𝑙, 𝑧) ∗ ℎLP (𝑙),
SNR (44)
where ∗ denotes the convolution. These filtered signals are used for further processing in Formulae (45) and
(47).
̃tonal
Band-dependent noise reduction is achieved by weighting the filtered specific loudness 𝑁 ′
(𝑙, 𝑧) of the tonal
component by a sigmoid function
̃ (𝑙,𝑧)
SNR ̃ (𝑙,𝑧)
SNR
−𝛼∙( −𝛽) −𝛼∙( −𝛽)
1−e 𝑔(𝑧) , e 𝑔(𝑧) <1
nr(𝑙, 𝑧) = { ̃ (𝑙,𝑧)
, (45)
SNR
−𝛼∙( −𝛽)
0 e 𝑔(𝑧) ≥1
20 Please note that the bandwidth of the lowpass filter is twice as large as the cut-off frequency! Therefore, the variable 𝑑 in
1 1
Formulae (14) and (15) should be calculated using 𝜏(𝑧) = ∙ 6 ∙ = 0.0268s for all critical bands according to
32 7 Hz
Formula (8).
21 For Formula (15) the following factors 𝑒 have to be used for a filter order of 𝑘 = 3: 𝑒 = 0, 𝑒 = 1, 𝑒 = 1.
𝑖 0 1 2
Sigmoidal weighting significantly reduces wrongly-detected specific loudness of tonal components for
broadband signals. The frequency dependent factor 𝑔(𝑧) is calculated as
𝑐(𝑠b (𝑧))
𝑔(𝑧) = , (46)
𝐹(𝑧)𝑑(𝑠b(𝑧))
where the parameters 𝑐 and 𝑑 are given in Table 8 depending on the block size 𝑠b (𝑧) (see Table 4). This
function mitigates frequency-dependent overestimations of the tonality estimation (due to the different block
sizes) such that SNR(𝑙, 𝑧)/𝑔(𝑧) is approximately constant over 𝑧 for pink noise signals.
Table 8 — Parameters for the frequency dependent factor 𝒈(𝒛) (Formula (46))
𝑠b (𝑧) 8192 4096 2048 1024
𝑐(𝑠b (𝑧)) 18,21 12,14 417,54 962,68
𝑑(𝑠b (𝑧)) 0,36 0,36 0,71 0,69
′
The specific loudness of the tonal component, 𝑁tonal (𝑙, 𝑧), is then modelled as
′
𝑁tonal (𝑙, 𝑧) ̃tonal
= nr(𝑙, 𝑧) ∙ 𝑁 ′
(𝑙, 𝑧). (47)
The perceived tonality is not only dependent on the tonal content in each band, but also on the signal-to-noise
ratio over all bands at each time instance 𝑙. Thus, to finally model the tonality of the signal, the overall loudness
signal-to-noise ratio is evaluated across all bands. First, a new estimation of the specific loudness of the noise
component is calculated, using the final estimation of the specific loudness of the tonal component:
′ ′ ′
𝑁noise (𝑙, 𝑧) = 𝑁signal (𝑙, 𝑧) ∗ ℎLP (𝑙) − 𝑁tonal (𝑙, 𝑧). (48)
A scaling factor
where 𝑐T = 2,8785151 is a calibration factor. The time index 𝑙 can be mapped to the time 𝑡 in seconds as:
𝑙 𝑙
𝑡 = = s. (52)
𝑟sd 187,5
The unit of the tonality calculated by the psychoacoustic tonality method is given in tu HMS (HMS stands for tonality
units “according to the Hearing Model of Sottek” described in Clause 5). The psychoacoustic tonality method is
calibrated using a 1 kHz tone with a sound pressure level of 40 dB. The tonality value shall be for this signal
1 tuHMS22.
The specific tonality 𝑇′(𝑧) is taken by averaging the time-dependent specific tonality 𝑇′(𝑙, 𝑧). The averaging is
performed as follows:
1. The first tonality values 𝑇′(𝑙, 𝑧) for 0 ≤ 𝑙 ≤ 56 (approximately corresponding to the first 300 ms of the
input signal) are discarded due to the transient responses of the digital filters.
2. Only values that exceed a specific tonality value of 0,02 tuHMS are used for averaging. This step
ensures that the single value is independent of parts of the signal without noticeable tonal
components.
1
𝑇′(𝑧) = ∑ 𝑇′(𝑙 ′ (𝑧), 𝑧) , (53)
#(𝑙 ′ (𝑧)) +𝜀
𝑙′
with
using set notation23. The frequencies 𝑓ton,z (𝑧) are calculated by accordingly averaging the frequency 𝑓ton (𝑙, 𝑧)
(see Formula (39)24) over corresponding time indices:
1
𝑓ton,z (𝑧) = ∑ 𝑓 (𝑙 ′ (𝑧), 𝑧). (55)
#(𝑙 ′ (𝑧)) + 𝜀 ′ ton
𝑙
The time-dependent tonality 𝑇(𝑙) is taken as the maximum of the time-dependent specific tonalities 𝑇 ′ (𝑙, 𝑧) over
all bands 𝑧. If the user is only interested in one specific tonal event, a user defined frequency range [𝑓L , 𝑓H ] can
be specified. In this case, only critical bands with the critical band number 𝑧 are considered that fulfill the
following requirements:
22 The calibration factor 𝑐 can be adjusted within a tolerance of 0,25 % to account for the effects of different
T
implementations.
23 In set notation, {𝑥 | Φ(𝑥)} denotes all elements 𝑥 with the property Φ(𝑥). #(𝐴) denotes the cardinality (i.e. the number
of elements) of a set 𝐴.
24 Note that 𝑓 (𝑙, 𝑧) is denoted 𝑓 (𝑧) in Eq. (39), since the time index 𝑙 was neglected in this computation step.
ton ton
and
leading to a range of critical bands between 𝑧L and 𝑧H . With this calculation procedure, the actually considered
frequency range is [𝑓L′ , 𝑓H′ ] with
and
∆𝑓(𝑧) ∆𝑓(𝑧)
𝑅(𝑧) = [𝐹(𝑧) − , 𝐹(𝑧) + ]. (60)
2 2
All frequency bands between 𝑧L and 𝑧H . are used for the maximum search:
where 𝑧max (𝑙) is the band in which the maximum of the time-dependent specific tonality 𝑇 ′ (𝑙, 𝑧) was found for a
given time instance 𝑙.
The single value 𝑇 of the tonality of the signal is taken by averaging the time-dependent overall tonality 𝑇(𝑙).
The averaging is performed in the same way as described in Formula (53)
1
𝑇 = ∑ 𝑇(𝑙 ′ ) , (63)
#(𝑙 ′ ) ′
𝑙
with
For stationary sounds, a tonal component in the critical band 𝑧tonal is identified as prominent, if the specific
tonality 𝑇′(𝑧tonal ) exceeds a value of 0,4 tuHMS and the specific tonality has a local maximum in 𝑧tonal .
Additionally, the frequency 𝑓ton,𝑧 (𝑧tonal ) needs to be in the range [𝐹(𝑧tonal − 1), 𝐹(𝑧tonal + 1)] for the component
to be identified as prominent. If the user is only interested in one specific tonal event, a user defined frequency
range [𝑓L , 𝑓H ] can be specified. Then, only tonalities that are in the frequency range [𝑓L′ , 𝑓H′ ]25 are considered:
For each tonal component that has been identified as prominent according to this standard, the following
information shall be recorded:
a) if a frequency range was defined, the resulting frequency range [𝑓L′ , 𝑓H′ ] for searching prominent tonalities
(Formulae (56) and (57));
b) the frequency, 𝑓ton,z (𝑧tonal ) , in hertz, of the tonality in the corresponding critical band 𝑧tonal (see
Formula (55));
c) details of the method used to evaluate the tonality (ECMA 418 – Part 2: Psychoacoustic metrics based on
the hearing model – Clause 6.2 Psychoacoustic tonality calculation method), together with a reference to this
Standard;
For non-stationary sounds, a signal is considered to contain prominent tonalities, if the time-independent single
value 𝑇 of the time-dependent tonality 𝑇(𝑙)26 exceeds a value of 0,4 tuHMS (see Formula (63)). If the signal has
been identified to contain prominent tonalities according to this clause, the following information shall be
recorded:
a) if a frequency range was defined, the resulting frequency range [𝑓L′ , 𝑓H′ ] for searching prominent tonalities
(Formulae (56) and (57));
b) the time-dependent frequency, 𝑓ton,l (𝑙), in hertz (see Formula (62)) of the time-dependent tonality 𝑇(𝑙);
c) details of the method used to evaluate the tonality (ECMA 418 – Part 2: Psychoacoustic metrics based on
the hearing model – Clause 6.2 Psychoacoustic tonality calculation method), together with a reference to this
Standard;
NOTE The criterion for prominence of tonalities for the psychoacoustic tonality calculation method (Clause 6.2) is
independent of frequency 0,4 tuHMS (HMS stands for tonality units “according to the Hearing Model of Sottek” described in
Clause 5).
26 The time index 𝑙 can be mapped to a time in seconds according to Formula (52).
The auditory sensation roughness describes, together with the auditory sensation fluctuation strength, the
perception of temporal variations of sounds. While fluctuation strength covers slow variations (typically below
20 Hz), roughness is produced by faster variations up to around 500 Hz. The maximum of the auditory sensation
is located at around 4 Hz modulation rate for fluctuation strength and 70 Hz modulation rate for roughness. Both
auditory sensations can be produced either by amplitude modulation or by frequency modulation. Generally,
periodic modulations produce higher values of fluctuation strength and roughness than stochastic variations.
Roughness is used for the subjective evaluation of sound characteristics as well as for sound design. With
increasing roughness, sounds are increasingly attracting attention and perceived as increasingly aggressive,
and annoying, without showing a difference in loudness or A-weighted sound pressure level.
The impression of roughness arises if a time-variant envelope is present in one critical band, for example tones
with a temporal structure because of a change in amplitude or frequency. If these variations are rather slow (for
example lower than 10 Hz), the auditory system is capable to follow the changes and a perception of fluctuation
arises. With increasing modulation rates, sensations like R-roughness (around 20 Hz) arise and turn into actual
roughness, where the auditory system is not capable of resolving the temporal variations. Variations of the
envelope with modulation rates between 20 Hz and 300 Hz are perceived as “rough”. Roughness depends on
the center frequency, the modulation rate 𝑓mod , the degree of modulation 𝑚 and the sound pressure level.
Frequency modulated sounds produce a similar roughness as amplitude modulated sounds. The unit of
roughness is “asper”. As reference signal with 𝑅 = 1 asper, an amplitude modulated sinusoid of 1 kHz center
frequency, 𝑚 = 1, 𝑓mod = 70 Hz and a sound pressure level of 60 dB was chosen.
Roughness originates for example from a multiplicative combination of two vibrations – such as for example the
gear mesh frequency and the rotational speed in a gear wheel – or from superposition of two or more tonal or
narrowband sounds with a similar frequency. In practice, roughness often occurs in rotating components
(engines, gearboxes, fans).
7.1.1 Overview
The psychoacoustic roughness calculation is based on scaled envelope power spectra ΦE,𝑙,𝑧 (𝑘), which are
′
calculated using the specific basis loudness 𝑁basis (𝑙, 𝑧) (see Formula (25)) and the envelope of the CBF = 53
segmented band-pass signals 𝑝𝑙,𝑧 (𝑛′) (see Clause 5.1.5) as described in Clause 5. For the calculation of these
values, a block size of 𝑠b = 16384 and a hop size of 𝑠h = 4096 for the segmentation in Clause 5.1.5 shall be
used.
Figure 8 — Calculation of roughness based on band pass signals and the specific basis loudness
calculated as described in Clause 5.
The low-frequency envelopes are calculated from the segmented band-pass filtered sound pressure signals
𝑝𝑙,𝑧 (𝑛′) (see Clause 5.1.5) using the Hilbert transform. The envelopes 𝑝E,𝑙,𝑧 (𝑛′) are taken as magnitude of the
analytical signals
with ℋ(∙) denoting the Hilbert transform. Since the envelopes only contain low modulation rates, downsampling
with a factor of 32 is performed. The resulting downsampled envelopes of the band-pass signals are denoted
𝑝E,𝑙,𝑧 (𝑛̃) 27. With this step, the sampling rate changes from 𝑟s = 48 kHz to 𝑟̃s = 1500 Hz. The block size 𝑠̃b = 512
and a hop size of 𝑠̃h = 128 are the values corresponding to the block size of 𝑠b = 16384 and the hop size of
𝑠h = 4096 for the segmentation in Clause 5.1.5.
The envelopes 𝑝E,𝑙,𝑧 (𝑛̃) are windowed with a von-Hann window28, 𝑤Hann (𝑛̃)) and a scaled power spectrum29
ΦE,𝑙,𝑧 (𝑘) is generated by using
ΦE,𝑙,𝑧 (𝑘)
′ (𝑙)
0, 𝑁max ∙ 𝜑E,𝑙,𝑧 (0) = 0
′ 2 Bark HMS (66)
= (𝑁basis (𝑙, 𝑧)) ∙ ( ) ,
soneHMS 2
′
|DFT𝑠̃ b (𝑝E,𝑙,𝑧 (𝑛̃) ∙ 𝑤Hann (𝑛̃))| , else
{ 𝑁max (𝑙) ∙ 𝜑E,𝑙,𝑧 (0)
where DFT𝑠̃b denotes the 𝑠̃b -point Discrete Fourier Transform30, 𝑘 is the index corresponding to a modulation
𝑟̃s ′ ′ (𝑙) ′
frequency of 𝑘 ∙ Hz, 𝑁basis (𝑙, 𝑧) is the specific basis loudness, 𝑁max = max(𝑁basis (𝑙, 𝑧)) and
𝑠̃ b 𝑧
𝑠̃ b −1 2
𝜑E,𝑙,𝑧 (0) = ∑𝑛̃=0 (𝑝E,𝑙,𝑧 (𝑛̃) ∙ 𝑤Hann (𝑛̃)) . (67)
This step consideres the fact that the sensation of roughness changes nonlinearly with loudness. The results
are scaled envelope power spectra ΦE,𝑙,𝑧 (𝑘) which are used for further analysis of the roughness.
Noise reduction of the envelopes is performed in two steps: First, the scaled power spectra of neighbouring
bands are averaged to reduce noise effects. Averaging is performed over 3 bands. Each band is averaged with
̅ E,𝑙,𝑧 (𝑘).
one higher and one lower band. This step results in averaged scaled power spectra Φ
is calculated, showing an overview of all the modulation patterns over time. Each band may contain fluctuations
even in the case of unmodulated noise due to the bandpass-filtering, but in this case the correlation between
neighbouring bands is very low, while for modulated noise, the correlation is very high. The summation of the
averaged scaled power spectra amplifies the correlated components (peaks) stronger than the uncorrelated
ones. As a result, constant and/or time-varying peaks of the modulation spectrum become cleary visible. Now,
the averaged scaled power spectra are weighted with a noise suppression weighting factor 𝑤(𝑙, 𝑘) depending
on 𝑠(𝑙, 𝑘), that is applied to each individual critical band 𝑧, in order to distinguish between peaks related to the
roughness perception and the background noise of the envelope.
̂ E,𝑙,𝑧 (𝑘)
Φ ̅ E,𝑙,𝑧 (𝑘) ∙ 𝑤(𝑙, 𝑘)
= Φ (69)
̃
2𝜋𝑛
0,5−0,5 cos( )
28 Here, the scaled von-Hann window is defined as 𝑤
Hann (𝑛
̃) = 512
. The scaling factor in the denominator ensures
√0,375
a correct estimation of the magnitude of the power spectrum.
29 In the original version of the algorithm (22), the spectrum of the autocorrelation function 𝜑
E,𝑙,𝑧 (𝑚) of the envelope 𝑝E,𝑙,𝑧 (𝑛)
was evaluated (𝑚: lag time), corresponding to the power spectrum of the envelope. It should be noted that ΦE,𝑙,𝑧 (𝑘) is not
the Fourier transform of 𝜑E,𝑙,𝑧 (𝑚) since the scaling of ΦE,𝑙,𝑧 (𝑘) is not part of the autocorrelation function 𝜑E,𝑙,𝑧 (𝑚).
30 DFT of length N is defined as: 𝑋(𝑘) = DFT (𝑥(𝑛)) = ∑𝑁−1 𝑥(𝑛) ∙ e−𝑗2𝜋𝑘𝑛/𝑁 with 𝑘 = 0,1, … , 𝑁 − 1.
𝑁 𝑛=0
clip(𝑤
̃(𝑙, 𝑘) − 0,1407,0,1) , ̃(𝑙, 𝑘) ≥ 0,05 ∙ max (𝑤
𝑤 ̃(𝑙, 𝑘))
𝑤(𝑙, 𝑘) = { 𝑘=2,…,255 (70)
0, else
where clip(𝑥, 𝑥min , 𝑥max ) returns clipped values of 𝑥 between 𝑥min and 𝑥max . 𝑤
̃(𝑙, 𝑘) is calculated as
𝑠(𝑙, 𝑘)
𝑤
̃(𝑙, 𝑘) = 0,0856 ∙ ∙ clip(0,1891 ∙ e0,0120∙𝑘 , 0,1) (71)
𝑠̃ (𝑙) + 𝛿
with the median 𝑠̃ (𝑙) of 𝑠(𝑙, 𝑘) over 𝑘 = 2, … ,255, and an additional exponential weighting depending on the
modulation rate. The constant 𝛿 = 10−10 ensures a defined value of 𝑤 ̃(𝑙, 𝑘) if 𝑠̃ (𝑙) = 0.
Note that for modulated signals the median value 𝑠̃ (𝑙) is small compared to the peaks, whereas for unmodulated
signals, 𝑠̃ (𝑙) and the random peaks have almost the same magnitude, thus leading to large ratios 𝑠(𝑙, 𝑘)/𝑠̃ (𝑙)
for modulated signals; 𝑤(𝑙, 𝑘) tends to be 1, whereas for unmodulated signals 𝑤(𝑙, 𝑘) becomes 0. The
parameters in the Formulae (70) and (71) were chosen that for an unmodulated White Gaussian Noise with a
level of 80 dB all the weighting values 𝑤(𝑙, 𝑘) become 0, consequently leading to a roughness value of 0 asper.
In this step, the amplitudes of the averaged scaled power spectra are weighted according to the perception of
roughness, which depends on the modulation rate. The spectral weighting is divided into four steps: First,
spectral peaks are identified, and the modulation rate of those peaks is estimated with high precision. The
amplitudes of peaks with a high modulation rate are weighted corresponding to the estimated modulation rate
in the second step. Since usually, more than one peak is found, a third step is performed to analyse the relation
of the different peaks. It is assumed that there is one dominant harmonic complex (a fundamental modulation
rate with harmonics at multiples of the fundamental modulation rate) which is the dominant cause for roughness
perception. The fundamental modulation rate of such a harmonic complex is estimated in the third step. In the
fourth step, the amplitudes of peaks with a low modulation rate are weighted corresponding to the estimated
fundamental modulation rate and summed to result in a first, uncalibrated estimation of the specific roughness.
In the peak picking steps, maxima of the averaged scaled power spectra are searched. To obtain a very precise
estimation of the modulation rates corresponding to these maxima, a quadratic fit of the envelope spectrum is
performed. Since the use of the von-Hann window in the calculation of the DFT does not lead to an exact
quadratic shape in the spectrum, an additional refinement step is performed to reduce this bias.
First, local maxima of the averaged scaled power spectra Φ ̂ E,𝑙,𝑧 (𝑘) for 𝑘 = 2, … ,255 are searched. For each
maximum, a corresponding prominence is calculated as the difference between the amplitude of the maximum
and the surrounding values.To measure the prominence of a peak, a horizontal line is first extended from the
peak to the left and right of the peak. The points where the line intersects the data on the left and right (this is
either another peak or the end of the data) are marked as the outer endpoints of the left and right intervals. Next,
the lowest valley is searched in both intervals. The larger of these two valleys is taken, and the vertical distance
from that valley to the peak is measured. This distance is the prominence. Only the ten maxima with the highest
prominence are considered. The maxima are numbered with 𝑖, where 𝑖 = 1 is the maximum corresponding to
the lowest modulation rate.
are considered, where 𝑘p,𝑖 (𝑙, 𝑧) desribes the modulation rate index 𝑘 of the 𝑖th maximum.
Since the modulation rate index 𝑘 only provides a limited resolution of the modulation rate, a refinement step is
performed, which improves the spectral resolution of the estimated modulation rate and the corresponding
̂ E,𝑙,𝑧
𝚽 =𝐊∙𝐂 (73)
with
2
(𝑘p,𝑖 (𝑙, 𝑧) − 1) 𝑘p,𝑖 (𝑙, 𝑧) − 1 1
2
𝐊 = (𝑘p,𝑖 (𝑙, 𝑧)) 𝑘p,𝑖 (𝑙, 𝑧) 1 . (75)
2
(𝑘p,𝑖 (𝑙, 𝑧) + 1) 𝑘p,𝑖 (𝑙, 𝑧) + 1 1
( )
𝑐1
𝑓̃p,𝑖 (𝑙, 𝑧) =− ∙ ∆𝑓 (76)
2𝑐0
𝑟̃s
is calculated with the DFT resolution ∆𝑓 = = 1500 Hz / 512 = 2,9297 Hz. The estimated modulation rate is
𝑠̃ b
refined by applying a bias correction term 𝜌(𝑓̃p,𝑖 (𝑙, 𝑧))
The bias comes from approximating the spectrum of the von-Hann window with a quadratic function, when
estimating the true modulation rate from the peaks in the sampled spectrum. The bias adjustment term depends
almost only on the difference between the peak index and the corresponding exact modulation rate. This term
𝐸(𝜃) is calculated for 32 steps, covering a range of ∆𝑓 , using integer steps 𝜃 = 0, … , 32 to indicate the
corresponding sub-interval. A higher resolution of the modulation rate could be achieved by using more sub-
intervals. Another option is the linear interpolation of 𝐸(𝜃) as a function of 𝛽(𝜃), the theoretical error after
applying a correction, and 𝜃corr , the argument leading to the smallest error 𝛽(𝜃), as shown in the following:
𝛽(𝜃corr − 1)
𝜌 (𝑓̃p,𝑖 (𝑙, 𝑧)) = 𝐸(𝜃corr ) − (𝐸(𝜃corr ) − 𝐸(𝜃corr − 1)) ∙ (78)
𝛽(𝜃corr ) − 𝛽(𝜃corr − 1)
𝜃corr is determined from the set of possible integer 𝜃 values that lie between 0 and 32 (the value of 𝜃 = 33 in
Table 10 is given only to simplify the implementation, to avoid the use of additional conditions in Formula (81)).
For each possible value of 𝜃, 𝛽(𝜃) is calculated from:
𝑓̃p,𝑖 (𝑙, 𝑧) 𝜃
𝛽(𝜃) = (floor ( ) + ) ∙ ∆𝑓 − (𝑓̃p,𝑖 (𝑙, 𝑧) + 𝐸(𝜃)) (79)
∆𝑓 32
where floor(𝑥) gives the greatest integer value smaller than or equal to the number 𝑥. 𝜃min is the 𝜃 value that
produces the smallest beta value magnitude:
Table 10 and Formula (79) are used to calculate the parameters needed to calculate the bias term given in
Formula (78).
where it is assumed that the energy of a peak is mainly distributed over the index of the maximum and the two
neighbouring indices due to the use of the von-Hann window in the DFT calculation.
In a next step, these amplitudes are weighted with a modulation-rate-dependent factor 𝐺𝑙,𝑧,𝑖 (𝑓p,𝑖 (𝑙, 𝑧)) and a
scaling factor 𝑟max (𝑧). This weighting (together with the weigthing of low modulation rates described in 7.1.5.4)
consideres the dependency of the perceived roughness on the modulation rate. The weighting parameters were
obtained by an optimization procedure, fitting the results of the roughness algorithm to the results of listening
tests for sinusoids of different carrier frequencies with different modulation rates from Reference [12]. Those
results are shown in the evaluation of the roughness algorithm in Annex C, Figure C.1 and also in Reference [30].
with
1
𝑟max (𝑧) =
𝐹(𝑧)
𝑟2
(84)
1 + 𝑟1 |log 2 ( )|
1 kHz
1
𝐺𝑙,𝑧,𝑖 (𝑓p,𝑖 (𝑙, 𝑧)) =
2 𝑞2 (𝑧)
𝑓p,𝑖 (𝑙, 𝑧) 𝑓max (𝑧) (85)
(1 + (( − ) ∙ 𝑞1 ) )
𝑓max (𝑧) 𝑓p,𝑖 (𝑙, 𝑧)
where
𝐹(𝑧)
𝑓max (𝑧) = 72,6937 ∙ (1 − 1,1739 ∙ 𝑒 −5,4583∙1 kHz ) Hz (86)
is the modulation rate at which the weighting factor reaches the maximum of one. 𝐹(𝑧) is the center frequency
of the auditory filter bank as descibed in Clause 5. The parameter 𝑞1 = 1,2822 and 𝑞2 (𝑧) is calculated as
𝐹(𝑧)
0,2471, < 2−3,4253
1 kHz
𝑞2 (𝑧) = 2 (87)
𝐹(𝑧) 𝐹(𝑧)
0,2471 + 0,0129 ∙ (log 2 ( ) + 3,4253) , ≥ 2−3,4253
{ 1 kHz 1 kHz
In this step, the maxima of the averaged scaled power spectra, which were found in 7.1.5.1 are further analysed.
It is assumed that there is one dominant harmonic complex (a fundamental modulation rate with harmonics at
multiples of the fundamental modulation rate) which is the dominant cause for roughness perception. The
fundamental modulation rate of such a harmonic complex is estimated in this step.
For each block 𝑙 and band 𝑧 , the fundamental modulation rate of the envelope is estimated in the next
processing step considering the modulation rate 𝑓p,𝑖 (𝑙, 𝑧) and the amplitude 𝐴̃𝑖 (𝑙, 𝑧) of the block. Since the
dependencies on 𝑙 and 𝑧 are not relevant for this processing step, the variables will be denoted only in
dependency of the index of the corresponding maximum, 𝑖, 𝑓p (𝑖) and 𝐴̃(𝑖) in the following to simplify the notation.
For each maximum with index 𝑖, it is tested whether the corresponding modulation rate 𝑓p (𝑖) is the best estimate
for the fundamental modulation rate of the envelope, by assuming that the sum over the harmonic complex
corresponding to the best estimate will result in the highest value. The excact procedure is described in the
following, where 𝑖0 describes the index of the currently tested maximum.
First, integer ratios of the modulation rates 𝑓p (𝑖) of all found maxima to the modulation rate 𝑓p (𝑖0 ) are calculated
𝑓p (𝑖)
𝑅𝑖0 (𝑖) = round ( ), (88)
𝑓p (𝑖0 )
by rounding to the nearest integer. If several 𝑖 result in the same integer ratio 𝑅𝑖0 (𝑖), it needs to be decided
which of the maxima is used further. In this case, the maximum with the index
𝑓p (𝑖)
𝑖 = argmin | − 1| (89)
𝑖 𝑅𝑖0 (𝑖) ∙ 𝑓p (𝑖0 )
is used, while the other maxima are discarded. From all remaining maxima, a set 𝐼𝑖0 of indices of all maxima,
which belong to a harmonic complex with fundamental modulation rate 𝑓p (𝑖0 ) is defined (using a tolerance of
4%):
𝑓p (𝑖)
𝐼𝑖0 = {𝑖 |(| − 1| < 0,04)}. (90)
𝑅𝑖0 (𝑖) ∙ 𝑓p (𝑖0 )
The index 𝑖0 leading to the highest energy is denoted 𝑖max in the following, the corresponding set of indices 𝐼𝑖0
is denoted 𝐼max . The fundamental modulation rate of the envelope is 𝑓p (𝑖max ).
In the following, only peaks corresponding to the indices in 𝐼max are considered as part of the envelope. The
amplitudes of these peaks are weighted depending on the distance between the center of gravity of these peaks
and the modulation rate of the peak with the highest amplitude:
0,749
𝑓p (𝑖)
∑𝑖∈𝐼max ( ∙ 𝐴̃(𝑖))
Hz 𝑓p (𝑖peak )|
𝑤peak = 1 + 0,1 ∙ || − (93)
∑𝑖∈𝐼 𝐴(𝑖)
max
̃ Hz |
and
In this next step, another weighting based on the fundamental modulation rate and a summation of amplitudes
is performed. The block index 𝑙 and the band index 𝑧 are reintroduced for this step. Thus, the weighted
amplitudes are denoted 𝐴̂𝑖 (𝑙, 𝑧), the corresponding fundamental modulation rates 𝑓p,𝑖max (𝑙, 𝑧) and the set of
relevant maxima 𝐼max (𝑙, 𝑧).
∑ 𝐺𝑙,𝑧,𝑖 (𝑓p,𝑖max (𝑙, 𝑧)) ∙ 𝐴̂𝑖 (𝑙, 𝑧) , 𝑓p,𝑖max (𝑙, 𝑧) < 𝑓max (𝑧)
𝑖∈𝐼max (𝑙,𝑧)
𝐴(𝑙, 𝑧) = (95)
∑ 𝐴̂𝑖 (𝑙, 𝑧) , 𝑓p,𝑖max (𝑙, 𝑧) ≥ 𝑓max (𝑧)
{ 𝑖∈𝐼max (𝑙,𝑧)
where 𝐺𝑙,𝑧,𝑖 (𝑓p,𝑖max (𝑙, 𝑧)) is calculated as described in Formula (85) but with parameters
𝑞1 = 0,7066 and
𝐹(𝑧)
𝑞2 (𝑧) = 1,0967 − 0,0640 ∙ log 2 ( ). (96)
1 kHz
The parameter 𝑓max (𝑧) in Formula (95) is calculated according to Formula (86).
Values of 𝐴(𝑙, 𝑧) that fall below a threshold of 0,074376 are set to zero.
In an optional processing step, 𝐴(𝑙, 𝑧) is weighted depending on the randomness (measured using the entropy)
of the estimated modulation rates. This method has been shown to improve the estimation of the roughness [31][32].
First, the rotational speed signal is segmented in the same way as the sound pressure signal (see Clause 5.1.5,
with 𝑠b and 𝑠h as given in Clause 7.1.1). The result is a segmented rotational speed signal 𝑑S (𝑛′, 𝑙). In each time
block 𝑙, the median of 𝑑S (𝑛′, 𝑙) over 𝑛′ is calculated. The result 𝑑̃S (𝑙) is an estimation of one rotational speed
value for each block. This estimation is transformed to an estimation of the frequency of the rotational speed in
Hertz:
𝑑̃S (𝑙)
𝑓D (𝑙) = Hz (97)
R
60
min
Now the maxima of the modulation rate, which were found in Clause 7.1.5.1 to calculate a weighting factor
based on the entropy of these maxima. First, a set
Is defined. This set contains all indices of maxima, which were not identified as corresponding to the harmonic
complex of the estimated fundamental frequency in Section 7.1.5.3, and the index corresponding to the
fundamental frequency (but not the ones of the harmonics). For all 𝑖 ∈ 𝐼f (𝑙, 𝑧) an estimation of the order is
calculated as the ratio between the frequency of the maximum 𝑓p,𝑖 (𝑙, 𝑧) (see Clause 7.1.5.1) and the frequency
of the rotational speed:
0, 𝑓D (𝑙) = 0
𝑜𝑖 (𝑙, 𝑧) = {𝑓p,𝑖 (𝑙, 𝑧) . (99)
, else
𝑓D (𝑙)
Now a histogram of all estimated orders is calculated for each time index 𝑙 and frequency band 𝑧 from all
31
maxima of the current time block and the three preceding and subsequent blocks . In these histograms, 160
classes of constant width are used between the values 0,0625 and 20,625. The result is the histogram 𝐻(𝑏, 𝑙, 𝑧),
where 𝑏 is the class number and 𝐻(𝑏, 𝑙, 𝑧) contains the number of elements in the respective class. For
calculation of the entropy, probabilities of occurrence
0, ∑ 𝐻(𝑏, 𝑙, 𝑧) = 0
𝑏
𝑃(𝑏, 𝑙, 𝑧) = 𝐻(𝑏, 𝑙, 𝑧) (100)
, else
{∑𝑏 𝐻(𝑏, 𝑙, 𝑧)
are calculated from the histogram for all classes. From this probability, the Shannon entropy
0, 𝑃(𝑏, 𝑙, 𝑧) = 0
𝐸(𝑙, 𝑧) = {− ∑(𝑃(𝑏, 𝑙, 𝑧) ⋅ log 𝑃(𝑏, 𝑙, 𝑧)) , else , (101)
2
𝑏
32
is calculated . Finally, 𝐴(𝑙, 𝑧) is weighted with the entropy, if 𝐸(𝑙, 𝑧) > 1:
𝐴(𝑙, 𝑧)
𝐴E (𝑙, 𝑧) = . (102)
max(𝐸(𝑙, 𝑧); 1)
32 In the case of a probability of zero, a result of 0 ∙ log 0 = 0 is used according to the limit lim (𝑥log2 𝑥) = 0.
2 𝑥→0
𝐴(𝑙, 𝑧) is interpolated to a sampling rate of 𝑟s50 = 50 Hz using a piecewise cubic Hermitian function (temporal
resolution of 20 ms). The new time index is designated 𝑙50 . Subsequently, negative values resulting from the
′ (𝑙
interpolation are set to zero, resulting in a first, uncalibrated estimate of the specific roughness 𝑅est 50 , 𝑧).
Here, the results belonging to the zero-padding done at the start of the processing need to be removed. Thus
the last evaluated block shall be:
𝑛samples
𝑙50,end = ceil ( ⋅ 𝑟s50 ). (103)
𝑟s
The next step in calculating the specific roughness is a nonlinear transform, depending on the distribution of
′
𝑅est (𝑙50 , 𝑧) over the critical bands 𝑧. This step is necessary to take into account that the roughness perception
′
differs for broad-band signals (i.e., signals with a broader distribution of 𝑅est (𝑙50 , 𝑧) over the critical bands)
compared to narrow band signals such as modulated sinusoids (i.e., signals with a narrow distribution of
′
𝑅est (𝑙50 , 𝑧) over the critical bands). With this step it is possible to model the roughness for very different kinds of
synthetical and technical sounds as described in Reference [30].
Together with the nonlinear transform, a calibration is performed, which ensures that the calibration signal
(amplitude modulated sinusoid, 60 dB SPL, 1 kHz carrier frequency, 70 Hz modulation rate) results in a
roughness of 1 asper33.
𝑅̂ ′ (𝑙50 , 𝑧) = ′
𝑐R ∙ (𝑅est (𝑙50 , 𝑧))𝐸(𝑙50 ) (104)
asper
with the calibration factor 𝑐R = 0,0180909 ,
BarkHMS
and
𝑅̃est
′ (𝑙 )
50
′ (𝑙 ) , 𝑅̅est
′ (𝑙 )
50 ≠ 0
𝐵(𝑙50 ) = {𝑅̅est 50 (106)
0, 𝑅̅est
′ (𝑙 )
50 = 0
′ 2
∑𝑧(𝑅est (𝑙50 , 𝑧)) (107)
𝑅̃est
′ (𝑙 )
50 = √ ,
CBF
and
′ (𝑙
∑𝑧(𝑅est 50 , 𝑧))
𝑅̅est
′ (𝑙 )
50 = (108)
CBF
where CBF = 53 is the number of critical bands. The resulting estimate of the time-dependent specific
roughness, 𝑅̂ ′(𝑙50 , 𝑧), is smoothed by using a lowpass filter of order one with different time constants for rising
and falling slopes. This filtering consideres the fact, that the perception of sound events rises quickly with the
33 The calibration factor 𝑐 can be adjusted within a tolerance of 0,25 % to account for the effects of different
R
implementations.
with the different time constants for rising and falling slopes
resulting in the final estimate of the time-dependent specific roughness 𝑅′(𝑙50 , 𝑧).
The specific roughness 𝑅′(𝑧) is taken by averaging the time-dependent specific roughness 𝑅′(𝑙50 , 𝑧). For the
averaging, the first roughness values 𝑅′(𝑙50 , 𝑧) for 0 ≤ 𝑙50 ≤ 15 (approximately corresponding to the first 300 ms
of the input signal) are discarded due to the transient responses of the digital filters.
The time-dependent roughness 𝑅(𝑙50 ) is the integral of 𝑅′(𝑙50 , 𝑧) over 𝑧, approximated by summing over all
bands 𝑧 while considering the overlap ∆𝑧:
The single value 𝑅 is calculated by taking the 90th percentile of the time-dependent roughness 𝑅(𝑙50 ), discarding
again the first roughness values 𝑅(𝑙50 ) for 0 ≤ 𝑙50 ≤ 15.
For binaural signals, monaural time-dependent specific roughness values 𝑅L′ (𝑙50 , 𝑧) and 𝑅R′ (𝑙50 , 𝑧) of the left and
right channel shall be calculated separately for each channel (assuming diotic signals).
A combined binaural time-dependent specific roughness 𝑅B′ (𝑙50 , 𝑧) is calculated using the quadratic mean:
′ 2 ′ 2
Formula (112) approximately corresponds to the formula for binaural inhibition from the binaural loudness model
by Moore/Glasberg (ISO 532-2[7], see also Reference [33]). In the case that the roughness value of a channel
is negligible, Formula (112) results in a roughness, which is √0,5 lower than that of the diotic presentation.
For binaural signals, the binaural time-dependent specific roughness 𝑅B′ (𝑙50 , 𝑧) shall be used as basis for the
calculation of the specific roughness 𝑅′(𝑧), the time-dependent roughness 𝑅(𝑙50 ) and the single value 𝑅 instead
of 𝑅′(𝑙50 , 𝑧) in Clauses 7.1.8, 7.1.9 and 7.1.10.
A signal is considered to contain prominent roughness, if the time-independent single value 𝑅 of the time-
dependent roughness 𝑅(𝑙50 ) exceeds a value of 0,2 asper. If the signal has been identified to contain prominent
roughness according to this standard, the following information shall be recorded:
a) details of the method used to evaluate the roughness (ECMA 418 – Part 2: Psychoacoustic metrics based
on the hearing model – Clause 7.1 Psychoacoustic roughness calculation method), together with a
reference to this Standard;
The calculation process is simpler compared to the last sections, since most of the calculations were already
described in Clauses 5 and 6. An overview of the determination of the specific loudness is shown in Figure 8.
Figure 8 — Calculation of loudness based on specific tonal and noise loudness (see Clause 6).
1/𝑒(𝑧)
𝑒(𝑧) 𝑒(𝑧)
′ ′
𝑁 ′ (𝑙, 𝑧) = ((𝑁tonal (𝑙, 𝑧)) + (𝑤n ⋅ 𝑁noise (𝑙, 𝑧)) ) (113)
Here 𝑤n = 0,5331 and the exponent 𝑒(𝑧) is a function of the maximal specific basis loudness:
𝑎
𝑒(𝑧) = +𝑏
′ ′
𝑁tonal (𝑙, 𝑧)+ 𝑁noise (𝑙, 𝑧) (114)
max ( soneHMS )+𝜖
𝑧
( )
Bark HMS
Table 12 – Parameters to define the exponent for the loudness power average (Formula (114))
Parameter 𝑎 𝑏 𝜖
Value 0,2918 0,5459 10−12
The specific loudness 𝑁 ′ (𝑧) is taken by averaging the time-dependent specific loudness 𝑁 ′ (𝑙, 𝑧) . For the
averaging, the first loudness values 𝑁 ′ (𝑙, 𝑧) for 0 ≤ 𝑙 ≤ 56 (approximately corresponding to the first 300 ms of
the input signal) are discarded due to the transient responses of the digital filters.
1/𝑒
1
𝑁 ′ (𝑧) = ( ∑ 𝑁 ′ (𝑙 , 𝑧)𝑒 ) , (115)
𝑙end − 56
𝑙
1
where 𝑒 = and 57 ≤ 𝑙 ≤ 𝑙end . A power average is used here because it gives more weight to stronger
log10 (2)
components and correlates better with human loudness perception[39]
The time-dependent loudness 𝑁(𝑙) is calculated by integrating all specific loudness values, like Formula (26)
with ∆𝑧 = 0,5:
CBF
𝑖
𝑁(𝑙) = ∑ 𝑁 ′ (𝑙, ) ∙ ∆𝑧. (116)
2
𝑖=1
The unit of the result is soneHMS /Bark HMS and no additional calibration is needed since the specific results were
already calibrated in Formula (23).
The single value 𝑁 of the loudness of the signal is taken again by a power average of the time-dependent
loudness 𝑁(𝑙). Like the specific loudness, the values of 𝑁(𝑙) for 0 ≤ 𝑙 ≤ 56 (approximately corresponding to the
first 300 ms of the input signal) are discarded due to the transient responses of the digital filters.
1/𝑒
1
𝑁 = ( ∑ 𝑁(𝑙)𝑒 ) , (117)
𝑙end − 56
𝑙
where 𝑙 > 56. The unit of the result is soneHMS . While this process does not significantly modify the loudness of
pure tonal signals in comparison to the result in Formula (26), it improves the result of noise-like signals and
mixtures of tones and noise for which the loudness of the noise components are overestimated [39].
For binaural signals, monaural time-dependent specific loudness values 𝑁L′ (𝑙, 𝑧) and 𝑁R′ (𝑙, 𝑧) of the left and right
channel shall be calculated separately for each channel (assuming diotic signals).
A combined binaural time-dependent specific loudness 𝑁B′ (𝑙, 𝑧) is calculated using the quadratic mean:
2 2
(𝑁L′ (𝑙, 𝑧)) + (𝑁R′ (𝑙, 𝑧)) . (118)
𝑁B′ (𝑙, 𝑧) = √
2
Formula (118) approximately corresponds to the formula for binaural inhibition from the binaural loudness model
by Moore/Glasberg (ISO 532-2 [7], see also Reference [33]). In the case that the loudness value of a channel is
negligible, Formula (118) results in a loudness, which is √0,5 lower than that of the diotic presentation.
For binaural signals, the binaural time-dependent specific loudness 𝑁B′ (𝑙, 𝑧) shall be used as basis for the
calculation of the specific loudness 𝑁′(𝑧), the time-dependent loudness 𝑁(𝑙) and the single value 𝑁 instead of
𝑁′(𝑙, 𝑧) in Clauses 8.1.2, 8.1.3 and 8.1.4.
a) details of the method used to evaluate the loudness (ECMA 418 – Part 2: Psychoacoustic metrics based
on the hearing model – Clause 8.1 Psychoacoustic loudness calculation method), together with a reference
to this Standard;
The psychoacoustic loudness calculation is evaluated by comparison with the target equal-loudness contours
as shown in Figure 2. The loudness was calculated for sinusoidal signals with a frequency of 1000 Hz and a
sound pressure level of 20 to 80 dB with a step size of 20 dB. For other frequencies, the level was varied to
match the loudness calculated for the 1000 Hz tone. The same procedure was performed for the lower threshold
of hearing. The results are shown in Figure A.1. The target equal-loudness contours are emulated well by the
results of the hearing model.
Figure A.1— Results for the equal-loudness contours. The dotted lines show the target equal-loudness
contours, the solid lines are the equal-loudness contours obtained with the hearing model
1. the spectrum (FFT size 65536, sampling rate 48 kHz), a smoothed spectrum (1/24th octave smoothed
FFT: the “background noise”, useful to show general shapes while not resolving pure tones), and a
1-critical-bandwidth peak-hold spectrum as “critical bandwidth ruler”;
2. the tone-to-noise ratio 34 (TNR) results along with the TNR tolerance line.
3. the prominence ratio 35 (PR) calculated as a full spectrum for each frequency of interest (specific
prominence ratio, SPR), both with and without recognition only of pure tones, along with the
PR tolerance line.
TNR and PR fail since the corresponding tolerance lines are not exceeded. Only SPR shows a marginal value
for a signal with a clearly prominent tonality (even though at a very low sound pressure level).
Figure B.2— Specific psychoacoustic tonality analysis of the same sound used as source for the results
of the analyses shown in Figure B.1
The psychoacoustic tonality is evaluated by comparison with listening test results. As a reference, PR is also
added to the comparison. TNR values were also calculated. However, since they were very similar to the results
of the PR, they are not displayed in the results for reasons of clarity.
For the listening tests, mixtures of a sinusoidal tone with a frequency of 1000 Hz with different levels and pink
noise with different levels were used. Thus, the effect of different signal-to-noise-ratios can be evaluated for
different levels. Five different tests were performed. In all five tests, the level of the pink noise was varied from
40 dB SPL to 80 dB SPL with a step size of 5 dB SPL. The tests differed in the level of the sinusoidal tone,
which was chosen from 55 dB SPL to 75 dB SPL with a step size of 5 dB SPL.
The tests were performed with 16 test subjects. The test subjects were asked to rate the tonality of each sound
on a 13-point categorical scale (ranging from “0 - not tonal” to “12 - extremely tonal”). To compare the results of
the listening tests with the results of the psychoacoustic model, a linear scaling factor was used for the results
of the listening tests. Another scaling factor was used to map the results of the listening tests to the results of
the PR. The scaling factors were derived by minimizing the root-mean-square error between the mean ratings
of all participants and the calculated psychoacoustic tonality (or the PR, respectively) of all five experiments.
The results of the evaluation are shown in Figure B.3. The results illustrate one problem of the PR: it decreases
linearly for decreasing SNR. The tonality perception however does not decrease linearly according to the
experimental results. The results of the psychoacoustic hearing model fit much better to the perceived tonality.
Figure B.3 — Psychoacoustic tonality and prominence ratio compared to results of listening tests
The better performance of the psychoacoustic hearing model is also reflected in this error measure. For the
psychoacoustic tonality, the error measure over all five experiments (related to the 13-pt categorical scale) was
0,21, for the PR it was 0,70, for the TNR (not shown in the figures) it was 0,74.
The psychoacoustic roughness is evaluated by comparison with listening test results and data from
Reference [12]. Figure C.1 shows results for amplitude modulated sinusoids with seven different carrier
frequencies (125 Hz, 250 Hz, 500 Hz, 1000 Hz, 2000 Hz, 4000 Hz, 8000 Hz) and different modulation rates.
The results from Reference [12] are idealized, smoothed curves that were fit to the results of jury tests. The
results of the model are close to these idealized curves and never exceed a tolerance of ±0,1 asper.
1 1 1 1
10 20 50 100 200 350 10 20 50 100 200 350 10 20 50 100 200 350 10 20 50 100 200 350
1 1 1 Roughness algorithm
ata from Fastl Zwicker
0.1 asper tolerance
0.5 0.5 0.5
Modulation rate Hz
Figure C.1 — Results for modulated sinusoids with different carrier frequencies and modulation rates.
All sounds were modulated with 100% degree of modulation and a sound pressure level of 60 dB.
GEN_02 Generator
In Figure C.2, the results of the psychoacoustic roughness model are compared with listening test results (mean
values and 95% confidence intervals) for the seven technical sounds and a reference sound (SINUS), which
was used as anchor. It can be seen that the calculated results are all within the 95% confidence intervals of the
listening test data, thus proving that the algorithm performs well for technical sounds. More results can be found
in Reference [30].
2
Roughness algorithm
ury test
1.5
Roughness asper
0.5
Figure C.2 — Results of several technical sounds. The results of the listening tests are displayed: mean
values with 95% confidence intervals.
[1] R. Sottek: A Hearing Model Approach to Time-Varying Loudness, Acta Acustica united with Acustica,
vol. 102, no. 4, pp. 725-744, 2016.
[2] M. Slaney: Auditory toolbox. Interval Research Corporation, Tech. Rep 10 (1998), 1998.
[3] ISO 532-1: Acoustics – Methods for calculating loudness, Part 1: Zwicker method
[4] DIN 45631/A1:2010: Calculation of loudness level and loudness from the sound spectrum - Zwicker
method - Amendment 1: Calculation of the loudness of time-variant sound, Beuth Verlag, 2010.
[5] J. Chalupper, H. Fastl: Dynamic loudness model (DLM) for normal and hearing-impaired listeners, Acta
Acustica united with Acustica 88(3), pp. 378-386, 2002.
[6] B.R. Glasberg, B.C.J. Moore: A model of loudness applicable to time-varying sounds, Journal of the
Audio Engineering Society 50, pp. 331-341, 2002.
[7] ISO 532-2, Acoustics — Methods for calculating loudness — Part 2: Moore-Glasberg method
[8] J. Rennies, J.L. Verhey, J.E. Appell, B. Kollmeier: Loudness of complex time-varying sounds? A
challenge for current loudness models. Proceedings of Meetings on Acoustics, vol. 19, 050189, 2013.
[9] J. Rennies, M. Wächtler, J. Hots, J.L. Verhey.: Spectro-temporal characteristics affecting the loudness
of technical sounds: data and model predictions, Acta Acustica united with Acustica, vol. 101(6), pp.
1145–1156, 2015.
[10] ISO 389-7, Acoustics — Reference zero for the calibration of audiometric equipment — Part 7:
Reference threshold of hearing under free-field and diffuse-field listening conditions
[12] H. Fastl, E. Zwicker: Psychoacoustics. Facts and Models, Springer, Berlin, Heidelberg, New York, 2006.
[13] B. C. Moore: Basic auditory processes involved in the analysis of speech sounds. Philosophical
Transactions of the Royal Society of London B: Biological Sciences, 363. Jg., Nr. 1493, S. 947-963,
2008.
[14] T. Bierbaums, R. Sottek: Modellierung der zeitvarianten Lautheit mit einem Gehörmodel, Proc. DAGA
2012, Darmstadt, pp. 591-592, 2012.
[15] S. Buus, M. Florentine: Modifications to the power function for loudness. In: E. Summerfield, R. Kompass,
T. Lachmann (eds), Fechner Day 2001. Proceedings of the 17th Annual Meeting of the International
Society for Psychophysics. Berlin: Pabst, pp. 236-241, 2001.
[16] M. Epstein, M. Florentine: A test of the Equal-Loudness-Ratio hypothesis using cross-modality matching
functions, J. Acoust. Soc. Am., vol. 118(2), pp. 907-913, 2005.
[17] R. Sottek: Improvements in calculating the loudness of time varying sounds. Proc. Inter-Noise 2014,
Melbourne, 2014.
[18] R. Sottek: Modelle zur Signalverarbeitung im menschlichen Gehör, dissertation, RWTH Aachen, 1993.
[20] R. Sottek: Progress in calculating tonality of technical sounds, Proc. Inter-Noise 2014, Melbourne, 2014.
[21] R. Sottek: Calculating tonality of IT product sounds using a psychoacoustically-based model, Proc. Inter-
Noise 2015, San Francisco, 2015.
[22] H. Hansen, J.L. Verhey, R. Weber: The Magnitude of Tonal Content. A Review, Acta Acustica united
with Acustica, 97(3), pp. 355-363, 2011.
[23] H. Hansen, R. Weber: Zum Verhältnis von Tonhaltigkeit und der partiellen Lautheit der tonalen
Komponenten in Rauschen, Proc. DAGA 2010, Berlin, pp. 597-598, 2010.
[24] J.L. Verhey, S. Stefanowicz: Binaurale Tonhaltigkeit, Proc. DAGA 2011, Düsseldorf, pp. 827-828, 2011.
[25] J.C.R. Licklider: A Duplex Theory of Pitch Perception, Cellular and Molecular Life Sciences, vol. 7(4),
pp. 128-134, 1951.
[26] R. Sottek: Gehörgerechte Rauhigkeitsberechnung, Proc. DAGA 1994, Dresden, pp. 1197-1200, 1994.
[27] R. Sottek, P. Vranken, H.-J. Kaiser: Anwendung der gehörgerechten Rauhigkeitsberechnung, Proc.
DAGA 1994, Dresden, pp. 1201-1204, 1994.
[28] R. N. Bracewell: The Fourier transform and its applications. McGraw-Hill, New York,1986.
[29] J. Becker, R. Sottek: Psychoacoustic Tonality Analysis, Proc. Inter-Noise 2018, Chicago, 2018.
[30] R. Sottek, J. Becker, T. Lobato: Progress in Roughness Calculation, Proc. Inter-Noise 2020, Seoul, 2020.
[31] A. Oetjen, "Threshold and Suprathreshold Phenomena in Auditory Modulation Perception" (Phd.-Thesis),
Oldenburg, 2018.
[32] A. Oetjen, U. Letens, S. van de Par, J. Verhey und R. Weber, „Roughness calculation for randomly
modulated sounds,“ in AGA, Meran, 2013.
[34] HEAD acoustics GmbH: Using the new psychoacoustic tonality analyses Tonality (Hearing Model),
Application Note, 2018.
[35] Zwicker, E., Loudness and excitation patterns of strongly frequency modulated tones, in Sensation and
Measurement, papers in honor of S.S. Stevens, edited by H.R. Moskowitz, B. Scharf, and J.C. Stevens
(D. Reidel, Dordrecht, Netherlands), pp. 325–335, 1974.
[36] R. Sottek, Loudness models applied to technical sounds, Noise-Con 2010, 2010.
[37] Hots, J. et al., Loudness of sounds with a subcritical bandwidth: A challenge to current loudness models?
J. Acoust. Soc. Am. 134(4), EL334–EL339, 2013.
[38] Hots, J. et al., Loudness of subcritical sounds as a function of bandwidth, center frequency, and level.
J. Acoust. Soc. Am., 135(3), pp. 1313-1320, 2014.
[39] R. Sottek, T. Lobato, J. Becker: Loudness of sounds with a subcritical bandwidth: improved prediction
with the concept of tonal loudness, DAGA 2022, Stuttgart, 2022.