Speech and Audio Processing: Lecture-2
Speech and Audio Processing: Lecture-2
By using a system identification technique called linear prediction , it is possible to estimate the parameters of the time-varying filter from the observed signal. The assumption of the model is that the energy distribution of the speech signal in frequency domain is totally due to the time-varying filter, with the lungs producing an excitation signal having a flat-spectrum white noise. This model is rather efficient and many analytical tools have already been developed around the concept. The idea is the well-known autoregressive model,
Figure 1.8 Example of speech waveform uttered by a male subject about the word problems. The expanded views of a voiced frame and an unvoiced frame are shown. The frame is 256 samples in length.
White noise samples are created using a unit variance Gaussian random number generator; when passing these samples (with appropriate scaling) to the filter, the output signal is obtained. Figure 1.10 compares the original speech frame, with two realizations of filtered white noise. As we can see, there is no timedomain correspondence between the three cases. However, when these three signal frames are played back to a human listener (converted to sound waves), the perception is almost the same!
How could this be? After all, they look so different in the time domain. The secret lies in the fact that they all have a similar magnitude spectrum, as plotted in Figure 1.11. As we can see, the frequency contents are similar, and since the human auditory system is not very sensitive toward phase differences, all three frames sound almost identical (more on this in the next section). The original frequency spectrum is captured by the filter, with all its coefficients. Thus, the flat-spectrum white noise is shaped by the filter so as to produce signals having a spectrum similar to the original speech. Hence, linear prediction analysis is also known as a spectrum estimation technique.
This simple speech coding procedure is summarized below. Encoding 1. Derive the filter coefficients from the speech frame. 2. Derive the scale factor from the speech frame. 3. Transmit filter coefficients and scale factor to the decoder. Decoding 1. Generate white noise sequence. 2. Multiply the white noise samples by the scale factor. 3. Construct the filter using the coefficients from the encoder and filter the scaled white noise sequence. Output speech is the output of the filter.
Due to this arrangement, the human auditory system behaves very much like a frequency analyzer.
Absolute Threshold
The absolute threshold of a sound is the minimum detectable level of that sound in the absence of any other external sounds.
The horizontal axis is frequency measured in hertz (Hz); while the vertical axis is the absolute threshold in decibels (dB), related to a reference intensity of 1012 watts per square metera standard quantity for sound intensity measurement.
Absolute Threshold
As we can see, human beings tend to be more sensitive toward frequencies in the range of 1 to 4 kHz, while thresholds increase rapidly at very high and very low frequencies. It is commonly accepted that below 20 Hz and above 20 kHz, the auditory system is essentially dysfunctional. These characteristics are due to the structures of the human auditory system.
Absolute Threshold
We can take advantage of the absolute threshold curve in speech coder design. Some approaches are the following:
Any signal with an intensity below the absolute threshold need not be considered, since it does not have any impact on the final quality of the coder. More resources should be allocated for the representation of signals within the most sensitive frequency range, roughly from 1 to 4 kHz, since distortions in this range are more noticeable.
Masking
Masking refers to the phenomenon where one sound is rendered inaudible because of the presence of other sounds.
Masking
The presence of a single tone, for instance, can mask the neighboring signalswith the masking capability inversely proportional to the absolute difference in frequency. Masking capability increases with the intensity of the reference signal, or the single tone in this case.
Phase Perception
There is abundant evidence on phase deafness; for instance, a single tone and its time-shifted version essentially produce the same sensation; on the other hand, noise perception is chiefly determined by the magnitude spectrum. Even though phase has a minor role in perception, some level of phase preservation in the coding process is still desirable, since naturalness is normally increased. The code-excited linear prediction (CELP) algorithm, for instance, has a mechanism to retain phase information of the signal