0% found this document useful (0 votes)
36 views

Speech and Audio Processing: Lecture-2

The document discusses modeling the human speech production system and the human auditory system. It describes how the speech production system can be modeled as a time-varying filter excited by a white noise source representing the lungs. Linear prediction is used to estimate the filter parameters from observed speech signals. The model aims to replicate the frequency spectrum of speech using filtered white noise. It also describes key aspects of the human auditory system including the pinna, ossicles, cochlea, and properties like absolute threshold and masking.

Uploaded by

Randeep Singh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Speech and Audio Processing: Lecture-2

The document discusses modeling the human speech production system and the human auditory system. It describes how the speech production system can be modeled as a time-varying filter excited by a white noise source representing the lungs. Linear prediction is used to estimate the filter parameters from observed speech signals. The model aims to replicate the frequency spectrum of speech using filtered white noise. It also describes key aspects of the human auditory system including the pinna, ossicles, cochlea, and properties like absolute threshold and masking.

Uploaded by

Randeep Singh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Speech and Audio Processing

Lecture-2 By: Mohit Goel

Modeling the Speech Production System


A model is a simplified representation of the real world. It is designed to help us better understand the world in which we live and, ultimately, to duplicate many of the behaviors and characteristics of real-life phenomenon. In order for the model to be successful, it must be able to replicate partially or completely the behaviors of the particular object or fact that it intends to capture or simulate. The model may be a physical one (i.e., a model airplane) or it may be a mathematical one, such as a formula.

Modeling the Speech Production System


The human speech production system can be modeled using a rather simple structure: the lungsgenerating the air or energy to excite the vocal tractare represented by a white noise source. The acoustic path inside the body with all its components is associated with a time-varying filter.

By using a system identification technique called linear prediction , it is possible to estimate the parameters of the time-varying filter from the observed signal. The assumption of the model is that the energy distribution of the speech signal in frequency domain is totally due to the time-varying filter, with the lungs producing an excitation signal having a flat-spectrum white noise. This model is rather efficient and many analytical tools have already been developed around the concept. The idea is the well-known autoregressive model,

Parametric Speech Coding


Consider the speech frame corresponding to an unvoiced segment with 256 samples of Figure 1.8. Applying the samples of the frame to a linear prediction analysis procedure, the coefficients of an associated filter are found. This filter has system function

Figure 1.8 Example of speech waveform uttered by a male subject about the word problems. The expanded views of a voiced frame and an unvoiced frame are shown. The frame is 256 samples in length.

White noise samples are created using a unit variance Gaussian random number generator; when passing these samples (with appropriate scaling) to the filter, the output signal is obtained. Figure 1.10 compares the original speech frame, with two realizations of filtered white noise. As we can see, there is no timedomain correspondence between the three cases. However, when these three signal frames are played back to a human listener (converted to sound waves), the perception is almost the same!

How could this be? After all, they look so different in the time domain. The secret lies in the fact that they all have a similar magnitude spectrum, as plotted in Figure 1.11. As we can see, the frequency contents are similar, and since the human auditory system is not very sensitive toward phase differences, all three frames sound almost identical (more on this in the next section). The original frequency spectrum is captured by the filter, with all its coefficients. Thus, the flat-spectrum white noise is shaped by the filter so as to produce signals having a spectrum similar to the original speech. Hence, linear prediction analysis is also known as a spectrum estimation technique.

This simple speech coding procedure is summarized below. Encoding 1. Derive the filter coefficients from the speech frame. 2. Derive the scale factor from the speech frame. 3. Transmit filter coefficients and scale factor to the decoder. Decoding 1. Generate white noise sequence. 2. Multiply the white noise samples by the scale factor. 3. Construct the filter using the coefficients from the encoder and filter the scaled white noise sequence. Output speech is the output of the filter.

General Structure of a Speech Coder

Human Auditory System

Human Auditory System


The pinna (or informally the ear) is the surface surrounding the canal in which sound is funneled. Sound waves are guided by the canal toward the eardruma membrane that acts as an acoustic-tomechanic transducer. The sound waves are then translated into mechanical vibrations that are passed to the cochlea through a series of bones known as the ossicles.

Human Auditory System


Presence of the ossicles improves sound propagation by reducing the amount of reflection and is accomplished by the principle of impedance matching. The cochlea is a rigid snail-shaped organ filled with fluid. Mechanical oscillations impinging on the ossicles cause an internal membrane, known as the basilar membrane, to vibrate at various frequencies. The basilar membrane is characterized by a set of frequency responses at different points along the membrane.

Human Auditory System


Motion along the basilar membrane is sensed by the inner hair cells and causes neural activities that are transmitted to the brain through the auditory nerve.

Due to this arrangement, the human auditory system behaves very much like a frequency analyzer.

Absolute Threshold
The absolute threshold of a sound is the minimum detectable level of that sound in the absence of any other external sounds.

The horizontal axis is frequency measured in hertz (Hz); while the vertical axis is the absolute threshold in decibels (dB), related to a reference intensity of 1012 watts per square metera standard quantity for sound intensity measurement.

Absolute Threshold
As we can see, human beings tend to be more sensitive toward frequencies in the range of 1 to 4 kHz, while thresholds increase rapidly at very high and very low frequencies. It is commonly accepted that below 20 Hz and above 20 kHz, the auditory system is essentially dysfunctional. These characteristics are due to the structures of the human auditory system.

Absolute Threshold
We can take advantage of the absolute threshold curve in speech coder design. Some approaches are the following:

Any signal with an intensity below the absolute threshold need not be considered, since it does not have any impact on the final quality of the coder. More resources should be allocated for the representation of signals within the most sensitive frequency range, roughly from 1 to 4 kHz, since distortions in this range are more noticeable.

Masking
Masking refers to the phenomenon where one sound is rendered inaudible because of the presence of other sounds.

Masking
The presence of a single tone, for instance, can mask the neighboring signalswith the masking capability inversely proportional to the absolute difference in frequency. Masking capability increases with the intensity of the reference signal, or the single tone in this case.

Phase Perception
There is abundant evidence on phase deafness; for instance, a single tone and its time-shifted version essentially produce the same sensation; on the other hand, noise perception is chiefly determined by the magnitude spectrum. Even though phase has a minor role in perception, some level of phase preservation in the coding process is still desirable, since naturalness is normally increased. The code-excited linear prediction (CELP) algorithm, for instance, has a mechanism to retain phase information of the signal

You might also like