Hands-On Lab On Speech Processing-Time-domain Processing - 2021
Hands-On Lab On Speech Processing-Time-domain Processing - 2021
In this lab, you will get acquainted with speech signals and their short-time processing. You will
explore the time domain structure of the most basic speech elements, such as vowels and consonants.
You will learn about time domain properties of speech signals and you will apply basic metrics on
speech, such as energy and zero-crossings.
The final goal of this lab is to implement a simple, fully-automated voiced/unvoiced/silence (VUS)
discriminator. Such an algorithm is very practical in real-life systems such as mobile communications
systems. A typical Voice Activity Detector (VAD), which is a subset of a VUS discriminator, is used in
the Global System for Mobile Communications (GSM), the European system for cellular communica-
tions. The algorithm you will implement is based on energy and zero-crossing measures of the speech
signal - such computations are fast enough for real-time implementations. However, in this lab, we will
consider an off-line approach. This means that we have the whole signal available from the first place.
1 Theoretical Background
The algorithm is based on energy and zero-crossings measures of the speech waveform. Let’s take
a short introduction on these subjects.
where h[n] = w2 [n] is the squared analysis window applied on a speech segment. We can safely as-
sume that the analysis window is supported in [−N, N ]. The choice of the impulse response, h[n], or
equivalently the analysis window, determines the nature of the short-time energy representation. To
1
see how the choice of window affects the short-time energy, let us observe that if h[n] in the equa-
tion above was very long, and of constant amplitude, En would change very little with time. Such a
window would be the equivalent of a very narrowband lowpass filter. Clearly what is desired is some
lowpass filtering but not so much that the output is constant; i.e., we want the short-time energy to
reflect the amplitude variations of the speech signal. Thus, we encounter for the first time a conflict
that will repeatedly arise in the study of short-time representations of speech signals. That is, we
wish to have a short duration window (impulse response) to be responsive to rapid amplitude changes,
but a window that is too short will not provide sufficient averaging to produce a smooth energy function.
The effect of the window on the time-dependent energy representation can be illustrated by discus-
sing the properties of two representative windows, i.e., the rectangular window
1, for 0 ≤ n ≤ N − 1
h[n] = (3)
0, otherwise
where N is the window length in samples. The rectangular window corresponds to applying equal
weight to all the samples in the interval (n − N + 1) to n, whereas the Hamming window gives more
weight to the center of the window, which is preferable in many applications. If the window size, N . is
too small, i.e., on the order of a pitch period or less, En will fluctuate very rapidly depending on exact
details of the waveform. If N is too large, i.e., on the order of several pitch periods (3 − 4), En will
change very slowly and thus will not adequately reflect the time-varying properties of the speech signal.
Unfortunately, this implies that no single value of N is entirely satisfactory. With these shortcomings
in mind, a suitable practical choice for N is on the order of 300 − 500 samples for a 16 kHz sampling
rate (i.e., 20 − 30 ms duration).
Speech signals are broadband signals and the interpretation of average zero-crossing rate is the-
refore much less precise. However, rough estimates of spectral properties can be obtained using a
representation based on the short-time average zero-crossing rate. Before discussing the interpreta-
tion of zero-crossing rate for speech, let us first define and discuss the theory behind. An appropriate
definition is
X∞
Zn = |sgn(x[m]) − sgn(x[m − 1])|w[n − m] (6)
m=−∞
2
where
1, for x[n] ≥ 0
sgn(x[n]) = (7)
−1, otherwise
and
1
2N , for 0 ≤ n ≤ N − 1
w[n] = (8)
0, otherwise
This representation shows that the short-time average zero-crossing rate has the same general proper-
ties as the short-time energy. However, the definition of zero crossings equation make the computation
of Zn appear more complex than it really is. All that is required is to check samples in pairs to de-
termine where the zero-crossings occur and then the average is computed over N consecutive samples
(the division by N is obviously unnecessary as well).
Now let us see how the short-time average zero-crossing rate applies to speech signals. The model
for speech production suggests that the energy of voiced speech is concentrated below 4 kHz, whereas
for unvoiced speech, most of the energy is found at higher frequencies. Since high frequencies imply
high zero-crossing rates, and low frequencies imply low zero-crossing rate, there is a strong correlation
between zero-crossing rate and energy distribution over frequency. A reasonable generalization is that
if the zero-crossing rate is high, the speech signal is unvoiced, while if the zero-crossing rate is low, the
speech signal is voiced. This, however, is a very imprecise statement because we have not said what is
high and what is low, and, of course, it really is not possible to be precise. Despite this imprecision,
zero-crossing rate is definitely a simple and convenient measure for speech discrimination.
You can see that the signal has a periodicity of some form. It’s not periodic, it’s quasi-periodic. It
has been observed that voiced speech can be represented using periodic or quasi-periodic waveforms.
So, speech signals like this one, which have a strong periodicity of some kind, can be thought of as
voiced speech signals, like /a/, /e/, /o/, etc. You can also see that it is a low frequency signal. It
doesn’t change very quickly over time. Let’s see what happens with the energy of a signal like that
one.
E1 = (1/N1)*sum(frame1.^2);
We’ll hold this result, we’ll need it later. Let’s now cut another piece of speech from our waveform.
Let’s try this one.
3
A voiced frame An unvoiced frame
0.3 0.015
0.2
0.01
0.1
0.005
0
0
−0.1
−0.005
−0.2
−0.01
−0.3
−0.4 −0.015
0 0.01 0.02 0.03 0.04 0.05 0.06 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045
time time
0.008
0.007
Energy Value
0.006
0.005
0.004
0.003
0.002
0.001
0
0 0.5 1 1.5 2 2.5 3
Index
Figure 1: Voiced and Unvoiced speech sounds, along with their energies.
frame2 = s(4800:5500);
N2 = length(frame2);
figure; plot(0:1/fs:N2/fs-1/fs, frame2); grid;
title(’An unvoiced frame’); xlabel(’time’);
% Listen if you want
%soundsc(frame2, fs);
You can see in Figure 1b that this one is very peaky! It has no periodicity at all, and it changes
very quickly in time. The last one means that it is a high-frequency signal. Speech signals like that
can be thought as unvoiced speech signals, like /s/, /f/, /sh/, etc. Let’s take a look at the energy of
this kind of speech signal, and compare it with the first one.
E2 = (1/N2)*sum(frame2.^2);
4
You can see in Figure 1c that the energy of the voiced speech signal is really higher than the unvoiced
one. If you do this in several other voiced/unvoiced examples, you will see that our observation holds
in general. If we do the same for a full speech waveform, what we will get is depicted in Figure 2.
0.2
−0.2
−0.4
−0.6
0 0.5 1 1.5 2 2.5 3
Time (s)
0.02
Short−Time Energy
0.015
0.01
0.005
0
0 0.5 1 1.5 2 2.5 3
Time (s)
You can clearly see that voiced parts of speech have higher energy than unvoiced or silent ones. So,
the energy of a speech frame is a good indicator of whether a frame is a voiced or unvoiced one.
How can we exploit the number of zero crossings in a speech frame to make the discrimination
between voiced and unvoiced speech? Let’s take a look again into our two extracted speech frames:
Take a look at Figure 3a. Can you guess what is going on with the zero-crossings? :-) You can
see that the number of zero-crossings are really high in the unvoiced speech frame, much higher than
the voiced speech frame! So, this can be our second indicator of whether a speech frame is voiced or
unvoiced. There is a straightforward way to find out how many zero crossing are there in a speech
frame. Let’s see it:
ZCr1 = 0.5*sum(abs(sign(frame1(2:end))-sign(frame1(1:end-1))));
ZCr2 = 0.5*sum(abs(sign(frame2(2:end))-sign(frame2(1:end-1))));
5
Voiced frame Zero−Crossings comparison
0.4
Zero Crossings of Voiced frame
0.2 Zero Crossings of Unvoiced Frame
200
0
−0.2
Zero−Crossings Value
150
−0.4
0 100 200 300 400 500 600 700 800 900
Unvoiced frame
0.02 100
0.01
0 50
−0.01
−0.02 0
0 100 200 300 400 500 600 700 800 0 0.5 1 1.5 2 2.5 3
Index
Figure 3: Voiced and Unvoiced speech sounds, along with their zero-crossings.
The above equation counts the number of zero-crossings in a speech frame. The first term returns
the sign of each sample value, starting from sample 2, and the second term returns the sign of each
sample value, starting from sample 1 and ending a sample from the last one. The subtraction of these
two gives us a value different than zero (+1 or -1) when we have a change of the function’s sign. So,
the addition of all these non-zero values will give us the number of zero-crosssings, but doubled. So,
we need to keep half of them. That’s why we multiply the number by 0.5. Try this in paper to see why
it is valid. A for loop and an if statement would do the same thing, but we prefer not using loops in
MATLAB, especially when the loops are too many, because all this procedure is not so time-efficient.
You can see the result in Figure 3b. We can generalize for a full waveform, and what we get is
shown in Figure 4. You can see that the number of zero-crossings is quite high in unvoiced parts of
speech, whereas is low in voiced parts. So, zero-crossings is a quite convenient way to discriminate
voiced from unvoiced speech.
2.3 Limitations
A combination of the methods we have described seems reasonable and powerful enough for our
purpose. And indeed, it is, up to a certain level. :-) However, there are some problems in our simple
approach. Some frames are transient frames (something between voiced and unvoiced), which cannot
be easily detected and categorized. Some others are neither purely voiced, nor purely unvoiced, such
as fricatives, plosives, or nasals (/p/, /g/, /d/, /b/, /m/, /n/, etc). Also, and more importantly, there
might be some voiced frames with low energy and some unvoiced with high energy. This depends on
the speaker and the speaking style. Even the zero crossing rate may differ among different speakers or
6
’The fish twisted and turned on the bent hook’
0.4
0.2
−0.2
−0.4
−0.6
0 0.5 1 1.5 2 2.5 3
Time (s)
Short−time Zero Crossings Rate
400
300
200
100
0
0 0.5 1 1.5 2 2.5 3
Time (s)
speaking style. So, there are limitations to our apporach, as you may have guessed. However, it is not
quite far from the one used in real systems, like the GSM standard of the European Union Cellular
Communications Systems, and the purpose of this lab is not to build a robust VUS discriminator, but
to use simple tools and familiarize yourselves with time-domain properties of speech.
• What about silence? We haven’t mentioned anything about it in our analysis. Well, ideally, a
silence frame would be a frame with all samples equal to zero, right? This means energy equal
to zero and no zero-crossings at all. But in practice, because of microphone noise or noise from
the environment (breath, room reflections etc.), silence frames have some kind of small sample
variation. This variation is so small that the energy should be less than both voiced and unvoiced
frames, and the zero-crossings should be also less than the other two cases. Usually, this is what
happens in the real world. :-)
• As we have mentioned at the beginning, our algorithm has an adaptivity of some form, which me-
ans that its results depend on some statistics of the input waveform. This adaptivity is presented
in the thresholds we introduce, in order to make our discrimination into voiced/unvoiced/silence
7
frames. You will see that there are two thresholds, one for the energy and one for zero-crossings.
These thresholds depend on the input waveform, so they are not fixed numbers.
• Your algorithm should work on a frame-by-frame basis. That means you have to estimate your
energy and zero-crossings in a set of analysis time instants of your choice, and with a step size of
your choice. However, you should pay attention that your analysis window should be long enough
for your statistics to be meaningful, but also short enough to have good time localization. :-) It
is suggested that your analysis window and your frame rate should be 20 − 30 ms and 5 − 10
ms, respectively. Your energy and zero-crossings estimates are considered to be localized in the
center of the analysis window.
• So, for each frame, you will have to calculate the above measures (energy and zero-crossings) and
find out if the frame is voiced/unvoiced/silence.
• Because of its simplicity, this VUS discriminator should perform adequately but not perfectly
(actually, a very accurate VUS is still a subject of research in speech community, although several
robust and highly accurate VUSs have been proposed over the years). However, it should at least
correctly detect the voiced parts of speech.
• Also, since your VUS results in estimates every 5 − 10 ms, you will have to interpolate your
results over the whole speech waveform, in order to have a continuous estimate (that means, for
every time sample). To do this, you can use the interp1 function that is already available for
you in MATLAB. Try different interpolation methods, like splines, cubic, and linear interpolation
schemes. You can visually inspect the results and check which method performs better. Justify
your results and make a short comment.
• Moreover, you can try different analysis window sizes, different analysis window types, or different
frame rates to see how the results change.
• Finally, in the speech examples that we provide, there are two speech files ending in -sin.wav
and -swn.wav. These are two utterances that have the same context (the speakers says the same
thing) but in a different speaking style. In the -sin.wav file, the speaking style is more ”stressed”,
more ”intense” in a way. This is called Lombard speech, because the speaker changes his/her
speaking style in order to produce speech that is more intelligible in a noisy environment. It
is like speaking in a very quiet place (-swn.wav file) and in a cafeteria, an airport, or in any
other crowded place (-sin.wav ). Can you see any significant energy or zero-crossing rate changes
between these two waveforms? If yes, comment on your results.
% Signal length
D = length(s);
8
% Frame length (30 ms, how many samples? )
L = %INSERT CODE HERE
% Number of frames
Nfr = %INSERT CODE HERE
% Clssification
for i = 1:1:Nfr
if % INSERT CONDITION HERE
% VOICED
VUS(i) = 1.0;
elseif % INSERT CONDITION HERE
% SILENCE
VUS(i) = 0.0;
elseif % INSERT CONDITION HERE
% UNVOICED
VUS(i) = 0.5;
end
end
% Visualize
figure;
t = 0:1/fs:length(s)/fs-1/fs;
plot(t, VUSi);
9
hold on; plot(t, s/max(s), ’r’); hold off;
xlabel(’Time (s)’);
title(’Energy & Zero-Crossings Rate-based VUS discrimination’);
grid;
If everything went well, you should see something like Figure 5 below. You can see that it is not
perfect but it works. :-)
You can try your own .wav file or you can use the ones we provide.
VOICED
UNVOICED
SILENCE
• Comment on the results of your VUSD, in all provided speech waveforms. Does it work well? If
not, where does it have problems? Comment and include figures for all waveforms in your report.
• What happens if you increase the frame rate to 20 or 30 ms? What happens if you decrease the
frame rate down to 2.5 or 5 ms? What happens if you change the window size (make it shorter
or larger)? Comment (figures are NOT necessary).
10
• Record your voice with a microphone and save it into a .wav file. Use a sampling frequency of
Fs = 16 kHz for your recording and a 16-bit precision. Load it into MATLAB (use audioread.m)
and apply your algorithm on the waveform. Include a figure of your voice and the corresponding
result of the algorithm.
If you have ANY questions on this lab, please send an e-mail to : [email protected]
11