0% found this document useful (0 votes)
3 views

EC39201_Expt4_Lab Report_Grp-24

The experiment investigates the significance of low frequency temporal cues in speech recognition, demonstrating that speech can be recognized with minimal spectral information. By manipulating audio signals through band-limited noise and various filters, it was found that increasing the number of frequency bands enhances voice clarity. The results indicate that temporal cues play a crucial role in speech perception, especially in low frequency ranges.

Uploaded by

karthikr90637
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

EC39201_Expt4_Lab Report_Grp-24

The experiment investigates the significance of low frequency temporal cues in speech recognition, demonstrating that speech can be recognized with minimal spectral information. By manipulating audio signals through band-limited noise and various filters, it was found that increasing the number of frequency bands enhances voice clarity. The results indicate that temporal cues play a crucial role in speech perception, especially in low frequency ranges.

Uploaded by

karthikr90637
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Digital Signal Processing Lab

Experiment IV : Speech Recognition with Primarily Temporal Cues

12 October 2022

Group-24
KR Rahul(20EC30021)
Rahul Singh(20EC30037)
AIM

In this experiment, we need to gain understanding of the relative importance of low frequency
temporal structure of speech and frequency content of speech in speech perception

THEORY

Nearly perfect speech recognition was observed under conditions of greatly reduced
spectral information. A speech temporal envelope was extracted from a wide frequency
band and used to modulate the same bandwidth of the sound. This operation preserved
the temporal envelope cues of each band but limited the listener to significantly
degraded information about the spectral energy distribution. Discrimination of
consonants, vowels, and words in simple sentences improved significantly as the number
of bands increased. High speech recognition performance was achieved with only her
three bands of modulated noise. Therefore, representing dynamic temporal patterns in
just a few broad spectral ranges is sufficient to recognize speech.

Speech recognition was supposed to require frequency-specific (spectral) cues. For


example, spectral energy peaks in speech reflect the resonance properties of the vocal
tract and provide acoustic information about the origin of the speech. However, attempts
to identify acoustic cues that reliably convey phoneme identity under different listening
conditions and different speakers have met with limited success. Studies using amplitude
compression and spectral reduction demonstrate the robustness of speech recognition
under these conditions. However, these manipulations resulted in stimuli whose
time-spectral properties were still very complex. Even removing spectral cues from
speech completely yielded stimuli containing a surprising amount of information about
consonant identity. Amplitude and time cues were retained while the amount of spectral
information was systematically varied. This combination allowed us not only to
parametrically assess the role of spectral detail in speech recognition, independent of
temporal cues but also to simulate cochlear implant stimulation patterns.

Spectral information was removed from a speech by replacing frequency-specific


information with band-limited noise over a wide range of frequencies. The acoustic
signal was divided into several frequency bands and the amplitude envelope was
extracted from each band.

Random noise was modulated using the envelope signal and spectrally limited by the
same bandpass filter used for the original analysis band. Thus, time and amplitude cues
were retained in each spectral band, but spectral details within each band were
removed. All bands were then summarized and presented to the audience
CODE:
clc;
clear;
info=audioinfo('fivewo.wav');
[x,Fs]=audioread('fivewo.wav');
t=0:seconds(1/Fs):seconds(info.Duration);
t=t(1:end-1);
subplot(3,1,1);
plot(t,x);
xlabel('Time');
ylabel('Audio Data');
title('Plot of the Given Audio');
disp(info);

n=100; %SNR
noise=(1/n)*wgn(156250,1,1);
subplot(3,1,2);
plot(t,noise);
xlabel('Time');
ylabel('Noise');
title('Plot of the White Gaussian Noise');

z=0;
N=2; %no. Of filters
for i = 1:N
f1=90*64.^((i-1)/N);
f2=90*64.^(i/N);
[B,A]=butter(2,[f1/Fs,f2/Fs],"stop");
y=filter(B,A,x);
y_hilb=hilbert(y);
[B_,A_]=butter(2,240/Fs,"low");
y_final=filter(B_,A_,abs(y_hilb));
mult=y_final.*noise;
z=z+mult;
end

subplot(3,1,3);
plot(t,z);
xlabel('Time');
ylabel('Final Audio');
title('Plot of Final Audio');

sound(n*z,Fs);
audiowrite('final.wav',z,Fs);
● For N=2 Bandpass Filters:

● For N=8 Bandpass Filters:


● For N=16 Bandpass Filters:

DISCUSSION

1. For N=1 or 2, where N is the number of bands, we couldn’t recognize the voice in
the provided audio file, however the rhythmic pattern could be understood. The
voice was recognizable from 3-bands onwards.
2. The voice is clearly recognizable for 8-bands and 16-bands. mThe clarity of voice
in the 16-bands is similar to using 8-bands.
3. As a result, as we add bands, the time domain representation becomes
increasingly close, and the frequency domain even begins to resemble the actual
spectrum of the audio stream.
4. The spacing between bands is crucial for the decoding of the audio signal.
5. The majority of human voice is located in the low frequency range. We begin to
acquire the information held in the speech as we apply small bandwidth filters
and increasingly closely spaced bands. Thus, increasing the number of bands
improves the output voice's clarity.

You might also like