NLP Project Reportttt
NLP Project Reportttt
INSTITUE OF TECHNOLOGY
INFORMATION TECHNOLOGY DEPARTMENT
Project Title: - Develop Speaker and Text Dependent Isolated Speech
Recognizer System
Name of scholars ID
Speech Recognition is the process in which certain words of a particular speaker will automatically
recognized that are based on the information included in individual speech waves. The definition
of speech recognition according to Macmillan Dictionary is “a system where you speak to a
computer to make it do things, for example instead of using a keyboard”. While the definition is
true, as the area of artificial intelligence moves forward the applications for speech recognition has
rocketed. To be able to communicate with devices in a natural way, we need speech recognition.
This, of course, makes it necessary to have great accuracy, fast speed and the ability to recognize
many different speakers. The use of speech recognition is increasing rapidly and is now available in
smart TVs, desktop computers, every new smart phone, etc. allowing us to talk to computers
naturally. With the use in home appliances, education and even in surgical procedures accuracy
and speed becomes very important. A speech recognition (SR) system can basically be either
speaker-dependent or speaker independent. A speaker dependent system is intended to be used by
a single speaker and is therefore trained to understand one particular speech pattern. A speaker-
independent system is intended for use by any speaker and is naturally more difficult to achieve.
These systems tend to have 3 to 5 times higher error rates than speaker-dependent systems.
To understand SR, one should understand the components of human speech. A phoneme is defined
as the smallest unit of speech that distinguishes a meaning. Every language has a set number of
phonemes which will sound different depending on accents, dialects and physiology. When
phonemes are considered in SR, they can be considered in their acoustic context, making them
sound different, i.e. when also considering the phoneme to the left or right of the phoneme we’re
A speaker-dependent system, depending on training and speaker, is usually more accurate than the
speaker-independent system. There are also multi-speaker systems that are intended to be used by
a small group of people and speaker-adaptive systems that learn to understand any speaker given
a small amount of speech data for training.
Isolated, meaning single words, and discontinuous, meaning full sentences with artificially
separated words by silence, are the easiest to recognize since the boundaries are detectable.
Continuous speech is the most difficult one to recognize because of co-articulation and unclear
boundaries, but it’s the most interesting one since it allows us to speak naturally.
The constraints can be task-dependent, accepting only relevant sentences for the task, e.g. a ticket
purchase service rejecting” The car is blue”. Others can be semantic, rejecting” The car is sad” or
syntactic, rejecting” Car sad the is”. Constraints are represented by grammar, filtering out
The common method used in automatic speech recognition systems is the probabilistic approach,
computing a score for matching spoken words with a speech signal. A speech signal corresponds
to any word or sequence of words in the vocabulary with a probability value. The score is
calculated from phonemes in the acoustic model knowing which words can follow other words
through linguistic knowledge. The word sequence with the highest score gets chosen as the
recognition result. The SR process can be divided into four consecutive steps; pre-processing,
feature extraction, decoding and post-processing.
Sequence of words in the vocabulary with a probability value. The score is calculated from
phonemes in the acoustic model knowing which words can follow other words through linguistic
the common method used in automatic speech recognition systems is the probabilistic approach,
computing a score for matching spoken words with a speech signal. A speech signal corresponds
to any word or knowledge. The word sequence with the highest score gets chosen as the recognition
result. The SR process can be divided into four consecutive steps; pre-processing, feature
extraction, decoding and post-processing. Different SR systems have different implementations of
each step and in between them, the following is the basic one and selected for this project.
Wake-up-word (WUW) recognition system follows the generic functions depicted and Speech
signal captured by the microphone is converted into an electrical signal that is digitized prior to
being processed by the WUW recognition system. The system also can read digitized raw
waveform stored in a file. In either case raw waveform samples are converted into feature vectors
by the front-end with a rate of 100 feature vectors per second – defining the frame rate of the
system. Those feature vectors are used by the Voice Activity Detector (VAD) to classify each
frame (i.e., feature vector) as containing speech or no-speech defining the VAD state. The state of
the VAD is useful to reduce computational load of the recognition engine contained in the back-
end. Backend reports recognition score for each token (e.g., word) matched against a WUW model.
Is the recording of speech with a sampling frequency of, for example, 16 kHz and, according to
The Shannon Theorem, a bandwidth limited signal can be reconstructed if the sampling frequency
is more than double the maximum frequency meaning that frequencies up to almost 8 kHz are
constituted correctly.
Input to the system can be done via a microphone (live-input) or through a pre-digitized sound file.
In either case the resulting input to the feature extraction unit, depicted as Front-End, is digital
sound and feature Extraction of the Front-End. Feature extraction is a procedure that concentrates
information from the voice flag that is one of a kind for every speaker. Feature Extraction is
accomplished using standard algorithm for Mel Scale Frequency Cepstral Coefficients (MFCC).
Features are used for recognition only when the VAD state is on. The result of feature extraction
is a small number of coefficients that are passed onto a pattern.
In the decoding process is where calculations are made to find which sequence of words that is
the most probable match to the feature vectors. For this step to work, three things have to be
present; an acoustic model with a hidden Markov model (HMM) for each unit (phoneme or word),
a dictionary containing possible words and their phoneme sequences and a language model with
words or word sequences likelihoods. The purpose of the voice activity detector (VAD) is to
reliably detect the presence or absence of speech. This tells the front-end application, and thus
correspondingly the backend, when and when not to process speech. The way in which this is
typically done is by measuring the signal energy at any given moment. When the signal energy is
very low, it suggests that no word is being spoken. If the signal energy spikes and stays at a high
level for a considerable period of time, a word is most likely being spoken. Therefore, the VAD
searches for extreme changes in the signal energy, and if the signal energy stays high for a certain
amount of time, the Voice Activity Detector will go back and mark the point at which the energy
changed dramatically.
Post Processing
SR systems usually, in the post-processing step, attempts to re-score this list, e.g. by using
a higher-order language model and/or pronouncing models. The simplest way to recognize a
delineated word token is to compare it against a number of stored word templates and determine
which model gives the “best match”. This goal is complicated by a number of factors. First,
different samples of a given word will have somewhat different durations. This problem can be
eliminated by simply normalizing the templates and the unknown speech so that they all have an
equal duration. However, another problem is that the rate of speech may not be constant throughout
the utterance (e.g., word); in other words, the optimal alignment between a template (model) and
the speech sample may be nonlinear. The Dynamic Time Warping (DTW) algorithm makes a
single pass through a matrix of frame scores while computing locally optimized segments of the
global alignment path.
Feature extraction is the greatest important part of the whole system. The aim of feature extraction
to decrease the data size of the speech signal earlier pattern classification or recognition. The steps
of Mel frequency Cepstral coefficients (MFCCs) calculation are: windowing, framing, Mel
frequency filtering, logarithmic function, Discrete Fourier Transform (DFT) and Discrete Cosine
Transform (DCT).
Discrete Fourier Transform (DFT) is used as the Fast Fourier Transform (FFT) algorithm. FFT
converts each frame of N samples from the time domain into the frequency domain. Th calculation
is more precise in frequency domain rather than in time domain.
Mel frequency filtering: The voice signal does not follow the linear scale and the frequency
range in FFT is so wide. It is perceptual scale that helps to simulate the way human ear works. It
corresponds to better resolution at low frequencies and less at high. Logarithmic function:
In recognition or classification of the speech signal, there are many approaches to recognize the
test audio file. The methodologies of speech recognition are: ANN, GMM, DTW, HMM, Fuzzy
logic and various other methods. Among them, HMM techniques are widely used in many
applications than any other ones. There are four types of HMM model used in speech processing.
For this project MATLAB is selected to integrate all of functional components of interest of
the system into a unified testing environment. MATLAB is chosen due to its ability to quickly
implement complex mathematical and algorithmic functions, as well as its unique ability to
visually display results through the use of image plots and other such graphs. Also, we were
able to develop a GUI in MATLAB to use as the command and control interface for all of our
test components. At the core of our testing environment is the backend pattern matching
algorithm. One of the goals of the presented testing environment is to research the effectiveness
of the back-end algorithm; more specifically, an implementation of Dynamic Time Warping
(DTW). The algorithm is used to perform speech recognition of a series of words against a
speech model.
1. A review on speech recognition technique. Gaikwad, Santosh K and Gawali, Bharti W and
Yannawar, Pravin. 3, s.l. : International Journal of Computer Applications, 244 5 th Avenue, \
# 1526, New York, NY 10001, USA India, 2010, Vol. 10.
2. Allawadi, Nishant. Speech-to -Text System for Phonebook Automation. THAPAR : THAPAR
UNIVERSITY PATIALA, 2012.
4. Emotion Recognition through Speech Using Gaussian Mixture Model and Support Vector
Machine. Utane, Akshay S and Nalbalwar, SL. 2013, Vol. 2.
5. Speech recognition using hidden markov models. Paul, Douglas B. 1990, Vol. 3.
6. F. T. Veton Kepuska, "A MATLAB TOOL FOR SPEECH PROCESSING, ANALYSIS AND
RECOGNITION: SAR-LAB," American Society for Engineering Education, 2006.