_speech recognition system
_speech recognition system
Preprocessing - Converts the spoken input into a form that the recognizer can process.
Recognition - Recognizer incorporates acoustic model, lexical model, language model
which helps to analyze the data at acoustic, articulatory phonetic, linguistic level and
identifies what is being said.
Communication - Sends the recognized input to the software/hardware systems that need
it.
In order to understand what these three tasks entail, the Technology Focus begins with a
description of the data that speech recognition systems must handle. It describe speech is
produced (called articulation), examines the stream of speech itself (called acoustics), and
then characterizes the ability of the human ear to handle spoken input (called auditory
perception).
In figure (1), communication is displayed as a bi-directional arrow. This represents the two-
way communication that exists in applications where the speech interface is closely bound to
the rest of the application. In those applications, software components that are external to the
speech recognition system may guide recognition by specifying the words and structures that
the recognition system can use at any point in the application. Other uses of speech involve
one-way communication from recognition system to other components of the application.
Figure (1): Components of speech recognition system.
The information needed to perform speech recognition is contained in the stream of speech.
For humans, that flow of sounds and silences can be partitioned into discourses, sentences,
words, and sounds. Speech recognition systems focus on words and the sounds that
distinguish one word from another in a language. Those sounds are called phonemes. The
ability to differentiate words with distinct phonemes is as critical for speech recognition as it
is for human beings.
There are a number of ways speech can be described and analyzed. The most commonly used
approaches are
These three approaches offer insights into the nature of speech and provide tools to make
recognition more accurate and efficient.
PREPROCESSING SPEECH
Like all sounds, speech is an analog waveform. In order for a recognition system to utilize
speech data, all formants, noise patterns, silences, and co-articulation effects must be
captured and converted to a digital format. This conversion process is accomplished through
digital signal processing techniques. Some speech recognition products include hardware to
perform this conversion. Other systems rely on the signal processing capabilities of other
products, such as digital audio sound cards.
In order for speech recognition to function at an acceptable speed, the amount of data must be
reduced. Fortunately, some data in the speech signal are redundant, some are irrelevant to the
recognition process, and some need to be removed from the signal because they interfere with
accurate recognition. The challenge is to eliminate these detrimental components from the
signal without losing or distorting critical information contained in the data.
One method of reducing the quantity of data is to use filters to screen out frequencies above
3100 Hz and below 100 Hz. Such bandwidth narrowing is similar to using a zoom lens on a
video recorder. Another data reduction technique, sampling, reduces speech input to slices
(called samples) of the speech signal. Most speech recognition system takes 8000 to 10000
samples per second.
The samples extracted from the analog signal must be converted into a digital form. The
process of converting the analog waveform representation into a digital code is called analog
to digital conversion or coding. To achieve high recognition accuracy at an acceptable speed
the conversion process must
The preprocessor extracts acoustic patterns contained in each frame and captures the changes
that occur as the signal shifts from one frame to the next. This approach is called spectral
analysis because it focuses on individual element of the frequency spectrum. Two most
commonly used spectral analysis approaches are the
RECOGNITION
Once the preprocessing of a user’s input is complete the recognizer is ready to perform its
primary function: to identify what the user has said. The competing recognition technologies
found in commercial speech recognition system are:
Template Matching
Acoustic – Phonetic Recognition
Stochastic Processing
Neural networks
Template Matching
The comparison is not expected to produce an identical match. Individual utterances of the
same word, even by the same person, often differ in length. This variation can be due to a
number of factors, including difference in the rate at which the person is speaking, emphasis,
or emotion. Whatever the cause, there must be a way to minimize temporal differences
between patterns so that fast and slow utterances of the same word will not be identified as
different words. The process of minimizing temporal/word length differences is called
temporal alignment. The approach most commonly used to perform temporal alignment in
template matching is a pattern – matching technique called dynamic time warping (DTW).
DTW establishes the optimum alignment of one set of vectors (template) with another.
Dynamic time warping is an algorithm for measuring similarity between two sequences
which may vary in time or speed.
Most template matching systems have a predetermined threshold of acceptability. Its function
is to prevent noise and words not in the application vocabulary from being incorrectly
identified as acceptable speech input. If no template match exceeds the threshold of
acceptability no recognition is recorded. Applications and systems differ on how such non-
recognition events are handled. Many systems ask the user to repeat the word or utterance.
Template matching performs very well with small vocabularies of phonetically distinct items
but has difficulty making the fine distinctions required for larger vocabulary recognition and
recognition of vocabularies containing similar-sounding words (called confusable words).
Since it operates at the word level there must be at least one stored template for each word in
the application vocabulary. If, for example, there are five thousand words in an application,
there would need to be at least five thousand templates.
Acoustic- phonetic recognition supplanted template matching in the early 1970s. Unlike
template matching, acoustic – phonetic recognition functions at the phoneme level.
Theoretically, it is an attractive approach to speech recognition because it limits the number
of representations that must be stored to the number of phonemes needed for language.
Feature – extraction
Segmentation and labeling
Word – level recognition
During feature extraction the system examines the input for spectral patterns, such as formant
frequencies, needed to distinguish phonemes from each other. The collection of extracted
feature is interpreted using acoustic – phonetic rules. These rules identify phonemes
(labelling) and determine where one phoneme ends and the next begins (segmentation).
The high degree of acoustic similarity among phonemes combined with phoneme variability
resulting from co-articulation effects and other sources create uncertainly with regard to
potential phone labels. As a result, the output of the segmentation and labelling stage is a set
of phoneme hypotheses. The hypotheses can be organized into a phoneme lattice (figure 3),
decision tree, or similar structure. Figure (3) displays more than one phoneme hypothesis for
a single point in the input.
Stochastic Processing
Like template matching, stochastic processing requires the creation and storage of models of
each of the terms that will be recognized. At that point the two approaches diverge.
Stochastic processing involves no direct matching between stored models and input. Instead,
it is based upon complex statistical and probabilistic analyses which are best understood by
examining the network-like structure in which those statistics are stored: Hidden Markov
Model (HMM).
Researchers began investigating using HMM’s for speech recognition in the early 1970’s.
One of the earliest proponents of the technology was James Baker, who used it to develop
CMU’s DRAGON system. Another early proponent of HMM’s was Frederick Jelinek, whose
research group at IBM was instrumental in advancing HMM technology. In 1982, James and
Janet Baker founded Dragon System, and soon developed the DragonScribe system, one of
the first commercial products using HMM technology. HMM technology did not gain
widespread acceptance for commercial system until the late 1980’s, but by 1990 HMM’s had
become the dominant approach to recognition.
An HMM, such as the one displayed in figure (4), consists of a sequence of states connected
by transitions. The states represent the alternatives of the stochastic process and the
transitions contain probabilistic and other data used to determine which state should be
selected next. The states of the HMM in figure (4) are displayed as circles and its transitions
are represented by arrows. Transitions from the first state of the HMM go to the first state
(called a recursive transition), to the next state, or to the third state of the HMM. If the HMM
in figure (4) is a stored model of the word “five”, it would be called a reference model for
“five” and would contain statistics about all the spoken samples of the word used to create the
reference model. Each state of the HMM holds statistics for a segment of the word. Those
statistics describe the parameter values and parameter values and parameter variation that
were found in samples of the word. A recognition system may have numerous HMM’s like
the one in figure (4) or may consolidate them into a network of states and transitions.
The recognition system proceeds through the input comparing it with stored models. These
comparisons produce a probability score indicating the likelihood that a particular stored
HMM reference model is the best for the input. This approach is called the Baum-Welch
Maximum-likelihood algorithm. Another common method used for stochastic recognition is
the Viterbi algorithm. The Viterbi algorithm looks through a network of nodes for a sequence
of HMM states that correspond most closely to the input. This is called the best path.
Stochastic processing using HMM’s accurate, flexible, and capable of being fully automated.
It can be applied to units smaller than phonemes or as large as sequences of words. Stochastic
processing is most often used to represent and recognize speech at the word level (sometimes
called whole word recognition) and for a variant of the phoneme level called sub-words.
Neural Networks
Neural networks are computer programs used in speech recognizers, which could learn
important speech knowledge automatically and represent this knowledge in parallel
distributed fashion for rapid evaluation. Such system would mimic the function of human
brain, which consists of several billion simple, inaccurate, and slow processors that perform
reliable speech recognition (Alex Waibel & John Hampshire II (1989). They are sometimes
called artificial neural networks to distinguish neural network programs from biological
neural structures.
Neural networks are excellent classification systems. They specialize in classifying noisy,
patterned, variable data streams containing multiple, overlapping, interacting, and incomplete
cues. Speech recognition is a classification tasks that has all of these characteristics, making
neural network an attractive alternative to the approaches described above.
Unlike most other technologies, neural networks do not require that a complete specification
of a problem be created prior to developing a network – based solution. Instead, networks
learn patterns solely through exposure to large numbers of examples, making it possible to
construct neural networks for auditory models and other poorly understood areas. The fact
that networks accomplish all of these feats using parallel processing is of special interest
because increases in complexity do not entail significant reductions in speed.
The concept of artificial neural networks has its roots in the structure and behavior of the
human brain. The brain is composed of a network of specialized cells called neurons that
operate in parallel to learn and process a wide range of complex information. Like human
brain, neural networks are constructed from interconnected neurons (also called nodes or
processing elements) and learn new patterns by experiencing examples of those patterns.
Speech recognition system can be separated in several different classes by describing what
types of utterances they have the ability to recognize. There classes are based on the fact that
one of the difficulties of ASR is the ability to determine when a speaker starts and finished an
utterance. Most can fit into more than one class depending on which made they are using.
Isolated words
Isolated word recognizers usually require each utterance to have quite lack of audio
signal on both sides of the sample windows.
This means that it accept single utterance at a time.
Isolated utterances might be a better name for this class.
Here each vocabulary word must be spoken in isolation which distinct pauses between
words.
Connected words
Connect word systems (or more correctly connected utterances) are similar to isolated
words but allow separate words to be ‘run-together’ with a minimal pause between
them.
Continuous speech
Recognizers with continuous speech capabilities are some of the most difficult to
create because they must utilize special methods to determine utterance boundaries.
This allows users to speak almost naturally, while the computer determines the
content. Basically, it's computer dictation.
It does not require pauses between words best but have a medium size vocabularies
and permit input of complete sentence.
Spontaneous speech
At basic level it can be thought of as a speech that is natural sounding and not
rehearsed.
As ASR system with spontaneous speech ability should be able to handle a variety of
natural speech features such as words being run together “ums” and “ah”, and even
slight stutters.
o Health care
o Telephony and other domains (Mobile telephony, mobile email)
o Disabled people
o Automatic translation
o Automotive speech recognition
o Court reporting (Real-time Voice Writing)
o Speech Biometric Recognition
o Hands-free computing: voice command recognition computer user interface
o Home automation
o Interactive voice response
o Multimodal interaction
o Pronunciation evaluation in computer-aided language learning applications
o Robotics
o Transcription (digital speech-to-text).
o Voice to Text: Visual Voicemail services for mobile phones.
o Dictation
o Military
ASR systems that are designed to perform functions and actions on the
system are defined as commands and control systems.
E.g. Utterances like “open nets cape” and start a “new xterm”
o Embedded applications