0% found this document useful (0 votes)
15 views12 pages

_speech recognition system

The document provides an overview of speech recognition systems, detailing their primary tasks: preprocessing, recognition, and communication. It describes various recognition technologies, including template matching, acoustic-phonetic recognition, stochastic processing, and neural networks, along with their applications in fields such as healthcare and telecommunications. Additionally, it categorizes speech recognition systems based on the types of utterances they can recognize, such as isolated words, connected words, continuous speech, and spontaneous speech.

Uploaded by

Arya Shaiju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

_speech recognition system

The document provides an overview of speech recognition systems, detailing their primary tasks: preprocessing, recognition, and communication. It describes various recognition technologies, including template matching, acoustic-phonetic recognition, stochastic processing, and neural networks, along with their applications in fields such as healthcare and telecommunications. Additionally, it categorizes speech recognition systems based on the types of utterances they can recognize, such as isolated words, connected words, continuous speech, and spontaneous speech.

Uploaded by

Arya Shaiju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

SPEECH RECOGNITION SYSTEM

SUBMITTED TO: Ms. Litna

SUBMITTED BY: Ms. Ashithanandan


SPEECH RECOGNITION
Speech recognition (also known as automatic speech recognition or
computer speech recognition) converts spoken words to machine-readable input. Speech
recognizer is a device that converts an acoustic signal into another form, like writing, to be
stored or used in some way. Speech recognition device accepts acoustic signals as input and
produces sequence of words as output. Speech recognition system performs three primary
tasks;

 Preprocessing - Converts the spoken input into a form that the recognizer can process.
 Recognition - Recognizer incorporates acoustic model, lexical model, language model
which helps to analyze the data at acoustic, articulatory phonetic, linguistic level and
identifies what is being said.
 Communication - Sends the recognized input to the software/hardware systems that need
it.

In order to understand what these three tasks entail, the Technology Focus begins with a
description of the data that speech recognition systems must handle. It describe speech is
produced (called articulation), examines the stream of speech itself (called acoustics), and
then characterizes the ability of the human ear to handle spoken input (called auditory
perception).

In figure (1), communication is displayed as a bi-directional arrow. This represents the two-
way communication that exists in applications where the speech interface is closely bound to
the rest of the application. In those applications, software components that are external to the
speech recognition system may guide recognition by specifying the words and structures that
the recognition system can use at any point in the application. Other uses of speech involve
one-way communication from recognition system to other components of the application.
Figure (1): Components of speech recognition system.

Preprocessing, recognition and communication should be invisible to the users of a speech


recognition interface. The end user sees them indirectly as the accuracy and speed of system.
Accuracy and speed are tools that users call upon to evaluate a speech recognition interface.

THE DATA OF SPEECH RECOGNITION

The information needed to perform speech recognition is contained in the stream of speech.
For humans, that flow of sounds and silences can be partitioned into discourses, sentences,
words, and sounds. Speech recognition systems focus on words and the sounds that
distinguish one word from another in a language. Those sounds are called phonemes. The
ability to differentiate words with distinct phonemes is as critical for speech recognition as it
is for human beings.

There are a number of ways speech can be described and analyzed. The most commonly used
approaches are

 Articulation - Analysis of how speech sounds are produced by speakers


 Acoustics – Analysis of speech signal as a stream of sounds
 Auditory Perception – Analysis of how speech is processed by human listener

These three approaches offer insights into the nature of speech and provide tools to make
recognition more accurate and efficient.

PREPROCESSING SPEECH

Like all sounds, speech is an analog waveform. In order for a recognition system to utilize
speech data, all formants, noise patterns, silences, and co-articulation effects must be
captured and converted to a digital format. This conversion process is accomplished through
digital signal processing techniques. Some speech recognition products include hardware to
perform this conversion. Other systems rely on the signal processing capabilities of other
products, such as digital audio sound cards.

Capturing the Speech Signal

In order for speech recognition to function at an acceptable speed, the amount of data must be
reduced. Fortunately, some data in the speech signal are redundant, some are irrelevant to the
recognition process, and some need to be removed from the signal because they interfere with
accurate recognition. The challenge is to eliminate these detrimental components from the
signal without losing or distorting critical information contained in the data.
One method of reducing the quantity of data is to use filters to screen out frequencies above
3100 Hz and below 100 Hz. Such bandwidth narrowing is similar to using a zoom lens on a
video recorder. Another data reduction technique, sampling, reduces speech input to slices
(called samples) of the speech signal. Most speech recognition system takes 8000 to 10000
samples per second.

Digitizing the Waveform

The samples extracted from the analog signal must be converted into a digital form. The
process of converting the analog waveform representation into a digital code is called analog
to digital conversion or coding. To achieve high recognition accuracy at an acceptable speed
the conversion process must

 Include all critical data


 Remove redundancies
 Remove noise and distortion
 Avoid introducing new distortions

The preprocessor extracts acoustic patterns contained in each frame and captures the changes
that occur as the signal shifts from one frame to the next. This approach is called spectral
analysis because it focuses on individual element of the frequency spectrum. Two most
commonly used spectral analysis approaches are the

 Bank – of – filters approach (uses FFT)


 Linear Predictive Coding

RECOGNITION

Once the preprocessing of a user’s input is complete the recognizer is ready to perform its
primary function: to identify what the user has said. The competing recognition technologies
found in commercial speech recognition system are:

 Template Matching
 Acoustic – Phonetic Recognition
 Stochastic Processing
 Neural networks

These approaches differ in speed, accuracy, and storage requirements.

Template Matching

Template matching is a form of pattern recognition and was dominant recognition


methodology in the 1950’s and 1960’s. It represents speech data as sets of
features/parameters vectors called templates. Each word or phrase in an application is stored
as a separate template. Spoken input by end users is organized into templates prior to
performing the recognition process. The input is then compared with the stored templates
and, as figure (2) indicates, the stored template most closely matching the incoming speech
pattern is identified as the input word or phrase. The selected template is called, the best
match for the input. Template matching is performed at the word level and contains no
reference to the phonemes within the word. The matching process entails a frame-by-frame
comparison of spectral patterns and generates an overall similarity assessment (usually called
the distance matrix) for each template.

Figure (2): Recognition using template matching.

The comparison is not expected to produce an identical match. Individual utterances of the
same word, even by the same person, often differ in length. This variation can be due to a
number of factors, including difference in the rate at which the person is speaking, emphasis,
or emotion. Whatever the cause, there must be a way to minimize temporal differences
between patterns so that fast and slow utterances of the same word will not be identified as
different words. The process of minimizing temporal/word length differences is called
temporal alignment. The approach most commonly used to perform temporal alignment in
template matching is a pattern – matching technique called dynamic time warping (DTW).
DTW establishes the optimum alignment of one set of vectors (template) with another.
Dynamic time warping is an algorithm for measuring similarity between two sequences
which may vary in time or speed.

Most template matching systems have a predetermined threshold of acceptability. Its function
is to prevent noise and words not in the application vocabulary from being incorrectly
identified as acceptable speech input. If no template match exceeds the threshold of
acceptability no recognition is recorded. Applications and systems differ on how such non-
recognition events are handled. Many systems ask the user to repeat the word or utterance.
Template matching performs very well with small vocabularies of phonetically distinct items
but has difficulty making the fine distinctions required for larger vocabulary recognition and
recognition of vocabularies containing similar-sounding words (called confusable words).
Since it operates at the word level there must be at least one stored template for each word in
the application vocabulary. If, for example, there are five thousand words in an application,
there would need to be at least five thousand templates.

Although template matching is currently on the decline as a basic approach to recognition, it


has been adapted for use in word spotting applications. It also remains the primary
technology applied to speaker verification.

Acoustic – Phonetic Recognition

Acoustic- phonetic recognition supplanted template matching in the early 1970s. Unlike
template matching, acoustic – phonetic recognition functions at the phoneme level.
Theoretically, it is an attractive approach to speech recognition because it limits the number
of representations that must be stored to the number of phonemes needed for language.

Acoustic-phonetic recognition generally involves three steps:

 Feature – extraction
 Segmentation and labeling
 Word – level recognition

During feature extraction the system examines the input for spectral patterns, such as formant
frequencies, needed to distinguish phonemes from each other. The collection of extracted
feature is interpreted using acoustic – phonetic rules. These rules identify phonemes
(labelling) and determine where one phoneme ends and the next begins (segmentation).

The high degree of acoustic similarity among phonemes combined with phoneme variability
resulting from co-articulation effects and other sources create uncertainly with regard to
potential phone labels. As a result, the output of the segmentation and labelling stage is a set
of phoneme hypotheses. The hypotheses can be organized into a phoneme lattice (figure 3),
decision tree, or similar structure. Figure (3) displays more than one phoneme hypothesis for
a single point in the input.

Figure (3): Phoneme Lattice.


Once the segmentation and labelling process has been completed, the system searches
through the application vocabulary for words matching the phoneme hypotheses. The word
best matching a sequence of hypotheses is identified as the input item.

Stochastic Processing

The term stochastic refers to the process of making a sequence of non-deterministic


selections from among sets of alternatives. They are non-deterministic because the choices
during the recognition process are governed by the characteristics of the input and not
specified in advance. The use of stochastic models and processing permeates speech
recognition. Stochastic processing dominates current word – construction/recognition and
grammar.

Like template matching, stochastic processing requires the creation and storage of models of
each of the terms that will be recognized. At that point the two approaches diverge.
Stochastic processing involves no direct matching between stored models and input. Instead,
it is based upon complex statistical and probabilistic analyses which are best understood by
examining the network-like structure in which those statistics are stored: Hidden Markov
Model (HMM).

In 1913, A. A. Markov described a network model capable of generating Russian Letter


sequences or predicting letter sequence using probabilities acquired through exposure to
Russian texts. In the 1960’s and early 1970’s, Markov modeling was applied to multi-layer,
hierarchical structures by Baum and other researchers. Since the probabilistic calculations of
the underlying layers are not observed as part of the higher-level sequences these models
were called Hidden Markov Models (HMM’s).

Researchers began investigating using HMM’s for speech recognition in the early 1970’s.
One of the earliest proponents of the technology was James Baker, who used it to develop
CMU’s DRAGON system. Another early proponent of HMM’s was Frederick Jelinek, whose
research group at IBM was instrumental in advancing HMM technology. In 1982, James and
Janet Baker founded Dragon System, and soon developed the DragonScribe system, one of
the first commercial products using HMM technology. HMM technology did not gain
widespread acceptance for commercial system until the late 1980’s, but by 1990 HMM’s had
become the dominant approach to recognition.

An HMM, such as the one displayed in figure (4), consists of a sequence of states connected
by transitions. The states represent the alternatives of the stochastic process and the
transitions contain probabilistic and other data used to determine which state should be
selected next. The states of the HMM in figure (4) are displayed as circles and its transitions
are represented by arrows. Transitions from the first state of the HMM go to the first state
(called a recursive transition), to the next state, or to the third state of the HMM. If the HMM
in figure (4) is a stored model of the word “five”, it would be called a reference model for
“five” and would contain statistics about all the spoken samples of the word used to create the
reference model. Each state of the HMM holds statistics for a segment of the word. Those
statistics describe the parameter values and parameter values and parameter variation that
were found in samples of the word. A recognition system may have numerous HMM’s like
the one in figure (4) or may consolidate them into a network of states and transitions.

Figure (4): Typical HMM Structure.

The recognition system proceeds through the input comparing it with stored models. These
comparisons produce a probability score indicating the likelihood that a particular stored
HMM reference model is the best for the input. This approach is called the Baum-Welch
Maximum-likelihood algorithm. Another common method used for stochastic recognition is
the Viterbi algorithm. The Viterbi algorithm looks through a network of nodes for a sequence
of HMM states that correspond most closely to the input. This is called the best path.

Stochastic processing using HMM’s accurate, flexible, and capable of being fully automated.
It can be applied to units smaller than phonemes or as large as sequences of words. Stochastic
processing is most often used to represent and recognize speech at the word level (sometimes
called whole word recognition) and for a variant of the phoneme level called sub-words.

Neural Networks

Neural networks are computer programs used in speech recognizers, which could learn
important speech knowledge automatically and represent this knowledge in parallel
distributed fashion for rapid evaluation. Such system would mimic the function of human
brain, which consists of several billion simple, inaccurate, and slow processors that perform
reliable speech recognition (Alex Waibel & John Hampshire II (1989). They are sometimes
called artificial neural networks to distinguish neural network programs from biological
neural structures.

Neural networks are excellent classification systems. They specialize in classifying noisy,
patterned, variable data streams containing multiple, overlapping, interacting, and incomplete
cues. Speech recognition is a classification tasks that has all of these characteristics, making
neural network an attractive alternative to the approaches described above.

Unlike most other technologies, neural networks do not require that a complete specification
of a problem be created prior to developing a network – based solution. Instead, networks
learn patterns solely through exposure to large numbers of examples, making it possible to
construct neural networks for auditory models and other poorly understood areas. The fact
that networks accomplish all of these feats using parallel processing is of special interest
because increases in complexity do not entail significant reductions in speed.

The concept of artificial neural networks has its roots in the structure and behavior of the
human brain. The brain is composed of a network of specialized cells called neurons that
operate in parallel to learn and process a wide range of complex information. Like human
brain, neural networks are constructed from interconnected neurons (also called nodes or
processing elements) and learn new patterns by experiencing examples of those patterns.

Figure (5): Typical Neural Network Architecture.

TYPES OF SPEECH RECOGNITION

Speech recognition system can be separated in several different classes by describing what
types of utterances they have the ability to recognize. There classes are based on the fact that
one of the difficulties of ASR is the ability to determine when a speaker starts and finished an
utterance. Most can fit into more than one class depending on which made they are using.
Isolated words

 Isolated word recognizers usually require each utterance to have quite lack of audio
signal on both sides of the sample windows.
 This means that it accept single utterance at a time.
 Isolated utterances might be a better name for this class.
 Here each vocabulary word must be spoken in isolation which distinct pauses between
words.

Connected words

 Connect word systems (or more correctly connected utterances) are similar to isolated
words but allow separate words to be ‘run-together’ with a minimal pause between
them.

Continuous speech

 Recognizers with continuous speech capabilities are some of the most difficult to
create because they must utilize special methods to determine utterance boundaries.
 This allows users to speak almost naturally, while the computer determines the
content. Basically, it's computer dictation.
 It does not require pauses between words best but have a medium size vocabularies
and permit input of complete sentence.

Spontaneous speech

 At basic level it can be thought of as a speech that is natural sounding and not
rehearsed.
 As ASR system with spontaneous speech ability should be able to handle a variety of
natural speech features such as words being run together “ums” and “ah”, and even
slight stutters.

APPLICATION OF SPEECH RECOGNITION SYSTEM

o Health care
o Telephony and other domains (Mobile telephony, mobile email)
o Disabled people
o Automatic translation
o Automotive speech recognition
o Court reporting (Real-time Voice Writing)
o Speech Biometric Recognition
o Hands-free computing: voice command recognition computer user interface
o Home automation
o Interactive voice response
o Multimodal interaction
o Pronunciation evaluation in computer-aided language learning applications
o Robotics
o Transcription (digital speech-to-text).
o Voice to Text: Visual Voicemail services for mobile phones.

o Dictation

 This is most common use for ASR system today.


 Included medical transcriptions, legal and business dictation, as well as
general word processing.

o Military

 High-performance fighter aircraft


 Helicopter
 Battle management
 Training air traffic controllers.

o Command and control


ASR systems that are designed to perform functions and actions on the
system are defined as commands and control systems.
 E.g. Utterances like “open nets cape” and start a “new xterm”
o Embedded applications

 The main application area for speech recognition is voice input to


computers for such task as document creation (word processing), data
base information and financial transaction processing.
 Others include above mentioned one together with automated baggage
handling, parcel sorting, quality control and computer aided design and
manufacture.

PROBLEMS IN AUTOMATIC SPEECH RECOGNITION

 Acoustic variations in individual speech production.


 There are no identifiable boundaries between sound or even words.
 Variability due to dialect which often includes leaving out certain sounds or replacing
one sound with another.
 Prosodic features such as intonation, rhythm and stress also cause variability in the
speech.

You might also like