Body and Conclu
Body and Conclu
Speech, being the most common, fastest and most natural method of communication for
human beings, has been an active research topic for years. One of the most promising
discoveries is using such a method of interaction between human and machine, also known as
speech recognition. In fact, computer scientists since the 1950s have been trying to configure
ways on how to interpret, record, and understand human speech. However, such a system, the
“Audrey” system of Bell Laboratories, was only able to focus on numbers and not on words.
Different advancements were observed in the later centuries. Carnegie Mellon’s “Harpy’ speech
system” in the 1970s was able to understand over 1,000 words and the “Hidden Markov Model
(HMM)” in the 1980s which helped recognize speech in a noisy environment. Eventually, the
speech recognition technology achieved an estimated 80 % level of accuracy by the year 2001.
[10]. In the 20th century, Google Voice Search became one of the prominent advancements,
letting speech recognition be on the hands of millions of people alongside Apple’s Siri and
Amazon’s Alexa. This type of technology has greatly affected the way people live and became
one of the key features which humans enjoy in their devices (e.g., Siri, Google Now, and
Cortana) [2].
Body
Being the fastest, most common, and most natural method of communication for human
beings, speech has been an active research topic for years. Speech recognition is one of the
most promising discoveries regarding this topic. It conforms to a method of interaction between
human and machine. Since the 1960s, computer scientists have been trying to configure ways
on how to interpret, record, and understand human speech. However, the “Audrey” system of
Bell Laboratories introduced in this decade was only able to focus on numbers and not on
words. Later centuries introduced different advancements to such systems. The “Harpy’ speech
system” in the 1970s by Carnegie Mellon was able to understand 1,000 words. In the 1980s, the
“Hidden Markov Model” helped recognize speech even in a noisy environment. Eventually, 80
percent level of accuracy in speech recognition technology was achieved by the year 2001 [7].
Google Voice Search has become one of the prominent advancements in the 20th century,
letting speech recognition be on the hands of millions of people alongside Apple’s Siri and
Amazon’s Alexa. According to Zwass [8], speech recognition is “the ability of devices to respond
to spoken commands.” Using algorithms, it converts speech signals to a sequence of words. It
is a hand-free control of devices and equipment which is usually used for querying databases,
dictation, giving commands to computer-based systems and others. This type of technology has
greatly affected the way people live and became one of the key features which humans enjoy in
their devices (e.g., Siri, Google Now, and Cortana) [3].
There are three significant approaches or methods that could be done in speech
recognition systems namely acoustic phonetic approach, pattern recognition approach, and
artificial intelligence approach [4].
Speech sounds alongside their acoustic qualities and physiological production is studied
in phonetics. This can be further subdivided into articulatory phonetics, acoustic phonetics, and
linguistic phonetics [5]. Acoustic phonetics basically deals with the wave-like properties of
speech sounds in the acoustics aspect. It deals with the acoustic properties of speech sounds
by analyzing sound wave signals through the variation of the amplitudes, durations and
frequencies [6]. The phonetic units represented by these acoustic properties varies with time as
it depends on both the neighboring sounds and the speaker, thus, the coarticulation effect.
Despite this variance, there is an assumption that this kind of approach can easily be learned
and adapted by a machine because of its straightforwardness. Acoustic phonetic approach is
governed by three main steps. First, together with a feature detection technique, spectral
analysis of the speech signal is done. It helps convert the measurements into values that can
describe its respective acoustic properties. Segmentation and labeling is the next step. Here,
the acquired speech signal is subdivided into stable acoustic regions which are properly labelled
forming a lattice characterization of the sample. Lastly, attempts are then made to determine an
existing word from the acquired phonetic label sequences. However, this type of approach has
not been commercially used to a large extent [4].
Pattern recognition approach, on the other hand, uses larger units such as patterns to
avoid difficulties with using a phenome, which are tiny units of speech sounds, primarily used in
the aforementioned acoustic phonetic approach. According to Bridle [2], pattern recognition
approach is equipped with three promising features namely consistent speech recognition, well
formulated mathematical framework and reliable pattern comparison. It follows four main steps
namely feature extraction, pattern training, pattern matching, and decision logic to extract
patterns, classify, as well as separate the said patterns. In the feature extraction step, a
reference pattern is obtained in consideration of one or more test patterns on the input signal.
This pattern can be in the form of a speech template or speech model such as HMM model and
can be used to a phrase, word or sound. In the pattern training stage, each possible pattern is
adapted and learned by the machine. In pattern matching, the speech to be recognized is then
directly compared to the possible patterns acquired in pattern training. Lastly, in the decision
logic stage, the match percentage between the patterns determines the identity of the sample.
For the last six decades, this type of approach is the prime method for speech recognition [4].
Pattern recognition can be further classified into four categories namely template-based
approach, stochastic approach, dynamic time warping and vector quantization. In a template-
based approach, a dictionary of the candidate’s words is formulated through the collection of
prototypical speech patterns. These are then stored as the reference patterns. Thus, this
basically compares the unknown sample to the best matching pattern from the said dictionary.
Contextual effects, speaker variability, homophone words and confusable sounds are only some
of the mostly encountered sources of these uncertain and incomplete information. Stochastic
approach, on the other hand, models based on probability are ensured to deal with these types
of information. Among researchers, the most popular stochastic model is the Hidden Markov
Model (HMM). As compared to the template-based approach, the stochastic approach has a
better mathematical background. Dynamic time warping, on the other hand, compares two
sequences varying in time or speed. It provides a linear representation of the acquired data may
it be in the form of graphics, audio or video. Lastly, vector quantization makes use of reference
models in the form of codebooks and instead of costly evaluation methods, uses codebook
searchers. It has a great data reduction efficiency and very much used as Speech Coders [4].
The last main type or speech recognition approach is artificial intelligence approach.
According to John McCarthy, artificial intelligence can be described as “human intelligence
exhibited by machines” [1]. It can be considered as a hybrid of the previously discussed
acoustic phonetics approach and pattern recognition approach as this incorporates both of their
concepts. It can be further divided into three categories namely knowledge-based approach,
connectionist approach and super vector machine. Knowledge based approach uses phonetic,
spectrogram and linguistic related information. This plays an important role in the design of
recognition algorithms, definition of unit speech and the selection of suitable input itself.
However, it is quite disadvantageous in terms of incorporation of expert and human knowledge
such as pragmatics, semantics and syntax. Connectionist approach or artificial neural network
applies intelligence in visualizing, analyzing and characterizing speech based on a set of certain
features, same with the mechanism of biological neurons. This type of approach is
commendable for its uniformity and simplicity. Lastly, support vector machines use a
discriminative model which provides decision boundaries. Thus, it can only classify on a fixed
length of data. For classification of data points. , it uses both nonlinear and linear hyper-planes
[4].