100% found this document useful (1 vote)
124 views4 pages

Speech Recognition Using Artificial Neural Network: - A Review

This document provides a review of using artificial neural networks for speech recognition. It discusses the speech recognition process, which involves speech preprocessing, feature extraction from the speech signal, and classification. Common feature extraction methods mentioned are MFCC and LPC. Classification algorithms discussed include HMM, DTW, and VQ. The document focuses on how various neural network architectures can be applied to speech recognition, comparing their advantages and disadvantages. It concludes that neural networks are well-suited for speech recognition tasks.

Uploaded by

Peter Wagih Zaki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
124 views4 pages

Speech Recognition Using Artificial Neural Network: - A Review

This document provides a review of using artificial neural networks for speech recognition. It discusses the speech recognition process, which involves speech preprocessing, feature extraction from the speech signal, and classification. Common feature extraction methods mentioned are MFCC and LPC. Classification algorithms discussed include HMM, DTW, and VQ. The document focuses on how various neural network architectures can be applied to speech recognition, comparing their advantages and disadvantages. It concludes that neural networks are well-suited for speech recognition tasks.

Uploaded by

Peter Wagih Zaki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Int'l Journal of Computing, Communications & Instrumentation Engg. (IJCCIE) Vol.

3, Issue 1 (2016) ISSN 2349-1469 EISSN 2349-1477

Speech Recognition Using Artificial Neural Network


– A Review
Bhushan C. Kamble1

Abstract--Speech is the most efficient mode of communication Speech signal provides two important types of information:
between peoples. This, being the best way of communication, could (a) content of speech and (b) identity of speaker. Speaker
also be a useful interface to communicate with machines. Therefore recognition deals with the extraction of identity of speaker [3].
the popularity of automatic speech recognition system has been Speech recognition technology can be a useful tool for
greatly increased. There are different approaches to speech various applications. It is already used in live subtitling on
recognition like Hidden Markov Model (HMM), Dynamic Time television, as dictation tools in medical and legal profession
Warping (DTW), Vector Quantization (VQ), etc. This paper and for off-line speech-to-text conversion or note-taking
provides a comprehensive study of use of Artificial Neural systems [4]. It has also many applications like telephone
Networks (ANN) in speech recognition. The paper focuses on the directory assistance, automatic voice translation into foreign
different neural network related methods that can be used for speech languages, spoken database querying for new and
recognition and compares their advantages and disadvantages. The unexperienced users and handy applications in field work,
conclusion is given on the most suitable method. robotics and voice based commands [5].
Keywords-–Neural Networks, Training Algorithm, Speech
Recognition, Artificial Intelligence, Feature Extraction, Pattern II. SPEECH RECOGNITION PROCESS
Recognition, LPC, MFCC, Perceptron, Feedforward Neural The process of speech recognition is complex and a
Networks, etc.
cumbersome job. The following figure 1 shows the steps
involved in the process of speech recognition.
I. INTRODUCTION
2.1 Speech
S PEECH is probably the most efficient and natural way to
communicate with each other. Humans learn all the Speech is the vocalized form of human interactions. In this
step, the speech of the speaker is received in waveform. There
relevant skills during early childhood, without any instruction,
are many software available which are used to record the
and they continue to rely on speech communication
speech of humans. The acoustic environment and transduction
throughout their life. Humans also want to have a similar equipment may have great effect on the speech generated. We
natural, easy and efficient mode of communication with can have background noise or room reverberation along with
machines. Therefore they prefer speech as an interface rather the speech signal which is completely undesirable.
than using any other hectic interfaces like mouse and
keyboards. But the speech is a complex phenomenon as the 2.2 Speech Pre-processing
human vocal tract and articulators, being the biological Speech pre-processing is intended to solve such problems.
organs, are not under our conscious control . This plays an important role in eliminating the irrelevant
Speech is greatly affected by accents, articulation, sources of variation. It ultimately improves the accuracy of
pronunciation, roughness, emotional state, gender, pitch, speech recognition. The speech pre-processing generally
speed, volume, background noise and echoes [1]. involves noise filtering, smoothing, end point detection,
Speech Recognition or Automatic Speech Recognition framing, windowing, reverberation cancelling and echo
(ASR) plays an important role in human computer interaction. removing [6].
Speech recognition uses the process and relevant technology
to convert speech signals into the sequence of words by
means of an algorithm implemented as a computer program.
Theoretically, there should be the possibility of recognition of
speech directly from the digitized waveform [2]. At present,
speech recognition systems are capable of understanding of
thousands of words under functional environment.

1
Student, Dept. of Mechanical Engineering, JDIET, Yavatmal, India

http://dx.doi.org/10.15242/IJCCIE.U0116002 1
Int'l Journal of Computing, Communications & Instrumentation Engg. (IJCCIE) Vol. 3, Issue 1 (2016) ISSN 2349-1469 EISSN 2349-1477

SPEECH
compared to knowledge based approach and template based
approach. In this method, speech is split into smaller audible
entities and these entities represent a state in the Markov
SPEECH PRE- Model. According to the probabilities of transition, there
PROCESSING
exists a transition from one state to another [10].
FEATURE
DTW – Dynamic Time Warping (DTW) technique
EXTRACTION compares words with reference words. It is an algorithm to
measure the similarity between two sequences that can vary in
SPEECH time or speed [11]. In this technique, the time dimensions of
CLASSIFICATION
the unknown words are changed until they match with that of
the reference word.
RECOGNITION VQ – Vector Quantization (VQ) is a technique in which the
mapping of vector is performed from a large vector space to a
Fig. 1: Process of Speech Recognition finite number of region in that space. This technique is based
on block coding principle. Each region is called as cluster and
2.3 Feature Extraction
can be represented by its centre known as a code-word. Code
The speech varies from person-to-person. This is due to the book is the collection of all code-words [12].
fact that every person has different characteristics embedded
in utterance. Theoretically, possibility should be there to III. ARTIFICIAL NEURAL NETWORK FROM THE VIEWPOINT OF
recognize speech from the digitized waveform. But due to the SPEECH RECOGNITION
large variation in speech signal, there arise a need to perform
3.1 What is Artificial Neural Network?
some feature extraction to reduce that variations. The
following section summarizes some of the feature extraction Artificial Neural Networks (ANN) are nothing but the crude
technologies that are in use nowadays. These techniques are electronic models based on neural structure of brain. The
also useful in other areas of speech processing [7]. human brain basically learns from the experiences. It is a fact
MFCC – Mel Frequency Cepstrum Coefficients (MFCC) is that some problems which are beyond the scope of current
the most prominent method used in the process of feature computers can be are easily solvable by energy efficient
extraction in speech recognition. It is based on the frequency packages. Such type if brain modelling also provides a less
domain which is based on Mel scale based on human ear technical path for the development of machine solution.
scale. MFCCs, being frequency domain features, are more ANN are computer having their architecture modelled after
accurate than time domain features [8]. MFCC represents the the brain. They mainly involve hundreds of simple processing
real cepstral of windowed short time signal which is derived units wired together in complex communication network.
from Fast Fourier Transform (FFT). These coefficients are Each simple processing unit represents a real neuron which
robust and reliable for variations of speaker and operation sends off a new signal or fires if it receives a strong signal
environment. from the other connected unit [13].
LPC – Linear Predictive Coding (LPC) is a tool most 3.2 Artificial Neuron
widely used for medium or low bit rate coder. Digital signal is Artificial Neurons are the basic unit of Artificial Neural
compressed for efficient transmission and storage. Network which simulates the four basic function of biological
Computation of parametric model based on least mean neuron. It is a mathematical function conceived as a model of
squared error theory is known as linear prediction (LP). The natural neuron. The following figure shows the basic artificial
signal is expressed as a linear combination of previous neuron.
samples. Format frequencies are the frequencies where
resonance peak occurs [9].
2.4 Speech Classification
The most common techniques used for speech
classification are discussed in short. These system involve
complex mathematical functions and they take out hidden
information from the input processed signal .
HMM – Hidden Markov Modelling (HMM) is the most
successfully used pattern recognition technique for speech Fig. 2: Basic Artificial Neuron
recognition. It is a mathematical model signalized on the
In this figure, various inputs are shown by the mathematical
Markov Model and a set of output distribution. This technique
symbol, i(n). Each of this inputs are multiplied by connecting
is more general and has a secure mathematical foundation as

http://dx.doi.org/10.15242/IJCCIE.U0116002 2
Int'l Journal of Computing, Communications & Instrumentation Engg. (IJCCIE) Vol. 3, Issue 1 (2016) ISSN 2349-1469 EISSN 2349-1477

weights w(n). Generally, this products are simply summed and neuron is multiplied by a weight and fed back to the inputs of
fed to the transfer function to generate the output results. The neuron with delay. RNN have achieved better speech
applications like text recognition and speech recognition are recognition rates than MLP, but the training algorithm is
required to turn these real world inputs into discrete values. again more complex and dynamically sensitive, which can
These applications don’t always utilize networks composed of cause problems [15].
neurons that simply sum, and thereby smooth, inputs. In the
software packages, these neurons are called as processing
elements and have many more capabilities than the basic
artificial neuron described above.

IV. TYPES OF ARTIFICIAL NEURAL NETWORK


Researchers from the world have found out countless
different structure of Artificial Neural Network. Short Fig. 4: Structure of A Recurrent Neural Network.
description of each is given below.
4.3 Modular Neural Network
4.1 Feedforward Network
A Modular Neural Network (MNN) consist of several
Feedforward network is the first and the simplest form of modules, each module carrying out one sub task of the neural
ANN. In this network, the information flows only in one i.e. network’s global task, and all module functionally embedded.
forward direction from input node via hidden nodes to the
The global task can be any NN application, e.g., mapping,
output node. This network contains no loops or cycles. A clustering, function approximation or associative memory
neuron in layer ‘a’ can only send data to neuron in layer ‘b’ if b application [16].
> a. Learning is the adaptation of free parameters of neural
network through a continuous process of stimulation by the
embedded environment. Learning with teacher is called as (a)
supervised training; and learning without teacher is called as
(b) unsupervised training. The back-propagation algorithm has
emerged to design the new class of layered feedforward
network called as Multi-Layer Perceptrons (MLP). It generally
contains at least two layers of perceptrons. It has one input
layer, one or more hidden layers and output layers. The hidden
layer plays very important role and acts as a feature extractor.
It uses a nonlinear function such as sigmoid or a radial-basis
Fig. 5: Module Neural Network Architecture.
to generate complex functions of input. To minimize
classification error, the output layer acts as a logical net which 4.4 Kohonen Self Organizing Maps
chooses an index to send to the output on the basis of input it Kohonen self-organizing maps are a type of neural network.
receives from the hidden layer [14]. They require no supervision and hence called as ‚Self-
organizing‛. They learn on their own unsupervised
competitive learning. They are called as ‚Maps‛ because they
attempt to map their weight to conform to the given input data
[17].

V. ADVANTAGES OF ARTIFICIAL NEURAL NETWORK


 ANN have the ability to learn how to do task based on
the data given for training, learning and initial
experience.
Fig. 3: A Fully Connected Feedforward With One Hidden Layer And  ANN can create their own organisation and require no
One Output Layer. supervision as they can learn on their own
unsupervised competitive learning.
4.2 Recurrent Neural Network
 Computations of ANN can be carried out in parallel.
A Recurrent Neural Network (RNN) is a neural network that  ANN can be used in pattern recognition which is a
operates in time. RNN accepts an input vector, updates its powerful technique for harnessing the data and
hidden state via non-linear activation function and uses it to generalizing about it.
make prediction on output. In this network, the output of the

http://dx.doi.org/10.15242/IJCCIE.U0116002 3
Int'l Journal of Computing, Communications & Instrumentation Engg. (IJCCIE) Vol. 3, Issue 1 (2016) ISSN 2349-1469 EISSN 2349-1477

 The development of system is through learning instead [10] Santosh K. Gaikwad, Bharti W. Gawali, Pravin Yennawar, “A
Review on Speech Recognition Techniques”, IJCA Vol. 10, No. 3,
of programming. pp. 16-24, November 2010.
 ANN are flexible in changing environments. http://dx.doi.org/10.5120/1462-1976
[11] Santosh K. Gaikwad, Bharti W. Gawali, Pravin Yennawar, “A
 ANN can build informative model when conventional Review on Speech Recognition Techniques”, IJCA Vol. 10, No. 3,
model fails. They can handle very complex interactions. pp. 16-24, November 2010
 ANN is a nonlinear model which is easy to use and [12]
http://dx.doi.org/10.5120/1462-1976.
Lindasalva Muda, “Voice Recognition Algorithm Using Mel
understand than statistical methods. Frequency Cepstral Coefficient (MFCC) and Dynamic Time
Warping (DTW) Techniques”, Journal of Computing, Vol. 2, Issue 3,
March 2010.
VI.LIMITATIONS OF ARTIFICIAL NEURAL NETWORK
[13] Singh Satyanand, Dr. E. G. Rajan, “Vector Quantization Using
 It is not a daily life problem solving approach. MFCC and Inverted MFCC”, International Journal of Computer
 No structured methodology is available in ANN. Applications, Vol. 17, No. 1, pp. 1-7, March 2011.
 ANN may give unpredictable output quality. [14] Sonali B. Maind, Priyanka Wankar, “Research Paper on Basic of
Artificial Neural Network “, International Journal on Recent &
 Problem solving methodology of many ANN system is Innovation Trends in Computing & Communication, Vol. 1, Issue 1,
not described. pp. 96-100.
 Black box nature. [15] Robison, A. J. Cook, G. D. Ellis, D. P. W. Fosteruissier, E., Renals,
 Empirical nature for model development. S. J., Williams, D. A. G., “Connectionist Speech Recognition of
Broadcast News”, Speech Commnication 37: 27-45, 2000.
[16] James Martens, Ilya Sutskever, “Learning Recurrent Neural Network
VII. CONCLUSION & FUTURE SCOPE with Hessian-Free Optimization”, University of Toronto, Canada.
[17] Gasser Auda, Mohamed Kamel, “Modular Neural Network: A
ANN are one of the promises for the future computing. This Survey”, International Journal of Neural System, Vol. 9, No. 2, pp.
paper shows that they can be very useful in speech signal 129-151, April 1999.
classification. They operate more similarly to human brain http://dx.doi.org/10.1142/S0129065799000125
[18] Shyam M. Guthikonda, “Kohonen Self-Optimizing Maps”,
than a conventional computer logic. Different types of ANN Wittensberg University, December 2005.
are shortly discussed in this paper and it can be concluded that
RNN have achieved better speech recognition rates than MLP,
but the training algorithm is again more complex and
dynamically sensitive, which can cause problems. Speech
recognition has attracted many scientists and has created
technological influence on society. Hope this paper brings out
the basic understanding of ANN and inspire the research
group working on Automatic Speech Recognition. The future
of this technology is very promising and the whole key lies in
hardware development as ANN need faster hardware .

REFERENCES
[1] Wouter Gevaert, Georgi Tsenov, Valeri Mladenov, “Neural Network
used for Speech Recognition”, Journal of Automatic Control,
University of Belgrade, Vol. 20, pp. 1-7, 2010.
http://dx.doi.org/10.2298/JAC1001001G
[2] Vimal Krishnan VR, Athulya Jayakumar, Babu Anto P, “Speech
Recognition of Isolated Malyalam Words Using Wavelet Feature and
Artificial Neural Networks”, 4th IEEE International Symposium on
Electronic Design, Test and Application, 2008.
http://dx.doi.org/10.1109/DELTA.2008.88
[3] Ganesh Tiwari,”Text Prompted Remote Speaker Authentication:
Joint Speech & Speaker Recognition/Verification System.
[4] http://www.guidogybels.eu/asrp4.html
[5] Yashwanth H, Harish Mahendrakar and Suman Davia, “ Automatic
Speech recognition Using Audio Visual Cues”, IEEE India Annual
Conferencec pp. 166-169, 2004.
[6] G. Saha, Sandipan Chakroborty, Suman Senapati, “A New Silence
Removal and Endpoint Deletion Algorithm for Speech and Speaker
Recognition Applications.
[7] Urmila Shrawankar, Dr. Vilas Thakare, “Techniques for Feature
Extraction in Speech Recognition System: A Comparative Study.
[8] Lei Xie, Zhi-Qiang Liu, “A Comparative Study of Audio Feature for
Audio Visual Conversion in MPEG-4 Compliant Facial Animation”,
Proc. of ICMLC Dalian, 13-16, August 2006.
http://dx.doi.org/10.1109/icmlc.2006.259085
[9] Honig, Florian Stemmer, George Hacker, Christian Brugnara, Fabio,
“Revising Perceptual Linear Prediction”, In interspeech – 2005, pp.
2997-3000.

http://dx.doi.org/10.15242/IJCCIE.U0116002 4

You might also like