Saheaw 2020
Saheaw 2020
Thai Voice Recognition for Controlling Electrical appliances Using Long Short-
Term Memory
Anantaporn Hanskunatai
Department of Computer Science, Faculty of Science
King Mongkut’s Institute of Technology Ladkrabang
Bangkok 10520, Thailand
e-mail: [email protected]
Abstract—Human speech possesses characteristics in each of can be sent into the learning process in each Deep Learning
the word that can be recognized and learned by computers. In Model. In the teaching process, noise must be eliminated
this research, It is being proposed the use of the Deep Learning first or muted. [3] Since noise is an important variable that
Model to predict speech turn-on and turn-off various electrical can cause significant problems in voice recognition. [4]
appliances, by using the sound conversion method that has Machine Learning, being able to learn to remember images
been through the process to get the value of sound waves and well Convolutional Neural Network (CNN) [5] was used in
applied toward training process in different ways. As the speech recognition. An experiment has been performed and
sound has more than 1 syllable and having characteristics of provides 89.7% accuracy by predicting 1 - 2 syllables with
similar words that might difficult to predict. This research is different meanings by learning from images of sound waves
based on Convolutional Neural Network (CNN) for in the research [3] and in the research. [6] with various
comparison with the use of Long Short-Term Memory sounds regarding the environment, the accuracy is 77%
(LSTM), which is part of the Recurrent Neural Network which shows the efficiency of Convolutional Neural
(RNN) and Thai language Speech Dataset turn-on and turn-off Networks in Speech Recognition. In addition to CNN, the
by the 7 types of electrical appliances, the process of reducing Recurrent Neural Network (RNN) [7] has been used for
noise and silence of the front and back of the audio files by 14 speech recognition, It uses the previous state outputs for
classes in total. The experimental results signify that the learning which is a loop state training. However, the
proposed Long Short-Term Memory can achieve the best Recurrent Neural Network still has Vanishing gradient
accuracy. problem from the research [8]. Long Short-Term Memory
(LSTM) has been adopted as part of the Recurrent Neural
Keywords-component; deep learning; voice recognition; long Network [9] to solve the problem of Vanishing gradient
short-term memory problem. The tutorial uses Matrix data for iterative learning
and has Memory when used to remember the value of each
I. INTRODUCTION state output that is included. Further, there is a Gate that
Speech recognition with Machine Learning is widely helps in deciding to remember the value or clear the value or
used. The method is to make the computer understand the pass the value further. Both learning methods require a
form or nature of the sound and can decide the meaning of dropout to be used to solve the overfitting problem [10].
the sound. The sound that human speaks can be detected and However, in this article Convolutional Neural Network uses
converted into sound wave signals. However, many general dropout, but for LSTM Models from the article [11],
components must be implemented, such as the steps to Bayesian modeling is used to calculate dropout, which gives
determine the Simple Rate before teaching, it is important better results.
from the research [1] to determine the frequency resolution, This research has proposed the process of Convolutional
which is in the form of wavelet. Though it also has the Neural Network and Long Short-Term Memory to test and
characteristic of sound Mel Frequency Cepstral Coefficient compare the accuracy of predict speech turn on and turn off
(MFCC) [2] that can be used to identify the characteristics of various electrical appliances.
sound. The received data into parts with the power and
frequency of the sound into a vector of numbers by II. METHODOLOGY
converting audio signals into images by extracting the A. Data Preparation and Preprocessing
characteristics of the sound to the segment. Afterward, data
will be arranged in time data which can be generated as an In this research, the data from the speech produced by the
audio waveform or created as Matrix data so that the data collection of speech is used, including 40 people from males
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 07,2020 at 18:32:31 UTC from IEEE Xplore. Restrictions apply.
and females (Thai language), turn on and turn off electrical Figure 1 shows the process of RNN 𝑥𝑡 is the input data at
appliances such as light bulbs, air conditioners, computers, that time and ℎ𝑡 value from hidden state at that time and 𝑦𝑡 is
TVs, doors, fans, curtains, in total 2,105 voices, the length of output. When spreading loops, it can be seen that there are
each sound is not the same and noise exist in the
multiple hidden states work jointly in sequence. ℎ𝑡 calculated
environment. The voice has a Simple Rate of 44.1 kHz
shown in (1)
through the process of silencing the front and back of the
audio file, and noise cancellation, which uses Library
FFmpeg and Sox to produce audio files with less noise [3] ℎ𝑡 = 𝑓ℎ (𝑈ℎ ℎ𝑡−1 + 𝑊ℎ 𝑥𝑡 + 𝑏ℎ ) (1)
but still has an unequal speech length, as the spoken words
have different syllables. While the characteristics of each The Activation Function 𝑓ℎ of the hidden layer, 𝑈ℎ is the
speech MFCCs features 20 to make a Spectrogram image weight at previous hidden state, 𝑊ℎ is the weight matrix of
with a size of 250x185, with color images and Spectrogram the current hidden state. And 𝑦𝑡 calculated shown in (2)
frequency data for training in CNN and LSTM respectively
is used. The Training Set 80%, while 20% for Test Set has
been used. Dataset used is as shown in Table I. 𝑦𝑡 = 𝑓𝑦 (𝑊𝑦 ℎ𝑡 + 𝑏𝑦 ) (2)
TOTAL 2,105
This research uses RNN, which works in a loop on the Figure 2 shows the steps used. It starts by receiving
Hidden Layer of the Neural Network using the prior state speech which is converted into a sound wave and stored in a
data to calculate with the current state for sending data in the Matrix format. After that, the data is transmitted one by one
next state. The model will able to understand current data per row for training and there is Memory to record the state
from learning prior data as shown in Figure 1. from previous sound waves. By Forget Gate (𝑓𝑡 ) deciding
whether to remember previous information or not, using the
sigmoid function as the decider as shown in (3).
𝑓𝑡 = 𝜎(𝑊𝑓 ∙ [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑓 ) (3)
698
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 07,2020 at 18:32:31 UTC from IEEE Xplore. Restrictions apply.
And use the Input Modulation Gate, which uses the 𝑡𝑎𝑛ℎ However, in this research, an experiment has been made
function instead to update the values as shown in (5). to adjust the cell size of LSTM and the number of stacks in
LSTM which has the structure as in Figure 4.
̌ 𝑡 = 𝑡𝑎𝑛ℎ(𝑊𝑐 ∙ [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑐 )
𝐶 (5)
The next step, after getting the forget gate and input gate
values, updating the cell state as in (6).
̌𝑡
𝑐𝑡 = 𝑓𝑡 ∙ 𝑐𝑡−1 + 𝑖𝑡 ∙ 𝐶 (6)
699
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 07,2020 at 18:32:31 UTC from IEEE Xplore. Restrictions apply.
IV. CONCLUSION
In this paper, although the results of LSTM are quite high
because the dataset of the sound has been reduced the noise.
And will see that the appropriate cell size increases the
prediction results, but when too much increase, the
prediction efficiency decreases. And the model can mix
between CNN and RNN (CRNN) as an additional
experimental approach. However, LSTM can be used to
classify the voice turn on and turn off electrical appliances.
REFERENCES
[1] Bhushan C. Kamble, “Speech Recognition Using Artificial Neural
Network – A Review”, Int'l Journal of Computing, Communications
& Instrumentation Engg. (IJCCIE) Vol. 1, Issue 1 (2016)
[2] Atik Charisma, M. Reza Hidayat, Yuda Bakti Zainal, “Speaker
RecognLtion Using Mel-Frequency Cepstrum Coefficients and Sum
Square Error”, The 3rd International Conference on Wireless and
Telematics 2017.
[3] Pete Warden, Google Brain, Mountain View, “Speech Commands: A
Dataset for Limited-Vocabulary Speech Recognition”,
arXiv:1804.03209v1 [cs.CL] April 2018.
Figure 5. The confusion matrix of Convolutional Neural Network [4] Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam,
accuracy 91.14%. “Bengali Speech Recognition: A Double Layered LSTM-RNN
Approach”, 2017 20th International Conference of Computer and
Information Technology (ICCIT), 22-24 December, 2017.
[5] Saad ALBAWI , Tareq Abed MOHAMMED, Saad AL-ZAWI,
“Understanding of a Convolutional Neural Network”, ICET2017,
Antalya, Turkey
[6] Aditya Khamparia, Deepak Gupta, Nhu Gia Nguyen, Ashish Khanna,
Babita Pandey, Prayag Tiwari, “Sound Classification Using
Convolutional Neural Network and Tensor Deep Stacking Network”,
10.1109/ACCESS.2018.2888882, January 8, 2019.
[7] Alex Sherstinsky, “Fundamentals of Recurrent Neural Network
(RNN) and Long Short-Term Memory (LSTM) Network”,
arXiv:1808.03314v4 [cs.LG] 4 Nov 2018.
[8] Stefano Squartini, Amir Hussain, Francesco Piazza,
“PREPROCESSING BASED SOLUTION FOR THE VANISHING
GRADIENT PROBLEM IN RECURRENT NEURAL
NETWORKS”, 2003 IEEE.
[9] Sepp Hochreiter, Jürgen Schmidhuber, “LONG SHORT-TERM
MEMORY”, Neural Computation 9(8):1735{1780, 1997.
[10] Imanol Bilbao, Javier Bilbao, “Overfitting problem and the over-
training in the era of data”, The 8th IEEE International Conference on
Intelligent Computing and Information Systems (ICICIS 2017)
[11] Yarin Gal, Zoubin Ghahramani, “A Theoretically Grounded
Application of Dropout in Recurrent Neural Networks”,
arXiv:1512.05287v5 [stat.ML] 5 Oct 2016.
700
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 07,2020 at 18:32:31 UTC from IEEE Xplore. Restrictions apply.