0% found this document useful (0 votes)
17 views4 pages

Saheaw 2020

The document describes a study that uses long short-term memory (LSTM) for Thai voice recognition to control electrical appliances. The researchers collected over 2,000 voice samples in Thai from 40 people turning on/off lights, air conditioners, computers, TVs and other devices. They preprocessed the audio, extracted MFCC features, and trained convolutional neural networks (CNNs) and LSTMs to classify the commands. Their experimental results showed that the proposed LSTM model achieved the best accuracy for this voice recognition task.

Uploaded by

Prasad Hiwarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views4 pages

Saheaw 2020

The document describes a study that uses long short-term memory (LSTM) for Thai voice recognition to control electrical appliances. The researchers collected over 2,000 voice samples in Thai from 40 people turning on/off lights, air conditioners, computers, TVs and other devices. They preprocessed the audio, extracted MFCC features, and trained convolutional neural networks (CNNs) and LSTMs to classify the commands. Their experimental results showed that the proposed LSTM model achieved the best accuracy for this voice recognition task.

Uploaded by

Prasad Hiwarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2020 IEEE 7th International Conference on Industrial Engineering and Applications

Thai Voice Recognition for Controlling Electrical appliances Using Long Short-
Term Memory

Wuttichai Saheaw Saichon Jaiyen


Department of Computer Science, Faculty of Science Department of Computer Science, Faculty of Science
King Mongkut’s Institute of Technology Ladkrabang King Mongkut’s Institute of Technology Ladkrabang
Bangkok 10520, Thailand Bangkok 10520, Thailand
e-mail: [email protected] e-mail: [email protected]

Anantaporn Hanskunatai
Department of Computer Science, Faculty of Science
King Mongkut’s Institute of Technology Ladkrabang
Bangkok 10520, Thailand
e-mail: [email protected]

Abstract—Human speech possesses characteristics in each of can be sent into the learning process in each Deep Learning
the word that can be recognized and learned by computers. In Model. In the teaching process, noise must be eliminated
this research, It is being proposed the use of the Deep Learning first or muted. [3] Since noise is an important variable that
Model to predict speech turn-on and turn-off various electrical can cause significant problems in voice recognition. [4]
appliances, by using the sound conversion method that has Machine Learning, being able to learn to remember images
been through the process to get the value of sound waves and well Convolutional Neural Network (CNN) [5] was used in
applied toward training process in different ways. As the speech recognition. An experiment has been performed and
sound has more than 1 syllable and having characteristics of provides 89.7% accuracy by predicting 1 - 2 syllables with
similar words that might difficult to predict. This research is different meanings by learning from images of sound waves
based on Convolutional Neural Network (CNN) for in the research [3] and in the research. [6] with various
comparison with the use of Long Short-Term Memory sounds regarding the environment, the accuracy is 77%
(LSTM), which is part of the Recurrent Neural Network which shows the efficiency of Convolutional Neural
(RNN) and Thai language Speech Dataset turn-on and turn-off Networks in Speech Recognition. In addition to CNN, the
by the 7 types of electrical appliances, the process of reducing Recurrent Neural Network (RNN) [7] has been used for
noise and silence of the front and back of the audio files by 14 speech recognition, It uses the previous state outputs for
classes in total. The experimental results signify that the learning which is a loop state training. However, the
proposed Long Short-Term Memory can achieve the best Recurrent Neural Network still has Vanishing gradient
accuracy. problem from the research [8]. Long Short-Term Memory
(LSTM) has been adopted as part of the Recurrent Neural
Keywords-component; deep learning; voice recognition; long Network [9] to solve the problem of Vanishing gradient
short-term memory problem. The tutorial uses Matrix data for iterative learning
and has Memory when used to remember the value of each
I. INTRODUCTION state output that is included. Further, there is a Gate that
Speech recognition with Machine Learning is widely helps in deciding to remember the value or clear the value or
used. The method is to make the computer understand the pass the value further. Both learning methods require a
form or nature of the sound and can decide the meaning of dropout to be used to solve the overfitting problem [10].
the sound. The sound that human speaks can be detected and However, in this article Convolutional Neural Network uses
converted into sound wave signals. However, many general dropout, but for LSTM Models from the article [11],
components must be implemented, such as the steps to Bayesian modeling is used to calculate dropout, which gives
determine the Simple Rate before teaching, it is important better results.
from the research [1] to determine the frequency resolution, This research has proposed the process of Convolutional
which is in the form of wavelet. Though it also has the Neural Network and Long Short-Term Memory to test and
characteristic of sound Mel Frequency Cepstral Coefficient compare the accuracy of predict speech turn on and turn off
(MFCC) [2] that can be used to identify the characteristics of various electrical appliances.
sound. The received data into parts with the power and
frequency of the sound into a vector of numbers by II. METHODOLOGY
converting audio signals into images by extracting the A. Data Preparation and Preprocessing
characteristics of the sound to the segment. Afterward, data
will be arranged in time data which can be generated as an In this research, the data from the speech produced by the
audio waveform or created as Matrix data so that the data collection of speech is used, including 40 people from males

978-1-7281-6785-5/20/$31.00 ©2020 IEEE 697

Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 07,2020 at 18:32:31 UTC from IEEE Xplore. Restrictions apply.
and females (Thai language), turn on and turn off electrical Figure 1 shows the process of RNN 𝑥𝑡 is the input data at
appliances such as light bulbs, air conditioners, computers, that time and ℎ𝑡 value from hidden state at that time and 𝑦𝑡 is
TVs, doors, fans, curtains, in total 2,105 voices, the length of output. When spreading loops, it can be seen that there are
each sound is not the same and noise exist in the
multiple hidden states work jointly in sequence. ℎ𝑡 calculated
environment. The voice has a Simple Rate of 44.1 kHz
shown in (1)
through the process of silencing the front and back of the
audio file, and noise cancellation, which uses Library
FFmpeg and Sox to produce audio files with less noise [3] ℎ𝑡 = 𝑓ℎ (𝑈ℎ ℎ𝑡−1 + 𝑊ℎ 𝑥𝑡 + 𝑏ℎ ) (1)
but still has an unequal speech length, as the spoken words
have different syllables. While the characteristics of each The Activation Function 𝑓ℎ of the hidden layer, 𝑈ℎ is the
speech MFCCs features 20 to make a Spectrogram image weight at previous hidden state, 𝑊ℎ is the weight matrix of
with a size of 250x185, with color images and Spectrogram the current hidden state. And 𝑦𝑡 calculated shown in (2)
frequency data for training in CNN and LSTM respectively
is used. The Training Set 80%, while 20% for Test Set has
been used. Dataset used is as shown in Table I. 𝑦𝑡 = 𝑓𝑦 (𝑊𝑦 ℎ𝑡 + 𝑏𝑦 ) (2)

TABLE I. DATASET However, RNN still has a problem with Gradient


Label IPA Thai Phonemic Syllable instance
vanishing problem. This research using the Long Short-Term
Turn off lights p�
t + faj 2 194 Memory (LSTM) to solve the Gradient vanishing problem,
using vector format from Spectrogram with MFCCs Features
Turn on lights pɤ̀ːt + faj 2 202 20 and send the data to training in LSTM as shown in Figure
Turn off air conditioner p�
t + ʔɛː 2 193 2.
Turn on air conditioner pɤ̀ːt + ʔɛː 2 189

Turn off computer p�


t + kʰɔːm 2 133

Turn on computer pɤ̀ːt + kʰɔːm 2 135

Turn off Television p�


t + tʰiː + wiː 3 124

Turn on Television pɤ̀ːt + tʰiː + wiː 3 128

Turn off door p�


t + p�}
ʔ + tuː 3 130

Turn on door pɤ̀ːt + p�}


ʔ + tuː 3 123

Turn off fan p�


t + pʰ�
t + lom 3 159

Turn on fan pɤ̀ːt + pʰ�


t + lom 3 149

Turn off curtain p�


t + m�
ːn 2 119

Turn on curtain pɤ̀ːt + m�


ːn 2 127

TOTAL 2,105

B. Proposed Method Figure 2. The training process of long short-term memory.

This research uses RNN, which works in a loop on the Figure 2 shows the steps used. It starts by receiving
Hidden Layer of the Neural Network using the prior state speech which is converted into a sound wave and stored in a
data to calculate with the current state for sending data in the Matrix format. After that, the data is transmitted one by one
next state. The model will able to understand current data per row for training and there is Memory to record the state
from learning prior data as shown in Figure 1. from previous sound waves. By Forget Gate (𝑓𝑡 ) deciding
whether to remember previous information or not, using the
sigmoid function as the decider as shown in (3).
𝑓𝑡 = 𝜎(𝑊𝑓 ∙ [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑓 ) (3)

The Input Gate is to allow the value to be updated using


the sigmoid function similar to Forget Gate as in (4)

𝑖𝑡 = 𝜎(𝑊𝑖 ∙ [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑖 ) (4)

Figure 1. The process of Recurrent Neural Network (RNN)

698

Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 07,2020 at 18:32:31 UTC from IEEE Xplore. Restrictions apply.
And use the Input Modulation Gate, which uses the 𝑡𝑎𝑛ℎ However, in this research, an experiment has been made
function instead to update the values as shown in (5). to adjust the cell size of LSTM and the number of stacks in
LSTM which has the structure as in Figure 4.
̌ 𝑡 = 𝑡𝑎𝑛ℎ(𝑊𝑐 ∙ [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑐 )
𝐶 (5)

The next step, after getting the forget gate and input gate
values, updating the cell state as in (6).

̌𝑡
𝑐𝑡 = 𝑓𝑡 ∙ 𝑐𝑡−1 + 𝑖𝑡 ∙ 𝐶 (6)

Output Gate will send the value from the sigmoid


function and this value will be used in the next state in the
pointwise calculation to find the forwarded value (ℎ𝑡 ) as in
(7).

𝑜𝑡 = 𝜎(𝑊𝑜 ∙ [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑜 ) (7)

Finally, the updated value will be passed through the


𝑡𝑎𝑛ℎ function and pointwise calculation with the value from Figure 4. RNN multiple stacks cell.
the output gate (𝑜𝑡 ) to get the value to be sent to the next
state (ℎ𝑡 ) and the resulting value of that state (𝑦𝑡) as in (8) The size of cells tested, are 128, 192, and 256 for 1 stack
and 2 stacks cell using size 128x128, 192x192, 256x256,
320x320, 384x384, and 512x512 respectively to measure the
ℎ𝑡 = 𝑜𝑡 ∙ 𝑡𝑎𝑛ℎ(𝑐𝑡 ) (8) effectiveness in predicting the results.
In the training process, K-Fold has been adopted by it is C. Performance Evaluation
divided into 10 Datasets has Training Set 90%, and Test Set In this research, the results are calculated from the use of
10% has been used in each round. After that, each round will Test data 20%. Then the result as a percentage that is
record the Model. Finally, each model has been used to predicted correctly is calculated from a total of 14 Labels,
predict the value and calculate the average value of the each type of electrical appliances in both CNN and LSTM.
Predict Set as shown in Figure 3.
III. EXPERIMENTAL RESULT
In the experiment, the results of machine learning from
Long Short-Term Memory provides the highest accuracy of
97.84%, which is more accurate than the Convolutional
Neural Network highest at 91.14%, as shown in Table II.
And In Figure 5-6, propose each Confusion Matrix.
Figure 6 proposes that LSTM predictions provide better
results than CNN. However, similar sounds in the first
syllable or second syllable may cause incorrect predictions
due to similar audio formats.
TABLE II. COMPARISON TABLE OF PERFORMANCE PREDICTION OF
CNN AND LSTM
Model Mean Max Min
CNN 91.06 91.14 89.17
LSTM 128 94.25 94.73 93.54
LSTM 192 95.59 96.17 95.21
LSTM 256 95.83 96.65 95.45
LSTM 128x128 96.45 97.12 95.69
LSTM 192x192 97.41 97.60 97.12
LSTM 256x256 97.55 97.60 97.36
LSTM 320x320 97.79 97.84 97.60
LSTM 384x384 97.17 97.60 96.17
Figure 3. Step training by K-fold cross validation.
LSTM 512x512 95.02 96.17 94.01

699

Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 07,2020 at 18:32:31 UTC from IEEE Xplore. Restrictions apply.
IV. CONCLUSION
In this paper, although the results of LSTM are quite high
because the dataset of the sound has been reduced the noise.
And will see that the appropriate cell size increases the
prediction results, but when too much increase, the
prediction efficiency decreases. And the model can mix
between CNN and RNN (CRNN) as an additional
experimental approach. However, LSTM can be used to
classify the voice turn on and turn off electrical appliances.
REFERENCES
[1] Bhushan C. Kamble, “Speech Recognition Using Artificial Neural
Network – A Review”, Int'l Journal of Computing, Communications
& Instrumentation Engg. (IJCCIE) Vol. 1, Issue 1 (2016)
[2] Atik Charisma, M. Reza Hidayat, Yuda Bakti Zainal, “Speaker
RecognLtion Using Mel-Frequency Cepstrum Coefficients and Sum
Square Error”, The 3rd International Conference on Wireless and
Telematics 2017.
[3] Pete Warden, Google Brain, Mountain View, “Speech Commands: A
Dataset for Limited-Vocabulary Speech Recognition”,
arXiv:1804.03209v1 [cs.CL] April 2018.
Figure 5. The confusion matrix of Convolutional Neural Network [4] Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam,
accuracy 91.14%. “Bengali Speech Recognition: A Double Layered LSTM-RNN
Approach”, 2017 20th International Conference of Computer and
Information Technology (ICCIT), 22-24 December, 2017.
[5] Saad ALBAWI , Tareq Abed MOHAMMED, Saad AL-ZAWI,
“Understanding of a Convolutional Neural Network”, ICET2017,
Antalya, Turkey
[6] Aditya Khamparia, Deepak Gupta, Nhu Gia Nguyen, Ashish Khanna,
Babita Pandey, Prayag Tiwari, “Sound Classification Using
Convolutional Neural Network and Tensor Deep Stacking Network”,
10.1109/ACCESS.2018.2888882, January 8, 2019.
[7] Alex Sherstinsky, “Fundamentals of Recurrent Neural Network
(RNN) and Long Short-Term Memory (LSTM) Network”,
arXiv:1808.03314v4 [cs.LG] 4 Nov 2018.
[8] Stefano Squartini, Amir Hussain, Francesco Piazza,
“PREPROCESSING BASED SOLUTION FOR THE VANISHING
GRADIENT PROBLEM IN RECURRENT NEURAL
NETWORKS”, 2003 IEEE.
[9] Sepp Hochreiter, Jürgen Schmidhuber, “LONG SHORT-TERM
MEMORY”, Neural Computation 9(8):1735{1780, 1997.
[10] Imanol Bilbao, Javier Bilbao, “Overfitting problem and the over-
training in the era of data”, The 8th IEEE International Conference on
Intelligent Computing and Information Systems (ICICIS 2017)
[11] Yarin Gal, Zoubin Ghahramani, “A Theoretically Grounded
Application of Dropout in Recurrent Neural Networks”,
arXiv:1512.05287v5 [stat.ML] 5 Oct 2016.

Figure 6. The confusion matrix of Long Short-Term Memory cell size


320x320 accuracy 97.84%

700

Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 07,2020 at 18:32:31 UTC from IEEE Xplore. Restrictions apply.

You might also like