Exploration of English Speech Translation Recognition Based On The LSTM RNN Algorithm
Exploration of English Speech Translation Recognition Based On The LSTM RNN Algorithm
https://doi.org/10.1007/s00521-023-08462-8 (0123456789().,-volV)(0123456789().
,- volV)
S . I . : E V O L U T I O N A R Y C O M P U T A T I O N B A S E D M E T H O D S A N D A P P L I CA T I O N S F O R
DATA PROCESSING
Received: 10 October 2022 / Accepted: 3 March 2023 / Published online: 23 March 2023
Ó The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2023
Abstract
In today’s information society, the demand for intelligence is increasing daily. English speech translation recognition
technology based on the LSTM (long short-term memory) recurrent neural network (RNN) algorithm is an important
manifestations of computer intelligence. In recent years, many scholars have conducted research on speech translation
recognition technology, including template matching and statistical pattern recognition. Each of these methods has its
drawbacks. This paper discusses English speech recognition techniques by utilizing the basic RNN principles. Moreover,
its application and construction in practice, which can provide some useful reference for future researchers, are analysed.
LSTM RNN is an intelligent system that is different from traditional pattern recognition methods. The greatest difference is
that it simulates the information processing of the human brain and realizes the intelligent information processing in a
distributed manner. It has a variety of automatic recognition and extraction functions, such as storage, association, and
retrieval, especially for speech translation and recognition problems with high perception ability. This new neural network
recognition system has a strong scientific nature and can store sound information in a decentralized manner, similar to the
human brain. The LSTM RNN has been widely used in the speech recognition field due to its excellent performance in
extraction and classification. The study found that the recognition accuracy of the original RNN was generally maintained
between 48 and 54%, and the data loss rate was relatively high. The accuracy rate of speech recognition based on LSTM
RNN was as high as 94%, and the information storage efficiency was high, which greatly avoided repetitive processes. The
voice data processing speed can be completed in 4.5 s at the fastest, which plays an important role in terms of mass
satisfaction and social development needs.
Keywords LSTM (long short-term memory) Recurrent neural network algorithm English voice translator
Translation recognition system
& Yu Dai
1 Introduction
[email protected]
Currently, the needs of human life are changing, especially
Qiwei Yuan
[email protected] in terms of computers. The initial simple applications can
no longer meet people’s needs. RNN is a good neural
Guangming Li
[email protected] network model that can process and predict sequence data.
RNNs and their variant networks have been widely used in
1
College of Foreign Language, Shaoyang University, many tasks, especially with certain time dependencies. In
Shaoyang 422000, Hunan, China speech recognition, machine translation, language mod-
2
College of International Education, Hunan University of elling, text classification, word vector generation, and
Medicine, Huaihua 418000, Hunan, China information retrieval, a model is needed to integrate and
3
Department of Mechanical and Energy Engineering, train data. However, RNNs are generally difficult to train.
Shaoyang University, Shaoyang 422000, Hunan, China After repeated loops, the gradient gradually disappears, and
123
24962 Neural Computing and Applications (2023) 35:24961–24970
the problem of gradient explosion rarely occurs. Due to will also become a new development trend for English
some problems in the practical application of RNNs, speech translation.
LSTM networks have attracted extensive attention as an
effective method for saving data. Improvements in English
speech translation recognition with LSTM RNN algorithms 2 Related work
have also appeared one after another.
In addition, in the past two decades, with the advance- At present, due to the development of science and tech-
ment of technology, computer performance has become nology, people’s demand for computer intelligence is
increasingly better. People’s understanding of speech increasing, and speech translation and recognition tech-
translation is also becoming increasingly mature, and many nology is one of the main intelligent technologies. Many
of our dream functions can be applied in real life. Speech scholars have conducted research on related aspects and
translation recognition technology is not only a science but proposed many translation recognition algorithms, but
also closely related to the fields of acoustics, intelligent these methods have certain drawbacks (such as poor-
computer recognition, and data processing. In daily quality translations, delays in data processing, and trans-
research and analysis, three aspects are often analysed from lations that are often very different from the original
template matching, random model, and probability gram- intent). Based on this, this paper discusses the recognition
mar. In addition, more research methods are constantly technology of speech translation from the perspective of
being discovered. neural networks and analyses its application and imple-
At present, the English speech translation recognition mentation in practice to meet the needs of today’s society.
system is still based on hidden Markov patterns for mod- For the purpose of English phonetic translation retrieval,
elling the time series of speech signals in English. Its an English listening translation retrieval system can be
model parameters cannot be well trained, which seriously designed and implemented, and different English phonetic
affects speech recognition and translation abilities. The retrieval, query and translation implementation methods
English speech translation recognition system based on the can be analysed. DuS proposed an English speech trans-
LSTM RNN algorithm mentioned in this paper is a forward lation recognition system based on the LSTM recurrent
neural network with multiple hidden layers, and each layer neural network algorithm to solve the translation ambiguity
is trained separately (including the classifier at the last problem in bilingual dictionary-based information retrieval
level). This enables more comprehensive training of the systems [1]. The difference between Chinese and English is
model parameters under limited training data, and the that English attaches great importance to syllable stress.
advantages of the LSTM RNN in the English speech Therefore, syllable recognition in English is a key link in
translation recognition task are supported by corresponding English teaching. Chen X’s research was a study on Eng-
evidence. lish speech recognition based on LSTM RNN. On this
English education is an important method for cultivating basis, an acoustic model based on weighted transfer was
language talent, but in practice, English education devel- proposed, that is, a one-step method and a two-step
opment has encountered many difficulties, thus affecting method. In English speech recognition, the convolutional
the quality of final education. English voice translators are neural network (CNN) and the LSTM neural network were
an indispensable tool in today’s teaching and can effec- compared and combined with the LSTM neural network to
tively solve various problems encountered in daily life. The form a statistical chart. The experimental results showed
core of the English translation system is speech recogni- that compared with other methods, the LSTM neural net-
tion. Through optimized design and comprehensive appli- work has obvious advantages in English translation, and it
cation, it can improve the level of English teaching. is suitable for subsequently teaching English speech
After research, this paper found that the recognition recognition models and developing real-time translation
accuracy of the original RNN is usually maintained in the equipment [2]. Zenkel T introduced a suite of open source
range of 48–54%, while the data loss rate is relatively high. English speech translations. Although there are a variety of
Speech recognition using LSTM RNN has a 94% accuracy open source tools to perform basic speech translation, the
rate, effectively saves a large amount of data, and greatly purpose is to provide a convenient method for overall
reduces repetitive processing. The processing time of voice English speech translation. Therefore, Docker (open source
data is less than 4.5 s, which plays a great role in meeting application container engine) can be used, which includes
the satisfaction of the people and the needs of social the following components: a neural speech recognition
development. The speech processing rate based on LSTM system, a sentence segmentation system, and an attention
recurrent neural network is relatively slow, and the LSTM conversion system. In addition, it also provides a pre-
recurrent neural network can reach 4.5 s in the shortest training model for this purpose to promote the development
time. At the same time, LSTM recurrent neural network of voice translation systems and encourage researchers to
123
Neural Computing and Applications (2023) 35:24961–24970 24963
improve overall translation recognition system perfor- model [7]. Zhu H proposed a language vocabulary classi-
mance [3]. Speech recognition or speech-to-text conversion fication type based on the LSTM RNN algorithm and used
has rapidly stimulated great interest in large organizations an English dictionary (WordNet) based on cognitive lin-
to simplify the human–computer communication process. guistics of synonyms, antonyms, and context words. The
Dharmale G found that optimizing the speech recognition English speech translation recognition system based on the
process is crucial because real-time users want to perform LSTM RNN algorithm can utilize the semantic relationship
actions based on voice input, and these actions sometimes of WordNet to improve the ease of use of English speech
define the user’s lifestyle, thus defining the speech process. translation recognition systems [8]. Worldwide, research
Some existing speech recognition software from Google, on English speech translation recognition systems has
Tencent and Microsoft tend to be more than 90% accurate turned to neural networks, which will be a hot topic for a
in real-time speech detection. The English speech transla- long time in the future.
tion recognition system based on the LSTM cyclic neural After a long period of development, the performance
network algorithm combines the speech recognition and role of speech translation and recognition systems have
methods used by these software programs and language been greatly improved, but there are still many problems
processing, which improves the overall process accuracy that need to be studied in the future. Although speech
with the help of speech analysis [4]. In English feature recognition technology based on neural networks has good
recognition, the speech translation recognition technology performance, it is difficult to widely use in practice due to
accuracy is particularly prominent. Hou Q took the intel- its unsatisfactory learning speed. With the continuous
ligent learning algorithm as the system algorithm and used development of its algorithm and the simplification of the
linear classification and nonlinear classification methods to model, the recognition rate of the LSTM RNN will be
carry out relevant subjective identification. By making use further improved, and it will gradually become the main-
of speech phase insensitivity, it can achieve the goal of stream in the market and occupy a place in various
noise reduction and better recognize and translate English industries.
[5]. With the continuous progress of society, RNNs based
on long- and short-term memory are increasingly favoured
by people, as they solve many historical problems for 3 English speech translation recognition
mankind and contribute to the development of the country. system based on the LSTM RNN algorithm
Therefore, English translation recognition technology
based on the LSTMRNN algorithm has become an 3.1 Technical basis for the design of an English
important direction in the field of human–computer inter- speech translation recognition system
action. In English, the use of speech recognition technol-
ogy to help teachers correct pronunciation has a certain Currently, the trend of education informatization in China
effect but can also help students not be limited by time and is becoming increasingly prominent. Both teachers and
space. Duan R used the LSTM cyclic neural network students need to build an information platform for English
algorithm to improve and analyse the speech recognition teaching activities, which has changed the shortcomings
algorithm and used the effective algorithm as a systematic (such as indoctrination teaching, time and space con-
algorithm for the English speech translation recognition straints, poor classroom atmosphere, and insufficient
model. Additionally, the basic speech-cutting process was human translation ability) of the previous English teaching
described. In addition, a control experiment was designed mode. English speech recognition plays an important role
to verify and analyse the English speech translation in translation software, and its recognition process is shown
recognition correction model based on the LSTM recurrent in Fig. 1. It is mainly used for identifying multiple lan-
neural network algorithm [6]. Hai Y aimed to make use of guages to help students quickly grasp English connotations.
English-specific syllables and prosodic features in spoken There are three main aspects of English speech translation
language data for English speech translation recognition recognition technology: feature extraction, pattern match-
and explored effective methods for speech detection and ing, and model training.
recognition systems. The method is based on a combination
of classifiers and syllable classifiers combined with other 3.1.1 Feature extraction technology
speech features based on speech rate, intensity, formant
and energy statistics, and articulation rate. Compared with The auxiliary function of the English language translation
syllable classifiers trained on specific syllables, it achieved speech recognition system usually involves three aspects:
better recognition rates. It was found that the recognition collection, processing and transmission. In addition, com-
performance of the English speech translation recognition puter language and natural language are very different.
system was significantly better than that of the traditional Therefore, correctly distinguishing the difference between
123
24964 Neural Computing and Applications (2023) 35:24961–24970
the two when translating is an urgent problem [9]. Feature feedback between the output of the network and the model.
extraction technology can extract features from the English Therefore, FNNs still have great drawbacks in many cases
language and transmit correct language signals to transla- [11]. The greatest RNN feature is that its continuous net-
tors to improve computer translation accuracy. work structure is well adapted to time series data and can
maintain data correlation. Figure 3 shows the RNN net-
3.1.2 Pattern matching technology work structure, which uses the loop on the hidden layer to
reduce the neural network parameters to be trained.
Speech recognition systems can help students and teachers In addition, due to the existence of shared parameters,
quickly understand the meaning of language through pat- data of different lengths can be extended, so the RNN input
tern matching technology and avoid the trouble caused by can be an indeterminately long sequence. For example, to
artificial language errors. Pattern matching technology uses train a fixed sentence, if a FNN is used, then each input
intelligent pattern recognition technology to automatically feature will have independent parameters, while the RNN
recognize and analyse speech input, which reduces the is completely the opposite. Although the original goal of
difficulty of manual translation. It can automatically select RNNs is to learn long-term dependencies, extensive prac-
a matching translation mode according to the structure, tice has proven that standard RNNs often struggle to pre-
grammar and application of English words and sentences. serve information for long periods of time [12]. Therefore,
People can obtain the final translation result by executing the application of an RNN in the initial stage is not uni-
the program command, which is of great help to both versal. For this problem, this paper improved the traditional
students and teachers. RNN. LSTM is the most effective method at present.
Compared with the implicit RNN unit, the internal struc-
3.2 Model training technology ture of the implicit LSTM unit is more complex and has
more options. The LSTM RNN type is shown in Fig. 4.
To realize teaching informatization, a translation recogni- In addition, the LSTM method is used for feature
tion system based on speech recognition is proposed. In extraction, and the fully connected layer method is used for
English teaching, it can help teachers solve translation regression classification. The RNN system of LSTM con-
problems and improve students’ ability to understand sists of 5 layers, which consist of two LSTMs, two fully
English knowledge. After completing speech recognition, connected and outputs. The structure is shown in Fig. 5.
the translator conducts simulation training according to the
actual situation to establish a virtual language training 3.4 Development status of English speech
platform. The simulation training technology uses the recognition and translation technology
design concept of man–machine integration. By combining
translators and speech recognizers for training, it can Currently, speech translation recognition has made great
quickly identify and judge the degree of English pronun- progress in theory and practice, which has greatly pro-
ciation and guide students to adjust their speech [10]. moted communication and collaboration between people in
different languages and cultural backgrounds. The sum-
3.3 LSTM RNN model structure mary is as follows:
First, the number of translated words is increasing. With
An FNN (feedforward neural network) can also be con- the progress of technology, the number of speech transla-
sidered a primitive neural network, which is composed of a tions is also growing exponentially. From the initial hun-
series of simple neurons. Figure 2 is a simple FNN that dreds of English words to the current tens of thousands,
includes the input layer, the hidden layer, and the output translation efficiency has been greatly improved.
layer. There are no loops in the network, and there is no
result
model output
library
123
Neural Computing and Applications (2023) 35:24961–24970 24965
input layer
recognition systems, the translation algorithm usually
hidden
output
layer
layer performs interactive processing through multiple transla-
tion methods. In this way, their respective advantages are
complemented, and the problem of a single algorithm is
overcome, thereby achieving the goal of a multiengine
Fig. 2 Simple structure of FNN
translation strategy. For example, translation software such
as Google Translate and Youdao Dictionary all use this
multiengine translation algorithm [15].
hidden layer
output Fourth, a large amount of world knowledge and lan-
layer input layer guage expression environment knowledge is introduced
into the speech recognition translation system. To improve
translation accuracy, many research groups are working to
Connect to the next time
step
introduce the knowledge of social roles, conversation
scenes, body movements, and expression into speech
Fig. 3 Network structure of RNN recognition systems. Some research groups are also using
television and image capture technology to help aid speech
recognition translation by collecting and analysing the
output gating speaker’s facial expressions, movements and environment.
unit Even if the translation effect is not very good, the listener
can roughly judge the meaning of the other party from
other information [16].
output Fifth, it begins the technological development from one
extrusion unit language to multilingual, multicontext, two-way commu-
nication. The previous translation systems used a single
output gate
language as the object to complete a single voice conver-
unit
sion. The current translation system is for multilanguage
and multidomain two-way speech, which greatly promotes
memory
cells the exchange of information between the two parties.
Table 1 shows an overview of the current state of
forget gate unit
development.
123
24966 Neural Computing and Applications (2023) 35:24961–24970
LSTM LSTM
result
LSTM LSTM
LSTM LSTM
Table 1 Overview of
Development status Vocabulary Translation quality (%) Translation needs (%)
development status
Translation volume increased More 85.69 89.62
Wide range of translation More 80.99 93.69
Translation diversity More 90.15 95.22
Wide range of translation situations More 92.66 93.88
speech
corpus acoustic language text corpus
dictionary model
model
123
Neural Computing and Applications (2023) 35:24961–24970 24967
information. The information loss is dt and the total loss is the state of the current input information cannot affect the
as follows: output information of the output gate. The increase in the
X
T connection of the current input gate to the output gate can
dt ð1Þ better control the output data of each memory cell.
t¼1
123
24968 Neural Computing and Applications (2023) 35:24961–24970
Fig. 7 Comparison of goal one goal two goal one goal two
recognition accuracy goal three goal four goal three goal four
55% 95%
54% 94%
53% 93%
52%
92%
recognition accuracy
recognition accuracy
51%
91%
50%
90%
49%
89%
48%
47% 88%
46% 87%
45% 86%
first second third forth fifth sixth first second third forth fifth sixth
time time time time time time time time time time time time
forth survey
third survey
second survey
first survey
66% 67% 68% 69% 70% 71% 72% 73% 74% 75%
storage integrity
fifth survey
network
forth survey
third survey
second survey
first survey
123
Neural Computing and Applications (2023) 35:24961–24970 24969
Fig. 9 Comparison of voice goal one goal two goal three goal four goal one goal two goal three goal four
data processing speed 12 6
10 5
8 4
4 2
2 1
0 0
5 Conclusions References
Throughout the text, it can be seen that the intelligent 1. Du S (2019) Optimization of speech recognition system of eng-
lish education industry based on machine learning. Computer-
translation system of English voice recognition is a new Aided Des Appl 17(1):124–136
translation technology based on information technology 2. Chen X (2021) Simulation of english speech translation recog-
and intelligent technology. After decades of development, nition based on transfer learning and CNN neural network.
it has made great progress in theory and practice and has J Intell Fuzzy Syst 40(2):2349–2360
3. Zenkel T, Sperber M, Niehues J (2018) An open source toolkit for
realized the desire to communicate across languages. speech-to-English text translation. Prague Bull Math Ling
LSTM cyclic neural networks are an important research 111(1):125–135
direction for current learning research. They can process 4. Dharmale G, Thakare VM, Patil DD (2019) Implementation of
sequence data such as text, audio, and video and achieve Efficient speech recognition system on mobile device for Hindi
and English language. Int J Adv Comput Sci Appl 10(2):83–87
significant results in many aspects. However, the explo- 5. Hou Q, Li C, Kang M (2020) Intelligent model for speech
ration of components in the circulatory structure continues recognition based on SVM: a case study on English language.
and continuously improves computing components to J Intell Fuzzy Syst 40(7):1–11
improve performance. However, there are still many 6. Duan R, Wang Y, Qin H (2020) A speech recognition model for
correcting spoken English teaching. Journal of Intelligence and
defects in the current voice recognition translation system. Fuzzy Systems 40(1):1–12
How to further improve translation quality has become a 7. Hai Y (2020) Computer-aided teaching mode of oral English
current problem faced by scientists, which requires all intelligent learning based on speech recognition and network
scientific researchers to work together to create tomorrow’s assistance. J Intell Fuzzy Syst 39(4):5749–5760
8. Zhu H (2020) Construction of English spoken language system
voice recognition translation technology. based on machine learning algorithm and natural language
recognition. J Intell Fuzzy Syst 39(99):1–12
9. Sangeetha J, Jothilakshmi S (2017) Speech translation system for
Funding This work was supported by Shaoyang Science and Tech- english to dravidian languages. Appl Intell 46(3):534–550
nology Planning Project (2021025ZD): Construction of College 10. Mott M, Midgley KJ, Holcomb PJ (2020) Speech recognition
English online education Platform under the background of translation initiation and image effects in American Sign Lan-
"Internet+". guage deaf and English listening learners. Biling Lang Cognit
23(5):1032–1044
Data availability statement Data sharing not applicable to this article 11. Mendel LL, Poussen M, Bass JK (2019) English speech recog-
as no datasets were generated or analyzed during the current study. nition threshold test for Spanish children. Am J Audiol 28(1):1–8
12. Long Y, Li Y, Zhang Q (2020) Acoustic data augmentation for
Mandarin-English code-switching speech recognition. Appl
Declarations Acoust 161(11):107–125
13. Feng X, Zhou Y (2021) English translation language retrieval
Conflict of interest These are no potential competing interests in our based on adaptive English phonetic adjustment algorithm. Com-
paper. plexity 202(1):1–12
14. Cao D, Guo Y (2020) Algorithm research of spoken English
assessment based on fuzzy measure and speech recognition
technology. Int J Biom 12(1):120–131
123
24970 Neural Computing and Applications (2023) 35:24961–24970
15. Miller MK, Calandruccio L, Buss E (2019) Masked English 20. Bawa S (2021) A Sanskrit-to-English machine translation using
speech recognition performance in younger and older Spanish– hybridization of direct and rule-based approach. Neural Comput
English bilingual and English monolingual children. J Speech Appl 33:2819–2838
Lang Hear Res 62(12):1–14
16. Yun Z (2017) Research on spoken english speech recognition Publisher’s Note Springer Nature remains neutral with regard to
technology in computer network environment. Boletin Tecnico/ jurisdictional claims in published maps and institutional affiliations.
Tech Bull 55(16):445–449
17. Zhang Y, Liu L (2018) Using computer speech recognition
Springer Nature or its licensor (e.g. a society or other partner) holds
technology to evaluate spoken English. Educ Sci Theory Pract
exclusive rights to this article under a publishing agreement with the
18(5):20–31
author(s) or other rightsholder(s); author self-archiving of the
18. Hidayat R, Winursito A (2021) Improved MFCC robust English
accepted manuscript version of this article is solely governed by the
speech recognition based on wavelet denoising. Int J Intell Eng
terms of such publishing agreement and applicable law.
Syst 14(1):12–21
19. Pathak A, Pakray P, Bentham J (2019) English-Mizo machine
translation using neural and statistical approaches. Neural Com-
put Appl 31:7615–7631
123