A Novel Machine Lip Reading Model A Novel Machine Lip Reading Model
A Novel Machine Lip Reading Model A Novel Machine Lip Reading Model
com
Available
Availableonline
onlineatatwww.sciencedirect.com
www.sciencedirect.com
ScienceDirect
Procedia
ProcediaComputer
ComputerScience
Science00199
(2018) 000–000
(2022) 1432–1437
Procedia Computer Science 00 (2018) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
A Novel
AHongyang Machine
Novel Huang,
Machine Lip
Lip Reading
Reading Model
Model
Chai Song*, Jin Ting, Taoling Tian,
HongyangChen
Huang, ChaiZhang
Hong, Song*,
Di,JinDanni
Ting,Gao
Taoling Tian,
Chen Hong, Zhang Di, Danni
[email protected] Gao
[email protected]
Southwest University for Nationalities
Southwest University for Nationalities
Chengdu, China
Chengdu, China
Abstract
Abstract
Lip Reading is the technology of obtaining the language content by analyzing the change of the speaker's lip shape and
Lip Readingthe
recognizing is the technology
information of obtaining
of the the language
lip movement. content
Lip reading helpsbypeople
analyzing
with the change
hearing of the speaker's
disabilities understandlip what
shapeother
and
recognizing the information of the lip movement. Lip reading helps people with hearing disabilities understand
people are saying, which is difficult for humans. This paper proposes a novel Lip Reading model using Transformer network, what other
people are saying,
to achieve whichmodel
a lip reading is difficult
with for humans.
high ThisThe
accuracy. paper proposes
main processa of
novel
the Lip Reading
model model
includes the using Transformer
processing of datanetwork,
sets, the
to achieve a lip reading model with high accuracy. The main process of the model includes the processing
extraction of lip features using the pre-trained neural network, and then input into Transformer network for training. of data sets, the
Finally,
extraction of lip features using the pre-trained neural network, and then input into
our model achieves a word-level lip reading accuracy of 45.81% on the open source GRID corpus. Transformer network for training. Finally,
our model achieves a word-level lip reading accuracy of 45.81% on the open source GRID corpus.
© 2021 The Authors. Published by Elsevier B.V.
© 2021 The Authors. Published by Elsevier B.V.
This
© is an
2021 open
The accessPublished
Authors. article under the CC BY-NC-ND
by Elsevier B.V. of the license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Selection
Peer-reviewand/or
under peer-review under
responsibility of responsibility organizers
the scientific committee of the of
TheITQM 2020&2021 Conference on Information Technology
8th International
Selection and/or peer-review under responsibility of the organizers of ITQM 2020&2021
and Quantitative Management (ITQM 2020 & 2021)
Keywords:Lip Reading, Transformer, Transfer learning
Keywords:Lip Reading, Transformer, Transfer learning
1. Introduction
1. Introduction
In China, in the sixth national census, the number of people with hearing and language disabilities reached
In million,
20.7 China, in the sixth national
accounting for 1.67% census,
of thethe number
total numberof of
people with
people in hearing and [1]
the country language disabilities
. Lip reading reached
can assist the
hearing impaired to communicate with others through lip movement. However, lip reading is very difficult the
20.7 million, accounting for 1.67% of the total number of people in the country . Lip reading can assist
[1]
for
hearing
humans.impaired to communicate
In the study of Easton et al. with
[2] others through lip movement. However, lip reading is very difficult for
, for the hearing impaired without lip reading training, in a corpus of only
humans. In the
30 syllables, study
their of Eastonrate
recognition is up ,tofor
et al. [2]
29%the, hearing
and onlyimpaired
32% whenwithout lip reading
the corpus training, inwords.
is 30 compound a corpus of only
Obviously,
30 syllables, their recognition rate is up to 29%
reading a language with your lips is a very difficult task., and only 32% when the corpus is 30 compound words. Obviously,
reading a language
In recent with your
years, deep lips isis developing
learning a very difficult task.it has become possible for machines to understand lips.
rapidly,
In recent years, deep learning is developing
In 2017, Google proposed a new model, Transformer network rapidly, it has become
[3] possible
, which for machines
is constructed to understand
using lips.
a self-attention
In 2017, Google proposed a new model, Transformer network [3]
, which is constructed
mechanism, instead of CNN and RNN which are commonly used in deep learning. The output of each state in using a self-attention
mechanism, instead
traditional RNN of CNN and
is contained RNN
in the inputwhich are commonly
of the usedcausing
previous state, in deep RNN
learning.
to be The output
slow of each
in some state in
sequential
traditional RNN is contained in the input of the previous state, causing RNN to be slow in some sequential
processing tasks. Transformer network adopts a self-attention mechanism, which effectively solves the problem
that RNN cannot be parallelized and greatly improves the speed of model training. Since then, Transformer has
been widely used in the field of NLP and achieved remarkable results in machine translation, speech recognition
and other directions.
In this paper, the pre-trained neural network VGG16 is used to extract the features of lips in the video. As the
extracted feature dimensions are too high, adopt some dimensionality reduction operations to deal with these
features. After obtaining the features of lips with lower dimensions, the features are then input into our
Transformer network for training. The experiment proves that using our Transformer network to train these
features can significantly reduce training costs and improve the lip reading accuracy of the model.
The rest of this paper is organized as follow: Section II studies related to lip speech recognition are reviewed;
Section III analyzes the model in detail; Section IV describes the detailed process of the experiment; Section V
concludes the paper.
2. Related Word
Lip reading technology was first proposed by W.H.Dumby and I.Pollack in 1954 [4], but the real Automatic
Lipreading System was established by Petajan at the University of Illinois in 1984 [5]. In recent years, computer
vision technology and computer speech technology continue to develop breakthrough, and lip recognition as a
comprehensive reflection of image, speech and natural language processing technology, has also made great
progress.
In terms of lip reading techniques based on traditional computer methods: In 1984, Petajan et al. proposed lip
reading system with single word as the minimum recognition unit for the first time. It calculates the features of
the lip image sequence, carries out the Nearest Neighbor search with all samples in the feature database, and
outputs the most similar feature samples as the predicted results. In 1998 [6], Gerasimos Potamianos et al. studied
a visual front end of automatic lip reading based on Hidden Markov model and proposed two methods for
extracting lip features: the feature method based on lip contour and the method based on image change. In 2007,
Zhao et al. [7] proposed a time-space local binary feature recognition method in order to solve the problem of
isolated phrase recognition, and used SVM (Support Vector Machine) to recognize phrases.
In terms of deep learning based lip reading: Wand et al. introduced LSTM (Long Short-Term Memory) for lip
reading research, and the recognition accuracy of the model reached 79.6% in word-level lip reading [8]. In 2016,
Chung et al. from the University of Oxford published the LRW data set in the field of lip reading and established
a WLAS (Watch, Listen, Attend and SPELL) network, which achieved a classification accuracy of 61.1% [9]. In
the same year, the Oxford Artificial Intelligence Laboratory, the DeepMind team, and the Canadian Institute for
Advanced Study (CIFAR) jointly released the LipNet lip reading model [10], which is the first end-to-end sentence
level lip reading model that can simultaneously learn spatial temporal visual features and sequential models. It
adopted STCNN (Spatiotemporal Convolution), LSTM and CTC loss (Connectionist Temporal Classification
loss), which was the best lip reading model at that time.
Our model construction mainly consists of three parts: the first part is dataset processing, the second part is
feature extraction, and the third part is model training. The overall structure of the model is shown in Figure 1:
1434 Hongyang Huang et al. / Procedia Computer Science 199 (2022) 1432–1437
Hongyang Huang/ Procedia Computer Science 00 (2017) 000–000
Cropping lip
Outputs Outputs
Transformer Decoder
(label) (probabilities)
Feature extraction
Transformer Encoder
75x512
75x3x224x224
restore cropping
Size=512x7x7
3x3 conv, 512
3x3 conv, 64
3x3 conv, 64
pool2d
pool2d
pool2d
pool2d
pool2d
64x112x112
128x56x56
256x28x28
512x14x14
3x224x224
Size:
Size:
Size:
Size:
Size:
In this paper, the pre-trained neural network VGG16 is used for transfer learning. After fine-tuning the VGG16
network structure, the lips in the video are extracted for features, which can not only effectively extract high-
dimensional lip features, but also avoid a large amount of time cost caused by training the model from scratch.
Here, we used all the feature layers of VGG16 to extract the features of the lips. Since the feature dimensions
extracted from the VGG16 pre-training model are too large, it is necessary to carry out corresponding
dimensionality reduction operation and then input them into our model for training. There are two dimensionality
reduction methods adopted in this paper. The first one is to convolve the extracted features to reduce the number
of convolution kernels to achieve the purpose of dimensionality reduction; the second is to reduce the feature
dimension by adding a full connection layer.
(3). Transformer
Transformer is a model proposed by Google in 2017. The model is composed of traditional encoder-decoder
structure. The Decoder and Encoder part does not use structures such as RNN and CNN, but use an attention
mechanism to build the model. We have modified Transformer network and added it to our model to train the lip
reading dataset. The Transformer network structure diagram is shown in Figure 4 below:
Outputs Outputs
Add & Norm
Attention
(label) (probabilities)
Frowrd
Maked
Feed
Nx
Decoder
Add & Norm
Input
Multi-Head
Attention
Frowrd
(75 x 512)
Feed
Nx
Encoder
mechanism and a position-wise fully connected feed-forward neural network. The residual connection is used
between the two sublayers and the layer normalization is performed on the output, the output of each Sublayer is
LayerNorm (x + Sublayer(x)).
Decoder: The decoder is also composed of N identical layers. In addition to the two sub-layers in each encoder
layer, the decoder inserts a third sub-layer. This sublayer is a multi-head attention mechanism that uses masking
in order to conceal the prediction of the current location from being affected by subsequent states.
The Transformer network has achieved significant results in machine translation tasks, and we need to modify
the structure of the Transformer network due to the different tasks we are dealing with. 1) The encoder input of
the model is a high-dimensional lip feature, which does not require word embedding. 2) The encoder input and
the decoder input of the model are not padding to a fixed length.
4. Experiment
The experimental model was trained on a Dell workstation and the operating system was Ubuntu18.0.4. The
GPU is a 24G Quadro P6000, the processor is Intel Xeon(R) GOLD 512, and the memory is 128GB. The model
was built using the Pytroch (1.6.0) framework.
The optimizer used in the model is Adagrad. Adagrad optimizer is an adaptive optimization method, which
adaptively allocates different learning rates to each parameter. In Adagrad optimization, we set the initial learning
rate as 7e-4 and trained 30 epochs. Finally, the accuracy of the model reached 45.81% in word-level lip reading.
The change of the loss function and accuracy of the model is shown in Figure 5 below:
5. Conclution
The accuracy of our model reaches 45.81% in word-level lip reading, which forms a simple lip reading system.
The recognition of the model still needs to be improved. I believe that after continuous debugging and
improvement of the model in the future, a lip reading model with high accuracy can finally be formed.
6. Acknowledgement
This paper is supported by the Key Research and Development Project of Sichuan Province (2021YFG0358)
and the Fundamental Research Funds for the Central Universities, Southwest Minzu University (2021PTJS24).
References
[1] http://www.stats.gov.cn/tjsj/zxfb/201104/t20110428_12705.html.
[2] Easton R D, Basala M. Perceptual dominance during lipreading[J]. Perception & Psychophysics, 1982, 32(6): 562-570.
[3] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. arXiv preprint arXiv:1706.03762, 2017.
[4] Sumby W H, Pollack I. Visual contribution to speech intelligibility in noise[J]. The journal of the acoustical society of america, 1954,
26(2): 212-215.
[5] Petajan E D. Automatic Lipreading to Enhance Speech Recognition (Speech Reading)[J]. 1985.
[6] Potamianos G, Graf H P, Cosatto E. An image transform approach for HMM based automatic lipreading[C]//Proceedings 1998
International Conference on Image Processing. ICIP98 (Cat. No. 98CB36269). IEEE, 1998: 173-177.
[7] Zhao G, Pietikäinen M, Hadid A. Local spatiotemporal descriptors for visual recognition of spoken phrases[C]//Proceedings of the
international workshop on Human-centered multimedia. 2007: 57-66.
[8] Wand M, Koutník J, Schmidhuber J. Lipreading with long short-term memory[C]//2016 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 2016: 6115-6119.
[9] Chung J S, Zisserman A. Lip reading in the wild[C]//Asian Conference on Computer Vision. Springer, Cham, 2016: 87-103.
[10] Assael Y M, Shillingford B, Whiteson S, et al. Lipnet: End-to-end sentence-level lipreading[J]. arXiv preprint arXiv:1611.01599,
2016.
[11] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556,
2014.
[12] Maas A, Xie Z, Jurafsky D, et al. Lexicon-free conversational speech recognition with neural networks[C]//Proceedings of the 2015
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015:
345-354.
[13] Zhang C, Zhang S. Lip Reading using CNN Lip Deflection Classifier and GAN Two-Stage Lip Corrector[C]//Journal of Physics:
Conference Series. IOP Publishing, 2021, 1883(1): 012134.