Multimodal Corpora for Silent Speech Interaction
Multimodal Corpora for Silent Speech Interaction
Deposited version:
Publisher Version
Use policy
Abstract
A Silent Speech Interface (SSI) allows for speech communication to take place in the absence of an acoustic signal. This type of
interface is an alternative to conventional Automatic Speech Recognition which is not adequate for users with some speech
impairments or in the presence of environmental noise. The work presented here produces the conditions to explore and analyze
complex combinations of input modalities applicable in SSI research. By exploring non-invasive and promising modalities, we have
selected the following sensing technologies used in human-computer interaction: Video and Depth input, Ultrasonic Doppler sensing
and Surface Electromyography. This paper describes a novel data collection methodology where these independent streams of
information are synchronously acquired with the aim of supporting research and development of a multimodal SSI. The reported
recordings were divided into two rounds: a first one where the acquired data was silently uttered and a second round where speakers
pronounced the scripted prompts in an audible and normal tone. In the first round of recordings, a total of 53.94 minutes were captured
where 30.25% was estimated to be silent speech. In the second round of recordings, a total of 30.45 minutes were obtained and 30.05%
of the recordings were audible speech.
4507
conditions to record all signals with adequate Portuguese (EP) (Strevens, 1954).
synchronization. The challenge of synchronizing all
signals resided in the fact that a potential synchronization
event would need to be captured simultaneously by all
(four) input modalities. To that purpose, we have selected
the sEMG recording device, which had an available
output channel, as the source that generates the alignment
pulse for all the remaining modalities. After the data
collection system setup was ready, a database described in
this paper, was collected for further analysis.
4508
aligned offline. To align the RGB video and the depth the extra input channels provide by this device, we
streams with the remaining modalities, we have used an decided to collect a second round of recordings where the
image template matching technique that automatically audio channel from the UDS device is also synchronously
detects the led position on each color frame. acquired. As such, we have collected in this round 3
For the UDS acquisition system, the activation of the speakers, one from the previous data collection and two
output I/O flag of the sEMG recording device, generates a elderly speakers without any history of speech disorders
small voltage peak on the signal of the first channel. To known so far and also native EP speakers. The first
enhance and detect that peak, a second degree derivative speaker was a male with 31 years old and the two elderly
is applied to the signal followed by an amplitude speakers, were two female with 65 and 71 years old,
threshold. To be able to detect this peak, we have respectively. In this second stage of data collection, each
previously configured the external sound board channel speaker recorded two sessions without removing the
with maximum input sensitivity. EMG electrodes or changing the recording position.
The time-alignment of the EMG signals is ensured by the
sEMG recording device, since the I/O flag is recorded in a
synchronous way with the samples of each channel.
4509
end of the session), in a random order with each prompt Total Recorded Silent
being pronounced individually, in order to allow isolated Word Set Non-Speech
Duration (minutes) Speech
word recognition. All prompts were repeated 3 times per Digits 15.28 26.78% 73.22%
recording session.
Nasal Pairs 13.02 28.90% 71.10%
AAL 25.63 33.00% 67.00%
Ambient Assisted Living Word Set
All word sets 53.94 30.25% 69.75%
Videos Ligar Contatos Mensagens Voltar
(Videos) (Call/Dial) (Contacts) (Messages) (Back)
Table 2: Audio duration, speech time and non-speech time
Pesquisar Anterior Fotografias Família Ajuda
distribution by word set (excluding silence utterances) for
(Search) (Previous) (Photographs) (Family) (Help)
the first round of recordings.
Seguinte Lembretes Calendário E-Mail
-
(Next) (Reminders) (Calendar) (E-Mail) 3.2 Second Round of Recordings
In the second round of recordings since synchronously
Table 1: Set of words of the EP vocabulary, extracted from acquired audio was available the estimation of the speech
AAL contexts. and non-speech characteristics was performed based on
the manual annotation of the speech signal by the first
3. Characterization of the Acquired author. As described in Table 3, in this second round, a
Database total elapsed duration of 30.45 minutes, with an average
In this section we present some statistics of the acquired duration of 5.07 minutes per session and 3.17 seconds per
utterance.
data. In the first round of recordings no audio was
collected thus an automatic algorithm was used to
estimate speech statistics. For the second round of Total Recorded
Word Set Speech Non-Speech
recordings, audible utterances were recorded and the Duration (minutes)
audio was used as auxiliary information for manually Digits 8.78 28.13% 71.87%
annotating the data. Nasal Pairs 7.48 71.87% 73.13%
AAL 14.20 32.91% 67.09%
3.1 First Round of Recordings All Word Sets 30.45 30.05% 69.95%
The data collected in the first round of recordings has a
total elapsed duration of 56.11 minutes, with an average Table 3: Audio duration, speech time and non-speech time
duration of 5.99 minutes per session and 3.74 seconds per distribution by word set (excluding silence utterances) for
utterance, not considering silence utterances. By applying the second round of recordings.
a Voice Activity Detection (VAD) technique based on
UDS alone, we estimate that 30.25% is silent speech (i.e. In Table 4 the session statistics for the first and
continuous facial movements) and that 69.75% is the second round are presented. Based on these values, a
silence before and after each utterance. The VAD larger duration of the sessions were only silent speech was
algorithm uses the energy of the UDS pre-processed considered, can be noticed. This suggests a slower
spectrum information around the carrier and a mean articulation when no acoustic feedback, however it might
reference value extracted from the silence prompts of also be related or influenced by the lack of experience
each speaker to distinguish silent articulation. Each verified in most speakers when articulating the words
session presents an average speech duration of 1.81 without any acoustic feedback.
minutes and 4.18 minutes of non-speech. The female
speakers had an average speech duration of 42.79% per Average Average
Average
session, while this figure for male speakers was only Duration Speech
Non-Speech
23.29%. Table 2 details the audio duration of the collected Data Collection Stage per per
per session
data by word set. session session
(minutes)
(minutes) (minutes)
1st round 5.99 1.81 4.18
2nd round 5.07 1.52 3.55
4510
and 4.19 minutes for non-speech data, a similar result to Speech Recognition based on Ultrasonic Doppler
what was obtained using the UDS algorithm. Sensing for European Portuguese”, Advances in
Speech and Language Technologies for Iberian
4. Conclusion Languages, vol. CCIS 328, Springer, 2012.
This paper describes a multimodal data collection with 5 Freitas, J., Teixeira, A., Silva, S., Oliveira, C., Dias, M.S.
streams of data: Video, Depth, Surface EMG, Ultrasonic (2014). Velum Movement Detection based on Surface
Doppler Sensing and audio. By using the surface EMG Electromyography for Speech Interface”, Proceedings
recording device we were able to synchronously combine of Biosignals 2014, Angers, France.
these silent speech modalities and acquire information Hueber, T., Chollet, G., Denby, B., Stone, M. and Zouari,
from multiple stages of the human speech production L. (2007). Ouisper: Corpus Based Synthesis Driven by
model. The data collection is divided into two rounds of Articulatory Data. International Congress of Phonetic
recordings: in a first round only silent speech (i.e. no Sciences, Saarbrücken, pp. 2193--2196.
acoustic signal was produced by the speaker) was Plux Wireless Biosignals, Portugal (2014). Online:
recorded; in a second set of recordings, audible speech http://www.plux.info/, accessed on 17 March 2014.
was captured in addition to the remaining modalities. We Porbadnigk, A., Wester, M., Calliess, J. and Schultz, T.
have also used an algorithm based on UDS energy for (2009). EEG-based speech recognition impact of
estimating total speech time in the absence of the acoustic temporal effects. Biosignals 2009, Porto, Portugal,
signal and some statistics of how the data was distributed. pp.376--381.
Schultz, T. and Wand, M. (2010). Modeling coarticulation
5. Future Work in large vocabulary EMG-based speech recognition.
Speech Communication, 52(4), pp. 341--353.
The collected data opens several doors in terms of future
Srinivasan, S., Raj, B. and Ezzat, T. (2010). Ultrasonic
research. This data will potentially allow for the
sensing for robust speech recognition. Internat. Conf.
development of a multimodal SSI based on these
on Acoustics, Speech, and Signal Processing, pp.
modalities, where the strongest points of one modality can
5102—5105.
eventually help to minimize the weakest point of other(s).
Strevens, P. (1954). Some observations on the phonetics
It will also allow looking at other types of information,
and pronunciation of modern Portuguese, Rev.
beyond the acoustic signal, for interesting research issues,
Laboratório Fonética Experimental, Coimbra II, pp.
such as elderly speech characteristics and nasal sounds
5--29.
production and recognition.
Tran, V.A., Bailly, G., Loevenbruck, H. and Jutten, C.
(2008). Improvement to a NAM captured
6. Acknowledgements whisper-to-speech system. Proceedings of Interspeech
This work was partially funded by Marie Curie Actions 2008, pp.1465-1468.
Golem (ref.251415, FP7-PEOPLE-2009-IAPP) and IRIS Wand, M. Schultz, T. (2011). Investigations on Speaking
(ref. 610986, FP7-PEOPLE-2013-IAPP), by FEDER Mode Discrepancies in EMG-based Speech
through the Program COMPETE under the scope of Recognition. Proceedings of Interspeech 2011,
QREN 5329 FalaGlobal and by National Funds (FCT- Florence, Italy.
Foundation for Science and Technology) in the context of Zhu, B. (2008). Multimodal speech recognition with
IEETA Research Unit funding ultrasonic sensors. Master’s thesis, Massachusetts
FCOMP-01-0124-FEDER-022682 Institute of Technology, Cambridge, Massachusetts.
(FCT-PEst-C/EEI/UI0127/2011). The authors would also
like to thank the experiment participants.
7. References
Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert,
J.M., Brumberg, J.S. (2009). Silent speech interfaces.
Speech Communication, 52(4), pp. 270--287.
Denby, B., Stone, M. (2004). Speech synthesis from real
time ultrasound images of the tongue. Internat. Conf.
on Acoustics, Speech, and Signal Processing, Montreal,
Canada, 1, pp. I685--I688.
Fagan, M. J., Ell, S. R., Gilbert, J. M., Sarrazin, E. and
Chapman, P.M. (2008). Development of a (silent)
speech recognition system for patients following
laryngectomy. Med. Eng. Phys., 30(4), pp. 419–425.
Freitas, J. Teixeira, A. Dias M. S. and Bastos, C. (2011).
Towards a Multimodal Silent Speech Interface for
European Portuguese. Speech Technologies, InTech.
Freitas, J. Teixeira, A., Vaz, F. and Dias, M.S., “Automatic
4511