0% found this document useful (0 votes)
5 views

Multimodal Corpora for Silent Speech Interaction

Uploaded by

dalgamuni.1984
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Multimodal Corpora for Silent Speech Interaction

Uploaded by

dalgamuni.1984
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Repositório ISCTE-IUL

Deposited in Repositório ISCTE-IUL:


2022-05-25

Deposited version:
Publisher Version

Peer-review status of attached file:


Peer-reviewed

Citation for published item:


Freitas, J., Teixeira, A. & Dias, J. (2014). Multimodal corpora for silent speech interaction. In Nicoletta
Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion
Moreno, Jan Odijk, Stelios Piperidis (Ed.), Proceedings of the Ninth International Conference on
Language Resources and Evaluation (LREC 2014). (pp. 4507-4511). Reykjavik: European Language
Resources Association (ELRA).

Further information on publisher's website:


--

Publisher's copyright statement:


This is the peer reviewed version of the following article: Freitas, J., Teixeira, A. & Dias, J. (2014).
Multimodal corpora for silent speech interaction. In Nicoletta Calzolari, Khalid Choukri, Thierry
Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios
Piperidis (Ed.), Proceedings of the Ninth International Conference on Language Resources and
Evaluation (LREC 2014). (pp. 4507-4511). Reykjavik: European Language Resources Association
(ELRA).. This article may be used for non-commercial purposes in accordance with the Publisher's
Terms and Conditions for self-archiving.

Use policy

Creative Commons CC BY 4.0


The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or
charge, for personal research or study, educational, or not-for-profit purposes provided that:
• a full bibliographic reference is made to the original source
• a link is made to the metadata record in the Repository
• the full-text is not changed in any way
The full-text must not be sold in any format or medium without the formal permission of the copyright holders.

Serviços de Informação e Documentação, Instituto Universitário de Lisboa (ISCTE-IUL)


Av. das Forças Armadas, Edifício II, 1649-026 Lisboa Portugal
Phone: +(351) 217 903 024 | e-mail: [email protected]
https://repositorio.iscte-iul.pt
Multimodal Corpora for Silent Speech Interaction
João Freitas1,2, António Teixeira2, Miguel Sales Dias1,3
1
Microsoft Language Development Center, Lisboa, Portugal
2
Dep. Electronics Telecommunications & Informatics/IEETA, University of Aveiro, Portugal
3
ISCTE-Lisbon University Institute, Lisboa, Portugal
E-mail: [email protected], [email protected], [email protected]

Abstract
A Silent Speech Interface (SSI) allows for speech communication to take place in the absence of an acoustic signal. This type of
interface is an alternative to conventional Automatic Speech Recognition which is not adequate for users with some speech
impairments or in the presence of environmental noise. The work presented here produces the conditions to explore and analyze
complex combinations of input modalities applicable in SSI research. By exploring non-invasive and promising modalities, we have
selected the following sensing technologies used in human-computer interaction: Video and Depth input, Ultrasonic Doppler sensing
and Surface Electromyography. This paper describes a novel data collection methodology where these independent streams of
information are synchronously acquired with the aim of supporting research and development of a multimodal SSI. The reported
recordings were divided into two rounds: a first one where the acquired data was silently uttered and a second round where speakers
pronounced the scripted prompts in an audible and normal tone. In the first round of recordings, a total of 53.94 minutes were captured
where 30.25% was estimated to be silent speech. In the second round of recordings, a total of 30.45 minutes were obtained and 30.05%
of the recordings were audible speech.

Keywords: Silent Speech, Multimodal HCI, Data Collection

able to work with speech-handicapped users or elderly


1. Introduction people, for whom speaking requires a substantial effort.
Silent Speech designates the process of speech Based on these requirements, we collected data from four
communication in the absence of an audible and SSI modalities with the following specifications: (1)
intelligible acoustic signal (Denby et al., 2009). By video input, which captures the RGB color of each image
extracting information of the human speech production pixel of the speakers’ mouth region and its surroundings,
process, an SSI is able to interpret and process the including chin and cheeks; (2) depth input, which
acquired data. Several SSI based on different sensory captures depth information of each pixel for the same
types of data have been proposed in the literature (e.g. areas (resulting in our this case, in a 3D point cloud in the
Electro-encephalographic sensors (Porbadnigk et al., sensor reference frame, represented by a grayscale
2009), Electromagnetic Articulography sensors (Fagan et image), providing useful information about the mouth
al. 2008), etc.). Nonetheless, acquiring data from a single opening and tongue position, in some cases; (3) surface
input modality limits the amount of useful information EMG (sEMG) sensory data, which provides information
available for capture and further processing. Furthermore, about the myoelectric signal produced by the targeted
in order to develop a multimodal SSI, it is necessary to facial muscles during speech movements; (4) Ultrasonic
collect data from multiple input modalities in a Doppler Sensing (UDS), a technique which is based on
synchronous way due to the nonexistence of SSI the emission of a pure tone in the ultrasound range
multimodal data available for research. However, towards the speaker’s face, that is received by an
satisfying the requirements and gathering all the ultrasound sensor tuned to the transmitted frequency. The
necessary equipment for collecting such corpora is a reflected signal then contains Doppler frequency shifts
complex and cumbersome task (Hueber et al., 2007). that correlate with the movements of the speaker’s face
Hence, the availability to the community of multimodal (Srinivasan et al., 2010).
corpora would not only allow to increase the number of In literature several studies that combined 2 input
data resources accessible for further research, but would modalities, in addition to audio can be found (e.g. Denby
also pave the way for the development of a multimodal and Stone, (2004) and Tran et al. (2008)). Nonetheless, to
SSI, which could provide a more complete representation the best of our knowledge, this is the first silent speech
of the speech production model behavior during speech. corpora that combines more than two input data types and
The work presented in this paper, creates the the first to synchronously combine the corresponding four
conditions to explore and analyze more complex modalities, thus, providing the necessary information for
combinations of input modalities for SSI research. By future studies and research on multimodal SSIs.
exploring non-invasive and state-of-the-art modalities
such as Ultrasonic Doppler (Srinivasan et al., 2010), we 2. Data Collection Setup
have selected several sensing technologies based on: the After assembling all the necessary data collection
possibility of being used in a natural manner without equipment which, in the case of ultrasound, led us to the
complex medical procedures from the ethical and clinical development of custom built equipment based on the
perspectives, low cost, tolerant to noisy environments and work of Zhu (2008), we needed to create the necessary

4507
conditions to record all signals with adequate Portuguese (EP) (Strevens, 1954).
synchronization. The challenge of synchronizing all
signals resided in the fact that a potential synchronization
event would need to be captured simultaneously by all
(four) input modalities. To that purpose, we have selected
the sEMG recording device, which had an available
output channel, as the source that generates the alignment
pulse for all the remaining modalities. After the data
collection system setup was ready, a database described in
this paper, was collected for further analysis.

2.1 The individual data input modalities


The devices employed in this data collection, depicted in
Figure 1, were: (1) a Microsoft Kinect for Windows that
acquires visual and depth information; (2) an sEMG
sensor acquisition system from Plux (2014), that captures
the myoelectric signal from the facial muscles; (3) a Figure 1: Acquisition devices and laptop with the data
custom built dedicated circuit board (referred to as UDS collection application running.
device), that includes: 2 ultrasound transducers (400ST
and 400SR working at 40 kHz), a crystal oscillator at 7.2
MHz and frequency dividers to obtain 40 kHz and 36
kHz, and all amplifiers and linear filters needed to process
the echo signal (Freitas et al., 2012).
The Kinect sensor was placed at approximately
70cm from the speaker. It was configured, using Kinect
SDK 1.5, to capture a color video stream with a resolution
of 640x480 pixel, 24-bit RGB at 30 frames per second and
a depth stream, with a resolution of 640x480 pixel, 11-bit Figure 2: Surface EMG electrodes positioning and the
to code the Z dimension, at 30 frames per second. Kinect respective channels (1 to 5) plus the reference electrode
was also configured to use the Near Depth range (i.e. (R).
range between 40cm to 300cm) and to track a seated The Ultrasonic Doppler sensing device was placed
skeleton. at approximately 40cm from the speaker and was
The sEMG acquisition system consisted of 5 pairs of connected to an external sound board (Roland, UA-25 EX
EMG surface electrodes connected to a device that in the first setup and a TASCAM US-1641 in the second
communicates with a computer via Bluetooth. As setup) which in turn was connected to the laptop through a
depicted in Figure 2, the sensors were attached to the skin USB connection. Two recording channels of the external
using a single use 2.50cm diameter clear plastic sound board were connected to the I/O channel of the
self-adhesive surfaces and considering an approximate sEMG recording device and to the UDS device. The
2.00cm spacing between the electrodes center for bipolar Doppler echo and the synchronization signals were
configurations. Before placing the surface EMG sensors, sampled at 44.1 kHz and to facilitate signal processing, a
the sensor location was previously cleaned with alcohol. frequency translation was applied to the carrier by
While uttering the prompts no other movement, besides modulating the echo signal by a sine wave and low
the one associated with speech production, was made. The passing the result, obtaining a similar frequency
five electrode pairs were placed in order to capture the modulated signal centered at 4 kHz.
myoelectric signal from the following muscles: the
zygomaticus major (channel 2); the tongue (channel 1 and 2.2 Registration of all input modalities
5), the anterior belly of the digastric (channel 1); the In order to register all the mentioned input modalities via
platysma (channel 4) and the last electrode pair was time alignment between all corresponding input streams,
placed below the ear between the mastoid process and the we have used an I/O bit flag in the sEMG recording
mandible. The sEMG channels 1 and 4 used a monopolar device, which has one input switch for debugging
configuration (i.e. placed one of the electrodes from the purposes and two output connections, as depicted in
respective pair in a location with low or negligible muscle Figure 4. Synchronization occurs when the output of a
activity), being the reference electrodes placed on the synch signal, programmed to be automatically emitted by
mastoid portion of the temporal bone. The positioning of the sEMG device at the beginning of each prompt, is used
the EMG electrodes 1, 2, 4 and 5 was based on previous to drive a led and to provide an additional channel in an
work (e.g. Schultz and Wand, 2010) and sEMG electrode external sound card. Registration between the video and
from channel 3 was placed according to recent findings by depth streams is ensured by the Kinect SDK.
the authors about the detection of nasality in SSIs (Freitas Using the information from the led and the auxiliary
et al., 2014), a distinct characteristic of European audio channel with synch info, the signals were time

4508
aligned offline. To align the RGB video and the depth the extra input channels provide by this device, we
streams with the remaining modalities, we have used an decided to collect a second round of recordings where the
image template matching technique that automatically audio channel from the UDS device is also synchronously
detects the led position on each color frame. acquired. As such, we have collected in this round 3
For the UDS acquisition system, the activation of the speakers, one from the previous data collection and two
output I/O flag of the sEMG recording device, generates a elderly speakers without any history of speech disorders
small voltage peak on the signal of the first channel. To known so far and also native EP speakers. The first
enhance and detect that peak, a second degree derivative speaker was a male with 31 years old and the two elderly
is applied to the signal followed by an amplitude speakers, were two female with 65 and 71 years old,
threshold. To be able to detect this peak, we have respectively. In this second stage of data collection, each
previously configured the external sound board channel speaker recorded two sessions without removing the
with maximum input sensitivity. EMG electrodes or changing the recording position.
The time-alignment of the EMG signals is ensured by the
sEMG recording device, since the I/O flag is recorded in a
synchronous way with the samples of each channel.

Figure 3: TASCAM US-1641 device used in the second


round of recordings.

Before each recording session, the participants


received a 30 minute briefing that included instructions,
speaker preparation and voluntarily signing of a consent
form which accurately described the experiment and its
duration and what kind of data was going to be collected.
Each recording session took between 40 to 60
minutes, generating an average 3.81GB of data per
speaker, that includes: session metadata, such as devices
Figure 4: Diagram of the time alignment scheme showing configuration; RBG and depth information of a 128x128
the I/O channel connected to the three outputs – debug pixel square centered at the mouth center and the
switch, external sound card and a directional led. coordinates of 100 facial points, in the sensor reference
frame, for each Kinect image; sEMG data from the 5
2.3 Acquisition Methodologies available channels; two channel wave per prompt
The recordings took place in a quiet room with controlled containing the UDS and the synchronization signals; and
illumination and an assistant responsible for monitoring a compressed video of the whole session. In the second
the data acquisition and also for pushing a record/stop round of recordings, we recorded a three channel wave
button in the recording tool interface in order to avoid containing the audio from the UDS device microphone,
unwanted muscle activity. the Ultrasonic and the synchronization signal.
The data acquisition is divided into two distinct
rounds hereon referred as first and second round of 2.4 Corpora
recordings. The main difference between them is the For this data collection we have selected a vocabulary of
acquisition of an audible acoustic signal (second round) 32 EP words, which can be divided into 3 distinct sets.
versus silently articulating the words (first round). The first set, used in previous literature work for other
The first round of our database contains the languages (e.g. (Srinivasan et al., 2010) and for EP in
recordings of 9 sessions of 8 native EP speakers (one prior work of the authors (Freitas et al., 2012), consists of
speaker recorded two sessions) - 2 female and 6 male – 10 digits from zero to nine. The second set contains 4
with no history of hearing or speech disorders, with an age minimal pairs of common words in EP that only differ on
range from 25 to 35 years old and an average age of 30 nasality of one of the phones (minimal pairs regarding this
years. Due to hardware limitations and the differences characteristic, e.g. Cato/Canto [katu]/[kɐ̃tu] or Peta/Penta
found between silently articulated speech and audible [petɐ]/[pẽtɐ] – see (Freitas et al., 2011) for more details),
uttered speech related with the lack of acoustic feedback and is directly related with previous investigation by the
(Wand and Schultz, 2011), in this first round we have authors on the detection of nasality with SSIs. Table 1
chosen to record only silent speech. Thus, no audible shows the last (third) set, with 14 common words in EP,
acoustic signal was produced by the speakers during the taken from context free grammars of an Ambient Assisted
recordings and only one speaker had past experience with Living (AAL) application that supports speech input and
silent articulation. chosen based on past experiences of the authors (Teixeira
In a second round, the previous sound card was et al., 2012). A total of 99 scripted prompts per session
replaced by a TASCAM US-1641, as depicted in Figure 3, were presented to the speaker (three additional silence
and for comparison purposes and by taking advantage of prompts were also included in the beginning, middle and

4509
end of the session), in a random order with each prompt Total Recorded Silent
being pronounced individually, in order to allow isolated Word Set Non-Speech
Duration (minutes) Speech
word recognition. All prompts were repeated 3 times per Digits 15.28 26.78% 73.22%
recording session.
Nasal Pairs 13.02 28.90% 71.10%
AAL 25.63 33.00% 67.00%
Ambient Assisted Living Word Set
All word sets 53.94 30.25% 69.75%
Videos Ligar Contatos Mensagens Voltar
(Videos) (Call/Dial) (Contacts) (Messages) (Back)
Table 2: Audio duration, speech time and non-speech time
Pesquisar Anterior Fotografias Família Ajuda
distribution by word set (excluding silence utterances) for
(Search) (Previous) (Photographs) (Family) (Help)
the first round of recordings.
Seguinte Lembretes Calendário E-Mail
-
(Next) (Reminders) (Calendar) (E-Mail) 3.2 Second Round of Recordings
In the second round of recordings since synchronously
Table 1: Set of words of the EP vocabulary, extracted from acquired audio was available the estimation of the speech
AAL contexts. and non-speech characteristics was performed based on
the manual annotation of the speech signal by the first
3. Characterization of the Acquired author. As described in Table 3, in this second round, a
Database total elapsed duration of 30.45 minutes, with an average
In this section we present some statistics of the acquired duration of 5.07 minutes per session and 3.17 seconds per
utterance.
data. In the first round of recordings no audio was
collected thus an automatic algorithm was used to
estimate speech statistics. For the second round of Total Recorded
Word Set Speech Non-Speech
recordings, audible utterances were recorded and the Duration (minutes)
audio was used as auxiliary information for manually Digits 8.78 28.13% 71.87%
annotating the data. Nasal Pairs 7.48 71.87% 73.13%
AAL 14.20 32.91% 67.09%
3.1 First Round of Recordings All Word Sets 30.45 30.05% 69.95%
The data collected in the first round of recordings has a
total elapsed duration of 56.11 minutes, with an average Table 3: Audio duration, speech time and non-speech time
duration of 5.99 minutes per session and 3.74 seconds per distribution by word set (excluding silence utterances) for
utterance, not considering silence utterances. By applying the second round of recordings.
a Voice Activity Detection (VAD) technique based on
UDS alone, we estimate that 30.25% is silent speech (i.e. In Table 4 the session statistics for the first and
continuous facial movements) and that 69.75% is the second round are presented. Based on these values, a
silence before and after each utterance. The VAD larger duration of the sessions were only silent speech was
algorithm uses the energy of the UDS pre-processed considered, can be noticed. This suggests a slower
spectrum information around the carrier and a mean articulation when no acoustic feedback, however it might
reference value extracted from the silence prompts of also be related or influenced by the lack of experience
each speaker to distinguish silent articulation. Each verified in most speakers when articulating the words
session presents an average speech duration of 1.81 without any acoustic feedback.
minutes and 4.18 minutes of non-speech. The female
speakers had an average speech duration of 42.79% per Average Average
Average
session, while this figure for male speakers was only Duration Speech
Non-Speech
23.29%. Table 2 details the audio duration of the collected Data Collection Stage per per
per session
data by word set. session session
(minutes)
(minutes) (minutes)
1st round 5.99 1.81 4.18
2nd round 5.07 1.52 3.55

Table 4: Audio duration, speech time and non-speech time


distribution by word set (excluding silence utterances) for
the second round of recordings.

If instead of estimating the characteristics of the first


round based on the automatic algorithm we use the
speech/non-speech distribution estimated in the second
round. Then, by applying it to the average duration per
session of the first round, we get a 1.80 minutes of speech

4510
and 4.19 minutes for non-speech data, a similar result to Speech Recognition based on Ultrasonic Doppler
what was obtained using the UDS algorithm. Sensing for European Portuguese”, Advances in
Speech and Language Technologies for Iberian
4. Conclusion Languages, vol. CCIS 328, Springer, 2012.
This paper describes a multimodal data collection with 5 Freitas, J., Teixeira, A., Silva, S., Oliveira, C., Dias, M.S.
streams of data: Video, Depth, Surface EMG, Ultrasonic (2014). Velum Movement Detection based on Surface
Doppler Sensing and audio. By using the surface EMG Electromyography for Speech Interface”, Proceedings
recording device we were able to synchronously combine of Biosignals 2014, Angers, France.
these silent speech modalities and acquire information Hueber, T., Chollet, G., Denby, B., Stone, M. and Zouari,
from multiple stages of the human speech production L. (2007). Ouisper: Corpus Based Synthesis Driven by
model. The data collection is divided into two rounds of Articulatory Data. International Congress of Phonetic
recordings: in a first round only silent speech (i.e. no Sciences, Saarbrücken, pp. 2193--2196.
acoustic signal was produced by the speaker) was Plux Wireless Biosignals, Portugal (2014). Online:
recorded; in a second set of recordings, audible speech http://www.plux.info/, accessed on 17 March 2014.
was captured in addition to the remaining modalities. We Porbadnigk, A., Wester, M., Calliess, J. and Schultz, T.
have also used an algorithm based on UDS energy for (2009). EEG-based speech recognition impact of
estimating total speech time in the absence of the acoustic temporal effects. Biosignals 2009, Porto, Portugal,
signal and some statistics of how the data was distributed. pp.376--381.
Schultz, T. and Wand, M. (2010). Modeling coarticulation
5. Future Work in large vocabulary EMG-based speech recognition.
Speech Communication, 52(4), pp. 341--353.
The collected data opens several doors in terms of future
Srinivasan, S., Raj, B. and Ezzat, T. (2010). Ultrasonic
research. This data will potentially allow for the
sensing for robust speech recognition. Internat. Conf.
development of a multimodal SSI based on these
on Acoustics, Speech, and Signal Processing, pp.
modalities, where the strongest points of one modality can
5102—5105.
eventually help to minimize the weakest point of other(s).
Strevens, P. (1954). Some observations on the phonetics
It will also allow looking at other types of information,
and pronunciation of modern Portuguese, Rev.
beyond the acoustic signal, for interesting research issues,
Laboratório Fonética Experimental, Coimbra II, pp.
such as elderly speech characteristics and nasal sounds
5--29.
production and recognition.
Tran, V.A., Bailly, G., Loevenbruck, H. and Jutten, C.
(2008). Improvement to a NAM captured
6. Acknowledgements whisper-to-speech system. Proceedings of Interspeech
This work was partially funded by Marie Curie Actions 2008, pp.1465-1468.
Golem (ref.251415, FP7-PEOPLE-2009-IAPP) and IRIS Wand, M. Schultz, T. (2011). Investigations on Speaking
(ref. 610986, FP7-PEOPLE-2013-IAPP), by FEDER Mode Discrepancies in EMG-based Speech
through the Program COMPETE under the scope of Recognition. Proceedings of Interspeech 2011,
QREN 5329 FalaGlobal and by National Funds (FCT- Florence, Italy.
Foundation for Science and Technology) in the context of Zhu, B. (2008). Multimodal speech recognition with
IEETA Research Unit funding ultrasonic sensors. Master’s thesis, Massachusetts
FCOMP-01-0124-FEDER-022682 Institute of Technology, Cambridge, Massachusetts.
(FCT-PEst-C/EEI/UI0127/2011). The authors would also
like to thank the experiment participants.

7. References
Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert,
J.M., Brumberg, J.S. (2009). Silent speech interfaces.
Speech Communication, 52(4), pp. 270--287.
Denby, B., Stone, M. (2004). Speech synthesis from real
time ultrasound images of the tongue. Internat. Conf.
on Acoustics, Speech, and Signal Processing, Montreal,
Canada, 1, pp. I685--I688.
Fagan, M. J., Ell, S. R., Gilbert, J. M., Sarrazin, E. and
Chapman, P.M. (2008). Development of a (silent)
speech recognition system for patients following
laryngectomy. Med. Eng. Phys., 30(4), pp. 419–425.
Freitas, J. Teixeira, A. Dias M. S. and Bastos, C. (2011).
Towards a Multimodal Silent Speech Interface for
European Portuguese. Speech Technologies, InTech.
Freitas, J. Teixeira, A., Vaz, F. and Dias, M.S., “Automatic

4511

You might also like