0% found this document useful (0 votes)
6 views15 pages

US6487531

The patent US 6,487,531 B1 describes a method for enhancing or replacing the natural excitation of the human vocal tract using artificial excitation means to improve voice recognition systems. This innovation aims to increase the robustness of recognition for both audible and inaudible speech, addressing challenges such as continuous speech recognition and variations in speaker voice patterns. The approach allows for improved accuracy and adaptability in speech recognition applications, even in noisy environments or when users speak softly.

Uploaded by

bloodninja.ca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views15 pages

US6487531

The patent US 6,487,531 B1 describes a method for enhancing or replacing the natural excitation of the human vocal tract using artificial excitation means to improve voice recognition systems. This innovation aims to increase the robustness of recognition for both audible and inaudible speech, addressing challenges such as continuous speech recognition and variations in speaker voice patterns. The approach allows for improved accuracy and adaptability in speech recognition applications, even in noisy environments or when users speak softly.

Uploaded by

bloodninja.ca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

USOO6487531B1

(12) United States Patent (10) Patent No.: US 6,487,531 B1


TOSaya et al. (45) Date of Patent: Nov. 26, 2002

(54) SIGNAL INJECTION COUPLING INTO THE 5,640,490 A 6/1997 Hansen et al. .............. 395/263
HUMAN VOCAL TRACT FOR ROBUST 5,664,052 A 9/1997 Nishiguchi et al. ......... 704/214
AUDIBLE AND INAUDIBLE VOICE 5,706,397 A 1/1998 Chow ........................ 395/2.52
5,729,694. A * 3/1998 Holzrichter et al. .......... 705/17
RECOGNITION 5,752,001 A 5/1998 Dulong ....................... 395/500
(76) Inventors: Carol A. Tosaya, 897 Madonna Way, RA - 3.86 EROSS
Los Altos, CA (US) 94024; John W. 2Y- - - 2
Sliwa, Jr., 897 Madonna Way, Los FOREIGN PATENT DOCUMENTS
Altos, CA (US) 94024 WO WOO9711453 A1 3/1997 ............. G1OL/3/00
(*) Notice: Subject to any disclaimer, the term of this
patent is extended or adjusted under 35 OTHER PUBLICATIONS
U.S.C. 154(b) by 0 days. M. Al-Akaidi, “Simulation Model Of The Vocal Tract Filter
For Speech Synthesis”, Simulation, vol. 67, No. 4, pp.
(21) Appl. No.: 09/348,600 241-246 (Oct. 1996).
1-1. B. Bergeron, “Using An Intrauural Microphone Interface
(22) Filed: Jul. 6, 1999 For Improved Speech Recognition’, Collegiate Microcom
(51) Int. Cl. ................................................ G10L 15/00 puter, vol. 8, No. 3, pp. 231-238 (Aug. 1990).
(52) U.S. Cl. ................. 704/246; 704/223 A. Syrdal et al., Applied Speech Technology, CRC Press p.
(58) Field of Search ................................. 704/246, 223, 28, (1995).
704/212, 251, 207, 264, 208
(List continued on next page.)
(56) References Cited
Primary Examiner Marsha D. Banks-Harold
U.S. PATENT DOCUMENTS ASSistant Examiner-Daniel Abebe
3,766,318 A 10/1973 Webb ...................... (74) Attorney, Agent, or Firm-David W. Collins
4,039,756 A 8/1977 Burtschi .................. 179/1 AL (57) ABSTRACT
4,223,411 A 9/1980 Schoendorfer et al. ........ 623/9
4,473,905 A * 9/1984 Katz et al. .................... 381/70 A means and method are provided for enhancing or replac
4,502,150 A 2/1985 Katz et al................... 381/70 ing the natural excitation of the human vocal tract by
4,520,499 A 5/1985 Montlick et al. ............. 381/36 artificial excitation means, wherein the artificially created
4,691,360 A 9/1987 Bloomfield .................. 381/70 acoustics present additional spectral, temporal, or phase data
4,706.292
4,709,390 A *
A 11/1987 Torgeson ..................... 381/70
11f1987 Atal useful for (1) enhancing the machine recognition robustness
2 : -- 2 et al. . ... 704/262
RE32,580 E * 1/1988 Atal et al. . ... 704/219 of audible Speech or (2) enabling more robust machine
4,821,326 A * 4/1989 MacLeod ... ... 704/261 recognition of relatively inaudible mouthed or whispered
4,993,071 A 2/1991 Griebel ........................ 381/70 speech. The artificial excitation (a) may be arranged to be
5,111,501 A 5/1992 Shimanuki. 379/355 audible or inaudible, (b) may be designed to be non
5,326,349 A 7/1994 Baraff ........................... 623/9 interfering with another user's Similar means, (c) may be
5,390,278 A 2/1995 Gupta et al. ............... 395/2.52 used in one or both of a Vocal content-enhancement mode or
2 2
A 3. Russia et al.
CCIO CL al. ......
- - - i. a complimentary vocal tract-probing mode, and/or (d) may
5,586,215. A 12/1996 Stork et al. ................ 395/2.41 E. audible inaudible COntinu
5,596,676 A 1/1997 Swaminathan et al. .... 395/2.17 p p
5,621.809 A 4/1997 Bellegarda et al. ......... 382/116
5,640,485 A 6/1997 Ranta ......................... 395/2.6 32 Claims, 3 Drawing Sheets

42a

40
28 44
56 58 NATURAL 36 C
SPEECH
SIGNAL Moon RECOGNIZED
REPRESENTATION CASSIFICATION WORDS
ES SEPARAroR M AV
SIGNAL ALGORTHM 30a 32a
MOENGf 60
REPRESENTATION CLASSIFICATION
ASEAL
EXCTATION 30 32
SIGNAL &
^^ RAININGAA -- EXCITATION
28 ARTIFICAL SPEECH Modes
US 6,487,531 B1
Page 2

OTHER PUBLICATIONS articulography (EMA) in vocalization’, Clinical Linguistics


R. Cole et al Ed., Survey of the State of the Art in Human and Phonetics, vol. 7, No. 2, pp. 129-143 (1993).
Language Technology, Cambridge University PreSS and New Eagle Tactical Headset, http://www.streetpro.com/bo
nephone.html.
Giardini Editori E. Stampatori In Pisa, vol. XII, XIII, Pama Throat Microphone, http://www.pama.co.uk/
Section 9.4–9.6, (1997). pr01.html.
J. Epps et al., “A novel instrument to measure acoustic J.F. Holzrichter et al., “Speech articulator measurements
resonances of the Vocal tract during phonation, Meas. Sci. using low power EM-wave Sensors', http://speech.1
and Technol., vol. 8, pp. 1112-1121 (1997). 1.nl.gov/Speech ArtMeasure.html.
D. Maurer et al., “Re-examination of the relation between
the Vocal tract and the vowel Sound with electromagnetic * cited by examiner
U.S. Patent Nov. 26, 2002 Sheet 1 of 3 US 6,487,531 B1

24
PITCH PERIOD 5 DIGITAL FILTER COEFFICIENTS
(VOCAL TRACT PARAMETERS)

IMPULSE
TRAIN
GENERATOR
VARYING SPEECH
12 DIGITAL SAMPLES
RANDON FILTER
NUMBER 22
|GENERATOR 6
A
lift- t FIG. 1
(PRIOR ART)

42

28
ACOUSTIC LEXICAL LANGUAGE
MODELS MODELS MODELS
44
SPEECH
SIGNAL RECOGNIZED
cKSENGon WORDS

30 32 34
FIG. 2
(PRIOR ART)
U.S. Patent Nov. 26, 2002 Sheet 2 of 3 US 6,487,531 B1

46

EXCITER
10 N 48
MPULSE
TRAIN
GENERATOR
VARYING AUDIBLE AND/OR
INAUDIBLE SPEECH
12 DIGITAL SAMPLES
FILTER
RANDON
NUMBER AMPLTUDE 22
GENERATOR
FIG. 3

42a

40
28a
56 58
s
NATURAL
SPEECH
SIGNAL
OVERALL SEPARAroR
SPEECH
SIGNAL - ALGORTHM
wr

60
ARTIFICIAL
SPEECH
EXCTAION
SIGNAL
EXCITATION
285, TRAINING DATA
ARTIFICIAL SPEECH MODELS
U.S. Patent Nov. 26, 2002 Sheet 3 of 3 US 6,487,531 B1

42b

TRAINING DATA:
NATURAL ARTIFICIAL SPEECH

ACOUSTIC LEXICAL LANGUAGE


MODELS MODELS MODELS

OVERALL RECOGNIZED
SPEECH
SIGNAL REPRESENTATION--cKSESH SEARCH WORDS

FIG. 5
US 6,487,531 B1
1 2
SIGNAL INJECTION COUPLING INTO THE software, referred to as Hidden Markov Models (HMMs),
HUMAN VOCAL TRACT FOR ROBUST the Statistics of varying annunciation and temporal delivery
AUDIBLE AND INAUDIBLE VOICE are Statistically captured in oral training Sessions and made
RECOGNITION available as models for the internal Search engine(s).
Major challenges to speech recognition Software and
Systems development progress have historically been that (a)
TECHNICAL FIELD continuous speech (CS) is very much more difficult to
recognize than Single isolated-word speech and (b) different
The present invention is directed generally to voice Speakers have very different Voice patterns from each other.
recognition, and, more particularly to a means and method The former is primarily because in continuous speech, we
for enhancing or replacing the natural excitation of a living pronounce and enunciate words depending on their context,
body's vocal tract by artificial excitation means. our moods, our StreSS State, and on the Speed with which we
BACKGROUND ART
Speak. The latter is because of physiological, age, Sex,
anatomical, regional accent, and other reasons. Furthermore,
The ability to Vocally converse with a computer is a grand 15 another major problem has been how to reproducibly get the
and worthy goal of hundreds of researchers, universities and Sound (natural speech) into the recognition System without
institutions all over the world. Such a capability is widely loSS or distortion of the information it contains. It turns out
expected to revolutionize communications, learning, that the positioning of and type of microphone(s) or pickups
commerce, government Services and many other activities one uses are critical. Head-mounted oral microphones, and
by making the complexities of technology transparent to the the exact positioning thereof, have been particularly thorny
user. In order to converse, the computer must first recognize problems despite their Superior frequency response. Some
what words are being Said by the human user and then must attempts to use ear pickup microphones (see, e.g., Bergeron,
determine the likely meaning of those words and formulate Supra) have shown fair results despite the known poorer
meaningful and appropriate ongoing responses to the user. passage of high frequency content through the bones of the
The invention herein addresses the recognition aspect of the 25 skull. This result Sadly SpeakS volumes to the positioning
overall Speech understanding problem. difficulty implications of mouth microphones which should
It is well known that the human vocal system can be give Substantially Superior performance based on their
roughly approximated as a Source driving a digital (or known and understood broader frequency content.
analog) filter; See, e.g., M. Al-Akaidi, "Simulation model of Recently, two companies, IBM and Dragon Systems, have
the vocal tract filter for speech synthesis”, Simulation, Vol. offered commercial PC-based software products (IBM Via
67, No. 4, p. 241–246 (October 1996). The source is the VoiceTM and Dragon Naturally SpeakingTM) that can recog
larynx and Vocal chords and the filter is the Set of resonant nize continuous speech with fair accuracy after the user
acoustic cavities and/or resonant Surfaces created and modi conducts carefully designed mandatory training or "enroll
fied by the many movable portions (articulators) of the ment” sessions with the Software. Even with Such
throat, tongue, mouth/throat Surfaces, lipS and nasal cavity. 35 enrollment, the accuracy is approximately 95% under con
These include the lips, mandible, tongue, Velum and phar trolled conditions involving careful microphone placement
ynx. In essence, the Source creates one or both of a quasi and minimal or no background noise. If, during use, there
periodic vibration (voiced Sounds) or a white noise are other Speakers in the room having Separate conversations
(unvoiced Sounds) and the many vocal articulators modify (or there are reverberant echoes present), then numerous
that excitation in accordance with the vowels, consonants or 40 irritating recognition errors can result. Likewise, if the user
phonemes being expressed. In general, the frequencies moves the vendor-recommended directional or noise
between 600 to 4,000 Hertz contain the bulk of the necessary canceling microphone away, or too far, from directly in front
acoustic information for human speech perception (B. of the lips, or Speaks too Softly, then the accuracy goes down
Bergeron, “Using an intraural microphone interface for precipitously. It is no wonder that Speech recognition Soft
improved Speech recognition’, Collegiate Microcomputer, 45 ware is not yet Significantly utilized in mission-critical
Vol. 8, No. 3, pp. 231-238 (August 1990)), but there is some applications.
human-hearable information all the way up to 10,000 hertz The inventors herein address the general lack of robust
or so and some important information below 600 hertz. The neSS described above in a manner Such that accuracy during
variable Set of resonances of the human Vocal tract are Speaking can be improved, training (enrollment) can be a
referred to as formants and are indicated as F1, F2 . . . . In 50 more robust if not a continuous improvement process, and
general, the lower frequency formants F1 and F2 are usually one may speak Softly and indeed even "mouth words'
in the range of 250 to 3,000 hertz and contain a major without significant audible Sound generation, yet retain
portion of human-hearable information about many articu recognition performance. Finally, the inventors have also
lated Sounds and phonemes. Although the formants are devised a means for nearby and/or conversing Speakers
principle features of human speech, they are by far not the 55 using voice-recognition Systems to automatically have their
only features and even the formants themselves dynamically Systems adapted to purposefully avoid operational interfer
change frequency and amplitude, depending on context, ence with each other. This aspect has been of Serious concern
Speaking rate, and mood. Indeed, only experts have been when trying to insert Voice recognition capabilities into a
able to manually determine what a perSon has said based on busy office area wherein numerous interfering (overheard)
a printout of the Spectrogram of the utterance and even this 60 conversations cannot easily be avoided.
analysis contains best-guesses. Thus, automated Speech rec The additional and more reproducible artificial excitations
ognition is one of the grand problems in linguistic and of the invention may also be used to increase the acoustic
Speech Sciences. In fact, only the recent application of uniqueness of utterances-thus speeding up speech recogni
trainable stochastic (statistics-based) models using fast tion processing for a given recognition-accuracy require
microprocessors (e.g., 200 Mhz or higher) has resulted in 65 ment. Such a Speedup could, for example, be realized from
1998's introduction of inexpensive continuous speech (CS) the reduction in the number of candidate utterances needing
Software products. In the Stochastic models used in Such Software-comparison. In fact, Such reductions in utterance
US 6,487,531 B1
3 4
identification possibilities also improve recognition accu are articulated by the Speaker. These Sounds, because they
racy as there are fewer incorrect conclusions to be made. are artificially excited, have far more latitude than the
Utterance or Speech-recognition practiced using the familiar naturally excited Voiced and aspirated human
invention may have any purpose including, but not limited Sounds. For example, they may or may not be audible, may
to: (1) talking to, commanding or conversing with local or excite natural vocal articulators (audibly or inaudibly) and/
remote computers, computer-containing products, telephony or may excite new articulators (audibly or inaudibly).
products or speech-conversant products (or with other per Artificially excited "speech' output may be Superimposed
Sons using them); (2) talking to or commanding a local or on normal Speech to increase the raw characteristic infor
remote System that converts recognized speech or com mation content. Artificially excited output may be relatively
mands to recorded or printed text or to programmed actions or completely inaudible thus also allowing for good
of any Sort (e.g.: Voice-mail interactive menus, computer recognition-accuracy while whispering or even mouthing
game control Systems); (3) talking to another person(s) words. Artificial content may help discern between compet
locally or remotely-located wherein one's recognized ing Speakers thus-equipped, whether they are talking to each
Speech is presented to the other party as text or as a other or are in Separate cubicles. Artificial content may also
Synthesized voice (possibly in his/her different language); 15 Serve as a user voiceprint.
(4) talking to or commanding any device (or connected Systems taking advantage of this technology may be used
person) discretely or in apparent Silence; (5) user for continuous speech or command-style discrete Speech.
identification or validation wherein Security is increased Such Systems may be trained using one or both of natural
over prior art Speech fingerprinting Systems due to the Speech and artificial Speech.
additional information available in the Speech Signal or even The artificial excitations may incorporate any of Several
the ability to manipulate artificial excitations oblivious to the features including: (a) broadband excitation, (b) narrow
user; (6) allowing multiple equipped speakers to each have band excitation(s) Such as a harmonic frequency of a natural
their own speech recognized free of interference from the formant, (c) multiple tones wherein the tones phase-interact
other audible speakers (regardless of their remote locations with articulation (natural speech hearing does not signifi
or collocation); (7) adapting a users “speech” output to 25 cantly involve phase), (d) excitations which are delivered (or
obtain better recognition-processing performance as by add processed) only as a function of the Success of ongoing
ing individually-customized artificial content for a given natural speech recognition, and (e) excitations which are
Speaker and making that content portable if not network feedback-optimized for each Speaker.
available. (This could also eliminate or minimize retraining The user need not be aware of the added acoustic infor
of new recognition Systems by new users.) mation nor of its processing.
DISCLOSURE OF INVENTION Consumer/busineSS products incorporating the technol
In accordance with the present invention, a means and ogy may include computers, PCS, office-wide Systems,
method are disclosed for enhancing or replacing the natural PDAS, terminals, telephones, games, or any speech
excitation of the human Vocal tract by artificial excitation conversant, speech-controlled or Sound-controlled appliance
means wherein the artificially created acoustics present 35 or product. For the discrete inaudible option, Such products
additional spectral, temporal or phase data useful for (1) could be used in public with relative privacy. Additional
enhancing the machine recognition robustness of audible police, military and Surveillance products are likely.
speech or (2) enabling more robust machine-recognition of Other objects, features, and advantages of the present
relatively inaudible mouthed or whispered speech. The invention will become apparent upon consideration of the
artificial excitation may be arranged to be audible or 40 following detailed description and accompanying drawings,
inaudible, may be designed to be non-interfering with in which like reference designations represent like features
another users Similar means, may be used in one or both of throughout the FIGURES.
a vocal content-enhancement mode or a complimentary BRIEF DESCRIPTION OF THE DRAWINGS
Vocal tract-probing mode and may be used for the recogni 45 The drawings referred to in this description should be
tion of audible or inaudible continuous speech or isolated understood as not being drawn to Scale except if specifically
spoken commands. noted.
Specifically, an artificial acoustic excitation means is FIG. 1 is a prior-art Schematic digital representation of the
provided for acoustic coupling into a functional Vocal tract Source/filter model of the human vocal tract;
working in cooperation with a Speech recognition System 50 FIG. 2 is a prior-art generic representation of a typical
wherein the artificial excitation coupling characteristics modem Speech recognition System;
provide(s) information useful to the identification of Speech FIG. 3 is a schematic diagram of the invention in the form
by the system. of a Source/filter model Showing it working to Supplement
The present invention extends the performance and appli the natural Vocal chord/larynx excitation Sources,
cability of Speech-recognition in the following ways: 55
(1) Improves speech-recognition accuracy and/or speed FIG. 4 is a Schematic diagram of the invention as inte
for audible speech; grated into a speech recognition System wherein the natural
and artificial Speech Signals undergo Separate processing;
(2) Eliminates recognition-interference (accuracy and
degradation) due to competing Speakers or Voices, (e.g., FIG. 5 is a schematic diagram of the invention as inte
as in a busy office with many independent speakers); 60 grated into a speech recognition System wherein the natural
(3) Newly allows for voice-recognition of silent or and artificial Speech Signals, or content, are processed
mouthed/whispered speech (e.g., for discretely inter together.
facing with speech-based products and devices); and
Improves Security for Speech-based user-identification or BEST MODES FOR CARRYING OUT THE
user-validation 65 INVENTION
In essence, the human vocal tract is artificially excited, Reference is now made in detail to a specific embodiment
directly or indirectly, to produce Sound excitations, which of the present invention, which illustrates the best mode
US 6,487,531 B1
S 6
presently contemplated by the inventors for practicing the the Schematic Signal path of FIG. 1, a Schematic "time
invention. Alternative embodiments are also briefly varying digital filter 22 is depicted. This is the filter of the
described as applicable. Source/filter model. In essence, filter 22 is a Set of the various
acoustic filters or is a “filter network” representing the many
Definitions articulators in the Vocal tract. The cooperative moving of
these articulatorS modifies the filtering properties Such that
Natural exciter or excitation: The vocal chords/larynx or different Sounds can be generated from the limited excitation
other acoustics-producing parts of a natural living or human Sources. In natural Speech, the brain controls how the Vocal
body; and the acoustic excitation naturally produced by Such tract articulators (lips, tongue, mouth, Vocal chords, etc.)
parts or Organs. should be positioned or arranged to create excitation modi
Artificial eXciter or excitation: A man-made acoustic fication recognizable as vowels, consonants or phonemes.
producing device acoustically coupled, directly or indirectly, Block 24 represents the dynamic positioning process of the
into the Vocal tract; and the acoustic excitation injected or many articulators. Overall, for a given set of articulator
caused by the device. positions, a combined Setting for filter 22 is established. AS
Speech takes place, the filter Settings vary to cause the
Pickup: A device which converts acoustic energy into a 15 desired phonemes or Speech Sounds. A Sample of articulated
processable form Such as a microphone. Typically used to Speech 26 is indicated coming out of the filter 22.
detect output coming directly or indirectly from the Vocal Before proceeding, it is useful to review what a generic
tract as a result of an excitation of the tract. prior-art modem Speech recognition System looks like.
Natural acoustics, Sound or Signal: That which emanates Referring to FIG. 2, a natural Speech Signal 28 is depicted,
from the Vocal tract or from any body part acoustically perhaps the output of a headset microphone, passing into a
coupled to the Vocal tract in response to the natural excita box 30 labeled “representation”. Typically, representation
tion of the larynx/vocal chords or of any other natural would consist of Sampling the Speech Signal 28 every 10 or
anatomical Sound-producing organ. 20 msec at a rate between 6.6 and 20 Khz. These samples are
Artificial acoustics, Sound or Signal: That which emanates typically processed to produce a Sequence of vectors, each
from the Vocal tract or from any body part acoustically 25 of which usually contains 10 to 20 characteristic parameters.
coupled to the Vocal tract in response to the artificial Modeling and classification of these vectorS is done in the
“modeling/classification” box 32. Finally, a search means 34
excitation caused by a man-made exciter directly or indi with access to acoustic model(s)36, lexical model(s)38, and
rectly coupled to the Vocal tract. language model(s) 40 determines the most likely identity of
Speech: Spoken or articulated Sounds uttered or Silently the Sounds and the words they make up. A “training data'
mouthed for communication or command-giving. In the case block 42 represents the pre-learned “enrollment” knowledge
of the artificial excitation of the present invention, the taught to the System. Based on the training data 42 and
Speech Signal which is generated by that portion of the total analysis thereof, the system assembles models 36,38, and 40
excitation may or may not be audible and may or may not before the user proceeds with routine use of the System.
itself be understandable to a human.
35
Thus, generally, when one thereafter Speaks to the System,
the pre-taught models 36, 38, 40 as well as training data are
Background accessed in a real-time Search process to understand what is
being Said. Training is generally done once only; however,
FIG. 1 depicts a prior-art digital Schematic representation during later use of the System, the user frequently needs to
of a Source/filter model of the human vocal apparatus. correct Single-word errors or add new words, and these
Humans have two general kinds of natural Sound 40 corrections represent further incremental training. "Recog
excitations, or Sources, capable of driving their many natural nized words” output 44 are the most likely uttered words,
resonant Structures. The first type are quasi-pitched vibra taking into account their fit to the acoustic (Sound) models
tory tones coming from the Vibrating vocal chords. The 36, the lexical (word) models 38, and the word
Second type is “white noise' coming from air aspirated co-relationship (language) models 40.
through the Vocal chords while they are held open and are 45 Useful prior art patents teaching Such speech recognition
not significantly vibrating. In both cases, air is forced past Systems hardware and Software include the following refer
the chords from the lungs. In general, Vowels primarily ences: U.S. Pat. No. 5,111,501 (“Speech Recognition
utilize the vibrating vocal chords and a relatively open vocal Telephone”), U.S. Pat. No. 5,390,278 (“Phoneme-Based
tract (filter) and are termed “voiced”. Also, in general, many Speech Recognition”), U.S. Pat. No. 5,502,774 (“Multiple
of the consonants utilize aspiration “white noise' and a 50
Source Recognition”), U.S. Pat. No. 5,535,305 (“Vector
relatively closed vocal tract and are termed “unvoiced”. Quantization”), U.S. Pat. No. 5,586.215 ("Acoustic/visual
On the left hand side of FIG. 1 is seen two blocks 10, 12 Speech Recognition Device”), U.S. Pat. No. 5,596,676
representing the two natural human excitation Sources (“Recognition Algorithm”), U.S. Pat. No. 5,621,809
described above. The “impulse train generator” 10 repre (“Multiple Source Recognition”), U.S. Pat. No. 5,640,485
Sents the Vibrating vocal chords capable of producing quasi 55 (“Speech Recognition System'), U.S. Pat. No. 5,640,490
pitched vibrations or sounds 14. The “random number (“Speech Recognition Microphone System'), U.S. Pat. No.
generator” 12 represents the “White noise’ generated as air 5,664,052 (“Voiced/Unvoiced Detector”), U.S. Pat. No.
is forced past (aspirated past) the open relaxed vocal chords 5,706,397 (“Acoustic Matching of Phones”), U.S. Pat. No.
to produce a periodic sound vibrations 16. It will be noted 5,752,001 (“Viterbi Scoring”), and U.S. Pat. No. 5,805,745
that a Switch 18 is shown capable of Switching the excitation 60
(“Facial Recognition”); European Patent EP 00138071 B1
Source between either type. Humans, in general, Switch back (“Method of Determining Excitation Condition'); and PCT
and forth between source types (voiced 14 and unvoiced publication WO 09711453 A1 (“Voice Recognition Display
sounds 16) as they speak. Also shown in FIG. 1 is an Device Apparatus and Method”).
amplitude or gain control 20 capable of controlling the
amplitude of either excitation Source. Humans, by varying 65 Present Invention
their lung pressure and Vocal chord tension, can control the One means of tackling a thorny problem is to change or
loudness of the excitations 14 or 16. Moving to the right in modify the problem into a more amenable one. The present
US 6,487,531 B1
7 8
inventors realized that in order to further improve speech ing particular lip positions, are not necessarily unique as had
recognition accuracies, it would be highly advantageous to been thought for many years. In fact, a given vowel appar
have more information regarding the detailed State of the ently can be enunciated by more than one set of articulator
many natural Vocal articulators. Furthermore, it would also filter States or positions.
be advantageous to be able to drive or excite Vocal tract U.S. Pat. No. 5,729,694, “Speech Coding, Reconstruction
portions or Surfaces that do not currently contribute to and Recognition Using Acoustics and Electromagnetic
natural Speech, or to excite natural articulator portions in Waves', issued to J. F. Holzrichter et all on Mar. 17, 1998,
additional new ways. The important basic principle is the describes the innovative use of miniature radar-imaging
provision of new data for Speech recognition processing. Systems to image the interior of the Vocal tract in real time
Prior art commercial Systems have only the natural and help deduce what is being said with the help of that
Sources 10 and 12 of FIG. 1 to excite the vocal tract filter particular incremental and direct information on articulator
System 22. Human evolution has admittedly produced a fine positions. Some Serious potential problems with this tech
and recognizable speech output 26 for the ear and brain to nique are electromagnetic exposure and, even more So, the
discern and understand. However, human perception and fact that Some articulatory States are very very close to others
human hearing are quite limited in what frequencies they 15 and are exceedingly hard to discern even by direct obser
can hear-even in an otherwise Silent Setting-and the very Vation (if that is possible). For example, the exact position
best recognition System available cannot compete with a of the tongue tip and the pressure with which it is held
human, especially in a noisy environment. The brain applies against (or very near) opposed oral tissue as air is forced past
many knowledge Systems to the problem, including contex it makes a huge difference in how various consonants Sound.
tual models not yet reproducible in Software-nor even MRI (magnetic resonance imaging) techniques, for
completely understood. However, in purely acoustic terms, example, have been shown to be too crude in Spatial and
the acoustic information the brain gets is limited by the temporal resolution to discern Such tiny differences at Speak
acoustic perceptive ability of the human ear to hear tones ing speed (or at any speed). The ambiguities discussed by
and low-amplitude Sounds and to discern them from each Maurer et al, Supra, compound these challenges.
other and from interference, See, A. Syrdal et al., Applied 25 In thinking about the problem of how Voice recognition
Speech Technology, CRC Press (1995), page 28. performance falls off So quickly in the presence of other
An important aspect of the present invention is that the Speakers, interfering noises, or Soft-spoken speech (and
Vocal tract can be thought of as a dynamic filter bank whose particularly whispered Speech wherein Voiced Sounds are
articulatory positions (and articulated acoustic output) can almost absent), the present inventors realized that what
further be deduced (or enhanced ) using additional excita would be beneficial is a source, Such as 10 and/or 12, which
tions not necessarily hearable by the human ear. In this is artificial in nature Such as a Sound injection or even an
manner, one may artificially produce both “natural” and acoustic probing device. Unlike the natural excitations natu
"unnatural” Sounds (by driving natural articulators in old or rally available from the larynx and vocal chords, an artificial
new ways or by driving unnatural articulatorS Such as throat excitation may have any desired spectral Shape and/or duty
or sinus mucous-membranes which may vibrate only under 35 cycle and may even operate to drive characteristic reso
the influence of the artificial excitation) and/or be able to nances in the Vocal tract which cannot possibly be driven by
Spectrally "probe' or map the acoustic admittance of the human excitation sources 10 and 12 of FIG. 1 because of
filter bank in more detail. Furthermore, by conducting either poorly matched Source/filter frequency response or
training Sessions using at least the artificial excitations and frequency limitations of the natural exciters. In fact, Such an
analyzing the System-detectable acoustic output or 40 artificial exciter may excite natural-speech resonances as
responses, we have basic new information for model build well as Such "unnatural resonances”. Furthermore, Since it is
ing and Searching activities Supportive of recognition analy a computer System doing the hearing and we have the
SS. accepted opportunity to “train” or “enroll” the computer
Along these lines of “providing more information' to help System, we can use the exciter and artificial Speech Sounds
make Such Systems more robust, we have seen Several 45 generated by it in the Vocal tract to train, further train, or
ongoing efforts. Ronald Cole et al, Survey of the State of the better train the computer. These new Sounds producable by
Art in Human Language Technology, Cambridge University the human users vocal articulators (as excited by the artifi
Press; Giardini Editori E Stampatori In Pisa (1997) in cial exciter) need only be “hearable” or detectable by the
Sections 9.4-9.6 describe attempts to utilize facial expres computer to be useful in improving robustness-they do not
Sions and/or body gestures in combination with the Speech 50 necessarily have to be audible to the user nor of normal
Signal 28 to better deduce what words are being Said and audible loudness. This also opens up the opportunity to
what their meaning might be. Video cameras which track the make different exciters operating for different Speakers
movement of the lipS and eyes as well as the hands and arms Systems purposefully non-interfering-So that one may have
have been designed and tested. These efforts will probably their voice recognized by their computer even with Several
eventually help to Some extent, but they demand the use of 55 other unrelated Speakers present and Speaking in the back
new equipment and the need for the user to be "on camera' ground. In fact, the exciter concept may also (or
even if the user is not visible to anyone other than the alternatively) be implemented in an instrument-probe form
computer itself. They are also fraught with their own unique wherein what one is doing is obtaining a full broadband
problems, Such as Sensitivity to lighting, head position, Spectral fingerprint of the articulation path and deducing
mood, use of makeup and the wearing of glasses or hands in 60 from its various attenuations and resonance couplings more
front of the face as well as the introduction of a Sensory detailed information regarding the articulator States (or
means not easily made portable. Finally, D. Maurer et al., complex impedances) vs. time. Indeed, J. Epps et al., “A
"Re-examination of the relation between the vocal tract and novel instrument to measure acoustic resonances of the
the vowel Sound with electromagnetic articulography vocal tract during phonation”, Meas. Sci. and Technol., Vol.
(EMA) in vocalizations”, Clinical Linguistics and 65 8, pp. 1112-1121 (1997) describe the use of such an acoustic
Phonetics, Vol. 7, No. 2, pp. 129-143 (1993) describes lab desktop instrument used now in a few such labs for
research which has shown that articulatory positions, includ purposes of Speech training and Speech therapy. It makes
US 6,487,531 B1
10
clear that because of the richer harmonic content of Such an familiar natural human-body exciters 10 and 12 related to
artificial Source, one may obtain more accurate estimates of the larynx and vocal chords. What is fundamentally new in
Spectral features Such as formants as well as values for the FIG. 3 is the addition of artificial exciter 46. Exciter 46 is
complex acoustic impedance of the Vocal tract. The refer shown as depositing or injecting its acoustic energy into
ence does not Suggest Supporting backing-up, or Serving as (directly or indirectly) the vocal tract filter bank 22 as is
a voice recognizer nor does it demonstrate a comfortable done by natural source exciters 10 and 12. Dotted phantom
acoustic injection device of a compact nature. Epps etal also lines 48, 50 and 52 are shown to indicate that the control of
utilized computational capabilities unable to Support real artificial eXciter 46 may utilize information regarding the
time high Sampling rates. This reference teaches the Strip State of natural Vocal chord exciter 10, natural aspiration
ping out and discarding of Some natural Speech components exciter 12, and filterbank 22 output, respectively. By way of
to get at the formants in a more accurate manner. Herein, the more detailed example embodiments:
present inventors preferably utilize the natural components (a) If speech Signal 54 were normally all naturally excited
to the extent that they are present-and in Several of the and found to be even momentarily too low in amplitude and
embodiments recognition-processing of both artificially getting hard to computer-recognize, artificial exciter(s) 46
excited and naturally excited Speech Signals for the same 15 could add more System-detectable amplitude and/or fre
Speech is conducted. quency components So that the Speech Signal gains extra
Before moving to the next Figure (FIG.3), it is important artificial components or content and is thus more easily
to emphasize that the artificial exciter(s) may inject their recognized. The extra Signal components may or may not be
acoustic energy from one or more paths to couple into the humanly audible. These artificial components would at least
vocal apparatus, including into the mouth (from outside or be system-detectable for recognition purposes. Lines 48, 50,
from within), through the cheek, throat, tongue, palate, and/or 52 could represent detection of said insufficient
gums, teeth, neck, nasal passages, into other Soft tissue or natural excitation or naturally excited Speech-Signal output
cartilage, into the facial bones, the Skull or into the chest. in this example.
The artificial exciter(s), for example, may also be arranged (b) One could always have both natural 10, 12 and
to operate in parallel with, Simultaneous with, interleaved 25
artificial Signal 46 excitations operating-but only go back
with, overlaid on or instead of the natural vocal chord and analyze (model/classify and Search) the artificial speech
exciters. It must also be emphasized that the recognition Signal components if Such recognition processing failed
System of the invention may receive the returning and likely using only the natural components. In this manner, process
modified artificially excited acoustic Signals by one or more
means Such as by: (a) via air-coupling, as by emanation from ing is minimized relative to full-time analysis of both
the mouth or nose (or alternatively from a radiating Solid artificial and natural Signals also. In this manner, the artifi
body skin Surface into the air) as for natural speech Signals cial information represents accessible backup information.
being picked up from the mouth by an air-coupled micro (avoiding re-utterance)
phone; (b) Via Skin-contact coupling of a receiving trans (c) If private “silent speech” via use of “mouthing the
ducer or Sensor (possibly using a coupling gel or liquid) after 35 words' techniques were desired, exciter 46 could Supplant
passage through skin, bone, cartilage or mucous membranes, exciters 10 and 12 and inject inaudible energy or frequency
or (c) By optical tracking of a vibrating body portion Such components (resulting in little or no human-audible
as laser-displacement Sensing of the lips, cheeks or neck. In “speech” Sounds). The mouthing action would enunciate the
many of these cases, the reception means may also double words or utterances in the artificial Spectral regime or time
as the excitation means, Such as in the case of a Send/receive 40 domain. Dedicated training for Silent Speaking could also be
piezoelectric transducer. It should be recognized that by used if beneficial as the “word” models may be substantially
injecting artificial acoustics which may be chosen to be different for mouthing without natural excitation(s).
different in nature than natural acoustic excitations (e.g., (d) If multiple talkers are present then person #1 could
higher frequency, lower frequency, higher or lower have their exciter (e.g., 46-1) work on one frequency band(s)
amplitudes, added harmonics, phase-controlled, different 45 and person #2 (in another cubicle and having an unrelated
duty cycles, mixed frequencies, etc.) one will have new but audible conversation) could have their exciter (e.g.,
articulators participating which may only be responsive to 46-2) work on a Second, non-interfering frequency band(s)
the artificial excitations. In the case of mixed signals and or temporal duty cycle. (The individual exciters 46-1, 46-2
phased signals, one may also arrange for articulation to are not shown in FIG. 3, but each comprise an exciter 46.)
cause predictable Signal interaction, reinforcement or can 50 In this manner, Substantial additional information is made
cellation of injected components. available to each person's Own Speech recognizer, which is
Included in the list of Vocal tract articulators or portions known to be uniquely that of the person to be understood by
that may modify or modulate artificial excitations are: the that computer. Communication between Such person's Sys
glottal opening, the glottis, the arytenoids, the pharynx, the tems in order to Set Such different frequencies or Sampling
esophagus, the tongue, the pharyngeal walls, the velum, the 55 Schemes could easily be done automatically-as for
Soft and hard palates, any mucous membrane, the alveolar example, over a network, a wireleSS link, an infrared link, an
ridge, the lips, the teeth, the gums, the cheeks, any nasal acoustic link, or even a hardwired link. In this example, a
cavity or oral cavity and even the larynx and Vocal chords. perSon's System may process both natural and artificial
It should be realized that an exciter for natural speech (e.g., Signals full-time or may proceSS only the artificial Signal
the vocal chords) may double as or become instead an 60 full-time-possibly using the natural elements only as
articulator of artificial excitations imposed on it. backup. Finally, the multiple Speakers may also be
Turning now to a consideration of what the human Speech co-located, as in a meeting, and the recognition System is
digital model might look like incorporating the exciter(s) of recognizing both Speakers’ Speech.
the invention, FIG. 3, similar in general nature to FIG. 1, (e) If the natural speech is one of voiced or unvoiced (as
depicts a Schematic digital representation of the human 65 it usually is with very few exceptions when it is a combi
Vocal System incorporating the exciter of the present inven nation of those), one may add excitation energy or frequency
tion. On the left hand side bottom of FIG. 3 are seen the content characteristic of the other missing excitation if that
US 6,487,531 B1
11 12
provides useful information to further delineate articulator tion with each other as a function of articulatory State. These
States. The added artificial content may excite the tract in an represent more entirely new raw data.
"unnatural” manner or in a natural manner or both, depend (i) For any application, exciter(s) 46 could drive any
ing on how it is delivered and what its content is. Five known tract resonance at one of its higher (or lower)
preferred artificial excitations are (1) driving the tract at one harmonicS either to add more information to the Spectrum or
or more harmonics (or Sub-harmonics) of a natural formant to decrease the audibleness of the excitation. It must be kept
with at least one skin-contact transducer (possibly using in mind that with the exciter(s), one can drive portions of the
broadband excitation), (2) driving the tract with inaudible Vocal tract Segments and Surfaces at both higher and lower
excitations Such as ultrasonic tones or short frequency chirps frequencies than the natural Vocal chords or aspiration can,
using a skin-contact transducer, (3) driving the tract with regardless of whether these are harmonics of anything and
phase-controlled frequencies which either interact with each
other or themselves phase-shift as a function of articulator regardless of whether they are being driven on-resonance or
off-resonance.
positions or States using one or more Skin-contact
transducers, (4) in the aforementioned probing approach, Although the injection of acoustic excitation into the
driving the tract with an air-coupled mouth Speaker 15
vocal tract filter system 22 has been shown in FIG. 3, an
(injecting Sound into the mouth), and (5) driving the tract approach can be expressly incorporated wherein that acous
with broadband excitation wherein induced tract resonances tic content is injected (instead of or in addition to) into an
and off-resonance attenuations provide additional articulator existing exciter 10 or 12 Such that that exciter is further
position or State information.-particularly wherein new excited (or differently excited) than is humanly possible.
articulatorS Such as mucous membranes can be brought into One can easily appreciate, for example, that to Supplement
play. or Substitute for white noise (random aspiration noise)
Again, it must be remembered that the excitation means produced by air forced through open vocal chords (natural
may or may not also be the reception means and because of exciter 12), one could inject through the throat a more
this, one will be coupling to the tract (loading the tract) Spectrally organized distribution of high frequency Sounds
differently with each arrangement. However, the important 25
particularly Subject to Substantial and obvious modification
aspect is that for any arrangement there is a correlation by a particular articulator Such as the lips or tongue-tip. In
between uttered speech and the received signal which rep this case, an information-enhanced artificial aspiration
resents new information. Source is provided.
(f) The exciter(s) 46 may be used for training wherein The exciter(s) 46 may take the form, for example, of a
exciter(s) 46 gather(s) information regarding spectral details throat-mounted transducer or bone (head) coupled or chest
of the Vocal System, Such as precise resonances, formant coupled transducer. Bone vibration headsets (“excitation
values, or attenuation values, not measurable via natural Sources') are widely used by police and special forces.
voice recognition alone. Such training may be done by These emit audio acoustics directly into the skull through the
reading prepared text as for training data 42 of FIG. 2 and/or thin intervening layer of skin. An example of a bone
by simply having the exciter(s) 46 spectrally map the vocal 35 microphone is the “New Eagle' headset made by Streets
tract as the user SpeakS-Such mapping contributing to the mart Professional Equipment. A throat injector would look
betterment of a model Such as 36, 38, or 40 of FIG. 2. Such much like a throat microphone Such as that made by Pama
mapping would comprise taking Spectral Samples under Cellular Division of Manchester, England, except that it
various exciter 46 excitations. Recognition by the System of would emit rather than receive acoustic energy. Such trans
Speech using one type of Signal (e.g., the natural signal) 40 ducers can be made, for example, using piezoceramics or
allows for the remaining type of Signal (e.g., artificial) and miniature Voice coils, as are widely known to the art.
asSociated models to be associated with the recognized FIG. 4 depicts a Schematic of an example of a speech
word. In this manner, System learning can also take place recognition System incorporating the invention. Starting
during normal use in a manner transparent to the user. from the left hand side of FIG. 4, a speech input 56 labeled
(g) The exciter may contribute to user identity verification 45 “overall speech input' will be seen. It must be emphasized
wherein exciter(s) 46 provide(s) spectral maps of the user's that by overall “speech” is meant sounds emanated by or
Vocal tract during speech or Silence. The Speech might be from the vocal tract (detectable via the mouth or via any
“good morning, computer for example. The Spectral map, other head, neck or chest acoustic pickup, for example)
either alone or in combination with the prior art recognition containing one or both of natural Sounds 28a and/or artifi
information, can enhance the Security by making the System 50 cially excited sounds 28b (as excited by exciter 46 of FIG.
more difficult to fool and more friendly due to the familiar 3). It should also be emphasized that any number of acoustic
words Said as opposed to a mandatory recitation of pickups may be used, including different ones for natural
randomly-generated “I’m testing you' text. Artificial exci sounds 28a as opposed to artificially excited sounds 28b.
tations for identity or user-verification may be selected at the Such pickups may be one or more of air-coupled, skin
time of use to prevent the using of a prerecorded Voice for 55 contact coupled, or non-contact optically coupled.
break-in. Matching done by Such a Security System may use An optional Separator algorithm 58 operates, as necessary,
a prior-Sampled Voiceprint containing artificial content or to discern the natural Sounds 28a from the artificial Sounds
may even compare the users voice (with a randomly selected 28b. Algorithm 58 may simply consist of a set of frequency
artificial excitation) to the expected response of an acoustic (or temporal) electronic or software filters which control
Vocal-tract model of the user. 60 what input the recognition System or Software hears and
(h) The exciter(s) 46, because there is complete control when it hears them. These filters are not to be confused with
over it (or them), may introduce a signal with known phase the anatomical acoustic filters of the Vocal tract. Algorithm
information. Normal human hearing does not significantly 58 may also consist of complex Signal deconvolution means
proceSS phase information as far as is known. But using the or of a simple Signal Subtraction means. This choice will
present invention, one may sensitively detect with the 65 depend Significantly on whether the natural and artificial
Speech recognition System the phase of one or more artifi Signals significantly interact with each other or whether they
cially excited Speech Signals-and indeed of their interac are relatively independent and can be treated Simply as
US 6,487,531 B1
13 14
Superimposed or additive signals. The System itself may be broadband manner and obtain characteristic spectra which
arranged to determine the optimal arrangement of algorithm can be used as dynamic fingerprints used in addition to (or
58 based on the users customized artificial excitation Spec instead of) the modeling/classification modules 32a and 32b.
trum. In any event, the artificial Signal content will be chosen In other words, as shown in FIG. 4, natural 28a and artificial
based on its useful correlation to utterances of the tract. Sound 28b models are in modules 36, 38, 40, and 62,
Item 58 may be used, for example in the depicted sche respectively. One could, in addition or instead of those Sound
matic approach wherein different Sound modeling/ model modules, have spectral models (not shown) whose
classification (32a, 32b) is used for natural sounds 28a and data come from Vocal tract spectra Sampled by injecting
artificial Sounds 28b, respectively. At least the natural Speech artificial excitation(s) 46 and observing the response. Such
signal 28a is routed to the familiar representation unit 30a, Spectra may be taken during speech or Silence for the
modeling/classification unit 32a and Search unit 34a (as in purposes of recognition and calibration/training, respec
FIG. 2). Again, Search unit 34a has inputs from natural tively. It will be realized that the artificial exciter(s) 46 may
acoustic models 36, lexical models 38, and language models inject a very broadband Signal allowing for the recording of
40, which themselves are built upon connected natural a very detailed response spectra acroSS a frequency range
training data 42a. 15
beyond that necessary for audible hearing (or "silent” inau
Also emanating from Separator 58 is artificially-excited dible speech) but still very useful for determining articula
Signal content 28b. AS with the natural Signal portion 28a, tion positions. It must be kept in mind that because one may
not hear it does not mean that it does not provide important
artificial Signal 28b is routed through its own artificial signal information to the system. What matters is that the system
representation module 30b, modeling classification module can learn the association between artificially induced signals
32b, and search module 34b. On the right hand side of FIG. and any one or more of (a) Simultaneously heard natural
4 is shown the artificial excitation Search results from Signals, (b) words which are read in a teaching exercise, and
Searcher 34b being made available to natural Search module (c) words recognized using natural Signals.
34a and Vice-versa to Supplement the identification decision Such “artificial Speech Signals' may be received by an
information available for Speech Signal recognition. In FIG. external mouth microphone (with the natural signals) or may
4 is also shown artificial excitation sounds or “speech” 28b 25 be received by the artificial exciter itself in the described
being routed to its own training data module 42b. It is to be “probing” fashion.
emphasized that Such a System may train itself incrementally It will be recognized that a good reason to have dedicated
during use (as well as before use as for 42a, prior art) using processing Sections for natural Sounds as opposed to artifi
the artificially induced excitation Sounds 28b which do not cial sounds (as shown in FIG. 4) is that if discrete “silent
necessarily require any user awareneSS or attentive coopera Speaking” is desired wherein generally inaudible Sounds are
tion unless reading prepared teaching text is involved. In excited by exciter 46 and words are mouthed, then one
particular, feedback 60 from the combined Search engine would want models available for those artificial Sounds, as
34a, 34b results goes to the artificial excitation training the natural excitations are not active or are at a low level. It
module 42b. The idea here is that correlations between the is widely known that “whispered speech” contains primarily
natural models and the artificial models will exist and ought 35 aspirated Sounds and little Voiced Sound and therefor has to
to be incrementally improved and kept track of and used to currently be processed Several times and averaged to iden
advantage in co-communication between Search modules tify utterances, and even with that effort, the accuracy is
34a and 34b for purposes of more accurate recognition. extremely poor and not recommended for use. The invention
It is to be emphasized for FIG. 4 that the main point being herein provides a broadband excitation (if it is desired) of
made is that artificial excitation induced acoustic signal 28b 40 inaudible mouthed speech-an excitation which can be
may be beneficially Subject to Similar processing as is arranged to be inaudible as by at least one of low amplitudes
conventional natural acoustic signal 28a. It is also to be or frequency excitation which are hard to naturally hear but
emphasized that one may alternatively elect to treat the easy to hear with the System hardware.
overall (combined) speech Signal 56 as a single signal Finally, it will be obvious to the person skilled in this art
(shown in FIG. 5) not requiring breakdown by a separator 45 that one may apply the embodiments of the present inven
58, and thus there may then be only one module(s), each of tion to one or both of continuous speech (discussed herein)
the types 30, 32, 34, 42, 36, 38, and 40 to treat the total or to discrete command-style speech (not discussed). It
mixed signal. It will also be noted that excitation model(s) should also be obvious that one may arrange for the artificial
62 (analogous to 36,38, and 40) are indicated in support of Sounds to be optimized for the user to maximize recognition
searching artificial sounds 28b. The nature of the lower 50 performance. Thus, the artificial Sounds may adapt, Via
branch (signal 28b signal path) in FIG. 4 should also be learning, to the user and be unique for each user. This
emphsized. AS shown, largely parallel recognition Sub adaptation may include changes in frequency/temporal
Systems for natural and artificial Sound content are present content, phasing or amplitude as well as changes in when the
there being a final judgment at 34a at the end based on a artificial excitations are delivered as a function of what is
weighting or comparison of both types of analysis, 34a and 55 being Said. The adapted excitations may then be used with
34b. It will be noted that artificial speech 34b search results any recognition System arranged to receive Such signal
are fed to natural Search box 34a for Such comparison and content-or may be used only with the original System on
weighting. One could alternatively do the two indicated which they were learned. The portability of these learned
recognition processes in Series and use one to narrow the excitations is a part of this invention.
Search space for the other in order to gain speed (or accuracy 60 Moving finally to FIG. 5, a combined natural and artificial
per unit time spent). Speech Signal 56 is processed through representation 30c,
It has also been stated above that the artificial excitation modeling/classification 32c and Searching 34c to produce
(S) 46 can instead (or in addition) be treated as a finger identified words 44. The acoustic models 36a, lexical mod
printing device for characterizing the changing vocal tract els 38a and language models 40a may also be optimized for
filters 22. In this mode, rather than exciting acoustics 65 combined excitation speech. Note also that feedback loop 64
analogous to the way the real Vocal chords/larynx do, one allows real-time training to take place in training module
can probe (via transmit/receive probing) the Vocal tract in a 42b (along with optional pre-use training via reading text).
US 6,487,531 B1
15 16
It is important to recognize that the invention is funda Thus, there has been disclosed a voice recognition Scheme
mentally different than artificial Sound Sources used in involving Signal injection coupling into the human Vocal
patients who have had a laryngectomy. There is a consid tract for robust audible and inaudible Voice recognition. It
erable body of prior art patents pertaining to Such devices, will be readily apparent to those skilled in this art that
these include U.S. Pat. No. 3,766,318 (“Handheld Vibrator various changes and modifications of an obvious nature may
Artificial Larynx”), U.S. Pat. No. 4,039,756 (“Artificial be made, and all Such changes and modifications are con
Larynx with Prosodic Inflection Control”), U.S. Pat. No. sidered to fall within the Scope of the present invention, as
4,473,905 (“Artificial Larynx with Prosodic Inflection defined by the appended claims.
Control”), U.S. Pat. No. 4,502,150 (“Artificial Larynx with What is claimed is:
Prosodic Inflection Control”), U.S. Pat. No. 4,520,499 1O
1. A speech recognition System for processing Sounds
(“Combination Synthesis and Recognition Device”), U.S. emanating from a living body's Vocal tract, Said Sounds
Pat. No. 4,691,360 (“Handheld Voice Simulator”), U.S. Pat. including Sounds excited by at least one artificial exciter
No. 4,706,292 (“Speech Prosthesis”), U.S. Pat. No. 4,993, coupled, either directly or indirectly, into Said Vocal tract to
071 (“Post-Laryngectomy Speech Aid”), and U.S. Pat. No. introduce artificial excitations, Said at least one artificial
5,326,349 (“Artificial Larynx”). excitation modified or modulated by Said vocal tract and
Firstly, the above-listed artificial Sound Sources are proS 15 emanating therefrom, Said Speech recognition System
theses designed to re-cover Some very Small portion of lost including:
natural Speech in a dysfunctional anatomy. To date, none of means for representation, modeling or classification or
these devices Sounds even remotely natural, and more often both, and Searching of artificially excited Speech Sig
provides crude, gravely and unpleasant monotonic Sound. In nals or Signal components,
any case, the present invention herein is not replacing means for representation, modeling or classification or
normal audible Speech when audible Speech recognition is both, and Searching of naturally excited Speech Signals
the aim. Secondly, unlike Such prostheses, the air flow out of or signal components,
the lungs or into the Vocal tract is not utilized to aerody at least one Said Searching means having access to at least
namically generate or modify Sound. Rather, Sound is gen one of an acoustic model, lexical model or language
erated ignorant and oblivious to airflow, and in fact, in the 25 model;
described embodiments, the artificial exciter(s) is/are usu at least one training means, and
ally outside of the airflow path. Most of the embodiments means for directing at least a first modified or modulated
herein allow for recognition accuracy improvement by hav artificially excited Speech Signal to a first Speech rep
ing dual or redundant Speech Signals, or allow for inaudible resentation means which Samples at least Said first
mouthed speech. The aforementioned prostheses represent a Signal to produce a first Sequence of Speech represen
Single, much cruder Speech Signal, allowing far lower rec tation vectors, representative at least in part, of Said
ognition accuracy than even the natural Voice alone. In fact, artificially excited Signal, wherein both the artificially
the present inventors are not aware of any such prostheses eXcited Signal and the naturally excited Signal are
that allows for continuous speech to be recognized with even represented by a Single set of representation vectors.
very poor accuracy, nor of any prostheses that produces 35 2. The Speech recognition System of claim 1 wherein Said
Speech content that could be overlaid on normal speech Sounds are one of continuous speech, command-style
without it being grossly unpleasant. Speech, or an utterance.
The present inventors also realize that if the artificial 3. The Speech recognition System of claim 1 further
exciter(s) are placed in a location other than at the vocal including means for modeling or classifying Said first
chord/larynx location then they “See an acoustic loading by 40 Sequence of Vectors.
the filter bank (vocal tract) different than that of the vocal 4. The speech recognition system of claim 3 further
chords. In fact, this is turned to advantage in that one will get including means for Subjecting Said modeled or classified
additional and different excited Signals from the tract and vectors to a Search in a Search module, Said Search module
these different signals are discernible from any natural having access to at least one of an acoustic model, a lexical
Signals in many cases. Furthermore, it should be specifically 45 model, or a language model.
again recognized that the generated artificial “speech” Signal 5. The speech recognition system of claim 4 wherein two
content need not be humanly audible nor humanly intelli Search modules operate, one arranged to proceSS naturally
gible. In all circumstances, the artificially generated excited Signals and the other to process artificially excited
“speech” Signal will correlate with articulatory positions or Signals, Said System utilizing the results of both modules to
with mouthed or spoken utterances. There is no need for this 50 decide what Speech took place or what words were articu
correlation to be the same as that for the natural Speech lated.
Signal, and in fact it being different gives one added inde 6. The Speech recognition System of claim 1 further
pendent data to recognize Such utterances. including means for directing at least a naturally excited
INDUSTRIAL APPLICABILITY
Second modified or modulated Signal to a speech represen
55 tation means which Samples Said naturally excited Signal to
The Voice recognition Scheme disclosed herein is produce a Second Sequence of Speech representation vectors,
expected to find use in a wide variety of applications, representative at least in part of Said natural Speech Signal.
including (a) provision of a robust speech interface to 7. The speech recognition system of claim 6 further
computers, terminals, personal electronic products, games, including Second means for modeling or classifying Said
Security devices and identification devices, (b) for non 60 Second Sequence of vectors representative, at least in part, of
interfering recognition with multiple Speakers or Voices Said naturally excited Speech Signal.
present, (c) for the automatic recognition of multiple speak 8. The speech recognition system of claim 7 further
ers and discerning them from each other, (d) for discrete or including Second means for Subjecting Said modeled or
Silent Speaking or command-giving Speech recognition, and classified natural Speech vectors to a Search in a Second
(e) for the option of having a portable user-customized 65 Search module Said Search module having access to at least
artificial enhancement excitation useable with more than one one of an acoustic model, a lexical model or a language
recognition System. model.
US 6,487,531 B1
17 18
9. The speech recognition system of claim 1 wherein optimized correlation between it and known words or utter
training means are provided for both naturally excited ances made available during training.
Signals and artificially excited Signals, Said means being one 22. A method of minimizing degradation in the accuracy
of independent or the same means, Said Signals being one of or Speed of Speech-recognition of a first Speaker's Speech or
Separate or combined. 5 utterance caused by at least one Second interfering back
10. The speech recognition system of claim 1 wherein ground Speaker, Voice, or Sound, Said method comprising:
artificial excitations are adapted to an individual user. coupling artificial acoustic excitation, directly or
11. The speech recognition system of claim 10 wherein indirectly, into the Vocal tract of the first Speaker;
Said adapted excitations are portable acroSS at least one of allowing Said first Speaker to audibly Speak in the poten
multiple recognition Systems, computers, networks, and tial acoustic presence of Said at least one Second
Speech-conversant devices. background Speaker or Sound, thereby modifying or
12. The Speech recognition System of claim 1 further modulating Said first Speaker's artificial acoustic exci
including a separator, deconvolution, or Subtraction means tation as well as Said first Speaker's natural excitation;
to discern naturally excited Sounds or Sound components and
from artificially excited Sounds or Sound components. 15 processing at least a portion of Said first Speaker's
13. The Speech recognition System of claim 1 wherein artificially-produced acoustic output by a speech rec
Said artificially excited Sounds permit inaudible Speaking or ognition means, Said Speech recognition means com
command-giving to a computer, computer-coupled device, prising:
or computer-containing device. means for representation, modeling or classification,
14. The Speech recognition System of claim 1 adapted for and Searching of artificially excited Speech Signals or
processing Sounds that are both naturally excited and arti Signal components;
ficially excited, said Sounds, or signal representations means for representation, modeling or classification,
thereof, being Substantially processed as one of Separate or and Searching of naturally excited Speech Signals or
Separated Signals or Signal-components or as a combined Signal components;
Signal. at least one of Said Searching means having access to at
15. The speech recognition system of claim 14 wherein 25 least one of an acoustic model, lexical model or
Said artificially excited Sounds permit improved recognition language model; and
accuracy or improved recognition-speed of natural Speech, at least one training means,
Sounds or utterances. wherein Said first Speaker's output is known to be that of Said
16. The speech recognition system of claim 14 wherein first Speaker due to its identifiable artificial acoustic content,
Said artificially excited and naturally excited Speech Sounds or wherein Said Second Speaker's or Sounds interfering
emanating from Said tract temporally overlap at least part of output is ignored or rejected because it does not contain first
the time. Speakers identifying artificial excitations.
17. The speech recognition system of claim 14 wherein 23. The method of claim 22 wherein at least two said
Said artificially excited and naturally excited Speech Sounds equipped speakers are one of (a) Speaking as part of a
emanating from Said tract are not identical in Spectral 35 conversing group of at least two or (b) speaking to each
content at least part of the time. other locally or from remote locations.
18. The speech recognition system of claim 14 wherein 24. The method of claim 22 wherein speech recognition
Said artificially excited Signal, before or after tract modifi means process at least portions of both naturally-excited and
cation or modulation, includes at least one of the following artificially-excited output of Said Speaker.
aspects: (a) said artificially excited Signal contains a har 40 25. The method of claim 24 wherein temporally and/or
monic or Subharmonic of a natural formant, (b) said artifi Spectrally unique artificial excitations are provided to two or
cially excited Signal contains phase information which is more thus-equipped SpeakerS Such that all Such equipped
utilized in the recognizer, (c) said artificially excited signal Speakers may speak and be recognized without recognition
is broadband in nature, (d) said artificially excited Signal is interference with each other, Said unique excitations as
Selected or Set as a function of any natural Signal parameter, 45 Sociable with particular speakers.
(e) said artificially excited signal contains tones or frequency 26. The method of claim 25 wherein a thus-equipped
components which interact with each other as a function of Speaker's recognition System is arranged to ignore or reject
a vocal tract parameter, (f) said artificially excited signal inputs containing modifications of, modulations of, or ele
contains at least one tone or frequency component which is ments of a potentially interfering Speaker's different artifi
modulated or modified by any portion of the vocal tract 50 cial excitation and audible Speech as Sociable with Said
anatomy, (g) said artificially excited Signal is generally interfering Speaker.
inaudible to the unaided ear of a separate listener, and (h) 27. The method of claim 25 wherein a computer provides
Said artificially excited Signal is Swept in frequency. or assigns said unique artificial excitations.
19. The speech recognition system of claim 1 wherein 28. The method of claim 27 wherein information regard
Said vocal tract includes at least one element Selected from 55 ing at least one unique artificial excitation, or assignment
the group consisting of Vocal chords, larynx, laryngeal thereof, is delivered by one of a computer network, tele
Valve, the glottal opening, the glottis, the arytenoids, the communications network, wireleSS Signal, or is inputted
pharynx, the esophagus, the tongue, the pharyngeal walls, manually or via Speech-input.
the velum, the hard palate, the alveolar ridge, the lips, teeth, 29. The method of claim 22 further comprising:
gums, cheeks or any nasal cavity, at least Said one element 60 choosing Said at least one artificial excitation based on an
modifying or modulating Said artificial excitation as the optimized correlation between it and known words or
Speaker articulates Speech either audibly or inaudibly. utterances made available during training.
20. The speech recognition system of claim 1 further 30. A method of providing a speech-recognition based
including a training data means capable of Supporting train Security function for user identification or validation com
ing using at least the artificially excited Speech Signals. 65 prising:
21. The Speech recognition System of claim 1 wherein (a) coupling, directly or indirectly, an artificial acoustic
Said at least one artificial excitation is chosen based on an exciter into a user's vocal tract;
US 6,487,531 B1
19 20
(b) having the user Speak, articulate or mouth an utterance of Said identification or validation, said audible or inaudible
wherein Said utterance, at least in part, comprises a entry-utterance comprising at least one of:
portion of the artificial excitation as-modified or modul (a) including at least a portion of said user's name oralias;
lated by Said user's vocal tract; (b) including a welcoming greeting;
(c) applying Speech recognition processing means to
identify or validate Said user, Said means processing at (c) being revealed to said user only at the time of
least a portion of Said artificially excited Speech, utter attempted entry; and
ance or Signal-representation thereof, and (d) being revealed to said user after its random Selection.
(d) Storing information relating to at least one character (4) Improves Security for speech-based user-identification
istic of Said user's vocal tract, or of its function, being or user-validation.
used in Said user identification or validation process, 32. The method of claim 30 further comprising:
wherein Said Speech-recognition processing includes
processing Said modified acoustic excitation through choosing Said at least one artificial excitation based on an
representation, modeling or classification or both, and 15
optimized correlation between it and known words or
Searching to produce identified words. utterances made available during training.
31. The method of claim 30 wherein said user speaks or
utters at least one designated entry-utterance for the purpose

You might also like