Speech Application Language Tags
Speech Application Language Tags
ABSTRACT:
applications which enables the users to speak and listen to a computer will greatly enhance
the ability for users to access computers at any time from nearly any place. SALT may be
used to develop telephony (speech input and output only) applications and multimodal
applications (speech input and output, as well as keyboard and mouse input and display
output). SALT and the host programming language provide control structures not available in
Everyday people ask questions. They give instructions. Speaking and listening are necessary
for learning and training, for selling and buying, for persuading and agreeing, and for most
social interactions. For the majority of people, speaking and understanding spoken speech is
simply the most convenient and natural way of interacting with other people.
Yes.
Emerging technology enables users to speak and listen to the computer now. Speech
recognition converts spoken words and phrases into text, and speech synthesis converts text
to human-like spoken words and phrases. While speech recognition and synthesis have long
been in the research stage, three recent advances have enabled speech recognition and
synthesis technologies to be used in real products and services: (1) faster, more powerful
computer technology, (2) improved algorithms using speech data captured from the real
world, and (3) improved strategies for using speech recognition and speech synthesis in
Speech applications enable users to speak and listen to a computer despite physical
impairments such as blindness or poor physical dexterity. Speaking enables impaired callers
to access computers. Callers with poor physical dexterity (who cannot type) can use speech to
enter requests to the computer. The sight-impaired can listen to the computer as it speaks.
When visual and/or mechanical interfaces are not an option, callers can perform transactions
by saying what they want done and supplying the appropriate information. If a person with
impairments can speak and listen, that person can use a computer to bypass the limitations of
small keyboards and screens. As devices become smaller, our fingers do not. Keys on the
keypad shrink often to the point where people with thick fingers press two or more keys with
one finger stroke. The small screens on some cell phones may be difficult to see, especially
in extreme lighting conditions. Even PDAs with QWERTY keyboards are awkward.
(QWERTY is a sequence of six keys found on traditional keyboards used by most English and
Western-European language speakers.) Users hold the device with one hand and “hunt and
peck” with the forefinger of the other hand. It is impossible to use both hands to touch-type
and hold the device at the same time. By speaking, callers can bypass the keypad (except
possibly for entering private data in crowded or noisy environments). By speaking and
listening, callers can bypass the small screen of many handheld electronic devices.
example, stoves, refrigerators, and heating and air conditioning thermostats have no
keyboards. These appliances may have a small control panel with a couple of buttons and a
dial. The physical controls are good for turning the appliance on and off and adjusting its
temperature and time. Without speech, a user cannot specify complex instructions such as,
“turn the temperature in the oven to 350 degrees for 30 minutes, then change the temperature
to 250 degrees for 15 minutes, and finally leave the oven on warm.” Without speech, the
appliance cannot ask questions such as, “When on Saturday morning do you turn the heat
on?” Any sophisticated dialog with these appliances will require speech input. And speech
especially useful in situations where the caller’s eyes and/or hands are busy. Drivers need to
keep their eyes on the road and their hands on the steering wheel. If they must use a computer
when driving, the interface should be speech only. When driving machines requiring their
hands to operate controls and their eyes to focus on the machine activities, machine operators
can also use speech to communicate with a computer. (Although is it not recommended that
you hold and use a cell phone while driving a car.) Mothers and caregivers with children in
their arms may also appreciate speaking and listening to a doctor’s Web page or medical
service. If a person can speak and listen to others while they work, they can speak and listen
available only during working hours. Computers can automate much of this activity, such as
accepting messages, providing information, and answering callers’ questions. Callers can
access these automated services 24 hours a day, 7 days a week via a telephone by speaking
and listening to a computer. If a person can speak and listen, they can interact with a
computer anytime
Callers become
frustrated when they hear “your call is very important to us” because this message means
they must wait. “Thanks for waiting, all of our operators are busy” means more waiting.
When using speech to interact with an application, there are no hold times. The computer
responds quickly. (However, computers can become saturated which results in delays; but
these occur less frequently than callers waiting for a human operator.) Because many callers
can be serviced by voice-enabled applications, the human operators are freed to resolve more
Some languages do not lend themselves to data entry using the traditional QWERTY
keyboard. Rather than force Asian language users to mentally translate their words and
phrases to phonetic sounds and then press the corresponding keys on the QWERTY
keyboard, a much better solution is to speak and listen. Speech and handwriting recognition
will be the key to enabling Asian language speakers to gain full use of computers. If a person
can speak and listen to an Asian language, they can interact with a computer using that
language.
TO CONVEY EMOTION:
frequently use emoticons — keyboard symbols to convey emotions to enhance their text
messages. Example emoticons include :) for happy or a joke and :( for sad. With speech,
these emotions can be conveyed naturally by changing the inflection, speed, and volume of
information between users and computers by transferring information in the most appropriate
mode—speech for simple requests and simple answers, and GUIs for complex requests and
This new environment led to the creation of VoiceXML, an XML-based declarative language
for describing the exchange of spoken information between users and computers and related
languages. The related languages include the Speech Recognition Grammar Specification
(SRGS) for describing what words and phrases the computer should listen for and the Speech
Synthesis Markup Language (SSML) for describing how text should be rendered as verbal
speech. VoiceXML is widely used to develop voice-only user interfaces for telephones and
VoiceXML uses predefined control structures, enabling developers to specify what should be
spoken and heard, but not the low level details of how those operations occur. As is the case
with many special-purpose declarative languages, developers sometimes prefer to write their
own procedural instructions. Speech Application Language Tags (SALT) was developed to
enable Web developers to use traditional Web development languages to specify the control
and use a small number of XML elements for managing speech. In addition for use with
telephony applications, SALT can also be used for multimodal applications where people use
Comverse, Intel, Microsoft, Philips, and SpeechWorks (now ScanSoft), published the initial
specificationin June 2002. This specification was contributed to the World Wide Web
Consortium (W3C) in August of that year. Later in June 2003, the SALT Forum contributed a
The SALT specification contains a small number of XML elements enabling speech output to
the user, called prompts, and speech input form the user, called responses. SALT elements
include:
contains a prompt queue and commands for managing the presentation of prompt on
the queue to the user.
• <listen>—recognizes spoken words and phrases. There are three listen modes:
platform rather than the application controls when to stop the recognition facility.
SALT designers subsetted the SALT functionality into multiple profiles that are implemented
and used independently of the remaining SALT modules. Various devices may use different
combinations of profiles. Devices with limited processor power or memory need not support
all features (for example, mobile devices do not need to support dictation). Devices may be
tailored to particular environments (for example, telephony support may not be necessary for
television set-top boxes). While full application portability is possible within devices using
the same profile, there is limited portability across devices with different profiles.
SALT has no control elements, such as <for> or <goto>, so developers embed SALT
elements into other languages, called host languages. For example, SALT elements may be
embedded into languages such XHTML, SVG, and JavaScript. Developers use the host
language to specify application functions and execution control while the SALT elements
provide advanced input and output using speech recognition and speech synthesis.
applications using a telephone, cell phone, or other mobile device with a microphone and
contains:
• Web server—contains HTML, SALT and embedded scripts. The scripts control the
dialog flow, such as the order for playing audio prompts to the caller.
telephone network
speech into text, a speech synthesis engine which converts text to human-sounding
speech, and an audio subsystem for playing prompts and responses back to the user.
• Client devices—device to which to user listens and speaks, such as for example
There are numerous variations for the architecture shown in Figure 1. A small speech
recognition engine could reside in the user device (for example, to recognize a small number
of command and control instructions), or it may be distributed across the device and speech
server (the device performs DSP functions on spoken speech, extracting “speech features”
that are transmitted to the speech server which concludes the speech recognition processing).
The various servers may be combined or replicated depending upon the workload. And the
Some mobile devices—and most desktop devices—have screens and input devices such as
keyboard, mouse, and stylus. These devices support multimodal applications, which support
more than one mode of input from the user, including keyed text, handwriting and pen
Figure 2 illustrates a sample telephony application written with SALT elements embedded in
HTML. The bolded code in Figure 2 will be replaced by the bolded code in Figure 3, which
illustrates the same application as a multimodal application.
Figure 3 illustrates a typical multimodal application written with SALT embedded in HTML.
In this application, the user may either speak or type to enter values into the text boxes. Note
that the code in Figure 3 is somewhat different from the code in Figure 2. This is because
many telephony applications are system-directed (the system guides the user by asking
questions which the user answers), while as with visual-only applications, multimodal
applications are often user-directed (the user indicates which data will be entered by clicking
Programming with SALT is different from programming traditional visual applications in the
following ways:
• If the developer does not like how the speech synthesizer renders text as human-
understandable voice, the developer may add Speech Synthesis Markup language
(SSML) elements to the text to provide hints for the speech synthesis system. For
example, the user could insert a <break time = "500ms"/> element to instruct the
speech synthesizer to remain silent for 500 milliseconds. SSML is a W3C standard
The developer must supply a grammar to describe the words and phrases users are
likely to say. Grammars help the speech recognition system recognize words faster
and more accurately. SALT (and VoiceXML 2.0/2.1) developers specify grammars
words frequently spoken by the user at each point in the dialog, as well as fine-tune
the wording of the prompts to encourage users to speak those words and phrases.
• Speech recognition systems do not understand spoken speech perfectly. (Even
speech recognition engines fail to accurately recognize three to five percent of spoken
words. Developers compensate for poor speech recognition by writing event handlers
speak again, often rephrasing the question differently so the user responds by saying
different words. Example event handlers are illustrated in Figure 2, lines 35–37 and
lines 38–40. Developers may spend as much as 30 to 40 percent of their time writing
event handlers which are needed occasionally but are essential when the speech
SALT and VoiceXML enable very different approaches for developing speech applications.
SALT tags control the speech medium (speech synthesis, speech recognition, audio capture,
audio replay, and DTMF recognition). SALT tags are often be embedded into another
language that specifies flow control and turn taking. On the other hand, VoiceXML is a
stand-alone language which controls the speech medium as well as flow control and turn-
taking.
In VoiceXML the details of flow control are managed by an a special algorithm called the
Forms Interpretation Algorithm. For this reason, many developers consider VoiceXML a
declarative language. On the other hand, SALT is frequently embedded into a procedural
programming language. Many developers consider the programming languages into which
SALT is embedded to be procedural. It should be noted, however, that SALT can be used as
a stand-alone declarative language by using the assignment and conditional features of the
<bind> statement. Thus, SALT can be used in resource-scarce platforms such as cell phones
that cannot support a host language. For details, see section 2.6.1.3 in the SALT specification.
difficult to design a quality speech application. An HTML programmer easily learns how to
write SALT applications, but designing a usable speech or multimodal application is still
more of an art than a science. [Balentine and Cohen] present guidelines and heuristics for
designing effective speech dialogs. A series of iterative designs and usability tests are
necessary to implement speech applications for users to both enjoy and use efficiently to
CONCLUSION:
It is not clear at when this article was written if SALT will overtake and replace VoiceXML
as the most widely used language for writing telephony applications. It is also not clear if
SALT or some other language will become the preferred language for developing multimodal
applications. The availability of high-level design tools, code generators, and system
development environments that hide the choice of development language from the speech
FURTHER READING :
Style Guide for Telephony Dialogues (2 nd edition), 1999, San Ramon, CA: Enterprise
Integration Group.
Cohen M. H., Giangola J. P., Balogh, J., (2004). Voice User Interface Design, Addison
Wesley.
Speech Synthesis Markup Language (SSML), Version 1.0, W3C Proposed Recommendation,