Download Voice Enabling Web Applications VoiceXML and Beyond 1st Edition Ken Abbott (Auth.) ebook All Chapters PDF
Download Voice Enabling Web Applications VoiceXML and Beyond 1st Edition Ken Abbott (Auth.) ebook All Chapters PDF
com
https://ebookfinal.com/download/voice-enabling-web-
applications-voicexml-and-beyond-1st-edition-ken-abbott-
auth/
OR CLICK BUTTON
DOWNLOAD EBOOK
https://ebookfinal.com/download/voicexml-10-projects-to-voice-enable-
your-web-site-1st-edition-mark-miller/
ebookfinal.com
https://ebookfinal.com/download/enabling-context-aware-web-services-
methods-architectures-and-technologies-1st-edition-quan-z-sheng/
ebookfinal.com
https://ebookfinal.com/download/vagabond-vol-29-29-inoue/
ebookfinal.com
https://ebookfinal.com/download/google-amazon-and-beyond-creating-and-
consuming-web-services-1st-edition-alexander-nakhimovsky/
ebookfinal.com
Beginning WebGL for HTML5 Expert s Voice in Web
Development 1st ed. Edition Danchilla
https://ebookfinal.com/download/beginning-webgl-for-html5-expert-s-
voice-in-web-development-1st-ed-edition-danchilla/
ebookfinal.com
https://ebookfinal.com/download/flask-web-development-developing-web-
applications-with-python-1st-edition-miguel-grinberg/
ebookfinal.com
https://ebookfinal.com/download/group-theory-for-the-standard-model-
of-particle-physics-and-beyond-1st-edition-ken-j-barnes/
ebookfinal.com
https://ebookfinal.com/download/building-web-applications-with-erlang-
working-with-rest-and-web-sockets-on-yaws-1st-edition-zachary-kessin/
ebookfinal.com
https://ebookfinal.com/download/understanding-and-applying-research-
design-1st-edition-martin-lee-abbott/
ebookfinal.com
Voice Enabling Web
Applications: VoiceXML
and Beyond
KEN ABBOTT
All rights reserved. No part of this work may be reproduced or transmitted in any
form or by any means, electronic or mechanical, including photocopying, recording,
or by any information storage or retrieval system, without the prior written per-
mission ofthe copyright owner and the publisher.
Trademarked names may appear in this book. Rather than use a trademark symbol
with every occurrence of a trademarked name, we use the names only in an editorial
fashion and to the benefit of the trademark owner, with no intention of infringement
of the trademark.
Editorial Directors: Dan Appleman, Gary Cornell, Iason Gilmore, Karen Watterson
Marketing Manager: Stephanie Rodriguez
Managing Editor: Grace Wong
Technical Reviewer: Dennis McCarthy
Developmental Editor: Marty Minner
Copy Editor: Nicole LeClerc
Production Editor: Laura Cheu
Compositor: Impressions Book and Journal Services, Inc.
Artists: Susan Glinert Stevens, Impressions Book and Journal Services, Inc.
Indexer: Valerie Haynes Perry
Cover Designer: Tom Debolski
The information in this book is distributed on an "as is" hasis, without warranty.
Although every precaution has been taken in the preparation of this work, neither the
author nor Apress shall have any liability to any person or entity with respect to any
loss or damage caused or alleged to be caused directly or indirectly by the infor-
mation contained in this work.
Contents
Author's Note on VoiceXML 2.0 ..................................... . vii
Preface ............................................................... .ix
iii
Contents
iv
Contents
v
Contents
vi
Author's Note on
VoiceXML 2.0
THE FIRST VERSION ofVoiceXML, VoiceXML 1.0, was officially released in May 2000
by the VoiceXML Forum. Subsequently, the VoiceXML Forum turned over control
of the VoiceXML specification to the World Wide Web Consortium (W3C). The
next version ofVoiceXML, popularly known as VoiceXML 2.0, has been pending
throughout the writing and production of this book, but still has not been pub-
licly released as of mid -October 2001.
As anyone who works with software technology knows, one and one half
years between releases of a burgeoning technology is an eternity. The delay has
been due to internal issues within the W3C regarding intellectual property rights.
In the past, the W3C has been a strong of advocate of open -source (public
domain) technologies. Modern reality is that many, if not most, new and evolving
technologies are being developed by parties who hold some intellectual rights
to the technology, so the W3C must adapt. VoiceXML is one such technology
under the W3C's purview (but not the only one).
As a result of this turmoil, the anxiously awaited VoiceXML 2.0 specification
has been pending release as a W3C Working Draft for over a year. Both the internal
deliberations of the W3C and the contents of any unreleased work-in-progress are
closed to the public. Therefore, the specification cannot be discussed publicly,
and there has been no firm official information from the W3C about when it can
be. (And the W3C finds itself in the unique position of being an open standards
body fighting fiercely to keep a much-requested standard secret.)
People who buy technical books want information that is up-to-date and
timely. This presented a dilemma to authors and publishers. To provide timely
information on an infant technology such as VoiceXML, books are often rushed
into production. On the other hand, to provide up-to-date information, books
are often timed to appear as soon as a particular new technology is released. For
books on VoiceXML, the choice was to rush to market with books on VoiceXML
1.0 (already decrepit in Web time and due to be superseded by the imminent
VoiceXML 2.0), or wait for VoiceXML 2.0 (which was making little publicly visible
progress toward release).
Initially, Apress and I decided that the smart thing to do was write the manu-
script, wait for VoiceXML 2.0 publication, and then follow as soon as possible with
publication of this book, compatible with VoiceXML 2.0. However, books don't
hold well in captivity, and as months passed with no resolution to uncertainty
vii
Author's Note on VoiceXML 2. 0
concerning the schedule for release ofVoiceXML 2.0, we decided that the book
needed to get into people's hands.
So, strictly speaking, this book can only claim that it is compatible with the
VoiceXML 1.0 specification and that an online supplement will reconcile any
incompatibilities when VoiceXML 2.0 appears. However, due to the long gestation
ofVoiceXML 2.0, a fair amount of information about VoiceXML 2.0 has become
publicly known, whether the W3C likes it or not. So there is some good news:
• This book is not just aboutVoiceXML. It's about voice enabling Web appli-
cations, and there is a lot of valuable information herein about integrating
voice with Web technologies and with existing applications that can be
found nowhere else.
So indulge your interest-use this book to start voice enabling your Web
applications right now!
Kenneth R. Abbott
October 2001
Holliston, MA
[email protected]
viii
Preface
This book is about two topics that I've pretty well mixed together: using voice to
access the Web and the VoiceXML language. Of the two, the former topic is the
bigger, more conceptual one, and it is the one that will wear the best over time. I
believe that VoiceXML will enable the use of voice to access the Web in a big way.
VoiceXML is a hot new enabling technology destined to live its meteoric life in
Web time: new and brilliant today but commonplace tomorrow. However, in the
grand tradition of computer technology, details get the attention and the major
trends take care of themselves.
I am a "big picture" person, and for me, the key to mastering the ever-
changing details of technology is to keep the details in context. My years in the
computer software industry have taught me that not everyone thinks the way I
do, and for many technically oriented people, "God is in the details." In this book
I have attempted to strike a balance between providing context and explaining
the current technical details.
In the admixture of voice, computers, and the Web, I've observed the follow-
ing overlapping constituencies with strong interest in seeingVoiceXML succeed.
• Another group is voice technologists, who are grounded in the deep complexi-
ties of voice recognition, voice synthesis, and natural language processing.
They have been working for decades on a complex and frustrating technology,
and they feel it is now getting close to the point where the masses can use it.
• Web enthusiasts are technically oriented people who are deeply involved with
the development, care, and feeding ofWeb applications and the Web itself. To
them, voice is a new technology to be quickly mastered and assimilated.
• Finally, there are technology integrators, who occupy the shadowy realm
between business and technology. They are interested in finding better ways
to do business using technology. Technology integrators tend to be interested
in markets, products, architectures, and standards-they are the architects
and general contractors of the computer-system building industry.
ix
Preface
Part II (Chapters 4-9) focuses on the nuts and bolts of the VoiceXML lan-
guage. Using a Simplified Personal Information Manager as a specimen
application, Part II guides you through an initial analysis of the application
(Chapter 4), introduces basic VoiceXML concepts (Chapter 5), helps you set
up your own VoiceXML development environment using software from the
companion CD (Chapter 6), and leads you through a hands-on VoiceXML
tutorial (Chapter 7). With the tutorial mastered, Chapters 8-10 examine
X
Preface
Part III (Chapters 11-13) dollies back from the details of the VoiceXML lan-
guage and explores the issues involved in building a single Web application
that incorporates multiple access modes, such as voice and graphical
interfaces. Chapter 11 briefly reviews major technologies that are used in
enterprise Web applications, including XML, XSL, JavaServer Pages, and
application servers. Chapter 12 introduces a transformational approach
for putting together the various technologies, includingVoiceXML, into a
scalable, multiple access mode architecture. Chapter 13 presents a working
prototype that demonstrates the transformational architecture. Detailed
instructions are provided for installing the prototype from the companion
CD, and the various components are dissected. Finally, Chapter 14
explores future directions forVoiceXML.
Companion CD
The companion CD contains all the software you need to begin developing
VoiceXML applications on your PC. The CD includes IBM Web Sphere Voice
Server SDK 1.5, IBMWebSphere Studio (trial edition), Allaire JRun 3.1 (developer
version), Altova XML Spy IDE for XML (30-day evaluation), plus an assortment of
goodies, such XML Quick Reference Cards from MulberryTech and a small gallery
of computer-synthesized voices to break in your new headset.
If you're still not sure that you're ready to take the plunge into the world of the
voice-enabled Web, I suggest you visit some of the major resource sites on
the Web. You'll see that there are a lot of people getting excited about combining
voice and the Web, and there are a lot of resources for newbies. I recommend
starting with the following sites.
• VoiceXML Forum (http: I /www. voicexml. orgl): The VoiceXML Forum is the
industry consortium that first standardized VoiceXML. This site has lots of
information about VoiceXML, including FAQs, technical resources, and
details about important activities such as user group meetings. In addition,
xi
Preface
there are links to just about every business, individual, and organization
active in the VoiceXML area.
• XML.org (http: I /www.xml. orgl): This site is the best jumping-offpoint for
immersing yourself in XML. XML.org is dedicated to promoting the use of
XML, and the site provides links to FAQs, books online, online courses,
examples, free software tools, and much more.
After you have surfed around a bit, I think you'll discover two things.
Acknowledgments
Thanks to Dennis McCarthy, both for getting me turned on to this whole
"VoiceXML thing" and for providing valuable comments and suggestions as the
principal technical reviewer for the manuscript. Thanks to Sue Spielman of
Switchback Software, who provided technical feedback from the perspective of
an expert Cocoon user and XML developer.
Special thanks to my wife Susan, who has seen more of me around the house
than she thought possible or tolerable, and who struggled through the first three
pages of the manuscript and concluded, "It's wonderful, dear." Which was exactly
the right thing to say.
Ken Abbott
August200l
Holliston, MA
[email protected]
xii
Part One
Retrospective on Voice
and the Web
THis PART INTRODUCES and reviews the key concepts that underlie voice technology
and the World Wide Web. Voice technology and the Web have very different ori-
gins. Understanding the context and trajectory of each technology will help
answer many of the questions that will pop up in subsequent parts as you grapple
with the technical details: Why do things work this way? Why isn't this obvious
feature standardized?
Chapter 1 provides a brief introduction to voice and its significance.
Chapter 2 explores how and why voice and the Web are converging. Chapter 3
closes this part with a review of how the Web has evolved technically and draws
some parallels to the future evolution ofVoiceXML.
CHAPTER 1
3
Chapter 1
Due to its technical history and orientation, the Web has favored the
explosion of visual interfaces over auditory interfaces. Much of the thrust of new
developments in user interfaces in the Information Age has been to increase the
rate at which information can be exchanged and the volume of information
available at any given moment. (Think 21" monitors set at high resolution.) This
approach has favored the visual over the audible, because visual interfaces are
"scalable"-to increase the amount of visual information, you simply increase
the transfer rate and the display capacity. On the other hand, speech cannot be
significantly speeded up without becoming incomprehensible. To increase the
amount of information conveyed through speech, you increase the length of
the conversation.
As Table 1-1 shows, the different characteristics of sight and speech make
them useful in different situations.
Now that both sight and speech are viable modes for accessing the Web,
there are some interesting questions to be answered.
• What types of interactions are more appropriate for speech, and what
types are more appropriate for sight?
4
The Role ofVoice on the Web
For example, consider the adage "A picture is worth a thousand words."
A picture that can be perceived instantaneously by the eyes may require a long
description to convey similar information through speech. On the other hand,
audible speech can rapidly convey emotions and shades of meaning that are lost
in sight-mediated representations (think of sending e-mail versus talking on
a phone). Entering names and addresses through a graphical interface requires
manipulating mice and keyboards, as opposed to simply saying the names
and addresses.
At this time, we are just becoming able to use speech as a means of (limited)
communication between people and machines. This book is about a key enabler
of this limited capability: VoiceXML. VoiceXML, in its current form, is strictly
concerned with speech interaction. However, it is important to bear in mind
that sight and speech are complementary, each with its own set of strengths
and weaknesses.
5
CHAPTER 2
The Convergence of
Speech and the Web
What Is VoiceXML?
VmcEXML IS A PROGRAMMING LANGUAGE for scripting voice interactions between
a computer and a person. The basic element of interaction is a spoken dialog in
which the computer produces spoken prompts to elicit spoken responses from
the user. VoiceXML prompts may be recorded or generated using Text-to-Speech
(TTS) synthesis. Spoken user responses are processed using speech recognition
and grammars defined in the VoiceXML program. Users may also respond
through a keypad (DTMF 1), as defined in the program.
1 DTMF stands for Dual Tone Multi-Frequency, which is techno-speak for the sounds
a Touch-Tone phone makes when you press the keys.
7
Chapter2
Something you hear is sound. A sequence of sounds people make with their
voices with the intent to communicate is speech. Meaning is the message con-
veyed when speech is successfully understood.
During a conversation, several distinct processes occur. Speaking can be
defined as the generation of understandable sounds using the voice. Hearing is
the perception of those sounds as speech and the chunking of sounds into units
of speech. Cognition is the assembling of speech units into an understandable,
meaningful message. This speech processing model is illustrated in Figure 2-1.
Hearing
SpeaHng
......
Speech
+-
Speech Speaking
Hearing
8
The Convergence of Speech and the Web
When people of a certain age hear about "talking to a computer," they often
conjure up images of HAL from the film 2001:A Space Odyssey. HAL was an amaz-
ing computer that could talk and see. He sounded a little nerdy, but he could
carry on a conversation with no problem whatsoever, and his eyesight was excel-
lent. In the year 2001, HAL is still amazing. Today's computers approach HAL in
their ability to speak understandably and to recognize the words that a person
speaks. However, in the area of cognition, HAL is still far beyond our current
technological capabilities. Consider the fact that HAL could not only understand
natural human language, but he could also lip-read it (despite his personal lack
of lips!).
HAL engenders the false expectation that you can talk to a computer and
have a conversation, just as you can strike up a conversation with someone
standing in line at the post office. That happens to be exactly what computers
can't yet do. That is why VoiceXML encompasses speaking and hearing but has no
cognition model other than standard computer programming.
9
Chapter2
Markup Languages
Markup languages had their heyday before graphical interfaces and the
Web. Markup languages were a way to embed text-formatting instructions into
text documents. The text and markup language were entered into a text file using
a text editor (which in those pre-GUI days were line-at-a-time monsters). To pro-
duce a formatted document, a special formatting program (a word processor)
read the text file, interpreted and stripped out the markup instructions, and pro-
duced a text file that would print nicely on a selected printer. Because pre-GUI
days were also pre-laser printer days, the printer might have been a line printer,
a dot matrix printer, or (for the utmost in quality) a daisy wheel printer. All these
technologies-markup language-based text processors, command-line text edi-
tors, and impact printers-were authoritatively eclipsed by GUis, WYSIWYG
(What You See Is What You Get) editors, and laser printers.
However, the markup language approach solved some difficult publishing
problems that WYSIWYG word processors could not. Standard Generalized
Markup Language (SGML), a standard developed by IBM, continued to serve
a small but loyal market, and as a result it was still vital when someone began
envisioning a "World Wide Web."
HTML is a blessing and a curse to its aged parent SGML. It's a blessing because
it breathed life into a stagnant technology. It's a curse because it did so with
a brash, youthful disregard for the elegance and refinement of its progenitor.
10
The Convergence of Speech and the Web
11
Chapter2
appear visually, but it provides no hint as to why it should appear that way. Other
tags such as <Hl> (header Ievell) sound like they are describing the logical struc-
ture of the document, but in use, they really refer to particular rendering styles.
In contrast, content markup tags content by its meaning (for example,
dnformalAside> .•. </informalAside>) and leaves rendering decisions to the
renderer. For example, a voice interface might render an informal aside in a whis-
per, while a Web interface might simply italicize the text. Decisions about
rendering into a particular medium can be made while generating presen-
tation markup from content or they can be preprogrammed into the media
browser itself.
Development ofXML was spurred by a desire to generalize and extend the
success of HTML. Technically, the approach was to popularize SGML by creating
a simplified subset that would be useful to the broad audience of businesses try-
ing to exchange data over the Web. In its full generality, SGML has some finicky
and complex nooks and crannies that are only used to solve the hardest prob-
lems. XML removes some of the most obtrusive complexity (and hence some of
the power) of SGML by defining a restricted family of SGML-compliant lan-
guages. This family shares a strict, but tractable, syntax that mere mortals can
learn and use. Notice that in the grand scheme of things, SGML and XML
are both metalanguages, while HTML is a single SGML application (that is,
an instance).
HTML has two younger siblings: Wireless Markup Language (WML) and
VoiceXML (VXML). In the "family" analogy, one might say that HTML is a young
teenager, WML is a toddler, and VXML is a newborn. Both WML and VXML are
XML-compliant markup languages.
The term "Wireless Markup Language" seems to imply that there is some-
thing special or unique about wireless communication that requires separate
handling from other types of communication. In fact, WML does not rely on or
exploit the "wireless versus wired" distinction in any essential way. WML is better
understood as a low-bandwidth, low-resolution markup language. In other
words, WML is targeted at being rendered in environments (devices) with low
communications bandwidth, limited display capabilities, and limited computing
resources. It was simply a mark of the times that in the late 1990s wireless devices
happened to be low-bandwidth, low-resolution devices. In the coming years,
when wireless bandwidth improves and mobile gear has small, high-resolution
displays, wireless devices will probably render HTML (or its successor). On the
other hand, future toasters and other relatively low-tech household devices may
render WML on small, cheap displays, even though they are connected by a wired
household LAN.
12
The Conuergence of Speech and the Web
• Extend the reach of the Web. People can access the Web from anywhere they
can make a phone call. People get increased access to goods and services
and businesses get increased access to their customers.
2 For an example of a low-end application, play the Tellme blackjack game by calling (800)
555-8355 and saying "entertainment" at the main menu followed by "blackjack." [Tellme is a
VoiceXML vendor, and its voice site is completely written in VoiceXML.)
13
Chapter2
Summary
This chapter reviewed some of the technologies converging in the emergent
voice-enabled Web. These technologies include speech recognition, speech syn-
thesis, and markup languages. Each is a powerful technology in its own right,
and each has its own history and drivers. This background sets the stage for the
next chapter, which explores how the architecture ofWeb applications draws on
these underpinnings.
14
CHAPTER 3
The Evolution
of Web Application
Architectures
Tms CHAPTER BRIEFLY TRACES the architectural evolution of the Web from a simple
mechanism for sharing published information electronically to its current role as
a public infrastructure for communication, interaction, and commerce. In the big
picture, enabling the Web with voice is just one piece of the larger mosaic that
comprises the Web. At a more detailed level, voice-enabling technologies such as
VoiceXML are just beginning on a growth path that has already been taken by
vision-enabling technologies such as HTML.
15
Chapter3
a person and content fetched from the Web. In visual terms, the browser acts as
a window in which various types of content can be rendered. The browser:
• Renders HTML.
Early on, the relationship between content and its rendition was simple:
Content arrived in files, content was typed (by MIME types and/ or file
extensions), and content types had renderers. Things rapidly became more
sophisticated with plug-ins, applets, ActiveX controls, and so on, and the
distinction between content and computer programs was blurred.
HTML is the glue that ties together related content and drives the browser.
The browser renders its way through an HTML stream, and along the way it must
call on other renderers to render images, sounds, spreadsheets, and so on. In
a sense, HTML is a rendering language that is "interpreted" by the browser.
Before the browser paradigm, developing a GUI was a programming-
intensive activity. Software developers wrote software programs that embedded
calls to an underlying windowing system API, such as Windows, Motif, Apple, X,
and so on. The programs were expensive to write, debug, and deploy, and they
were specific to the underlying windowing system. With the advent of the Web
browser, anyone with a text editor and basic knowledge of HTML could program
a GUI that could run anywhere.
16
The Evolution ofWeb Application Architectures
Sessions
As originally conceived, the HTTP protocol was stateless and anonymous.
A browser requested a document from a server, the server returned it, and the
transaction was done. Any browser that made the same request got exactly
the same document returned.
Some gross but effective techniques for maintaining information about
a user's interaction with a Web site were rapidly hacked into place. Cookies and
URL rewriting are the most common. Both are techniques for piggybacking infor-
mation about user identity onto HTTP requests: Cookies store information in
HTTP headers, while URL rewriting stores information in URL search strings.
17
Chapter3
Application Servers
The early techniques for managing user state and generating content on
demand, while minimally secure and far from perfect, enabled the evolution of
Web servers from document vending machines into sophisticated environments
for running complex distributed applications.
Early attempts to improve on CGI focused on directly extending the capabili-
ties of the Web server. A server-side include (SSI) is a mechanism for embedding
executable instructions into the content being served. As the server parses
through content, it recognizes and acts on server commands, which produce the
actual content served to the browser. Server extension APis such as ISAPI and
NSAPI enabled programs running on the server to interact with the Web server
directly. Unfortunately, these direct extensions were implemented directly in the
Web server software, making Web servers proprietary. The extensions also com-
plicated Web server implementations by compounding technical issues
concerning threading and resource management that are much simpler in a pure
Web server.
The development of component-based server extensions sidestepped the
propriety issue. This approach leaves Web servers as they are but defines a com-
plementary component-hosting server that interacts with the Web server
through existing protocols. The component hosting server, or application server,
provides a robust, general-purpose environment for components. The appli-
cation server handles the OS-like functions of resource management, thread
management, I/0, and so on, while the components implement transformations
on content streams. This makes server-side components relatively easy to pro-
gram and deploy and enables Web servers to do what they do best: serve content.
Of course, the application server is tied to the component architecture it imple-
ments, so the issue of propriety was moved but not solved.
18
The Evolution ofWeb Application Architectures
The markup languages for these three types of interfaces are all based on
XML. XML is a markup metalanguage derived from SGML. XML simplifies some
of the complex and little-used features of SGML, but it still provides a flexible and
extensible base for defining specialized markup languages. For more on XML, see
Chapter 11.
HTML
HTML was originally conceived as the "stitching" needed to weave together the
"World Wide Web" of multimedia documents. In this role, HTML is appropriate
because it is simple and platform independent, and it provides a lingua franca for
linking together documents. Basic HTML had a simple model of interaction with
the user. The browser rendered documents and forms onto screen real estate
owned by the browser. The user could "interact" with the browser by clicking a link,
entering text into a form field, or pressing a form button to submit data to a server.
Surprisingly, this simple model was rich enough to support the rapid morph-
ing of HTML into a platform for building GUis. As the Web took off, it became
possible for anyone with a text editor to put together a rudimentary GUI in
a matter of minutes, changing forever the economics of developing graphical
interfaces. More sophisticated HTML-based GUis were enabled by the fol-
lowing innovations:
19
Chapter3
The net result is that today it is possible to build very sophisticated GUis
using HTML and the Web. However, you can't do it using your trusty old text edi-
tor. In fact, HTML may be on the path to becoming a purely generated language
created by programs such as WYSIWYG GUI editors and application servers
solely to drive browsers and looked at by humans only in the direst need. (If you
have doubts about this trend, go to your favorite Web page and have your
browser display the HTML source.)
NOTE HTML is not, strictly speaking, a true XML application. Some syn-
tactically sloppy shortcuts in HTML violate the strict syntax rules ofXML.
Extensible HTML (XHTML) is a slightly reformulated version ofHTML
that does conform to XML. XHTML is an up-and-coming standard now
and is expected to supplant HTML soon. Although there are no major con-
ceptual or functional differences between HTML and XHTML (as far as
Web browsing goes), there are subtle but important technical differences.
As a well-behaved member of the XMLfamily, XHTML enables a suite of
powerful XML-based technologies that HTML does not. These technologies
are explored in Chapter 11. For more information on XHTML, visit
XHTML.org (http: I IWVM . xhtml. org/ ).
WML
20
The Evolution ofWeb Application Architectures
Linguistic hairsplitting aside, WML WUis really work like lightweight HTML
GUis. There's a screen, text and graphics, and links to click and buttons to push.
The overall structure of how the user interacts with the interface is the same for
WML and HTML.
VoiceXML
Unlike HTML, which has roots in publishing, VoiceXML comes from a program-
ming language background. It has the earmarks of a programming language:
control constructs, variables, event handlers, nested scoping, and so on.
VoiceXML was designed from the beginning to be a lightweight, easy-to-learn,
interpreted programming language for developingVUis.
VoiceXML structures interactions with the user into dialogs. A dialog consists
of a sequence of prompts spoken by the computer and responses spoken by
a person. The person can speak responses or key them in using a keypad. VUI
dialogs are by nature sequential and linear, in contrast to GUI windows, which
are multi tasked and two-dimensional.
Architecturally, VoiceXML interfaces are event -driven interfaces just like
GUis. In a dialog, the computer speaks a prompt and then waits for the user to
respond to it. The computer then waits until a speech recognition event occurs.
A speech recognition event is initiated by the speech recognition engine, which is
continuously analyzing the user's speech and attempting to match it to expected
responses in the dialog. There are a number of possible speech recognition
events, including "recognized response blah blah blah," "got a response but didn't
recognize it," "no response," and so on. Therefore, unlike GUis and WUis, where
the events that drive the interface are low-level, unambiguous occurrences (but-
ton pressed, mouse clicked, and so on), events in VoiceXML interfaces are the
result of complex, computation-intensive, potentially erroneous processing.
Summary
At its inception, the Web was envisioned as a medium to publish, share, and
interlink electronic documents. As the Web became more popular, the focus
shifted from electronic publishing of static documents to an interactive infra-
structure for generating, processing, and displaying information. The content
21
ChapterS
comes from a variety of sources, including files, databases, and computer pro-
grams. A browser interprets markup language embedded in the content, renders
it to a particular medium, and interacts with the user. HTML renders to a com-
puter display (and a sound card) and interacts with the user through a keyboard
and mouse. WML works with low-capability handheld devices. The newcomer to
the scene, VoiceXML, renders content as speech and interacts with the user using
speech recognition and speech synthesis technologies.
22
Part Two
The VoiceXML Language
HAVING ESTABLISHED A technical context for voice in general, it's time to focus on the
specifics of the VoiceXML language. As languages go, VoiceXML is not par-
ticularly tough. However, developing interfaces for voice is quite different than
developing graphical or text-based interfaces. Voice and graphical interfaces dif-
fer in structure, in how errors are created and perceived, and in how information
is processed. Some of the material in this part is the usual "syntax and seman-
tics" common to all introductions to programming languages. Much of the
material, however, is intended to help you think about and design high-quality
voice interfaces.
Chapter 4 introduces the Simplified Personal Information Manager, a voice-
enabled Web application that animates the tutorial in this part and the prototype
in the next part. Chapter 5 introduces the concepts that the VoiceXML language is
built around. Chapter 6 dives into the nuts and bolts of setting yourself up (in
terms of hardware and software) as a VoiceXML developer. With your environ-
ment in place, Chapter 7 provides a step-by-step tutorial that acquaints you with
all the key features ofVoiceXML 1.0. Building on your understanding of what
VoiceXML can do, Chapter 8 offers guidance about how to design a good voice
interface usingVoiceXML. Chapter 9 provides a reference-style discussion of the
machinery behind features explored in the tutorial. Finally, Chapter 10 explores
some advanced topics and issues that become apparent when you develop real-
world VoiceXML applications.
CHAPTER 4
Simplified Personal
Information Manager
Example
THROUGHOUT THE REST of this book, you will work with a simplified personal infor-
mation manager (SPIM) application. A SPIM is a slimmed down personal
information manager (PIM) with the following features: address book, appoint-
ment calendar, and to-do list. A SPIM is accessible through voice, wireless, and
Web interfaces. As you work through the examples in the book, you will develop
fully functioning components of your SPIM, but your intent should not be to
implement a complete, robust PIM.
This example was chosen for the following reasons:
• It is simple to understand and useful. Many people are familiar with some
kindofPIM.
In the following sections, I analyze the SPIM as a serious but petite appli-
cation. The core use cases are elaborated and diagrammed in Universal Markup
Language (UML) .1 The basic diagrams used here are pretty intuitive, so you
should be able to grasp the intent of the diagrams without knowing UML.
1 UML is a notation for object -oriented analysis and design. It was developed by Rational
Software and is called "universal" because it incorporates the methodologies championed
by Booch, Jacobson, and Rumbaugh, as well as others. For more information on UML, the
Rational Rose object -oriented design and development tool (which was used to draw
the diagrams here), and pretty much anything to do with developing object-oriented
software, visit the Rational Software Web site (http: I /www. rational. com/).
25
Chapter4
26
Simplified Personal Information Manager Example
0
..... /Ed;tcootoctl"fo
~
0
r
~ ;z
SPIMU•~""";"gloto
;z Attendee
/•
J
0
ReviewSchedule
During use case analysis, SPIMUser's access mode (eye or ear) is not
assumed or implied unless the access mode is intrinsic to the requirements of
the use case. For example, the use case in the next section (titled "Edit Contact
Information Use Case") should be the same regardless of access mode, because it
is an essential function of the system. On the other hand, a hypothetical "Dial
Roadside Assistance from My Broken-Down Car" use case could, arguably, pre-
sume that the user is accessing the system through a VUI or WUI, but not a GUI.
The sections that follow drill down another level into these scenarios. Notice
that every case starts with the Authenticate use case. This shared case models the
process of identifying the current user to the system. Conventionally, authenti-
cation is performed by a username/password-based login, but it may use other
mechanisms for voice (as you will see in Chapter 12).
27
Chapter4
..,
0
Authonlicalo
lo~2
~s:lociContac~O
~ BrowseContact
SPIMUs~~0 ~0 !:mmarizeContact
eviewContactlnfo 0
OotoiiContoct
0
EditAttribute
28
Simplified Personal Information Manager Example
0
Aulhenticate
/ 0
~~arizeAppointment
SPtMUi~ 0 Re-.iewAppointmtnt
~ OetaiiAppointment
oi'IE--..--~
NotffyAJtendee
Altendee
29
Chapter4
-t • R.uhunt.tl Ros~ - SPIMI.null- (Utii!' (.n~e Diaqranr.: u.. e Lne Vll"-W / A.evlewSth~uleOom) . . r.J X
Toolo Add-Ins
0
Authenticate
I'
r / 0
..J ~~-·~·~·
SPIMUser 0
SelectAppointment
0
RevtewAppomtmenl
Object Model
The core objects in a SPIM are as follows:
• A schedule is a list of appointments that fall in a given time span (this after-
noon, today, tomorrow, and so on).
30
Simplified Personal Information Manager Example
Contact
Person omeAddress · Address
Gbt>usinessAddress : Address
~rstName · Stringl«---l ~honeNumber : PhoneAddrss
~a stName · String O. 1 ~AXNumber PhoMAddress
~M ai!Address : EMai!Address
1 n
..t
Time
. ~ year Integer
~ont h : EnumMonth
~ay EnumOayOIWeek • elapsedYearsO
~our Integer • elapsedMonthsO
IIPOrrunute : Integer • elapsedOaysO
• elapsedHoursO
• elapsedMinutesO
• elapsedSecondsO
Summary
In this chapter, I've laid out the skeleton of a simple, but realistic, application
example. The core use cases were elaborated and diagrammed in UML. From the
use cases, a basic object model was derived and diagrammed. Applying this level
of formality to the example is intended to drive home two points.
• The examples presented throughout the book are not "concocted" exam-
ples that show off technology features but never occur in practice. Rather,
the examples are practical because they derive from a real application.
In the chapters that follow, you'll use the SPIM application to lead you into
VoiceXML programming.
31
CHAPTER 5
VoiceXML Concepts
THIS CHAPTER INTRODUCES VoiceXML. It begins with a brief history ofhowVoiceXML
was developed and provides an overview ofVoiceXML technical concepts. This
sets the stage for subsequent chapters, which drill down into specific topics using
the SPIM application to illustrate how speech user interfaces are expressed
in VoiceXML.
VoiceXML History
The major driver for development ofVoiceXML has been the desire of the tele-
phony industry to make existing telephone networks a vital part of the
Information Age. Obviously, telephone companies and our society as a whole
have a strong investment in the telephone and Public Switched Telephone
Network (PSTN). Access to the Internet over dial-up connections was and is
a major component of the Internet's success. Technically, however, the encoding
and transmission of digital data over telephone networks originally intended for
analog voice communication is somewhat inefficient. On the other hand, using
voice to access the Internet would capitalize directly on the existing, proven, and
highly tuned capabilities of Plain Old Telephone Service (POTS).
AT&T, Lucent Technologies, and Motorola began discussing the possibility of
developing a common language for voice-enabled applications in 1998. The
VoiceXML Forum was formed in 1999, and IBM became the fourth founding
member. The initial specification, VoiceXML 0.9, was released in August 1999.
A process of public review and response to comments culminated in the release
of the VoiceXML 1.0 specification in March 2000. Following development of the
language definition, the VoiceXML Forum turned custody of the specification
over to the World Wide Web Consortium (W3C). The Forum reorganized itself
and broadened its charter to include more of a general role as a technology and
industry advocate for the VoiceXML community. See Figure 5-l for a detailed
timeline of the previously described events.
33
ChapterS
I I I I I
AT&T, Motorola,
Lucent discuss
---1
common W3C preparation
language of VoiceXML 2.0
draft & speech
grammar ML, TTS
ML, others
Motorola
-----1
announces VoxML
Forum: 20+ Promoters,
W3C opens discussion 250+ Supporters; many
regarding a voice
browser language
recommendation
- Submission of
platforms, services,
applications
I~sp_ee~c=hM=L=========~r---
reorganize
IBM announces L_ VoiceXML
!-
Forum
Announce
1 formation of
I VoiceXML 1.0 I
VoiceXML Forum
1-
1......----..1
j VoiceXML
-1 0.9
.______ _____,
lucent announces
TelePortal
'
Figure 5-l. Time line of speech and telephony activities (courtesy of the
VoiceXML Forum)
The VoiceXML Web site (http: I /www. voicexml. org/ goals. html) describes the
Forum's goal as follows:
"The VoiceXML Forum is an industry organization founded by AT&T, IBM,
Lucent and Motorola, and chartered with establishing and promoting the Voice
Extensible Markup Language (VoiceXML), a new specification essential to mak-
ing Internet content and information accessible via voice and phone."
Now the W3C is overseeing the ongoing specification of the VoiceXML lan-
guage and other speech-related technologies, including
• Speech grammars
• Voice dialogs
• Voice synthesis
• Pronunciation lexicon
• Call control
34
VoiceXML Concepts
• Multimodal systems
Visit the VoiceXML Forum's Web site (http: I lwww. voicexml. orgl) for infor-
mation about VoiceXML Forum activities and members. Visit the W3C's Voice
Browser Activity area (http: I lwww. w3c. orgiVoice) for information about the
W3C's work on various voice-related technologies.
1. You sit down at a computer, open a Web browser, and type in a Web
address (URL).
3. The Web server sends a response to the Web browser, which includes an
HTML document.
4. The Web browser "renders" the HTML document onto your screen.
5. You view the display and navigate the site by clicking and typing.
Getting the same information from the same Web site using VoiceXML goes
something like this:
1. You pick up a phone and dial the Web site's phone number.
5. You listen and then respond by speaking or pressing keys on the tele-
phone's keypad.
35
ChapterS
NOTE Even though you can use your wireless phone to access the Web
using voice (VXMLJ or Web browsing (WML), you can't do both togethe1:
When using voice, the link is over the telephone network; when using
wireless, it's over a WAP link. Currently, handheld devices can only com-
municate over a single type of link at a time. Even ifyour handheld device
is capable of servicing both types of links simultaneously, there's no proto-
col to correlate what's happening on the phone with what's happening on
the WAP link.
From the user's perspective, the difference is in the medium by which input
and output are conveyed. With a conventional Web browser, you receive the out-
put from the computer through your eyes and send input to the computer with
your hands. With VoiceXML, you receive output from the computer through your
ears and send input to the computer with your voice (see Figure 5-2).
tSSSTIME
36
VoiceXML Concepts
37
ChapterS
VoiceXML Interpreter
Telephony Network
Interface Interface
Speech Text-to-
Recog- DlMF Audio Speech
nition (TTS)
• Voice HTTP t
8
Figure 5-3. VoiceXML gateway subsystems
8
Elements of VoiceXML
In HTML, the basic element of retrieval is a page, which is a document with a
certain address. When a Web browser receives an HTML document from a Web
server, it renders the entire page to the screen at once. AVoiceXML document
specifies an entire conversation, which consists of interchanges between the
caller and VoiceXML interpreter. In each of these interchanges, the VoiceXML
interpreter reads or plays a prompt and the caller responds with information or
a command. The VoiceXML interpreter thus "renders" the VoiceXML document
one exchange at a time. (In this respect, VoiceXML is more similar to WML, where
the WAF phone displays one card at a time.)
Dialogs
In VoiceXML, dialogs are the building blocks for conversations. The VoiceXML
interpreter "renders" a VoiceXML document by carrying on a conversation with
the person at the other end of the telephone call. The person and the VoiceXML
interpreter take turns speaking and listening. At any point in time, the conver-
sation between the caller and gateway is "in'' one dialog. During a call, the
conversation moves among dialogs in the application. The VoiceXML document
contains elements that specify what the VoiceXML interpreter can say or play to
the user and what the user can say or key to the VoiceXML interpreter.
38
VoiceXML Concepts
Navigation
Web sites are usually organized into a shallow hierarchy of topics. Visually, this
hierarchy is shown as a row of links at the top of each page (or column of links at
the left of each page). Visitors click these links to navigate through the hierarchy.
VoiceXML menus provide the audio equivalent of visual pick lists. The computer
speaks the list of available choices, and the user responds by saying the desired
option. VoiceXML links enable people to jump directly to a destination at any
time simply by saying the name of the link.
Grammars
At any point in a VoiceXML dialog, there is a predefined set of valid responses
that a person can speak. A grammar specifies a set of responses that can be rec-
ognized by the computer. A simple grammar may specify some fixed phrases to
be used as commands (for example). A more complex grammar may specify a set
of basic words (the vocabulary) and multiple alternative rules determining the
order in which the words may appear to form expressions or sentences. A gram-
mar is the primary input to the voice recognition technology that underlies the
VoiceXML interpreter.
39
ChapterS
Events
Programming languages such as C++ and Java use exceptions to handle errors.
VoiceXML events are based on the exception concept. Events are thrown when
certain conditions are detected either by the VoiceXML interpreter or by the
VoiceXML program itself. Event handlers are fragments of executable code that
are invoked to catch an event.
Due to the nature of speech, conversational dialogs are intrinsically real-time
processes. Unlike graphical interfaces, where you can take a lunch break and pick
up right where you left off, speech interfaces require responses within certain
time periods and must explicitly accommodate real-time interruptions and dis-
tractions. In VoiceXML, events are not restricted to representing error
conditions-they are used to represent all kinds of real-time interactions.
Summary
Beginning with a brief history ofVoiceXML, this chapter introduced the key con-
cepts that underlie VoiceXML. Browsing the Web by voice is a new activity
enabled byVoiceXML that differs from using a conventional Web browser. To
access a VoiceXML application, a person dials into a VoiceXML gateway, which
contains all the hardware and software required to recognize speech, synthesize
speech, and run the VoiceXML interpreter. The VoiceXML application is com-
posed of dialogs that structure the interaction between person and computer.
VoiceXML provides links, which are the verbal equivalent of hypertext links.
VoiceXML forms are composed of items, each of which elicits a particular piece
of information from a person. Grammars specify what responses are valid at
every point in a dialog. Events are used to handle errors and manage a conver-
sation in real time.
40
CHAPTER 6
Outfitting Your
VoiceXML Expedition
VmcEXML rs A NEW and burgeoning technology, so the landscape of available
VoiceXML tools and products is changing at "Web speed." This chapter provides
a quick orientation to the basic development approaches and tools. I'm not going
to try to cover the various characteristics and features of each tool, because that
information will undoubtedly be obsolete by the time you get it. However, I will
provide pointers to places where you can find up-to-date information.
41
Exploring the Variety of Random
Documents with Different Content
Silmäsi ällistyen häntä renki; hän jäykkeni ja kädet
nyrkkiintyivät. Nuo sanat kimposi, kuin jännittänyt ois jousen
Pankari, ja sitten syntyi äänettömyys kuin ennen kamppailua.
II.
TALVI
***
***
Emäsika on Pankarissa
kuin tynnöri mahdoton.
Ja renki virkkoi: sika tuo
sydänkäpyseni on.
Sioista, viinast' auta meitä, Herra.
Näät lemmittynsä luota
mies karkasi metsihin,
ja pääsiäiseksi lahdataan
sika paksu Pankarin.
Lohduta tyttö raukkaa silloin, Herra.
Vuorilla murheissansa
nyt istuu renki tuo
ja hioo pitkää puukkoaan
ja viinaa, villi, juo.
Puukosta auta Pankaria, Herra.
***
***
Niin oli kulunut jo neljä yötä, hän odotteli; Klaara tiesi kyllä:
hän pienintäkään pakkoa ei käytä. Vaan silloin tapahtui tuo
tavaton, tapahtui kuitenkin ja tuskanhikeen pusersi miehen
sekä kylän tietoon julisti koko hänen häpeänsä. Kahtena yönä
oli niin jo käynyt, mitenkä kolmantena käy? — Hän katsoi
avainta, kelloansa: kaksitoista.
***