Integrating voiceXML With SIP Services
Integrating voiceXML With SIP Services
Abstract— We describe our Session Initiation Protocol on the web server or dynamically generated based on some
(SIP)-based VoiceXML browser, sipvxml, that allows pro- server side programming logic like HTTP-CGI (Common
gramming interactive voice response applications that are
accessible from telephones as well as IP phones. We also Gateway Interface), Java servlet or Java server pages. The
describe how we have used sipvxml in our multi-party mul- media files can either be stored on the web server or can
timedia conferencing server. We propose other applications be streamed in real-time from an RTSP [10] media server,
and extensions that can benefit from this technology in our
IP telephony test bed. (keywords: Internet telephony; in-
such as rtspd, directly to the SIP caller using RTP [8].
teractive voice response; SIP; VoiceXML; sipvxml)
PSTN SIP-PSTN
gateway Fetch
I. Introduction Telephone VoiceXML
pages Web server
(CGI, Servlet, JSP)
People are familiar with traditional interactive voice re- Call request
sponse (IVR) systems found in voice mail access, dial- Get streaming media
in conferences, phone-based customer support and tele- IP soft-phone SIP based
VoiceXML
banking. VoiceXML is an XML-based language developed browser
by the W3C [1] to create voice dialogs that feature syn- Media server
IP hardware phone
thesized speech, digitized audio, recognition of spoken and (media files)
DTMF key input and recording of audio for telephony Fig. 1. Example sipvxml scenario
applications. It enhances the traditional proprietary and
closed IVR systems to an open programmable architecture.
It brings the advantage of web technologies to a telephony A. VoiceXML page
user by providing programmable dialogs, similar to HTML The following example VoiceXML page prompts the
forms or CGI scripts. caller with spoken audio: “Enter the ZIP code ...”. When
the user presses a sequence of digits, say 10027#, the vari-
The Session Initiation Protocol (SIP [7]) is an Inter- able zipcode gets the value “10027” that gets passed to
net telephony signaling protocol used for establishing and the URL http://myserver.com/weather.cgi?zipcode=10027.
terminating Internet multimedia sessions. A SIP-based It is up to the script weather.cgi to process the input and
VoiceXML browser (or SIP-VoiceXML browser) allows a generate further VoiceXML pages. If there is some error or
user doesn’t press anything, then the prompt is repeated.
SIP user to take part in application-specific IVR systems,
e.g., voice mail or tele-banking. It also brings the advantage <?xml version="1.0"?>
<vxml version="1.0">
of VoiceXML technology to a regular telephone user via a <form>
SIP-PSTN gateway. We have developed a SIP-VoiceXML <field name="zipcode">
browser, sipvxml, to enhance the services of our CINEMA <prompt>Enter the ZIP code of the location for which you
want weather information.</prompt>
test-bed [3], [4], [11]. In particular, we have extended </field>
our multimedia conferencing server, sipconf [12], and uni- <catch event="noinput error help">
fied messaging (voicemail) server, sipum [13] to provide en- Enter the ZIP code again followed by the pound key.
hanced services and convenience to a telephone user. </catch>
<block>
We describe the architecture of sipvxml in Section II. <submit
Section III describes use of sipvxml with our conferenc- next="http://myserver.com/weather.cgi" namelist="zipcode"/>
</block>
ing server. More examples of SIP services enabled by
</form>
VoiceXML are described in Section IV. We list some other </vxml>
related work in Section V and conclude in Section VI.
We have implemented a very simple DTMF grammar. A
typical explicit dtmf tag in the VoiceXML page looks like:
II. Architecture
<dtmf type="application/x-dtmf">
A SIP-VoiceXML browser is similar to a web browser for 1 | 2 | 3 | 4 | *
</dtmf>
a telephone instead of a desktop PC. The browser fetches
the VoiceXML pages or pre-recorded media files from a web The MIME type for this grammar is “application/x-dtmf”.
server and presents an interactive dialog to the telephone Input is either a fixed length string or terminated by a “#”.
user. Fig. 1 shows an example scenario where the browser An implicit timeout of 5 seconds is implemented so that the
can be accessed from SIP phones as well as a regular tele- input is automatically accepted if the user does not press
phone. The VoiceXML pages can either be statically stored the terminating “#” key for some time. If no grammar is
2
specified, then the interpreter will accept any input. Users On the other hand, if the request-uri is
can press “**#” anytime to signal the help event. sip:[email protected], then the interpreter
We have implemented only a subset of VoiceXML tags is invoked with the default pre-configured initial
as needed in our application: assign, audio, block, catch, VoiceXML URL, e.g., that of the conferencing service.
clear, disconnect, dtmf, error, exit, field, filled, form, goto, 2. The interpreter thread calls the XML parser with the
help, noinput, nomatch, prompt, submit, value, var and vxml. initial URL.
We do not support any client side script (e.g., JavaScript) 3. The XML parser fetches the page from the web server
usually needed for arithmetic or string operations in the or a local file system (based on the initial URL).
browser, as the same effect can be achieved using server 4. It presents the returned XML document into a tree
side processing. data structure.
5. The interpreter thread invokes the Form Interpreta-
B. Operation of the browser tion Algorithm (FIA [1]) on the selected form from the
VoiceXML document.
SIP interface 6. FIA invokes various other modules based on the con-
SIP On new incoming
XML
(3) Web
server
tent of the VoiceXML document. For example, it
INVITE SIP call
(1) (2) parser
can invoke the text-to-speech SDK to synthesize any
Interpreter
thread
(4)
prompts. The current implementation does not use
RTP any speech recognition engine because user input is
interface
RTP/RTCP RTP
(12) Detect
DTMF
(5)
via touch-tone keys.
receive
thread (14)
(13)
7. FIA can also invoke the HTTP fetcher module to fetch
Speech Grammar
(11)
Form an external grammar file or a media file for an audio
interpretation
recognition matching
rules (15) algorithm prompt. XML parser internally has its own HTTP
(7) client to fetch VoiceXML pages.
(6) (8)
RTP
send (9) Http Web 8. The HTTP fetcher implements a simple HTTP GET
fetcher server
RTP/RTCP
(16)
thread
text to
method to retrieve a document.
(10) speech SDK 9. The media file retrieved from the web server using
HTTP fetcher is fragmented into 20 ms packets for
Fig. 2. Operation of sipvxml
interactive telephony, and enqueued for streaming out
to the caller by the send thread.
Fig. 2 shows the components of our SIP-VoiceXML
10. The speech synthesizer output is also fragmented
browser, sipvxml. We use our SIP library1 for implement- and enqueued for sending out to the caller.
ing a SIP interface, the RTP library2 for the RTP/RTCP 11. The VoiceXML document can specify the grammar
interface, the Apache’s XML parser3 with DOM interface,
rules in various scopes in the document. FIA can set
an HTTP fetcher4 for getting non-XML pages and IBM the active grammar for the matching engine based on
ViaVoice Text-To-Speech SDK5 for speech synthesis. the current execution scope in the VoiceXML page.
1. When the browser receives a new incoming SIP call 12. The RTP receive thread receives the RTP media
it creates three different threads: RTP receive thread, packets and invokes the DTMF detector.
RTP send thread, and the VoiceXML interpreter 13. Any detected DTMF digit is passed to the grammar
thread. The RTP receive thread receives media matching engine.
packets from the caller and invokes the DTMF 14. DTMF tones can be transported from the caller to
detection module. The RTP send thread streams the browser in a number of ways. One approach is to
out media packets to the caller. A separate send not distinguish them from the spoken voice by encod-
thread helps in maintaining the constant band- ing them using the same audio codec. However, a low
width (e.g., 64 kb/s for G.711 audio) for outgoing bandwidth audio codec may distort the properties of
packets and irrespective of the speed of the speech the in-band DTMF tones making them hard to detect.
synthesizer. The initial VoiceXML page URL can A second, preferred way is to use “telephone-event” [9]
be preconfigured in the browser or encoded in containing the digit codes instead of the encoded audio
the SIP request [6]. For example, if the caller in RTP packets. In the first case, the browser has to do
dials sip:dialog.vxml.http%3a//dialogs.server.com/ the DTMF detection, whereas in the second case the
[email protected] then the call will caller or the gateway has to do the DTMF detection.
reach the browser running at vxmlservers.com The RTP receive module forwards telephone-events di-
and it will fetch the initial VoiceXML page from rectly to the grammar matching engine. We have im-
http://dialogs.server.com/script32.vxml. plemented both these methods. A third method of
transporting DTMF in SIP INFO message is not used
1 http://www.cs.columbia.edu/˜ kns10/software/siplib
2 http://www.cs.columbia.edu/˜
in our implementation.
hgs/rtp/rtp-library.html
3 http://xml.apache.org 15. The grammar matching engine tries to match the
4 http://cs.nmu.edu/˜ lhanson/http fetcher/ received digits with any active grammar, and informs
5 http://www-4.ibm.com/software/speech/dev/ttssdk linux.html
3
the FIA if a match is found. 7. The browser again checks if Alice is allowed to join
16. The RTP send thread periodically sends media pack- the conference identified by number 23, which in this
ets to the caller. No packets are sent during silence. example is sip:[email protected].
8. Once the authentication is done, the browser transfers
III. Multi-party Conferencing the call to the actual conference server using the SIP
REFER method [14] containing the SIP URI of the
Consider a SIP conferencing system where users
conference.
join the conference by dialing in a conference URI
9. Alice’s phone accepts the transfer and initiates a new
sip:[email protected]. A regular telephone user
call to the conference server.
with only a touch-tone phone cannot dial such a generic
10. Alice’s phone exchanges audio with the conference
URI. We can assign one phone number per conference for
server directly, without going through the browser.
direct inward dialing. However, it is preferred that the
user always dials the number of the VoiceXML browser Note that the user authentication, conference look up
and transfer are actually invoked by the conference ser-
that in turn prompts him for the authentication PIN (per- vice CGI scripts, whereas the browser just interprets the
sonal identification number) and conference number. Once VoiceXML pages generated by the scripts to do the actual
the user is authenticated the browser transfers the call to transfer or prompt the caller. For instance, the service
the selected conference. One can also use a single PIN to script may generate the following transfer tag for the call
transfer in step (9).
identify both the participant as well as the conference.
<block><prompt>Your call is being transferred,
(a) Message flow please wait.</prompt></block>
<transfer dest="sip:[email protected]" bridge="false" />
(b) Architecture
User SIP phone VoiceXML browser (4) (7) The transfer can be done in two modes: blind and
(1) INVITE sip:[email protected] bridged. The former is the transfer of the call to the con-
(1)
200 OK (accepted) ference server without consulting the server whereas the
ACK (cnfirmed) (8) latter is the transfer after consulting such that the browser
(2) Welcome, please enter your (9) may choose to be in the media path. We have implemented
four digit PIN code. Database the blind call transfer as shown in Fig. 3.
(3) 1-2-3-4-# (4) user auth/
identification 1234=>Alice (a) Message flow (b) Architecture
taining packet forwarding states for the duration of the Java Servlet. These programs generate VoiceXML pages
conference limits the scalability of the browser on how based on the mail box content and user input. More infor-
many simultaneous callers it can handle. The browser mation can be found at [5].
may issue re-INVITEs with updated transport addresses
for RTP/RTCP to both the caller and the conference server C. Event notification and scheduling
such that the media path is direct. However, this still needs Asynchronous event notification is useful when polling
the signaling state to be maintained for the duration of the for the event is inefficient. For example, the email-by-phone
call. On the other hand, a blind transfer does not require system can be modified to notify the user of any important
any call state in the browser for the duration of the con- email by calling user’s cell phone. Text-to-speech is used to
ference. But it expects that the caller’s IP phone supports play out the email content on the phone. Alternatively, the
the REFER method. user can go to a web page and schedule a birthday reminder
or wakeup call by recording his own audio announcement
IV. Other services or a text message. The system notifies the user by phone
This section describes some of the current and future at the scheduled time. These simple systems do not need
services that can be provided using VoiceXML in our SIP VoiceXML. However, a VoiceXML browser is needed to al-
environment. low the user to schedule events from phone, or to “snooze”
and notify again after a short while. We are extending
A. Unified messaging – voice mail sipvxml to allow initiating a new call for notification.
Sipum is a SIP/RTSP-based unified messaging sys- D. Audio volume level for conference
tem that provides a centralized voice mail and answer-
ing machine service. For example, when Alice calls Multi-party audio conferencing among heterogeneous
[email protected], the SIP server on cs.columbia.edu do- clients with different audio devices causes annoying distor-
main forks the call request to both Bob’s IP phone and tion of audio. Some participants are heard very loud and
the answering machine (sipum). If Bob picks up the phone, some are not heard at all. Ideally, the conference server
the call request to sipum is cancelled. If Bob does not pick should balance the input audio level from the participants
up the phone after 10 seconds, sipum accepts the call on before mixing. However this imposes additional processing
Bob’s behalf and prompts the caller, Alice, to leave any requirement on the server for every audio packet. Another
voice message. The same application, sipum, is also used approach is to tell the participant to adjust his volume
by Bob to retrieve his voice mails, for example, by dialing a level for both microphone and speaker. The participant
URL sip:[email protected] to re- connects to a “audio level feedback” system before joining
trieve the message with ID 672. There are other ways of the conference and speaks into it. The systems announces
retrieving voice mails, for instance, using a web browser or if the user’s microphone volume is acceptable, too high or
a media client. However none of these are appropriate for a too low. The system also plays back a pre-recorded au-
telephone user with limited touch tone capability. We use dio file and allows the user to adjust his speaker volume.
sipvxml with application-level logic for voice mail service to This processing is built in a server side CGI script that is
allow more interactive interface to access and manage the accessible via a VoiceXML browser.
voice mail box. From a user’s perspective this is similar to
E. Advanced conference control
the traditional voice mail service. However, use of SIP and
VoiceXML allows easy integration with web, email, instant Our current conference server implementation provides
messaging, and telephone. a web interface for floor control by the moderator and par-
The application logic to perform voice mail service is ticipant list display. We can extend it such that conference
built as CGI scripts executed in the browser’s context. control can be done using the same telephone that the par-
Once the user is authenticated using PIN, the main menu is ticipant or moderator is using for the conference.
spoken out. This includes options for playing out new voice
messages and other details like sender, subject and times- F. Integrating speech recognition
tamp of the message. An RTSP media server (e.g., rtspd) Our current implementation accepts user input only via
can be used to stream the actual voice message directly DTMF digits. VoiceXML is designed for spoken audio in-
to the caller phone using RTP. We also provide additional put as well as DTMF. Allowing both mechanisms will im-
options like saving or deleting the message. prove the user experience.