0% found this document useful (0 votes)
65 views

Thesis 2023

This document describes a study that implemented VoiceXML in a speech-driven online learning module for oral communication. The study aimed to develop such a module to serve as a substitute for a speech laboratory and evaluate its effectiveness. The resulting application, called Speech Tutor, was designed for college students enrolled in an English oral communication course. It includes tutorials on English sounds, exercises, and records and assesses students' pronunciation. The application was programmed in VoiceXML and PHP with a graphical user interface in HTML and PHP. Students and instructors evaluated Speech Tutor and found it effective in delivering course content and improving pronunciation, providing a cheaper and more innovative alternative to a physical speech laboratory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Thesis 2023

This document describes a study that implemented VoiceXML in a speech-driven online learning module for oral communication. The study aimed to develop such a module to serve as a substitute for a speech laboratory and evaluate its effectiveness. The resulting application, called Speech Tutor, was designed for college students enrolled in an English oral communication course. It includes tutorials on English sounds, exercises, and records and assesses students' pronunciation. The application was programmed in VoiceXML and PHP with a graphical user interface in HTML and PHP. Students and instructors evaluated Speech Tutor and found it effective in delivering course content and improving pronunciation, providing a cheaper and more innovative alternative to a physical speech laboratory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 69

IMPLEMENTATION OF VOICEXML IN A SPEECH-

DRIVEN ONLINE LEARNING MODULE FOR


ORAL COMMUNICATION

In Partial Fulfillment of the


Requirements for the Degree
BACHELOR OF SCIENCE IN INFORMATION TECHNOLOGY
STARTFORD INTERNATIONAL SCHOOL
J. Catolico.Avenue,Lagao 9500,General Santos City

Approval Sheet
In partial fulfillment of the requirements for the degree of BACHELOR OF SCIENCE IN
INFORMATION TECHNOLOGY

IMPLEMENTATION OF VOICEXML IN A SPEECH-DRIVEN ONLINE


LEARNING MODULE FOR ORAL COMMUNICATION

APPROVED in partial fulfillment of the requirements for the degree of MASTER IN

INFORMATION TECHNOLOGY by the Oral Examination Committee:

YVONNE OCHAGABIA MIT


DEAN
Table of Contents

Page
Chapter 1: Introduction
1.1 Background of the Study……..…………………………………………….. 1
1.2 Technology Application Context ……….………………………………… 2
1.3 Objectives of the Study ………………….………………………………... 4 1.4
Significance of the Study ……………….………………………………… 4 1.5
Scope and Limitations of the Study …….………………………………… 5
1.6 Definition of Terms ……….…………….………………………………… 6

Chapter 2: Review of Related Literature


2.1 Rise in the Demand of Speech Applications …………..…………………... 9
2.2 VoiceXML: the W3C Standard for Building the Voice Web ……..………. 9
2.3 Survey of Existing Voice-driven Applications Using VoiceXML
2.3.1 Voice Access Booking System (VABS) ……………………….. 10 2.3.2
Voice Inventory Management System (VIMS) ………………... 11
2.3.3 Speech-Driven Automatic Receptionist …….………………….. 13
2.4 Usability Issues of VoiceXML
2.4.1 Speech User Interface (SUI) Design …………………………… 14
2.4.2 Scalability Issues ………………………………………………. 16
2.5 Speech Technology in Computer-Aided Language Learning:
Strengths and Limitations of a New CALL Paradigm ………………. 18
2.6 E-Learning and VoiceXML ………..……………………………………... 18

Chapter 3: Methodology
3.1 Operational Framework …..………………………………………………. 20
3.2 Procedure in Developing the Speech-Driven Online Learning
Application ………………………………………………………….. 21

Chapter 4: Technology Background …………………………………………………... 26

Chapter 5: Results and Discussion ………………………….…………………………. 33

Chapter 6: Conclusion and Recommendations


6.1 Conclusion …………………………………………………………….... 43
6.2 Recommendations ……………………………………………………… 44

Bibliography ……………...…………………………………………………………… 46

Appendices
Appendix A: User’s Manual ……….………………..………………………………… 48
Appendix B: Relevant Codes ………………………………………………………….. 59
Appendix C: Speech Tutor Rubrics …………………………………………………… 61
Appendix D: Sample Answered Questionnaires ………………………………………. 62 List
of Figures

Page

Figure 1. Voice Inventory Management System Control Flow Diagram ….………….. 12


Figure 2. Stages of Development of Receptionist …………………….………………. 13
Figure 3. Speech Tutor LAN Architecture …………………….…………………….... 20
Figure 4. VoiceXML Architecture ……………………………….……………….…… 27
Figure 5. VoIP Architecture …………………………………….………………….…. 28
Figure 6. Voxeo Softphone ……………………………………….……………….….. 33
Figure 7. MySQL Command Line Client ………………….…………………….…… 33
Figure 8. Skype Interface …………………………………….…………………….…. 34
Figure 9. Login Page ……………………………………….…………………….…… 36
Figure 10. Student Home Page ……………………………….………………….…… 37
Figure 11. Tutorial Page …………………………………….………………….……. 38
Figure 12. Playing of Recording in Windows Media Player ………….…………….… 38
Figure 13. Login Page of Faculty ……………………………………………….……. 39
Figure 14. Faculty Home Page – Student List ………………………………….…….. 39
Figure 15. Student Record ……………………………………………………….…… 40
Figure 16. Playing of Student Recording ……………………….…………….………. 40

List of Tables
Page

Table 1. Results of System Evaluation Based on Speech Tutor Rubrics ……………… 35


ABSTRACT

Technological advances such as the convergence of the Internet and Telephony


have brought about the creation of VoiceXML – a W3C standard markup language for
dynamic voice applications. It uses two basic voice technologies, speech recognition and
text-to-speech synthesis, which enable users to interact with applications simply through
voice. As such, it is the appropriate tool to use for building online learning systems for
speech courses. This project aimed to develop a speech-driven online learning module for
Oral Communication (English 3). Specifically, it aimed to use the speech application as a
substitute for a speech laboratory, and evaluate its effectiveness as a learning tool for Oral
Communication.

The resulting speech application named Speech Tutor was designed to be used by college
students who are enrolled in English 3, instructors handling the subject, and English
learners in general. After designing the content and the dialogues for the application,
grammar files were created and a database was developed in MySQL to store student and
assessment data. The application was then programmed in VoiceXML and PHP. Also, a
graphical user interface was created in HTML and PHP to provide a visual guide to the
students. It is where the students create their accounts, read the tutorials, view their scores
and listen to their recordings. It is also where the instructor views the scores and links to
the recordings of his students.

Using the Speech Tutor rubrics, the speech application was evaluated by students who are
enrolled in English 3 as well as their respective instructors. The test users called the
application through Voxeo Prophecy’s softphone. They heard the tutorials on each of the
English vowel sounds, interacted with the system in the exercises, and recorded their
voice while reading a sample passage. Based on the results of the rubrics, the test users
find Speech Tutor effective in delivering the content of the English 3 course and
improving the pronunciation of the learners. More importantly, the application has been
proven to be a cheaper, better, and more innovative alternative to building a speech
laboratory.
Chapter 1

INTRODUCTION

1.1 Background of the Study

The Internet has certainly become a part of the daily activities of people from
different walks of life. These activities include online shopping, telecommuting,
videoconferencing, online gaming, communicating, and online learning. Among these,
online learning is most relevant to both educators and learners because it enables them to
interact with each other by means of the different communications technologies. Online
learning is more flexible than the traditional face-to-face lectures because students can
choose when and where to complete the course; all they need is a computer with Internet
access. Moreover, online learning provides convenient access to information and
resources through the Internet (Grant MacEwan College, 2006).

However, because most online learning materials require learners to read large
volumes of text with little or no interaction with the system aside from a few mouse
clicks, most students either lose interest or find it too eye-straining. According to
Millbower (2003), learners complain that e-learning is boring, one-dimensional and
impersonal. These complaints are a result of the absence of an instructionally designed
auditory component.
Many people consider voice capabilities as one of the key factors that contribute to
the success of the internet. However, the vocal component of multimedia is typically one-
directional and not at all interactive, unless complex and expensive telephony solutions
are overlaid. Thus, in order for the internet to fulfill its role as a communications medium
of choice, there is a need to fully embrace the concept of dialogue. This is especially true
for education, where not only the quality of the information transferred, but also the
instructor/learner relation fostered through dialogue is important (HorizonWimba, 2006).
Also, some instructional materials are better presented with audio; when teachers and
learners are on different continents, when learners need to listen to content repeatedly,
learners need to pronounce repeatedly, or learners need to converse with others (Ross,
2002).
Hence, this study was conducted to investigate the effectiveness of a speechdriven
online learning system as a medium of instruction in a course entitled Oral
Communication. The application was developed using a new technology called
VoiceXML; the HTML of the voice web, the open standard markup language for voice
applications.

The developer of this study saw the great potential of a speech-driven online learning
system in enhancing the teaching-learning process of Oral Communication, one of the
basic courses offered by the English Department in MSU-GenSan. Oral Communication is
concerned with the study of the different sounds of English (vowels, consonants, and
diphthongs), as well as stress and intonation.

1.2 Technology Application Context


VoiceXML is an XML-based language created to develop speech user interfaces for
telephone users. It permits developers to create applications that can dialogue with users
simply through voice interactions. As such, it is the appropriate tool to use for building an
online learning module for Oral Communication, which is basically a subject on
improving speech.

Several schools invest large amount of money on their facilities. Among these
facilities is the speech laboratory. A speech laboratory is a facility that has at its disposal
various technical means to teach foreign languages and where the students’ speech
presentations are held. The equipment commonly found in a speech lab include: cassette
players, cassette recorders, overhead projectors, microphones, speakers, and sometimes,
video cameras. The speech laboratory exists to help students enhance their communication
skills in a language (commonly the English language) and to motivate them to access
various language resources towards personal development (De La Salle University, 2007).

Since MSU-GenSan is a state university, its budget on facilities is relatively low


compared to top private universities. The English faculty has long been requesting for a
speech lab for their students but until now, it is yet to be materialized due to the lack of
finances. The need of the university for a speech laboratory could be answered by
developing voice applications using VoiceXML.

2
The speech application would be composed of VoiceXML, grammar, PHP, and
audio files, as well as a MySQL database. The user would access the application simply
by dialing the SIP number (sip:[email protected]) assigned to it through the use of a
softphone. The VoiceXML interpreter in the VoiceXML gateway would answer the call
and start interpreting the VoiceXML code. Next, the application would prompt the user
for input by playing an audio file or synthesizing speech with the use of the gateway’s
Text-to-Speech Synthesizer. Once the response is received by the application, the
gateway’s Speech Recognition System would recognize the speech spoken by the user.
The application would then take the appropriate action for the caller’s response. It may
play the tutorial by synthesizing speech or playing an audio file, engage the user with the
speech exercises and save his score in the database, or it can record the user’s voice and
save the recording. The application may also generate VoiceXML documents dynamically
through the use of PHP.

According to Voxeo (2006), some of the benefits of VoiceXML include: delivery


of web content and services through the telephone, portability across implementation
platforms, flexibility, and low capital and support costs.

The voice application can be delivered through the telephone because VoiceXML was
designed to augment the limitations of interactive voice response systems (IVR)s. Since
the telephone (both wireless and landlines) is a ubiquitous device, it increases the
application’s accessibility. The Internet also provides a great potential and opportunity to
route telephone calls over the medium at low costs through Voice over Internet Protocol
(VoIP). It enables almost anyone, anywhere to access the application’s content.
VoiceXML applications are also portable; over 300 vendors have pledged their support
for the VoiceXML standards. Developers have the guaranty that their

applications will run on a wide variety of platforms.


VoiceXML applications can be created in Java, JSP, .NET, C, C#, C++, VisualBasic,
Delphi, Oracle, Perl, Python, PHP, Coldfusion, JavaScript, VBScript or any other web
capable programming language. Moreover, VoiceXML applications can run on any web
capable operating system, and work with any telephony infrastructure including T1,
ISDN, SS7, and H.323 or SIP Voice over IP (VoIP).

3
Perhaps the greatest advantage of VoiceXML is its low capital and support costs.
VoiceXML applications can run on the same web application servers as the university’s
existing web site, avoiding the need for additional application server investments. Also,
the split client/server architecture lets developers deploy telephony applications at their
facility while outsourcing telephony infrastructure, and call processing to a web hosting
service such as Voxeo.

1.3 Objectives of the Study


General Objective: This study aimed to use VoiceXML in developing a speech-driven online
learning module for Oral Communication.

Specific Objectives:
Specifically, this study aimed to
a. use the speech application as a substitute for a speech laboratory;
b. evaluate the effectiveness of the speech application as a learning tool for Oral
Communication.

1.4 Significance of the Study


This study is significant to schools and universities that are aiming to improve their
language instruction, especially in subjects that require oral participation. This study
would help them see that there is a better and cheaper alternative to putting up a speech
laboratory. They no longer have to spend large amount of money just to acquire the best
facilities for their students; all they need is to use technology to innovate.

Moreover, this study is helpful to people studying English as a second language;


more specifically to those who are enrolled in Oral Communication, a subject offered in
MSU-GenSan. The voice application will serve as a good learning tool for improving the
learners’ English fluency and pronunciation. According to Zhang (2001), the development
of interactive tools which can be used by students working independently to improve
pronunciation and conversation will greatly increase the productivity of the teaching work
force.

4
In the traditional face-to-face lectures, the teachers read the correct pronunciation or
intonation of some words and let the students imitate the sounds. Some students could not
catch up as well as the others but are afraid to ask their instructors to slow down or repeat
some words. The speech-driven system will give the students an opportunity to learn
independently, listen to the lessons repeatedly, and interact with the system using voice
inputs and commands. Voice will be used in the instruction, which is a more natural way of
interaction.

Also, one of the problems met by the instructors in teaching Oral Communication is the
lack of audio equipment. This problem hinders the instructors from using audio materials,
which are considered to be very effective instructional media for speech education. This
problem is addressed by the use of the proposed system; wherein the learners will be able
to listen to pre-recorded audio or synthesized speech regarding the basic principles of
speech anytime, anywhere, as long as they have a computer with a speaker and
microphone, and a connection to the application.

1.5 Scope and Limitation of the Study


This study focused on developing a voice application using VoiceXML. This voice application
included only the topic on vowel sounds of the English language.

Basically, the online learning tool contained voice tutorials for each of the vowel
sounds. The voice tutorials were composed of recorded audio and synthesized speech.
Since it is a voice application, navigation of the system was done by issuing voice
commands or by pressing keys on the telephone keypad (DTMF). Moreover, the
application included interactive drill exercises and recording functionalities for each
lesson covered.

A corresponding web site was also constructed for the speech application. It
allowed the students to view their scores and listen to their recordings as well as study the
tutorials in each of the vowel sounds. It also enabled the teacher to view the scores and
listen to the recordings of his students in each of the vowel sounds.

5
The voice application is available online – both in the local area network and the
public Internet. It was uploaded to Voxeo Evolution, the proponent’s chosen hosting
service. It used the VoiceXML interpreter of the hosting service for synthesizing speech
and recognizing speech inputs. The hosting service assigned a phone number to the
application, which was dialed by the users to access the voice application through Skype.
One of the limitations of this study is the hard-coded questions in the exercises. It
did not permit the teachers to change the questions whenever they want to. Furthermore,
this application is just one medium of instruction; it is not intended to be the sole or
primary mode of classroom instruction.

1.6 Definition of Terms


Automatic Speech Recognition (ASR) – describes a group of special technologies that
allow callers to speak words, phrases, or utterances that are used to control
applications; in the case of voice processing, speech recognition is used to replace
touch-tone input; the process of using an automatic computation algorithm to
analyze spoken utterances to determine what words and phrases or semantic
information were present.

Audio file - a file format for storing audio data on a computer system (some audio file
formats: .wav, .mp3, .au)

DTMF – an acronym for Dual Tone, Multi-Frequency; a method used by the telephone system to
communicate the keys pressed on the phone; also known as touch-tone

English 3 (Oral Communication) – a course offered in MSU-GenSan, which is concerned


with the study of the basic principles of speech and their application in various
communication situations, namely dyadic, small group and public speaking.

Grammar file - a file that can be written in Grammar Specification Language (GSL
Nuance) format or in Speech Recognition Grammar Specification (SRGS) format, it
specifies the valid utterances that a user can make to perform an action or supply
information in a voice application

6
H.323 or SIP – Session Initiation Protocol is a signaling protocol used for establishing
sessions in an IP network; adopted by Voice over IP community as its protocol of
choice for signaling. SIP is a request-response protocol, a simple toolkit that
developers can use to build converged voice and multimedia services

ISDN – Integrated Services Digital Network is a circuit-switched telephone network system,


designed to allow digital transmission of voice and data over ordinary telephone copper
wires, resulting in better quality and higher speeds than that available with the PSTN
system; a set of protocols for establishing and breaking circuit switched connections, and
for advanced call features for the user

PSTN – Public Switched Telephone Network is the network of the world’s public
circuitswitched telephone networks, in much the same way that the Internet is the
network of the world’s public IP-based packet-switched networks; includes analog,
digital, fixed, as well as mobile phones.

SIP – Session Initiation Protocol, a signaling protocol used for establishing sessions in an
IP network. A session could be a simple two-way telephone call or it could be a
collaborative multimedia conference session. It resembles a request-response
protocol that closely resembles two other Internet protocols, HTTP and SMTP

SS7 – Signaling System #7 is a set of telephony signaling protocols which are used to set up
the vast majority of the world’s public switched telephone network telephone

calls
Text-to-Speech Synthesis (TTS) – the artificial production of human speech; it converts ASCII
text into the spoken word

T1 – also known as digital signal or DS-1 – a T-carrier signaling scheme devised by Bell
Labs; a widely used standard in telecommunications in North America and Japan to
transmit voice and data between devices.

Voice application - any application or service which relies upon voice communications;
including PSTN voice, also known as POTS ("plain old telephone service");
features, such as voice mail; services, such as teleconferencing; and audiotext.

7
VoiceXML – Voice eXtensible Markup Language is an XML-based markup language for
creating distributed voice applications; W3C's standard XML format for specifying
interactive voice dialogues between a human and a computer.

Voice gateway – provides telephone access to voice-enabled web services; it retrieves


VoiceXML-formatted content from web servers, converting it into interactive voice
dialogs with end users via fixed or mobile phones

VoIP – Voice over Internet Protocol is the transmission of voice traffic over IP-based
networks; VoIP services convert your voice into a digital signal that travels over the
Internet.
Voxeo – a company that helps enterprises improve service and lower costs by automating
and connecting their most common phone calls with its Interactive Voice Response
or Voice over IP solutions

XML – acronym for Extensible Markup Language; a standard for creating markup
languages which describe the structure of data; a metalanguage or a language for
describing languages; enables authors to define their own tags

8
Chapter 2

REVIEW OF RELATED LITERATURE

2.1 Rise in the Demand for Speech Applications


In the past years, computing has extended its reach from PCs and laptops to a
growing number of devices – from PDAs to smartphones to automobiles (King, 2005).
This increase in use of wireless devices has also increased the demand for wireless
services and applications. Among these services is access to Internet content (Touesnard,
2001). The need of people to access Internet content from anywhere, anytime gave rise to
wireless and voice recognition-based applications (Penumaka, 2006).

There is a huge market opportunity in voice applications. Voice or speech


applications are applications in which the input and/or output are through a spoken,
instead of a graphical, user interface (IBM, 2005).

Penumaka has identified several benefits of voice applications including the


ubiquitous presence of the telephone and hands-free interaction using voice commands.
Access to business and information over the telephone is somewhat easier because of the
greater availability of landline and wireless telephones and its ease of use (Phonologies,
2007). Furthermore, the Internet currently provides a great potential and opportunity to
route telephone calls over the medium at low costs through Voice over Internet Protocol
(VoIP).

Moreover, hands-free interaction through voice commands offers convenience and


safety to wireless phone users. Voice commands are far more convenient and intuitive to
use than punching in letters and numbers on the tiny keypad of a wireless phone.

2.2 VoiceXML: the W3C Standard for Building the Voice Web
Traditional Interactive Voice Response (IVR) applications have been deployed in
enterprises for decades, but they’ve faced serious limitations including poor usability and
the inability to go beyond access to proprietary information (Penumaka, 2006). Hence, the
companies Lucent, Motorola, IBM, and AT&T developed an industry standard for

9
building voice applications that could access Internet content by phone and voice, and
they called this VoiceXML (Mairesse, 2004). According to Dennis King, Director of
Architecture for the Pervasive Computing Division in IBM, VoiceXML is crucial in
propelling the speech industry forward and promises to be an important part of the
multimodal world. Kelsey Group also says that VoiceXML is the right standard that
enables developers to take advantage of the huge market opportunity in voice

applications.
VoiceXML hides the complexities of the telephony platform from developers and
provides an easy way of developing feature rich and media rich speech applications. It
uses Speech Recognition and DTMF for user input, and prerecorded Audio and Text-
toSpeech for output (Phonologies, 2007).

Christer Granberg, Chief Executive Officer of PipeBeach also believes that VoiceXML is
a powerful business enabler for the rapid and cost-efficient development of interactive
speech services, especially innovative services for the mobile user.

Benefits of VoiceXML
Since VoiceXML is an international standard, it lets the developer write an application
once and run it anywhere. It is also independent of the Speech and Telephony platform,
giving the developer flexibility in choosing the platform of choice. VoiceXML is also a
simple scripting language, allowing developers to build applications without worrying
about the complexities of the platform (Phonologies, 2007).

Moreover, development of voice applications using VoiceXML is relatively


cheaper because it does not require infrastructure set up, which is the case with traditional
IVR applications. VoiceXML applications are typically hosted with a hosting service such
as Voxeo and BeVocal, which offers their services at a relatively low cost (Penumaka,
2006).

10
2.3 Survey of Existing Voice-driven Applications Using VoiceXML
2.3.1 Voice Access Booking System (VABS)
Age Concern, an organization that takes care of old people has established a network of
Age Resource Desks to help older people benefit from Information and Communication
Technologies (ICT). Oftentimes, older people have little or no knowledge of computing,
and usually have age-associated impairment such as memory and sight loss, making it
hard for them to use desktop computers.

The Voice Access Booking System (VABS) was built for Age Concern
Oxfordshire, and is based upon a Web accessible database which holds the bookings for
IT taster sessions at the Age Resource Desk. The system allows the old adults to book a
taster session with a reminder call, book a taster session without a reminder call, cancel a
taster session and notify the database if they are going to be late. This system can also be
used to create reminder calls to the old adults to remind them that they have an IT taster
session on that day. The computer’s utterances in the dialogue are produced using
prerecorded natural speech (Zajicek, 2002).

Zajicek concluded that the power of the VoiceXML system lies in its ability to set up
reminders for older adults. The system can telephone with a message to remind them to
take some medicine, or keep an appointment. It can also allow a remote caregiver to
populate the databases with reminders, which will prompt a telephone call at prearranged
times.

2.3.2 Voice Inventory Management System (VIMS)


The Voice Inventory Management System (VIMS) prototype was developed by
the NRC IIT eBusiness Human Web group. VIMS was designed to allow users to access
inventory information through the use of speech. It permits a mobile worker to easily
retrieve product and warehouse information out of the product database in real time using
a mobile or regular phone through a natural speech dialog. VIMS provides a quick and
easy way for managers and salespeople “in the field” to stay up-to-date on their inventory
(Rebolj et. al, 2004).

The system provides an authentication menu to allow only authorized access to the
system. Users must say their name and a personal identification number to enter the

11
system. The PIN could be spoken or entered using a touchtone keypad. The VIMS
application keeps track of a series of products and warehouses in a database. Each of these
products and warehouses has a number of attributes. Each product has a price, product
number and description, and is associated with the warehouses in which it is located in.
Meanwhile, each warehouse has an address, and a listing of its contents. The system also
keeps track of product types – represented by a tree that links particular types of products
together.

The user can browse through a hierarchical menu of product types if the name of
the product is not known. The user can also ask directly about a particular product or
warehouse and request information regarding the product or warehouse. Furthermore, the
user can browse an alphabetical listing of all products and warehouses stored in the
system. As searching through an alphabetical list of dozens of items can be time
consuming, the user also has the ability to skip to items beginning with a certain letter.

All products in the warehouse database are entered into the VIMS speech
recognition grammar so that the grammar is updated dynamically and concurrently with
the information on current products and warehouses in the database.

The control flow diagram for the VIMS system is presented in Fig.1 .

12
Figure 2. Voice Inventory Management System Control Flow Diagram
2.3.3 Speech-Driven Automatic Receptionist
The Speech-Driven Automatic Receptionist is a voice application that was designed to be
used by smaller Swedish companies (Matzon, 2005). It answers calls coming into the
company and directs the calls to an employee based on speech input from the user. It also
has error handling capabilities for unrecognized names, busy signals, and unanswered
phone calls. In case the requested person may be reached at several numbers, the application asks
which number it should connect to (mobile, home, work). After the system knows the correct
number, it connects the call.

It has a database that stores data needed in order to make it more dynamic and to log call
statistics. It was programmed using VoiceXML and Coldfusion. A corresponding website
was developed to administrate the telephony application and allow companies to
customize the application as well as view statistics about their usage of the application.

13
It has the following capabilities:
• answers calls,
• recognizes speech input and commands,
• gives contact numbers of the employees in a company

Figure 3. Stages of Development of Receptionist

Matzon concluded that the receptionist was designed and implemented for small
companies where the calls are easily directed and transferred to one of the employees.
She suggested that error handling should be improved in the future versions of the system.
To do this, if the caller is unsuccessful for the first time, the system must give a list of
people who work at the company with their corresponding positions. In this way, the
caller would immediately know if the person he is trying to call is working at the
company. This would work well for small companies but could be too tedious and time
consuming for larger companies.

14
2.4 Usability Issues of VoiceXML
2.4.1 Speech User Interface (SUI) Design
Speech User Interface (SUI), also known as Voice User Interface (VUI), is a type of

computer interface consisting of spoken text and other audible sounds. It enables users to

access information and perform transactions using spoken commands (IBM, 2005).

Human Communication
According to Matzon (2005), a designer of speech applications must consider all the
unwritten rules that exist in human conversation so that the people will find speaking to a
machine as natural as speaking to a human.

Grice (1975) has written about four well-known principles that should be
followed in order to have a satisfactory conversation. These four principles are:
• Quality – this means that a person must always be sincere in a conversation.
• Quantity – this means that a person should say neither too little nor too much. If a
person doesn’t say enough then it could lead to confusion and the same could
happen if they say too much.

• Relevance – this means that what a person says at any point must always be
relevant in the conversation. The listeners would be confused if the speaker starts
to say something that is unrelated to the current subject of the conversation.

• Manner – this means that ambiguity must be avoided. Speakers must be clear and
direct to the point otherwise it can lead to confusion.

All of these principles must be followed in designing the dialogues so that the user will
be comfortable with the conversation. According to Norrby (1996), aside from these
principles, the conversation structure is also important to follow. Conversations between
humans are structured in turn construction units (TCU). Each speech act by a speech partner is
considered a TCU and these TCUs are surrounded by turn relevance places (TRPs). For
instance, if a person directs a question to another person, that is considered a TCU. When the

15
other person answers the question, that is considered another TCU. The time in between the
question and the answer is a TRP. TRPs are extremely important because they signal when
another party can take a turn in speaking. TRPs can be signaled by a longer pause, the
intonation at the end of a TCU and other signals that humans perceive automatically. It is vital
for the users of the speech application to understand when a pause is a TRP or not, otherwise a
conversation can be frustrating for the user.

Importance of SUI Design


There is a credo among speech interface designers: “A good GUI and a good SUI are
both a pleasure to use, a bad GUI is hard to use, but a bad VUI isn’t used at all”.

Effective SUI design draws upon many disciplines. The key scientific disciplines are
Psychology, Human-Computer Interaction, Human Factors, Linguistics and
Communication Theory. On the other hand, the artistic disciplines of Auditory Design and
Writing (especially the techniques of writing dialog) are also very important. More
importantly, for true craftsmanship, there is no substitute for experience and codification
of best practices (IBM, 2005).

End users call the speech application for the purpose of obtaining a service from
the service provider. Users want SUIs that are easy to use, allow efficient task completion,
and provide a pleasant user experience. Hence, developers must write the code that creates
the entire speech application, including the SUI. The primary objectives for developers
creating SUIs are that the interface be technologically feasible, capable of completion
given resource constraints, and require minimal effort (IBM, 2005).

In the influential book of Balentine and Morgan on SUIs, they have stated that the
main enemy of spoken user interface is time. Speech has a temporary existence and
listeners must remember what they have heard. However, if prompts are too short, they
can be subject to multiple interpretations. Therefore, developers must avoid letting users
hear more (or less) than they need to hear or to say more (or less) than they need to say.
Usability Study on Speech User Interface (SUI) Design
The study conducted by Zajicek et.al (2002) entitled Towards VoiceXML
Dialogue Design for Older Adults focused on the usability of VoiceXML dialogues for

16
older adults and the challenges of embedding context sensitive help and instructions in
dialogues.

The researchers used the Voice Access Booking System (VABS) and they found
out that messages should be kept short so that old users do not forget what they have
heard. With shorter messages, users are questioned as they go through the dialogue rather
than being presented with a list of options as done in a hierarchical menu driven system.

Also, the dialogue designer must provide context sensitive help and instructions to guide
the older adult in using the system. Confirmatory sentences are also important in that they
can help the user feel more in control of their interaction with the system. By providing
positive reinforcement, users are encouraged that they are doing the right thing.

Other Design Issues


Knowing the User
It is important to know the audience before building applications. For example, if
majority of the target audience would be aged fifty and above, it would be wise to keep
menus concise, as it is known that memories normally become poorer with age. The
developer will be able to predict potential problems if he knows the users. Otherwise,
developers may create a general interface that does not accommodate majority of the users
(Touesnard, 2001).

Error Recovery
It is a fact that users commonly make mistakes when interacting with
computerized systems. Users often become frustrated when they commit these errors.
Hence, it is extremely important to have a comprehensive recovery system to allow users
to go back to where they made a mistake and continue with minimal trouble.

2.4.2 Scalability Issues


Scalability issues must be foreseen by a system developer to prevent future problems.

17
Study on the Scalability of VoiceXML using the Voice Inventory Management System
(VIMS)

Database Growth
One aspect of scalability that must be investigated is database growth. In order to
explore the scalability of VoiceXML, Touesnard (2001) added more than a thousand
computer products to VIMS’ database of only sixteen products. He predicted that the
system would not perform well when it had a larger database compared to the time when
it had only sixteen products. The system was tested by letting the user say a product and
the system would then return the product information. The results of the test showed that
the increased size of the database had little effect on performance.

Phrase Size
Another aspect of scalability that was investigated in the study is the phrase size. Product
names that were added to the database were very long and a lot of abbreviations were
used, making the phrases difficult for the Text-to-Speech (TTS) engine to pronounce. The
investigation involved targeting the actual products individually and was based on the
assumption that as product phrases increases in length, recognition confidence decreases.
To test this assumption, long product phrases were gradually decreased in size and tested
after each decrease. The results of the test showed that the assumption was correct.
Shorter product phrases were more confidently recognized than longer ones. Furthermore,
it is found out that it is good practice to keep phrases shorter than 6 words to yield better
recognition and to keep the dialog natural for the user to pronounce.

Grammar Growth
The Grammar Specification Language (GSL) grammar format has a limited set of
acceptable characters that can be used in its phrases. The characters include lowercase
letters, digits, hyphen, underscore, single quote, at sign, and period. The GSL grammar
parser will throw an error when presented with a character outside the set.

According to Touesnard, in terms of scalability, grammar can become a major


problem over time as the database grows. Attached grammars must be compiled every
time a VoiceXML page is accessed. The compile time is directly related to the size and
complexity of the grammar.

18
In VIMS, all the products were written in a single grammar file; hence, compile
time of the grammar would increase with the growth of the database. This scalability issue
can grow to completely disable the application. Once the grammar becomes large enough,
one of two things will occur: the compile-time delay will take longer than the user is
willing to wait, or the VoiceXML browser will timeout waiting for the grammar and the
application will exit.

To prevent this scalability problem, one solution formulated by Touesnard is to segregate


the product names into several portions and storing the portions in different fields in the
database. The segregation of data can enhance the flexibility of the system architecture
thus enabling a decrease in grammar sizes.

2.5 Speech technology in Computer-Aided Language Learning: Strengths and


Limitations of a New CALL Paradigm
This is a paper by Ehsani and Knodt (1998) that studies the suitability of deploying
speech technology in computer-based systems that can be used to teach foreign language
skills. The advancement in multimedia technology, computer-aided language learning
(CALL) has emerged as a luring alternative to traditional modes of supplementing or
replacing direct student-teacher interaction, such as a language laboratory and audio-taped
based self study.

One of the most successful applications of speech recognition and processing technology
is in the area of pronunciation training. Voice interactive pronunciation tutors prompt
students to repeat spoken words and phrases or to read aloud sentences in the target
language to practice the sounds and intonation of the language.

According to this study, the key to teaching pronunciation successfully is through


corrective feedback. Automatic pronunciation scoring has been used to evaluate spoken
learner productions in terms of fluency, segmental quality (phonemes) and
suprasegmental features (intonation).

19
2.6 E-learning and VoiceXML
According to Regruto (2003), the possibility of dialogue with Web applications and services
will help those who are not readily conversant or familiar with computers to use high-tech
networks such as the Internet. VoiceXML allows the developer to provide services that are
accessible by means of the “natural” interface of the voice and without the use of peripherals
such as mouse, keyboard, monitor, or other interfaces.

In a presentation by Ross (2002), he stated that online asynchronous learning has always
used text as the primary medium for both content delivery and discussion. Until recently,
most online courses were void of voice. However, he said that there are particular subjects
wherein voice is essential, such as: Language (English as the Secondary Language or
ESL), Public Speaking, Poetry, Acting, Singing, and Literacy (in children and adults). He
discussed that some instructional materials are better presented with audio; when learners
need to listen to content repeatedly, learners need to pronounce repeatedly, or learners
need to converse with others.

This is also supported by a paper presented by Reusch (2004), where he said that voice-
enabled applications can support e-learning in many ways. Reusch also stated that
developers of speech applications are no longer restricted to pre-recorded audio; they can
already bring any text to the ear of the user – a user who could be visually impaired and
needs a voice channel to communicate – or a user who can read but prefers to listen. And
the best way to write a voice-enabled application is through VoiceXML, wherein
applications can be accessed not only through computers at home or in computer labs, but
also through the telephone. This is even made better by the flat rates offered by
telecommunications providers. People can sit anywhere with their phones and listen to
elearning applications. Indeed, voice-enabled applications add new channels for everyone.
Chapter 3

METHODOLOGY

20
3.1 Operational Framework

Figure 4. Speech Tutor LAN Architecture

A typical VoiceXML system contains four main components: the telephone network, the
VoiceXML gateway, an application server, and the TCP/IP network. The telephone
network may either be accessed through the Public Switched Network (PSTN) or through
Voice over IP (VoIP) packet network. The VoiceXML gateway consists of the
VoiceXML interpreter, media resources such as Speech Recognition, Text-to-Speech, and
Audio Playback, and telephony resources such as Dual Tone Multi Frequency (DTMF)
and Call Control (Phonologies, 2007).

In this application, the developer used Voxeo Prophecy as the VoiceXML


gateway. It was installed in the application server, along with the VoiceXML, HTML,
GSL, audio, and PHP files. Voxeo’s SIP Softphone was used in testing but in the
deployment, SJphone will be used as the VOIP softphone and it will be installed in all
client machines. Users will use this softphone to call the application by dialing sip:
tutor@ip-address. A MySQL database was also used to store student and exercise
information, as well as the scores obtained by the students and their recording in each
exercise. Upon deployment, all of these components will be connected through the
university’s local area network. And the application’s simpler version (no scripts) will
also be uploaded to Voxeo Evolution to increase its availability and accessibility.

21
3.2 Procedure in Developing the Speech-Driven Online Learning Application
3.2.1 Designing the content of the application
The content of the proposed system was determined through interviews with the
instructors of the subject, review of the course syllabus and course handouts. The syllabus
of Oral Communication course was studied and the topics to be included in the system
were selected. Also, the instructional materials that are currently used by the instructors
werecollected to obtain the necessary information on specific topics.

3.2.2 Speech User Interface


Since the application to be developed is speech-enabled, it is important to have
well-designed dialogues so that there will be a smooth flow in the “conversation” between
the user and the virtual speech partner (speech-enabled online learning application). If the
dialogues are well-designed, the user will be able to navigate the application by giving
voice commands without getting lost, confused or frustrated. The four principles of
communication developed by Grice (1975) must be followed in creating the dialogues to
have a satisfactory communication. These four maxims are: quality, quantity, relevance,
and manner.

The SUI design methodology outlined in IBM’s Programmer’s Guide (2005) was
followed. It involved an iterative four-phase process, which includes: design phase,
prototype phase, test phase, and the refinement phase.

3.2.2.1 Design Phase


1. Users of the system were analyzed – any user characteristics and requirements
that might influence application design (usage frequency, motivation,
environment type, connection type, language familiarity, age of users) were
identified

2. User tasks were analyzed –user tasks that will be supported by the system were
determined

22
3. The conceptual design (vision clips) was developed – samples of conversations
to associate user tasks with user interface expectations were created. According
to IBM’s Programming Guide (2005), designers must be very familiar with the
capabilities of speech technologies to avoid preparing vision clips that would be
difficult or impossible to deploy as applications.

4. High-level decisions were made – the appropriate user interface, barge-in style,
prompt style, and help style were selected

5. Low-level decisions were made – a consistent “sound and feel” was adopted,
consistent timing was used, introductions were created, menus and prompts
were constructed, grammars were designed and created, error recovery and
confirmation of user input were constructed

6. The complete callflow was designed – Typical responses, unusual responses,


and any error conditions that might occur were identified

7. Initial dialog script between the application and the user was created – all the
text that will be spoken by the application were included, as well as expected
user responses

8. A plan for expert users was designed – Parts where expert users can cut through
were identified to allow them to perform tasks quickly

3.2.2.2 Prototype Phase


A prototype script was written on paper so that two people can test the dialogues. One
person played the role of the user while the other played the role of the computer system
(the wizard). This technique is known as the “Wizard of Oz” testing.

The script included the proposed introductions, prompts, list of global commands, and all
self-revealing help. The two participants were physically separated to prevent them from
communicating through visual cues. Moreover, it was ensured that the wizard was
familiar with the script while the user did not see the script.

23
The “Wizard of Oz” technique helped the developer fix problems in the script and task
flow before coding, reducing the coding time for the application. But this technique could
not detect other usability problems such as speech recognition and audio quality problems.
Hence, a working prototype of the system was built.
1. Grammar Creation and Voice Recording
After designing the content and the dialogues, the grammar files and the voice
recordings were created. A grammar is an enumeration of the set of utterances
– words and phrases – that constitute the acceptable user response to a given
prompt. Grammars may be inline (embedded within the VoiceXML document)
or placed in external files. The application used the Nuance Grammar
Specification Language (GSL) for creating the grammar files.

On the other hand, voice recording was done for some parts of the system
to create a more personal interaction; the students would feel that they are
having a one-on-one tutorial with their instructor. Also, human voice is still
considered to be more pleasing to human listeners than computer-generated
voice. Recording was done using the necessary hardware and software tools
such as a headset with microphone, and a sound recorder application (i.e.

Windows Sound Recorder).


Prerecorded audio files were of the following formats:
• an 8KHz 8-bit mu-law .au file
• an 8KHz 8-bit mu-law .wav file
• an 8KHz 8-bit a-law .au file
• an 8KHz 8-bit a-law .wav file
• an 8KHz 16-bit linear .au file
• an 8KHz 16-bit linear .wav file
2. Creation of VoiceXML documents
While the grammar files and voice recordings were created, the VoiceXML
documents were also constructed. The introductions, menus, prompts, and
outputs that were created in the design phase were incorporated in the

24
VoiceXML documents. The XML documents were created using the
Macromedia Dreamweaver editor.

3. Database Design and Development


A simple database was created using the MySQL database
management system. It was used to store student, faculty, and exercise
information. It was also used to authenticate users and save their scores and
recordings in the exercises.

4. Web Site Construction


A corresponding web site was also developed for the speech
application. It provided a graphical user interface for the tutorials. It also
displayed the scores and recordings made by the students. It was constructed
using HTML and PHP.

5. Scripting
The server-side scripting language used in this application was PHP. It was
used to allow the application to connect to the MySQL database. Clientside
programming was done using JavaScript. The scripts were inserted in the
VoiceXML code as necessary.

6. Uploading the files to a Hosting Service


The VoiceXML and grammar files were also uploaded to a free Web hosting
service for VoiceXML. The hosting service then mapped a phone number to
the application, which was dialed by the users through a telephone or a
softphone (such as Skype). Voxeo does not allow developers to upload scripts
into their hosting system. Hence, the application that was uploaded to Voxeo is
the simpler version of the application that is void of scripts and database
connectivity.

3.2.2.3 Test Phase


This is the phase where speech recognition problems were identified. Also, special
attention was given to words that were consistently misrecognized. Potential user interface

25
breakdowns were also identified in this phase. Some factors that were taken into
consideration are: percentage of users who did not successfully complete the test
scenarios, points in the application where users experienced the most difficulty,
effectiveness of error recovery mechanisms, and time required to complete typical
transactions.
The SIP Phone, which is included in the Voxeo Prophecy platform, was used to
call and test the application. A call simulation was done by the developer to detect and
correct bugs in the speech application.

3.2.2.4 Refinement Phase


In this phase, the user interface was updated based on the results of the testing phase.
Scripts were revised; prompts, introductions, menus, and error-recovery mechanisms were
also added or modified.

3.3 Procedure in Usability Testing

Usability testing is a critical part of testing VoiceXML applications. This is done to detect
potential problems that were not anticipated during the design process. Usability testing
continues even after deployment since the data generated by users are important in
understanding user interactions. It is an ongoing cycle that captures user data to
continually measure and improve the user experience (Michael and Bhagayan, 2001).

According to Bryan Michael and Mukund Bhagayan, some of the metrics that can
measure how usable an application are: task completion time, user experience in
completing a task, and out-of-grammar utterances.

Task completion time is just the amount of time it takes for a user to complete a task. On
the other hand, the second metric determines if the users were confused with certain
prompts, and the last metric determines the words that are uttered by the user that are not
included in the grammar.

Usability testing for this application was done by gathering a sample population of the
target users of the system. Each of them was asked to answer a survey questionnaire after

26
testing the system. The survey questionnaire named Speech Tutor Rubric (See Appendix
A) contained the following criteria: voice projection, pronunciation, pace, transitions,
appropriateness, organization, responsiveness, convenience, acceptance, and relevance.
Each of the criteria was rated from 1 to 4, which means unacceptable, needs work,
competent, and excellent respectively.

Chapter 4

TECHNOLOGY BACKGROUND

VoiceXML
VoiceXML or VXML is a W3C endorsed markup language that allows
developers to write advanced telephony applications with simplicity undreamed of until
recent years. As VXML is a tag-based markup language, its structure is very similar to
HTML in many ways, but instead of being a primarily visual medium, VoiceXML is an
auditory medium that allows the end user to navigate through his 'telephony page' by
using voice commands, rather than by clicking a button on a web page (VoiceXML
Forum, 2006).

VXML allows a user to interact with the Internet through voice-recognition


technology. VoiceXML voice recognition is more effective than dictation style voice
recognition because the former uses a predefined set of grammars, which describe a list of
things the person can say at a given point. While voice recognition for dictation works
well on a computer with a great microphone, (and after tedious voice training), telephone
voice recognition for VoiceXML needs to work without training, on poor lines and noisy
cell phones, and for dozens to hundreds of callers with differing dialects and accents, all at
the same time (Voxeo, 2006).

Instead of a traditional browser that relies on a combination of HTML, keyboard


and mouse, VXML relies on a voice browser and/or the telephone. Using VXML, the user
interacts with voice browser by listening to audio output that is either pre-recorded or
computer-synthesized and submitting audio input through the user's natural speaking
voice or through a keypad, such as a telephone (Webopedia, 2006).

27
As with HTML documents, VoiceXML documents have web URIs (Uniform
Resource Identifiers) and can be located on any web server. Yet a standard web browser
runs locally on the client’s machine, whereas the VoiceXML interpreter (or browser) is
run remotely - at the VoiceXML hosting site. And one has to use a telephone or any free
Internet call service to access the VoiceXML browser (BeVocalCafé., 2006).
A voice browser typically runs on a specialized voice gateway node that is
connected both to the Internet and to the public switched telephone network. The voice
gateway can support hundreds or thousands of simultaneous callers, and be accessed by
any one of the world's estimated 1,500,000,000 phones, from antique black candlestick
phones up to the very latest mobiles.

VoiceXML takes advantage of several trends: the growth of the World Wide Web
and its capabilities, improvements in computer-based speech recognition and text-
tospeech synthesis, and the spread of the WWW beyond the desktop computer
(VoiceXML Forum, 2006).

VoiceXML architecture

Figure 5. VoiceXML Architecture

There are four main components in a VoiceXML architecture: the telephone network, the
VoiceXML gateway, application server, TCP/IP network.

28
Telephone Network
The users will need to access the VoiceXML application through the telephone network.
They may either use the Public Switched Network (PSTN) or Voice Over Internet
Protocol (VoIP) packet network. PSTN includes all the network of telephones – from
landlines to mobiles, and from analog to digital. A telephone network can be thought of as
a collection of wires strung between switching systems (Wikipedia, 2007).
Voice Over Internet Protocol (VoIP)
VoIP is simply the transmission of voice traffic over IP-based networks. VoIP services
convert your voice into digital signal that travels over the Internet. It has become popular
largely because of the cost advantages to consumers over traditional telephone networks.
If a person is calling a regular phone number, the signal is converted to a regular
telephone signal before it reaches the destination. VoIP can allow a person to make a call
directly from a computer, a special VoIP phone, or a traditional phone connected to a
special adapter.

A broadband (high speed Internet) connection is required. Some VoIP services only work
over the computer or a special VoIP phone, while other services allow a person to use a
traditional phone connected to a VoIP adapter. If a computer will be used, a software and
microphone will be required. In addition, wireless “hotspots” in locations such as airports,
parks, and cafes allow you to connect to the Internet and may enable you to use VoIP
wirelessly (Federal Communications Commission, 2007).

29
Figure 6. VoIP architecture

VoiceXML gateway and Web Server


The VoiceXML gateway consists of a VoiceXML interpreter integrated with media
resources such as Speech Recognition, Text-to-Speech and Audio Playback, and
telephony resources such as Dual Tone MultiFrequency (DTMF) and Call Control
(Phonologies, 2007).

The VoiceXML interpreter is responsible for interpreting the VoiceXML code. It sends
parameter values to the web server as part of the request and it receives a VoiceXML
document as the response. The web server, on the other hand, receives requests and sends
responses back to the interpreter.

The other infrastructure components in the gateway are the telephony switch, voice
recognition software, and a speech synthesis engine. The gateway is responsible for
connecting to the Public Switched Network (PSTN), performing voice recognition,
playing audio files, and other supporting functions (Penumaka, 2006).

Grammar File
In order for the application to gather input from the caller, this will require that a field and
a corresponding grammar are specified. The grammar helps define what the caller can say

30
to select or complete a given field in that form, much like predefined items within an
HTML drop-down form. The field element presents the user with a number of choices and
eventually returns a result based on the user input. Grammars are defined using a grammar
language; either the proprietary GSL Nuance Format, or the W3C SRGS (Speech
Recognition Grammar Specification) format.

A grammar file is a separate file from our VoiceXML document. According to


IBM (2005), speech recognition systems provide computers with the ability to recognize
what a user says through the use of grammars. Grammars are used to identify the words
and phrases that can be spoken by the user. They formally define the set of allowable
phrases that can be recognized by the speech engine. Command and control grammars
specify the valid utterances that a user can say to perform an action or supply information.
Automatic Speech Recognition
An Automatic Speech Recognition (ASR) system not only captures spoken words but
also distinguishes word groupings to form sentences. It contains a number of IT
components that work together: an input device such as a microphone, software to
distinguish words, and databases containing words to which spoken words are matched.

To distinguish words and sentences and match them to those in a database, an ASR
system follows three steps:

Step 1: Feature Analysis. In this step, words are captured as the user speaks into the
microphone, background noise is eliminated, and digital signals of the speech are
converted into phonemes. A phoneme is the smallest unit of speech, something
that most people equate with syllables. This is what a person usually sees when he
looks up the pronunciation of the word in a dictionary.

Step 2. Pattern Classification. The ASR system attempts to recognize the spoken
phonemes by locating a matching phoneme sequence among the words stored in
an acoustic model database. The acoustic model database is essentially the ASR
system’s vocabulary. If the ASR system finds possible matches for the spoken
word, it sends all these possibilities to language processing.

Step 3. Language Processing. The ASR system attempts to make sense of what a person
is saying by comparing the possible word phonemes generated in pattern

31
classification with a language model database. The language model database
includes grammatical rules, task-specific words, phrases, and sentences that are
frequently used. Once a match is found, the words are stored in digital form. This
is the most complicated step since the ASR system attempts to determine the exact
words that are spoken. The system must perform a number of tasks, including
evaluating the inflection of the speaker’s voice.

Many people believe that ASR system will be standard technology on home computers
within the next few years (Haag et. al, 2002).

Text to Speech Synthesis


The text-to-speech engine is the component that synthesizes speech output by:
breaking down the words of the text into phonemes, analyzing the input for occurrences of
text that require conversion to symbols, such as numbers, currency amounts, and
punctuation (a process known as text normalization, TN), and generating the digital audio
for playback.

TTS engines in general can use one of two techniques:


• Formant TTS
• Concatenative TTS
Using linguistic rules and models, a formant TTS engine generates artificial
sounds similar to those created by human vocal cords and applies various filters to
simulate throat length, mouth cavity shape, lip shape, and tongue position. Although
format TTS engines can produce highly intelligible speech output, the output still has a
"computer accent".

A concatenative TTS engine also uses linguistic rules to formulate output, but
instead of generating artificial sounds, the engine produces output by concatenating
recordings of units of real human speech. These units are usually phonemes or syllables
that have been extracted from larger units of recorded speech, but may also include words
or phrases.

32
The Speech Platform TTS engine utilizes the concatenative technique. Although
the speech output produced by the concatenative technique sounds more natural than the
speech output produced by the formant technique, it still tends to sound less human than
an individual, continuous recording of the same speech output produced by a human
speaker. Nevertheless, text-to-speech synthesis can be the better alternative in situations
where preparing individual audio recordings of every individual prompt required for an
application is inadequate or impractical.

Generally, developers consider using TTS when:

• Audio recordings are too large to store on disk, or are prohibitively expensive to record.

• The developer cannot predict what responses users will require from the application (such
as requests to read e-mail over the telephone)

• The number of alternate responses required makes recording and storing prompts
unmanageable or prohibitively expensive.

• The user prefers or requires audible feedback or notification from the application. For
example, a user may prefer to use TTS to perform audible proofreading of text and
numbers in order to catch typographical errors missed by visual proofreading.

Voxeo Prophecy
Prophecy is the world’s first telephony application platform software that is accessible,
simple, 100% standards based and inexpensive. It is openly available and does not require
configuration. It already comes with a speech recognition engine and speech synthesizer.
It includes everything needed to create and deploy IVR or VOIP applications, including
VoiceXML and CCXML browsers, a built-in SIP soft-phone, and support for hundreds of
SIP providers and devices. It also works with any web development language or server
including ASP, CGI, C#, Java, Perl, PHP, Python, and Ruby.

33
Voxeo Evolution

Evolution is Voxeo’s Developer and Customer Portal. The site provides a free
development platform, resources, and technical support for Interactive Voice Response
(IVR) and voice recognition applications, integrated with modern XML and web
development solutions.

Chapter 5

RESULTS AND DISCUSSION

The speech application named Speech Tutor contains tutorials and exercises on the
different vowel sounds. It allows the user to select the specific vowel sound that he wishes
to study, and choose whether he wants to listen to the tutorials on the correct
pronunciation, take the exercises, or create a recording. It also has an authentication
mechanism to allow only registered users to gain access to the system.

Figure 7. Voxeo Softphone

The system was developed and tested using Prophecy, Voxeo’s development kit
for VoiceXML while the VoiceXML files and grammars were written using Macromedia
Dreamweaver editor. Calls to the application were made through Voxeo’s SIP softphone.
MySQL was used to create and update the database of the application, while PHP was
used for server-side scripting.

34
Figure 8. MySQL Command Line Client
To make the application available on the web, the simpler version of the system
was uploaded to Evolution – Voxeo’s Developer and Customer Portal. This version has no
scripts since Voxeo does not allow developers to upload scripts to Evolution. It was
accessed by dialing its assigned number in Skype.

Figure 9. Skype Interface

The developer used an iterative process in system development and future users
were greatly involved. They tested the system and gave some valuable suggestions. This is
essential since they would be the ones who will get to use the system once it is deployed.

35
During the initial design phase of the system, the speech application used voice in
navigation. However, the developer observed that this design is not appropriate since the
users are not yet so familiar with using speech applications. They would often say things
that are not included in the grammars. Also, the application would pick up background
noise since it is sometimes used in a noisy environment. These resulted to misrecognition
and irritated most of the users. Hence, the developer decided to use DTMF in navigation;
it is far more convenient to use and significantly lessened misrecognition.

Furthermore, due to the users’ unfamiliarity to speech applications, they often got lost
and didn’t know what to do or what to say next, especially when misrecognition or errors
occur. This led to the use of self revealing help; it is a design strategy wherein the system
would respond with a new prompt every time an error occurs. It would assist the users by
informing them of their options and giving them suggestions on what to do next.

In order to test the application’s usability and effectiveness, it was evaluated by


some of the faculty from the English Department in Mindanao State University – General
Santos City who are teaching Oral Communication, and by a number of students who
were enrolled in the said subject. The application was evaluated based on the following
criteria: voice projection, pronunciation, pace, transitions, appropriateness, organization,
responsiveness, convenience, acceptance and relevance.

Table 1 shows the results of the system evaluation based on the Speech Tutor
Rubrics. According to the results, the system’s voice projection is clear and audible,
words were pronounced correctly, and pacing is mostly effective all throughout the
application.

However, the instructors complained about transitions in the system. They stated
that it is hard for them to tell if it was already their turn to speak because there is no clear
signal as to when it was time for them to respond. This problem is called spoke-too-soon
and spoke-way-too-soon incidents. Based on the paper released by IBM, spoke too soon
or spoke way too soon incidents often occur when the developer has disabled barge-in.

Table 2. Results of System Evaluation Based on Speech Tutor Rubrics

36
Criteria Average Rating
By English By
Instructors Students
Voice Projection 3.5 3.333333
Pronunciation 4 3.6
Pace 2.75 3.066667
Transitions 2.75 3.4
Appropriateness 4 3.666667
Organization 4 3.266667
Responsiveness 4 3.466667
Convenience 4 3.4
Acceptance 4 3.2
Relevance 4 3.533333

Spoke-too-soon happens when the user speaks before the system is ready for
recognition. If the user continues speaking over the tone and into the recognition
timeframe, a portion of the user utterance will not be included in the recognition and the
input that the speech recognition engine receives will not match anything in the active
grammar; thus resulting to a no match event. On the other hand, spoke-way-too-soon
occurs when a user finishes speaking before the application is ready for speech
recognition. The entire utterance will be outside of the recognition timeframe and the
recognition engine will not receive any input at all, thus triggering a no input event.

Due to this problem, the developer inserted a beeping sound at the end of the
prompt to inform users that it was their turn to speak and that the system is ready for
recognition. The users were also warned in the introductory message to wait for the
beeping sound before speaking.

The teachers also suggested that a graphical user interface be provided so that the
users won’t be confused about the things that the application is saying. According to them,
it would guide the students and make them understand the lessons better. That is why, the
developer created an accompanying web site for the speech application. It was developed
using HTML and PHP, and used the speech application’s database and Apache

server.

37
Figure 10. Login Page
Users need to enter their ID and PIN, and choose whether he is a faculty or a student.
Once he is authenticated, the home page is displayed, depending on his position. If he is a
student, the system will bring him to the Student’s Home Page wherein his scores and
recordings are displayed. He can also view the tutorials on each of the vowel sounds by
clicking the links on his home page. On the other hand, if the user is a faculty, then the
system will bring him to the Faculty’s Home Page. The names of his students, which are
arranged according to their sections and last names, are displayed, together with a link to
their individual records. The faculty can then easily monitor his student’s performance by
viewing his scores in the exercises and listening to his recordings.

38
Figure 11. Student Home Page

39
Figure 12. Tutorial Page

Figure 13. Playing of Recording in Windows Media Player

40
Figure 14. Login Page of Faculty

Figure 15. Faculty Home Page - Student List

41
Figure 16. Student Record

Figure 17. Playing of Student Recording

42
Despite of the few problems encountered, the evaluators expressed their
acceptance of the application, stating that it would be highly useful in filling up the
absence of a speech laboratory.

Based on the cost analysis shown below, the university will save much amount of
money if they choose to use the speech application instead of putting up a speech
laboratory.

Candidate 1: Estimated Costs for Building a Speech Laboratory

Building
Construction of new room P 500,000.00

Equipment
Cubicles P 70,000.00
Chairs 9,000.00
Headset 15,000.00
Audio Control Unit 100,000.00
TV Component 30,000.00
Reading Guide Books 1,000.00
Total Development Costs P 725,000.00

Candidate 2: Estimated Costs for Using the Speech Tutor

Personnel
1 Programmer/Analyst ( 60 hours, Php 200/hour) P 12,000.00
1 Trainer (4 hours, Php 200/hour) P 800.00

Hardware and Software

1 Server Already existing


30 Client Workstations (present in the Already existing
computer laboratory)
30 Headset (Php 300 each) 9,000.00
1 Voxeo Prophecy license FREE for download
1 Apache Web server ( open source, integrated with FREE for download

43
Prophecy)
1 PHP (open source, integrated with Prophecy) FREE for download
1 MySQL Database (open source) FREE for download
1 SJphone (VoIP softphone) FREE for download
Total Development Costs P 21,800.00

The teachers strongly agree that the application will aid their students in studying
the vowel sounds and they would use it as soon as possible. In addition, they also
recommended that the application be used in training call center agents. According to
them, the application would expose the trainees to the pronunciation, diction, and
intonation of a native speaker of the English language. It is because the synthesized
speech sounds like a native English speaker and its pronunciation is based on the
International Phonetic Alphabet (IPA).

The version that was uploaded to Evolution was called and tested using Skype. It
was found out that the application’s performance is directly affected by the network’s
bandwidth. Its performance is directly proportional to the bandwidth available; meaning,
the broader the bandwidth, the better the performance, and the smaller the bandwidth, the
poorer the performance. The application produced inaudible sounds and is unusable when
only a small bandwidth is available. It is recommended by the developer that the
university would acquire an SIP gateway that would convert SIP to PSTN and vice versa
if they really want to allow users to call the application outside the local area network
through their landlines or mobile phones.

44
Chapter 6

CONCLUSION AND RECOMMENDATIONS

6.1 Conclusion
The general objective of this study, which is to use VoiceXML in developing a
speech-driven online learning module for Oral Communication was achieved by the
developer.

The speech application was designed to answer the university’s need to have a
speech laboratory. With this application, the students would be able to listen to voice
tutorials, take the speech exercises, and record his voice. The instructors on the other
hand, would be assured that their students would be able to hear the correct pronunciation
of words since the synthesized speech of the application is based on the International
Phonetic Alphabet (IPA). It also provides them with an effective and innovative tool for
improving their instruction.

A set of rubrics was used by selected instructors and students in rating the
effectiveness of the speech application as a learning tool for Oral Communication. Based
on the results of the said evaluation, the application rates between competent and
excellent; having an average rating of 3.5. This only means that the users accept and
acknowledge it as an effective and adequate learning resource for a speech class such as
Oral Communication.

Furthermore, the cost analysis clearly shows that this application is the best alternative in
acquiring a speech laboratory’s facilities. The university will only spend a minimum
amount for this application since it may be installed in one of the existing servers in the
university and the users can access it by using the computers that are already present in
the computer laboratories. Also, all the software used in this application can be
downloaded for free (as of this time of writing).

Hence, the developer concludes that VoiceXML is indeed a good tool to use in
building speech applications for e-learning applications. It utilizes the power of the web,

45
automatic speech recognition, text-to-speech synthesis, and telephony infrastructure in
innovating people’s way of doing things.

6.2 Recommendations
This application focused only on the different English vowel sounds. Future
researchers could explore how they can utilize VoiceXML in other learning areas. They
could also use this technology in business to replace the traditional interactive voice
response (IVR) systems.

Moreover, future versions of the speech application must include topics on intonation,
stress, and other English speech sounds. Also, the companion web pages must be
synchronized with the user’s navigation of the application. This means that wherever the
user may be in his navigation of the speech interface, it must also be where he is in his
graphical web browser.

In this version of the application, the questions are hard-coded, making it difficult
for the teachers to modify them. Future developers must provide an easy-to-use facility
that would allow teachers to change the questions in the exercises dynamically. The
teachers then would just input their questions in text fields and the application would save
them in the database and incorporate them into the application.

Furthermore, the developer recommends that an SIP gateway or device must be


installed in the university to allow users to access the application through the telephone. In
the current setup of the application, it can only be accessed by users within the local area
network; thus limiting its accessibility. The SIP gateway will be used to convert SIP to
PSTN and vice versa.
BIBLIOGRAPHY

Balentine, Morgan, and Meisel. (2001). How to Build a Speech Recognition Application:
A Style Guide for Telephony Dialogues (Second Edition). Enterprise Integration
Group, San Ramon, CA

BeVocal Café. VoiceXML Tutorial. Retrieved on 20 April 2006 from www.café.bevocal.com

46
De La Salle University. English Language Laboratory. Retrieved on 27 April 2007 from
http://www.dlsu.edu.ph/academics/colleges/ced/ell.asp

Ehsani, F., Knodt, E. (1998). Speech Technology in Computer-Aided Language Learning:


Strengths and Limitations of a New CALL Paradigm. 2(1), 54-73. Retrieved on 16
May 2006 from http://llt.msu.edu/vol2num1/pdf/article3.pdf

Federal Communications Commission. What is VOIP? Retrieved on 27 April 2007 from


http://www.fcc.gov/voip/

Granberg, C. (2006). Testimonials for VoiceXML. CEO, PipeBeach. Retrieved on 25 April


2007 from www.w3c.org

Grant MacEwan College. Is online learning for me? Retrieved on 16 May 2006
http://stats.macewan.ca/learn/students/tutorial/mod1/forme.html

Grice, H. P. (1975). Logic and Conversation.

Haag, Cummings, McCubbrey. (2002). Management Information Systems for the Information
Age. Emerging Technologies. pp. 234-237

HorizonWimba EduVoice. Put Voice on Your Web Site or WebCT Course. Retrieved on
18 May 2006 from
www.horizonwimba.com/docs/Introduction_to_Wimba_Voice_Tools.pdf

IBM Websphere. (2005). VoiceXML Programmer’s Guide. Retrieved on 27 April 2007


http://www.elink.ibmlink.ibm.com/public/applications/publications/cgibin/pbi.cgi

King, D. (2006). Testimonials for VoiceXML. Director of Architecture, Pervasive Computing


Division, IBM. Retrieved on 25 April 2007 from www.w3c.org

Mairesse, F. (2007). An Introduction to VoiceXML. Retrieved on 7 May 2007 from


http://www.dcs.shef.ac.uk/~francois

Matzon, Katarina. (2005). Speech-Driven Automatic Receptionist Using VoiceXML.


Department of Linguistics and Philology. Uppsala University. Sweden. Retrieved on
28 July 2006 from http://stp.ling-uu.se/exarb/arch/2005_matzon.pdf

Millbower, L. (2003). The Auditory Advantage. American Society for Training & Development.
Virginia, USA.

Norrby, Catrin. (1996). Samtalsanalys. Studentlitteratur.

47
Penumaka, S. (2006). VoiceXML: An Emerging Standard for Creating Voice
Applications.
Retrieved on 26 April 2007 from www.sonify.org/tutorials/other/voicexml

Phonologies: The Voice of Technology. What is VoiceXML?: The Voice eXtensible


MarkupLanguage. Retrieved on 25 April 2007 from
http://www.phonologies.com/voicexml.html

Rebolj, D. and Menzel, K. (2004). Voice and Multimodal Technology for the Mobile Worker.
Retrieved on 25 April 2007 from http://www.itcon.org/2004/24/

Regruto, L. (2003). VoiceXML – Surfing on the Internet Using Voice. Retrieved on 25 April
2007 from www.vxmlitalia.com/bb2003.pdf

Reusch, P. (2004). VoiceXML Applications for E-Commerce and E-Learning. Retrieved on


14 July 2006 from www.fh-dortmund.de/de/ftransfer/medien/reusch1.pdf

Ross, Keith. (2002). Eurecom Institute –Asynchronous Voice in e-learning. Retrieved from
http://cis.poly.edu/~ross/papers/alnTalk.ppt

Touesnard, B. (2001). VoiceXML: Usability, Scalability, and the Future. Retrieved on 26 April
2007 from brad.touesnard.com/docs/wtr3-nrc.pdf

VoiceXML Forum. VoiceXML Tutorial. Retrieved on 20 June 2006 www.voicexml.org

Voxeo. VoiceXML Tutorial. Retrieved on 25 July 2006 from www.voxeo.xom

Wikipedia. Definitions. Retrieved on 2007 from www.wikipedia.com


Webopedia. VoiceXML. Retrieved on 25 June 2006 from www. webopedia.com
Zajicek, M. 2002. Software Design for Older Adults to Support Memory. Joint
Proceedings of HCI 2002 and IHM 2001, pp 503 – 513

Zhang, Mengjie. (2001). A Speech Recognition System for Improving English


Pronunciation of Mandarin Speakers. School of Mathematical and Computing
Sciences. Victoria University of Wellington. New Zealand.
http://www.mcs.vuw.ac.nz/comp/Publications/archive/CS-TR-01/CS-TR-01-14.pdf
Appendix A
USER’S MANUAL

48
Outline
1. Creating an account
2. Logging in to the system via the Graphical User Interface (GUI)
a. Student
b. Faculty
3. Navigating the GUI (Student Account)
a. Viewing the scores and list of recordings
b. Reading the tutorials
c. Playing a recording
4. Navigating the GUI (Faculty Account)
a. Viewing the class list
b. Viewing the individual record of a student
5. Logging in to the system via the softphone
6. Navigating the Speech User Interface (SUI)
a. Choosing type of vowel sound to study
b. Listening to a tutorial
c. Answering an exercise
d. Making a recording

1. Creating an account

49
You must fill in the registration form with the following conditions:
a. ID must be unique
b. Position must be either faculty or student
c. If you are a student, you must enter your section

If the ID is already taken, an error message will be displayed:

If the registration is successful, the following screen is displayed:

50
2. Logging in to the system via the GUI
To log in to the system, you must enter your ID number, PIN, and position.
a. Student

b. Faculty

51
If login fails, the following screen will be displayed:

If login is successful, the following screen will be displayed:


a. Student

52
b. Faculty

3. Navigating the GUI (Student Account)


a. Viewing the scores and list of recordings

53
The student’s scores in each of the exercises that he has taken will be
displayed. If he has not taken a specific exercise yet, -1 will be displayed.
The list of the recordings made by the student will be shown too. The
recording is named by combining the ID number of the student and the
number of the vowel sound for which the recording is made.
b. Reading the tutorials – just click on the vowel sound you wish to study

54
c. Playing a recording – click on the recording you wish to hear. The recording
will be played by the default audio player (i. e. Windows Media Player)

4. Navigating the GUI (Faculty Account)


a. Viewing the class list

55
The class lists of the faculty will be shown, which is sorted by the
section of the students. Upon clicking of the ID number of a specific student,
his individual record will be shown.
b. Viewing the individual record of a student

56
The student’s scores and recordings will be displayed, which makes it easy for
the faculty to monitor each of his student’s performance.

5. Logging into the system via the softphone (Speech User Interface)

In order to use the speech user interface, the student must use a softphone.

The student will enter the following on the Dial String textbox:
sip:[email protected].
sip: Session Initiation Protocol, a signaling protocol used for establishing sessions
in an IP network. A session could be a simple two-way telephone call or it
could be a collaborative multimedia conference session.
tutor: name of the application being called
127.0.0.1: IP address of the server where the voice application resides

After which, he should click the Dial button.

Speech Tutor will then answer the call.

Speech Tutor: Please key in your 4-digit student I D


< Student then keys in his ID >
Speech Tutor: Please key in your 4-digit personal identification number or PIN
< System will check if the ID and PIN are valid and if they match >

If the login is successful, the system will read the options of the main menu. However
if login fails, the student will be asked to re-login.

57
6. Navigating the Speech User Interface

Speech Tutor: Welcome! I am your Speech Tutor.


Remember, you can start keying in your answer only when you
hear this beep.<beep> You can press the zero key on your keypad
to go back to this dialog at any time.

a. Choosing type of vowel sound to study


This is the main menu. Choose type of vowel sound to study.
For the front vowel sounds, press 1.
For the back vowel sounds, press 2.
For the center vowel sounds, press 3.

< student presses 1 >

You have chosen to study the front vowel sounds.


You will choose among the five front vowel sounds. Refer to your
workbook for the listing of the front vowel sounds. Press any
number from 1 to 5.

< student presses 1 >

You are now in front vowel1 tutorial.


Press one if you want to hear the tutorials.
Press two if you want to take the exercise.
Press three if you want to make a recording.
Press nine if you want to go back to the previous dialog.

b. Listening to a tutorial
< if the student presses 1 >
Speech Tutor will read the Vowel 1 tutorial

c. Answering an exercise

< if the student presses 2 >


Speech Tutor will read some words with the vowel 1 sound
and the student must repeat after her. If the student pronounces the
word correctly, then he will get 1 point. However if he pronounces

58
it wrongly, he will get a 0 point. After answering all the questions,
his total score for the specific exercise will be saved in the
database.

d. Making a recording

< if the student presses 3 >


Speech Tutor: You have one minute to make a recording.
Start speaking after you hear the beep. Press any
DTMF key to end the recording.

< student will read a passage on the specific vowel


sound.>
Appendix B
RELEVANT CODES USED IN THE SYSTEM

a. User Authentication

$id = $_REQUEST["student_id"];
$pin = $_REQUEST["pin"];

$sql = "SELECT pin,fname,lname,sec_id FROM student WHERE stud_id = $id";


$result = mysql_query($sql,$connection) or die (mysql_error());

if(!mysql_num_rows($result)) $res = 0;
else {
$query_data = mysql_fetch_row($result);
$val = $query_data[0];

if($val == $pin)
$res = 1;
else
$res = 2;
}

$firstname = $query_data[1];
$lastname = $query_data[2];
$sec_id = $query_data[3];

echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?>";


echo "\t".'<vxml version="2.1">
<form id="loginResult">
<var name="student_id" expr="'.$id.'"/>
<var name="sec_id" expr="'.$sec_id.'"/>'."\n";

if($res == 1)
{
echo '<block> <prompt> Hello '.$firstname.' '. $lastname.' <break/> </prompt>
<submit next="t-home.php" method="" namelist="student_id sec_id"/></block> ';

59
}
else if($res == 2)
{
echo '<block> <prompt> Wrong PIN! <break/> You entered '.$pin.' , as your pin.
<break /> </prompt> <goto next="login.php"/> </block> ';
}

else
{
echo '<block> <prompt> Student I D is not in use! <break/> Please enter a valid
I D and PIN.
<break /> </prompt> <goto next="login.php"/> </block> ';
}
echo '</form> </vxml>';

b. Customizing error messages

<nomatch count="1">
No Match! I did not understand you. Could you please try that again?
<reprompt />
</nomatch>
<nomatch count="2">
I am sorry. But I still cannot understand you. Please try again.
<reprompt/>
</nomatch>
<nomatch count="3">
I still cannot understand you. Please choose only from the options given.
<reprompt/>
</nomatch>

<noinput count="1">
No Input! I did not hear anything. Could you try that again?
<reprompt />
</noinput>
<noinput count="2">
I am sorry, but I still did not hear you. Please try again.
<reprompt/>
</noinput>
<noinput count="3">
Sorry, but I still did not hear anything. Please check your microphone.
See to it that it is plugged.
<reprompt/>
</noinput>

c. Making an if-then-else statement

<field name="choice">
<prompt > for the front vowel sounds, press 1. <break
strength="medium"/> for the back vowel sounds, press 2. <break
strength="medium"/> for the center vowel sounds, press 3. <break
strength="medium"/>
<audio src="beep.wav"/>
</prompt>

60
<grammar type="text/gsl">
[dtmf-1 dtmf-2 dtmf-3 dtmf-0]
</grammar>

</field>
<filled>
<if cond="choice==1">
<submit next="front.php" method="" namelist="student_id sec_id"/>
<elseif cond="choice==2"/>
<submit next="back.php" method="" namelist="student_id sec_id"/>
<elseif cond="choice==3"/>
<submit next="center.php" method="" namelist="student_id sec_id"/>
<elseif cond="choice==0"/>
<submit next="t-home.php" method="" namelist="student_id sec_id"/>
</if>
</filled>

d. Making a grammar

<grammar type="text/gsl">
[ dtmf-1 dtmf-3 dtmf-0 dtmf-9 dtmf-8]
</grammar>
<grammar type="text/gsl">
<![CDATA[[
[scene] {<response "leave">}
[leave] {<response "scene">}
[deed] {<response "deed">}
[heal] {<response "heal">}
[peak] {<response "peak">}
[zero] {<response "zero">}
[nine] {<response "nine">}
[one] {<response "one">}
[three] {<response "three">}
[eight] {<response "eight">}
]]]>
</grammar>
e. Making a subdialog

<subdialog name="SubD4" src="questionnaire.xml">


<param name="question" expr="'Number four. The word is, heal, spelled as, h, e, a, l.
Repeat after me, heal.'"/>
<filled>
<if cond="SubD4.response=='heal'">
<prompt> Correct! <break/></prompt>
<assign name="score" expr="score+1"/>
<elseif cond="SubD1.response==8" />
<goto nextitem="SubD5" />
<else/>
<prompt> You didn't get it right.<break/></prompt>
<goto nextitem="SubD5" />
</if>
</filled>
</subdialog>

61
f. Submitting the score to the database

$score = $_REQUEST["score"];
$exer_id = $_REQUEST["exer_id"];
$id = $_REQUEST["student_id"];
$sec_id = $_REQUEST["sec_id"];

$sql = "UPDATE evaluation SET score = $score WHERE stud_id = $id AND exer_id =
$exer_id";
$result = mysql_query($sql) or die (mysql_error());

g. Making a recording

<form id="Recording">
<var name="student_id" expr="'.$id.'"/>
<var name="sec_id" expr="'.$sec_id.'"/>
<var name="exer_id" expr="1"/>
<record name="file" beep="true" maxtime="60s" finalsilence="3000ms">
<prompt bargein="false">
You have one minute to make a recording. Start speaking after you hear the beep. Press
any DTMF key to end the recording.
</prompt>

<filled>
<prompt bargein="false">
This is your recording. <break strength="medium"/> <audio expr="file"/> <break/> </prompt>
<var name="recSample" expr="application.lastresult$.recording" />
<submit next="upload.php" method="post" namelist="file exer_id student_id sec_id"
enctype="multipart/form-data"/>
<goto next="#MainForm"/>
</filled>
</record>
</form>

62
Appendix C Speech
Tutor Rubrics
Criteria Excellent (4) Sometimes audible Inaudible
Voice Projection Clearly and consistently
audible Only about 50% of the Only less than or up to 25%
Competent (3)
Pronunciation All words were words were pronounced of the words were pronounced
Mostly audible
pronounced correctly. correctly. correctly, Consistently too fast
or too slow
At most 75% of the words
Pace Consistently effective At times too fast or too slow Transitions may be needed
were pronounced
Transitions Smooth transitions are used correctly. Mostly effective Transitions may be Some biased or unclear
awkward language is used
Appropriateness Language is familiar to
Transitions are moderately Language is questionable
the listener, appropriate Listener cannot understand
acceptable Language is not or inappropriate for some
for the presentation because there is
disrespectful or offensive listeners, occasion, or
setting, and free of bias no sequence of information
setting Listener has
Organization Speech Tutor presents Speech Tutor presents difficulty following Speech Tutor is not able to
information in a logical, information in a logical presentation because keep the audience engaged
interesting sequence sequence which the Speech Tutor jumps
which the listener can audience can follow around
follow Speech Tutor is inconvenient
Speech Tutor is able to Speech Tutor kept the and difficult to use
Responsiveness Speech Tutor keeps the keep the listener engaged audience engaged for a
listener engaged all the most of the time short time I will not use Speech Tutor in
time
Speech Tutor is convenient There are times when teaching vowel sounds
and easy to use most of the Speech Tutor is
Convenience Speech Tutor is convenient
time inconvenient and difficult to
to use all the time
I will use Speech Tutor in use I strongly disagree that
teaching vowel sounds after I will use Speech Tutor in Speech Tutor will aid my
Acceptance I will use Speech Tutor
a few revisions are made teaching vowel sounds students in studying the
in teaching vowel
after major revisions are vowel sounds
sounds as soon as
I agree that Speech Tutor made
possible TOTAL
will be helpful in studying
the vowel sounds Speech Tutor may be
Relevance I strongly agree that the
helpful in studying the
Speech Tutor will aid my
vowel sounds
students in studying
Needs Work (2) Unacceptable (1) Rating
vowel sounds

You might also like