Voice User Interface
Voice User Interface
www.bestneo.com
Introduction
Definition
In its most generic sense a voice portal can be defined as speech enabled access to Web based information. In other words, a voice portal provides telephone users with a natural language interface to access and retrieve Web content. An Internet browser can provide Web access from a computer but not from a telephone. A voice portal is a way to do that.
Overview
The voice portal market is exploding with enormous opportunities for service providers to grow business and revenues. Voice based internet access uses rapidly advancing speech recognition technology to give users any time, anywhere communication and access-the Human Voice- over an office, wireless, or home phone. Here we would describe the various technology factors that are making voice portal the next big opportunity on the web, as well as the various approaches service providers and developers of voice portal solutions can follow to maximize this exciting new market opportunity.
www.bestneo.com
Why Voice?
Natural speech is modality used when communicating with other people. This makes it easier for a user to learn the operation of voiceactivate services. As an output modality, speech has several advantages. First, auditory input does not interfere with visual tasks, such as driving a car. Second, it allows for easy incorporation of sound-based media, such as radio broadcasts, music, and voice-mail messages. Third, advances in TTS (Text To Speech) technology mean text information can be transferred easily to the user. Natural speech also has an advantage as an input modality, allowing for hands-free and eyes-free use. With proper design, voice commands can be created that are easy for a user to remember .These commands do not have to compete for screen space. In addition unlike keyboard-based macros (e.g., ctrl-F7), voice commands can be inherently mnemonic (call United Airlines), obviating the necessity for hint cards. Speech can be used to create an interface that is easy to use and requires a minimum of user attention.
www.bestneo.com
www.bestneo.com
Voice recognition is tough. And sophisticated packages not only can recognize a wide variety of speakers, they also allow experienced users to interrupt menu prompts ("barge-in") and can capture compound instructions such as "I'd like to transfer a thousand dollars from checking to savings" in one command rather than several. These features are designed to not only overcome limitations of DTMF but to increase customer use and acceptance of IVR systems. The hope is that customers will eventually be comfortable telling a machine "I want to add a driver to my Camry's policy." Besides taking some of the load off customer service representatives, VUI vendors promise an attractive ROI to help get these systems into insurers' IT budgets. ASR systems can be enabled with voice authentication, eliminating the need for PINs and passwords. Call centers themselves will likely transform into units designed to support customers regardless of whether contact comes from a telephone, the Web, e-mail, or a wireless device. At the same time, the 'voice Web' is evolving, where browsers or Wireless Application Protocol (WAP)-enabled devices display information based on what the user vocally asks for. "We're definitely headed toward multi-modal applications," Ehrlich predicts. ASR vendors are working to make sure that VUI evolves to free staff from dealing with voice-related channels; it's better to have them supporting the various modes of service that are just now beginning to emerge. . Good VUI systems have multiple fallback strategies for speech recognition failure, such as asking callers to spell names or breaking a question into parts. The interface could revert to DTMF, especially if ASR was added to an existing DTMF-enabled system. Still, transferring an unintelligible customer to a human agent is frequently an early fallback option in order to minimize customer frustration. Facilitating this evolution is XML's telephonic flavor, VoiceXML. As its name implies, VoiceXML is a markup language specification that defines the key elements of voice-enabled transactions, which also allows repurposing of this data across any number of platforms and systems
www.bestneo.com
www.bestneo.com
www.bestneo.com
Speaker Verification
The speaker-specific characteristics of speech are due to differences in physiological and behavioral aspects of the speech production system in humans. The main physiological aspect of the human speech production system is the vocal tract shape. The vocal tract is generally considered as the speech production organ above the vocal folds, which consists of the following: (i) laryngeal pharynx (beneath the epiglottis), (ii) oral pharynx (behind the tongue, between the epiglottis and velum), (iii) oral cavity (forward of the velum and bounded by the lips, tongue, and palate), (iv) nasal pharynx (above the velum, rear end
www.bestneo.com
of nasal cavity), and (v) nasal cavity (above the palate and extending from the pharynx to the nostrils). The shaded area in figure 1 depicts the vocal tract. Figure 1:
The vocal tract modifies the spectral content of an acoustic wave as it passes through it, thereby producing speech. Hence, it is common in speaker verification systems to make use of features derived only from the vocal tract. In order to characterize the features of the vocal tract, the human speech production mechanism is represented as a discretetime system of the form depicted in figure 2.
www.bestneo.com
The acoustic wave is produced when the airflow from the lungs is carried by the trachea through the vocal folds. This source of excitation can be characterized as phonation, whispering, frication, compression, vibration, or a combination of these. Phonated excitation occurs when the airflow is modulated by the vocal folds. Whispered excitation is produced by airflow rushing through a small triangular opening between the arytenoid cartilage at the rear of the nearly closed vocal folds. Frication excitation is produced by constrictions in the vocal tract. Compression excitation results from releasing a completely closed and pressurized vocal tract. Vibration excitation is caused by air being forced through a closure other than the vocal folds, especially at the tongue. Speech produced by phonated excitation is called voiced, that produced by phonated excitation plus frication is called mixed voiced, and that produced by other types of excitation is called unvoiced.
It is possible to represent the vocal-tract in a parametric form as the transfer function H(z). In order to estimate the parameters of H(z) from the observed speech waveform, it is necessary to assume some form for H(z). Ideally, the transfer function should contain poles as well as zeros. However, if only the voiced regions of speech are used then an all-pole
www.bestneo.com
model for H(z) is sufficient. Furthermore, linear prediction analysis can be used to efficiently estimate the parameters of an all-pole model. Finally, it can also be noted that the all-pole model is the minimumphase part of the true model and has an identical magnitude spectra, which contains the bulk of the speaker-dependent information. The above discussion also underlines the text-dependent nature of the vocal-tract models. Since the model is derived from the observed speech, it is dependent on the speech. Figure 3 illustrates the differences in the models for two speakers saying the same vowel.
Figure 3:
www.bestneo.com
Choice of features :
The LPC features were very popular in the early speech-recognition and speaker-verification systems. However, comparison of two LPC feature vectors requires the use of computationally expensive similarity measures such as the Itakura-Saito distance and hence LPC features are unsuitable for use in real-time systems. Furui suggested the use of the cepstrum, defined as the inverse Fourier transform of the logarithm of the magnitude spectrum, in speech-recognition applications. The use of the cepstrum allows for the similarity between two cepstral feature vectors to be computed as a simple Euclidean distance. Furthermore, Atal has demonstrated that the cepstrum derived from the LPC features results in the best performance in terms of FAR and FRR for a speaker verification system. Consequently, we have decided to use the LPC derived cepstrum for our speaker verification system.
Speaker Modeling :
Using cepstral analysis as described in the previous section, an utterance may be represented as a sequence of feature vectors. Utterances spoken by the same person but at different times result in similar yet a different sequence of feature vectors. The purpose of voice modeling is to build a model that captures these variations in the extracted set of features. There are two types of models that have been used extensively in speaker verification and speech recognition systems: stochastic models and template models. The stochastic model treats the speech production process as a parametric random process and assumes that the parameters of the underlying stochastic process can be estimated in a precise, well defined manner. The template model attempts to model the speech production process in a non-parametric manner by retaining a number of sequences of feature vectors derived from multiple utterances of the same word by the same person. Template models dominated early work in speaker verification and speech recognition because the template model is intuitively more reasonable. However, recent work in stochastic models has demonstrated that these models are more flexible and hence allow for better modeling of the speech production process. A very popular stochastic model for modeling the speech production process is the Hidden Markov Model (HMM). HMMs are extensions to the conventional Markov models, wherein the observations are a probabilistic function of the state, i.e., the model is a doubly embedded stochastic process where the underlying stochastic process is not directly observable (it is hidden). The HMM can only be viewed through another set of stochastic processes that produce the sequence of observations.
www.bestneo.com
Thus, the HMM is a finite-state machine, where a probability density function p(x | s_i) is associated with each state s_i. The states are connected by a transition network, where the state transition probabilities are a_{ij} = p(s_i | s_j). A fully connected three-state HMM is depicted in figure 4. For speech signals, another type of HMM, called a left-right model or a Bakis model, is found to be more useful. A left-right model has the property that as time increases, the state index increases (or stays the same)-- that is the system states proceed from left to right. Since the properties of a speech signal change over time in a successive manner, this model is very well suited for modeling the speech production process.
Pattern Matching :
The pattern matching process involves the comparison of a given set of input feature vectors against the speaker model for the claimed identity and computing a matching score. For the Hidden Markov models discussed above, the matching score is the probability that a given set of feature vectors was generated by the model. A Speaker Verification System:
www.bestneo.com
www.bestneo.com
Background Noise: Before mobile phone users often access voice portals, background has been difficult for voice recognition developers to filter out The development of better microphones has helped, but issues such as wind, murmurs and music have made it a challenge to properly isolate voice from noise. 5. Continuous speech recognition: Designing systems that are powerful enough to understand and respond to continuous speech requires a large amount of processing power that was not available at reasonable cost in the past. When a person speaks at a natural rate, it has been difficult to distinguish which sounds were associated with specific words. For example the phrase to recognize speech could have been misunderstood to be to wreck a nice beech. Users do not naturally speak with pauses between words .As a result processing phrases in real time as they are naturally spoken has been a major challenge. Many of the solutions to problem associated to speech recognition are still being fine tuned. Powerful computers have provided the processing power to overcome many of the limitations of the past. However the most common speech-recognition systems of today are still very different from the way in which people naturally interact.
www.bestneo.com
Conclusion .
Speech-enabled Internet portals or voice portals are quickly becoming the hottest trend in E-commerce broadening the access to internet content to everyone with the most universal communications device of all, a telephone. Voice portals purt all kinds of information at a customers fingertips anytime, anywhere. Customers just dial into the voice portals prescribed number and they use simple voice commands to access whatever information they need. Its quick, easy and effective, even from a car or the airport. The potential for voice portals is as wide as the reach of telephones, which today number 1.3 billion around the world. Compare that to the 250 million computers with Internet access and it is easy to understand while analysts believe voice-enabled web access will take off. Feasibility studies exhibit that by 2005, 45 million wireless users around the world will regularly use voice portals to handle their everyday cyber chores. Todays voice portal are just the tip of the iceberg-the first step in changing the way people access Internet content and, ultimately, how businesses and consumers will conduct business over the Internet. Over the next few years, voice portals-and core technologies behind then are posed to businesses view and Internet with their customers. Voice portals are changing the telephone interaction from a vendor-centric to a customer-centric experience-increasing satisfaction for customers while improving efficiency and cutting costs for businesses. A voice portal provides telephone users with a natural language interface to access and retrieve web content. An Internet browser can provide web access from a computer but not from a telephone. A voice portal is a way to do that. Of course simple access and retrieval of information is just the beginning. A voice portal can also provide users access to virtual personal assistance and web-based unified messaging applications Voice portals can also cut operating expenses by freeing up agent time and replacing human operators with an easy-to-use automated solution. They also provide new revenue opportunities by opening up the
www.bestneo.com
possibility of new subscription services or building revenue through advertising. Voice portals are the next frontier in convergence, the intersection of the Internet And Telecommunications, blurring the distinctions among voice and data, Computers and telephones and like any frontier, the rewards are great for Staking your claim early. The voice portal reference system is a packaged, integrated hardware and Software reference system for building hardened E-Business and speech-enabled voice portal solutions. Combining the power of server technology with telephony interface boards in an integrated server reference system, it is embraced by leading speech-technology providers