Volume 1, Issue 5 - May 2001
   
 

Ten Steps to a Commercial-grade VoiceXML Application

By T. Todd Elvins

The VoiceXML revolution is just beginning. Only within the past few months have VoiceXML interpreters become robust, feature-rich and capable of supporting a carrier-grade commercial application. Work still remains to further improve scalability, efficiency and density, and even to agree on a common interpretation for every tag and attribute in the specification. This article assumes that carrier-grade VoiceXML interpreters will continue to mature and evolve, and focuses instead on the challenges of developing a commercial-grade VoiceXML application to run on the interpreter.

Superficially, it might seem that developing a commercial-grade application in VoiceXML would save months of development time. However, developing a commercial-grade VoiceXML application for the first time requires nearly as much effort as developing an application hard-coded to a particular speech recognition API. And, as discussed below, the two processes involve most of the same steps. Nevertheless, the resulting VoiceXML application is well worth the effort, possessing a number of advantages over an equivalent hard-coded application. VoiceXML applications are portable across platforms, and somewhat portable across ASR vendors; they may be interoperable with other VoiceXML applications; and they benefit from the advantages of a distributed http-based architecture.

Experience On the Leading Edge

Indicast is a premier provider of private-label voice portal services to the telecommunications, Web, and enterprise industries. Indicast has amassed the largest database of professionally produced audio content from leading brands, such as ABCNEWS.com, The Wall Street Journal, and Associated Press, covering more than 1,000 topics. Indicast also offers voice-activated dialing, unified messaging services, business finder services, driving directions, and other voice-activated telephone services.

This comprehensive voice portal content and services suite, combined with Indicast's innovative "playlist" user interface design, provides a compelling voice portal solution available on a private-label basis.

Indicast decided early in its history to develop in 100% VoiceXML with no proprietary extensions. The Indicast voice portal service has been launched in the USA by Centennial Wireless and is now commercially available, giving credence to the fact that it is possible today to deploy real voice applications written in 100% VoiceXML. Along the way, Indicast has gained valuable expertise in VoiceXML development and deployment. The following 10 step program is based on our pioneering experiences in this area, and will help minimize the risks associated with designing a carrier-grade system based on an emerging standard like VoiceXML.

The 10-Step Program

The tasks required to design, build, and deploy a commercial grade VoiceXML application are listed below. All of the steps are challenging, but the two most demanding are requirements and design, and speech recognition tuning.

1. Attend a course on speech recognition. A number of the speech recognition companies offer high-quality classes on designing and developing speech-based applications, building and tuning grammars, and managing speech-based software projects. For developers lacking ASR experience, this training will save months of trial and error. Attending such a course can be an eye opener for those uninitiated in the subtleties of speech recognition. A list of speech recognition companies can be found at the VoiceXML Forum's web site (www.voicexml.org).

2. Design your application and voice interface. Probably the most challenging phase of developing a commercial-grade speech application-- VoiceXML or any other-- is arriving at a good, usable voice interface design. Begin by enumerating all of the application requirements, for the first few versions. Next, hire a linguist with a background in voice interface design to create an overall voice interaction philosophy, and the interface designs for all components of the application. The voice interaction philosophy should be dictated by the type of data being accessed. For example, a unified messaging virtual assistant like Webley" should have a strong persona, while Indicast's voice portal delivering primarily personalized audio content should provide a more passive or deferential/reactive voice interface. This data-driven philosophy results in "content-specific voice interfaces." Once the voice interface is designed, the next step is to conduct: user studies, observations of naïve users, and "Wizard of Oz" experiments to determine which components are understandable and which interactions are problematic. Some of these studies can be done orally or on paper, while others may require a rough prototype. The results of these studies should be used to redesign, tune, refine, and then redesign again. Any effort expended at this early stage will pay off ten-fold later. If you cannot find a linguist to help with this step, contact professional services at one of the ASR companies and ask to purchase some time with one of their linguists.

3. Make use of available VoiceXML tools. A number of VoiceXML development environments exist and can be found via the VoiceXML Forum's web site (www.voicexml.org). These development environments are a good way to get started with static VoiceXML, however will probably not yield the full-featured dynamic VoiceXML application specified. Undoubtedly, some features will also require hand-coding and external calls. Some VoiceXML developer options are described below.

  • VoiceXML URL registration on a web site. The URL of prototype static VoiceXML or a VoiceXML generator can be registered at a web site, and the developer can call a phone number to interact with the VoiceXML file. A number of these web sites are available and may or may not have logging and/or debugging facilities.

  • Web-based development environments. These include the features described above, but also include various capabilities to enable developers to build more efficiently. These tools include VoiceXML debuggers, editors, and grammar modules.

  • VoiceXML development environments. Several full-featured visual VoiceXML development environments have appeared recently. These work in much the same way that Visual Java works, with the exception of lacking a compiler. These visual development environments include a visual editor, a debugger, grammar builders, grammar modules, and many other features. Some of these packages include modules for dynamically generating VoiceXML.

4. Tune end points. An utterance is a single instance of a spoken command for a particular user, or "talker". End points are parameters that describe the speech recognition listening window. Examples of these parameters include the expected length range of an utterance, expected amount of silence on either end of the utterance, and other parameters that describe how the utterance is processed. Tuning the end point parameters can have a dramatic effect on the ability of the speech recognizer to determine what voice command was spoken. By recording a few dozen utterances from each of a few hundred talkers (and in varying environments from quiet to noisy), and then physically listening to and/or transcribing the utterances, a developer can determine an optimal listening window. Developers may want to enlist tools and professional services from their ASR provider to complete this step of the project, at least the first time.


Continued...

back to the top

 

Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).