VoiceXML Review - Feature Articles

Volume 2, Issue 7 - November/December 2002

Standardizing VoiceXML Generation Tools

By David L. Thomson,

Introduction
An area where we have an opportunity to make VoiceXML easier to use and more portable is in development and runtime tools. VoiceXML provides two significant advantages in authoring speech-enabled applications, when compared to previous methods. It allows a developer to build speech services with less effort and it allows applications written for one speech platform to run on another speech platform. These advantages are diminished, however, if software tools used to create and support VoiceXML code are inadequate or incompatible. The VoiceXML Tools Committee, under the direction of the VoiceXML Forum, has been working on methods for improving the quality and uniformity of tools as described below.

To define a process for improvement, we must first outline an architecture that illustrates how tools are connected. Companies currently building tools include application developers, speech server suppliers, speech engine vendors, speech hosting service bureaus, stand-alone tool developers, and customers. An informal survey of commercial tools suggests that the description illustrated in Figures 1-3, describes most VoiceXML toolsets currently available. While the interfaces are often proprietary, vary from system to system, and not all modules are available from every vendor, most products fit this general framework.

Figure 1 shows a complete VoiceXML development system divided into three parts, an application development environment, a VoiceXML page server, and a VoiceXML gateway. Development tools may include a grammar compiler; a call flow editor that could be table-based, GUI-based, wizard-based, or script-based; a waveform editor; an expert system that assists the developer in making wise service design choices; error checking routines, and finite-state and n-gram grammar generation software The output of the development environment may be VoiceXML code or a representation of the service call flow in a form that is later converted to VoiceXML pages by the VoiceXML server.

During runtime, the VoiceXML page server provides VoiceXML pages to the gateway in response to user input and other events. The VoiceXML gateway executes VoiceXML code and uses text-to-speech and speech recognition software to communicate with callers. The VoiceXML gateway, and the associated speech recognition and synthesis software, the VoiceXML interpreter, and related software, lies largely out of the scope of the VoiceXML tools effort and is treated only lightly in this paper.

Figure 2 shows a detailed view of the application development tools block. The call flow designer helps the developer write an application, either via a text editor or a GUI (graphical user interface). It uses a grammar builder that creates grammar structures for use by a speech recognizer. In addition, it may support pre-built high-level scripting objects that encode common user interactions.

Another feature of the application development tools block is service analysis and testing. Data collected from the server and the gateway during development, trials, and live service is used to iteratively improve the application. This information is an example of runtime data created during operation for which few standards currently exist.

An important characteristic of a service creation environment is the form of its output. While simple applications may be written directly in VoiceXML, many services (particularly complex services) are written in an intermediate form such as Java, C++, ASP, proprietary scripts, XML, etc., and converted to VoiceXML by the VoiceXML page server. For our purposes, we call the intermediate form meta code, written in a given meta language. The meta code created by the development tools specifies the call flow (behavior of the system in response to a caller) and is used by the VoiceXML page server.

Figure 3 is a detailed view of the VoiceXML page server. It receives the service description represented in meta code and generates VoiceXML pages (and accepts corresponding signals from the VoiceXML gateway) during runtime. The process is controlled by a conversation manager, which may be a state machine or other similar software. The conversation manager may have access to customer, service, and other data. It may also access external systems such as e-mail, instant messaging, web pages, and even live agents when and if necessary.

The page server may generate runtime data related to caller actions, system parameters, or external information. This data, plus additional data received from the speech server, is stored in a logging database for use by billing, OAM&P, service analysis, and for other purposes.

There are many points in the tools domain that might benefit from standardization. We might define standards for all interfaces between tools. We might set up an open source network for developing tools. With finite resources, we must be realistic and chose those areas where we expect to reap the greatest benefit. The Tools Committee has identified two topics of particular interest, runtime data and the meta language. We treat each separately in the following two sections.

Runtime Data

In a live service, data is generated that is not represented in the VoiceXML language. This data includes quality of service information, OAM&P (operations, administration, maintenance, and provisioning) data, billing information, and data related to individual and collective call traffic. Since this data is not entirely covered by industry standards, each technology vendor uses a different approach for formatting, transporting, and storing the information.

Our approach to creating standards for the runtime data begins with an attempt to list the data elements we wish to capture. We divide the elements into six categories:

Data generated by the VoiceXML Gateway
Hardware and software processes
VoiceXML application data
ASR Performance and activity
TTS Performance and activity
Data processing

We estimate that there may be 100-200 data elements, a large but tractable number. A few illustrative examples include:

Conferencing 3rd party
Resetting speech channel
Telephony card failure
Playback completed
CPU idle percentage
ASR version number
TTS memory usage
VoiceXML audio cache hits/misses
Maximum call duration
VoiceXML session ID
ANI
Database response time

Once we have a reasonably complete list of elements, refined with input from industry participants, our next step is to define a transport and storage format. A successful runtime data standard will enable service providers to interchange VoiceXML gateways and page servers from different vendors.

The Meta Language

Development tools and runtime software on the VoiceXML page server must use the same meta language. Since the meta language is generally unique to a given tool vendor, runtime software on the VoiceXML page server will only work with development tools from the same vendor. One unfortunate consequence of this restriction is that applications written with one toolset will not necessarily run on a page server built by a different vendor. This incompatibility threatens to thwart one cause for which VoiceXML was created, that of application portability between platforms. If the target application code is written in VoiceXML, then the systems may be compatible, but our observation is that many applications are represented in a vendor proprietary form (ASP, scripts, etc.) and then converted to VoiceXML at runtime.

In an effort to solve this incompatibility, the VoiceXML Tools Committee is studying ways to standardize the meta language. Vendors would then use the standard meta language to represent parameters of the call flow, even if vendor tools otherwise provide different features. Two proposals under consideration are 1) the XForms standard under development by the W3C and 2) an XML-based standard where styles sheets convert between formats used by different vendors. This rather ambitious goal will, if successful, improve the interoperability of development and runtime tools and make applications portable across vendors.

Conclusion

Tools for developing VoiceXML-based speech applications are a critical factor in making VoiceXML easy to use. While VoiceXML itself may be well-defined, industry software for generating VoiceXML code lacks uniformity. We have launched an effort to define two standards that will help VoiceXML systems interoperate across different vendors. The effort will define how applications are represented and how runtime data is transported and stored. We hope that this effort will foster the creation of better tools and make developing VoiceXML services faster and easier.

back to the top

Copyright © 2001-2002 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).