VoiceXML Review - Feature Articles

Volume 2, Issue 1 - January 2002

Building the VoiceXML Forum Certification Program

By Greg Harman

(Continued from Part 1)

The important thing to notice here is that the user interfaces change, but the rest of the application stays the same. At this abstract level, we could think of this as not three, but one single application with three user interfaces (see Figure 3).

Figure 3 - Integrated Application Architecture

At this point it should be clear that the fundamental difference between applications is the user interface, so now we will delve into some of the details of user interface design for v-commerce.

Audio data is sequential; the user must remember everything that was said, as it's difficult at best to go back and hear previous data. Any menu presented must be kept short, as with an m-commerce application. However, a voice application can increase usability by making use of grammars, natural language constructs that allow users to select menu options without remembering an exact keyword. In addition, grammars can enable a mixed-initiative dialog, allowing the user to navigate globally and skip the menu system if they are familiar with the application. For example, a user could say "black top hats" at the first prompt and skip to the ordering section, rather than having to say "top hats" at the style menu, waiting for the color menu, then saying "black."

As with m-commerce applications, the biggest challenge in v-commerce is data input. ASR (Automatic Speech Recognition) does not yet provide accurate transcription of free-form speech so another method is required. Text could be entered one letter at a time using an alphabet grammar, but this is far too cumbersome to be practical, and also limited by ASR's ability to correctly distinguish similar-sounding letters (such as all letters that rhyme with 'E'). Where VoiceXML excels is in creating menus consisting of a concrete number of options, and providing intuitive grammars for those options. Therefore, we must find a way to allow the user to input arbitrary text information via set menus and patterns.

There are two primary methods of handling arbitrary text information in current enterprises. One is keyboard entry, and the other is the tried-and-true human conversation. For a small enterprise (like our hat store), or for an enterprise that can afford powerful transcription software, speech input may be a reality. VoiceXML lets the user record information, such as a shipping address. That information is then either entered manually into the shipping database, or via the transcription software. This method can be used, but it is important to understand that the VoiceXML system cannot interact with this information, but simply record it. Furthermore, the input box must be very specialized (i.e. labeled as the shipping address) for transcription software to make effective use of it.

Perhaps the best option (until multi-modal technology becomes widely available) is to make use of a connected e-commerce application to enter arbitrary text. A user enters arbitrary text "profiles" through the web interface, and these profiles are saved and made available to the audio interface as menu options. For example, a user can enter two shipping addresses, and then have them available from the audio interface as a two-option menu.

So in the end what is needed is an application architecture in which the database and application logic support multiple user interfaces, and the user interfaces communicate between each other (in practice, they communicate through the lower application layers).

Figure 4 - Complete Architecture for an e-m-v Commerce Application

Constructing an application to meet this architecture at first seems to require a large development effort, especially if there is an e-commerce and/or m-commerce application already in place. However, as explained above, v-commerce is really another aspect to the same application, not a new application to be created independently of the others. Unfortunately, many existing e-commerce applications were not designed with multiple user-interfaces in mind, and certainly not with interaction between the different interfaces in mind. A modern application server makes this sort of functionality possible, but requires the re-construction of an entire existing e-commerce application in order to add mobile or audio functionality.

The Clickmarks Platform is a tool that enables rapid extension of existing applications to new user interfaces, such as VoiceXML, and allows communication between those interfaces. This platform simply re-uses the existing components (website, database, etc.), and provides an easy mechanism to convert the web or mobile interface into an audio-only voice interface. It provides an easy way to alter the web application as needed to add support for audio/mobile input, and provides tools for recognizing pieces of web content and translating those to a voice application.

In addition, the Clickmarks Platform supports VoiceLet technology, small standalone VoiceXML applications, such as pop3 or IMAP e-mail access, web access, and LDAP directory access. These VoiceLets can be inserted into any existing VoiceXML application to add easy functionality without programming.

In summary, v-commerce, e-commerce, and m-commerce applications should be viewed as different user interfaces to the same application. These interfaces should communicate with each other, as each has different strengths and weaknesses, and all can be used to enhance the whole.

back to the top

Copyright © 2001-2002 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).