VoiceXML Review - Feature Articles

Volume 3, Issue 5 - September/October 2003

Developing Multimodal Applications using XHTML+Voice

By:

Richard Miranti (Advisory Software Engineer, IBM)
David Jaramillo (Senior Software Engineer, IBM)
Soonthorn Ativanichayaphong (Staff Software Engineer, IBM)
Marc White (Staff Software Engineer, IBM)

Executive Summary

On the Internet, people use browsers to visit Web sites, access documents from networks, and fill out forms. With this growing capability to retrieve information, communications between users and their devices is receiving more attention. As devices become smaller, other means of input -- in addition to keyboard or tap screen -- are becoming necessary. Small handheld devices, including cell phones and PDA’s, now contain sufficient processing power to handle multiple tasks. On some devices it is difficult to perform these tasks using only keyboard, stylus, or handwriting recognition. This has lead to a new application technology called multimodal, the use of multiple methods of communication between the user and a device. These methods include keypad, touch or tap screen, handwriting recognition, and voice recognition.

This paper illustrates the basic structure and contents of an XHTML+Voice multimodal application, describing its fundamental building blocks. It is intended for those who are familiar with XHTML, VoiceXML, and HTML.

Each of the building blocks is described and coding samples are provided. A multimodal implementation of a hypothetical Pizza Order Form application is presented as an example.

The Structure of an XHTML+Voice Application

A basic XHTML+Voice multimodal application consists of a Namespace Declaration, Visual Part, Voice Part, and a Processing Part. Figure 1 illustrates these components and their relationship to each other.

Namespace Declaration

The Namespace Declaration for a typical XHTML+Voice application is written in XHTML, with additional declarations for VoiceXML, and XML-events. Figure 2 is an example of the namespace declaration for an XHTML+Voice application.

Figure 2 -- Namespace declaration

Visual Part

The Visual Part of an XHTML+Voice application is XHTML code that is used to display the various form elements to the device’s screen, if available. This can be ordinary XHTML code and may include check boxes and other form items that are found For example, Figure 3 displays the pizza size choices and their appropriate radio buttons. Figure 4 illustrates a typical form using XHTML+Voice.

Size:

Small 12"

Medium 16"

Large 22"

Figure 3 -- Visual part of a multimodal application

Voice Part

The Voice Part of an application is the section of code that is used to prompt the user for a desired field within a form. This VoiceXML code utilizes an external grammar to define the possible field choices. If there are many choices, or a combination of choices is required, the external grammar can be used to handle the valid combinations.

For example, to select the vegetable toppings for a pizza, there are multiple ways to say the selections. The VoiceXML code in Figure 5 is used with the vegtoppings.jsgf grammar file to prompt the user to select vegetable toppings for the pizza. To add additional vegetable topping choices, modify the vegtoppings.jsgf file.

Continued...

back to the top

Copyright © 2001-2003 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).