VoiceXML Review - Feature Articles

Volume 3, Issue 5 - September/October 2003

Developing X+V Applications Using the Multimodal Tools

By:

Andrea Rutherfoord (Information Developer, IBM)
Robert Vila (Software Engineer, IBM)
Bina Dames (Software Engineer, IBM)
Austin Lee (Embedded Voice Test. IBM)

Executive Summary

As computing devices become smaller and more pervasive, customers expect access to their data anytime, anywhere. With advances in the function, speed, and size of Personal Digital Assistants (PDAs) and cellular phones, coupled with an increasingly diverse set of users, the demands for flexible user interfaces from application developers have multiplied. Traditional visual interfaces, such as those provided by HTML pages, are no longer adequate to meet the public’s rising expectations for convenience, performance, and usability. End-users of these embedded devices are no longer satisfied with low-resolution versions of desktop-based, Web applications and cumbersome methods of data entry. Consumers expect multiple methods, or modes, for interacting with a device. They want the ability to use the interaction method that most naturally fits the situation – to make the interface work for them, instead of being forced to work with an interface.

Traditionally, to create these “multimodal” applications, developers would have to master the development of both visual and voice software, resulting in a daunting learning curve. Many of these applications required extensive porting work to adapt to new platforms, and there was no way to leverage existing Web applications.

With IBM's 40-year commitment to voice technology, and emergent software and hardware making applications faster and more powerful, IBM has created a practical solution for application developers seeking to integrate both voice and visual technologies: the Multimodal Toolkit and Multimodal Browser.

This document provides an overview of the XHTML+Voice (X+V) language, an introduction to the development toolkit, and a general description of how to use the features in the toolkit to develop a multimodal application. For specific details on creating and implementing a multimodal application, refer to the documents that accompany the Multimodal Tools.

The Multimodal Tools

The Multimodal Tools release builds on the WebSphere® Studio framework to add the functionality you need to create, test, and run multimodal applications.

The Multimodal Toolkit V4.1 for WebSphere Studio, which adds extensions to a WebSphere Studio development product to provide multimodal functionality, introduces a user interface that can minimize both the skills and time needed to develop high-tech applications for PDAs and other handheld, wireless devices. IBM has used similar technology for years to facilitate the rapid development of server-based voice applications.

The Multimodal Toolkit provides an integrated development environment that lets you integrate visual and voice applications efficiently without requiring expertise in all the development languages. The toolkit provides multiple tools, editors, and views that are operated using standard menus, icons, toolbars, and basic XHTML and VoiceXML programming skills.

The toolkit’s Reusable Dialog Components provide common functionality such as mailing address, credit card, and social security number form components using only a few button clicks, and each field provides the user with multiple methods of data entry.

The WebSphere Everyplace® Multimodal Browser V1.0, developed in a strategic relationship with Opera Software, provides a Web browser in which you can test voice-enabled Web applications. The browser is enhanced with extensions that include IBM's automatic speech recognition and text-to-speech technology, allowing you to view and interact with multimodal applications that you have built using XHTML+Voice. When you install the Multimodal Browser, the icon for the Opera Browser appears on your desktop, and you can use it to open the browser and run your multimodal applications.

The Voice Server SDK V3.1.1 contains the programs that are needed to play and compose pronunciations in the Multimodal Toolkit.

Multimodal applications consist of visual (XHTML) and voice (VoiceXML) components.

What is XHTML?

The eXtensible HyperText Markup Language (XHTML) is an XML-based markup language for creating visual applications that users can access from their desktops or wireless devices. XHTML is the next generation of HTML 4.01 in XML.

If you have existing programs with HTML pages, you will have to make some simple structural changes to comply with XHTML conventions. XHTML has replaced HTML as the supported language by the World Wide Web Consortium® (W3C), so future-proofing your Web pages by using XHTML will not only help you with multimodal applications, but will ensure that users with all types of devices will be able to access your pages correctly.

For more information, refer to the XHTML 1.0 specification on the W3C Web site (see the References section at the end of this paper).

What is VoiceXML?

The Voice eXtensible Markup Language (VoiceXML) is an XML-based markup language for creating distributed voice applications, just as HTML is a language for distributed visual applications. VoiceXML was defined and promoted by an industry forum, the VoiceXML Forum™, founded by AT&T®, Lucent®, Motorola®, and IBM, and supported by approximately 500 member companies. Updates to VoiceXML are a product of the W3C voice working group. The language is designed to create audio dialogs that feature text-to-speech, pre-recorded audio, recognition of both spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. Its goal is to provide voice access and interactive voice response (such as by telephone, PDA, or desktop) to Web-based content and applications.

Users can interact with these Web-based voice applications by speaking or by pressing telephone keys rather than solely through a graphical user interface.

For more information, refer to the VoiceXML 2.0 specification on the W3C Web site (see the References section at the end of this paper).

What is XHTML+Voice?

XHTML+Voice, or X+V for short, is a markup language for multimodal Web pages. With X+V, Web developers can create Web pages that let end-users select voice input and output as well as traditional visual (GUI) interaction. X+V does this by providing a simple way to add voice markup to XHTML. Hence the name "XHTML plus Voice."

X+V fits into the Web environment by taking a normal visual Web user-interface and speech-enabling each part of it. That is, if you take a visual interface and break it up into its basic parts (such as an input field for a time of day, a check box for AM or PM, and so on), you can then simply enable the use of voice by adding voice markup to the visual markup. X+V consists of visual markup, a collection of snippets of voice markup for each element in the user interface, and a specification of which snippets to activate when. For visual markup, X+V uses the familiar XHTML standard. For voice markup, it uses a (simplified) subset of VoiceXML. For associating the snippets of VoiceXML and user-interface elements, X+V uses the XML Events standard. All of these are official standards for the Web as defined by the World-Wide Web Consortium (W3C) that governs web standards.

Motorola, Opera Software ASA, and IBM submitted the X+V specification to the W3C, which submitted it to the multimodal working group in January of 2002. For a Web site with the XHTML+Voice Profile 1.0 specification, see the References section at the end of this paper.

Note: The specific details of creating and implementing a multimodal application are beyond the scope of this white paper. See the companion article, also published in this issue of the VoiceXML Review for a discussion of how to write XHTML+Voice markup.

Continued...

back to the top

Copyright © 2001-2003 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).