VoiceXML Review - Feature Articles

Volume 2, Issue 4 - June 2002

W3C The World Wide Web Consortium's Activities in Multimodal Interaction

By Deborah Dahl , Unisys Corporation, Chair,
W3C Multimodal Interaction Working Group

[NOTE: This article is published with the express permission of Unisys Corporation. Unisys
Corporation retains ownership of the copyright of this article.]

Currently interaction with the web takes place primarily with standard web pages on desktop browsers. Newer forms of interaction, such as voice interaction and mobile handset applications, are also starting to become more widespread. Multimodal interaction adds new dimensions to the experience of interacting with web applications. It goes beyond GUI or voice-only inputs by allowing users to interact with applications in multiple ways, combining several modes. Eventually, input modes could include speech, keyboard, pointing devices, and handwriting, as well as other modes that might become popular in the future, such as gestures.

Perhaps the most obvious reason to use multimodal technology is to enrich the web experience. Another important feature of multimodal interaction is that it can leverage the inherent advantages and disadvantages of different input modalities to make the user's experience much more natural and efficient. For example, in a travel planning application, it's very natural to be able to speak the name of a destination rather than scrolling through a long list to select one. However, on the output side, it's much more efficient to view a list of flights on a display than to listen to someone reading the items in the list one by one.

Another significant advantage to multimodal applications is that they can adapt to different situations and different users by taking advantage of the characteristics of the different modes. For example, voice interaction with an application might be inappropriate in a crowded meeting, while GUI interaction would be unsafe for someone who's driving. Ideally, the same application could support both voice and GUI interaction, depending on the user's environment. The advantages and disadvantages of different modalities also vary depending on the device that's being used. For example, voice is a very natural input modality on small devices such as cell phones with awkward keypads.

Finally, multimodality can play an important role in making the web accessible to users with disabilities. Users with disabilities will be able to make use of alternate input modalities if they're unable to use the standard modality due to their disability.

What is the W3C's role in multimodal interaction?

The W3C has recently started a Working Group to define multimodal specifications for the web. Although some earlier work was done on multimodal requirements in the W3C's Voice Browser Working Group , the Multimodal Interaction Working Group is itself very new, having been chartered in February of 2002. Its charter is effective for two years, until February of 2004. The group's work is done during weekly teleconferences as well as periodic face-to-face meetings. The first face to face meeting was held February 28 and March 1 during the 2002 W3C Technical Plenary meeting in Cannes, France. That meeting defined the immediate directions for the group's activities. A second face to face meeting took place in Boston from June 20-21.

Currently the group's primary activities include compiling multimodal use cases and requirements for a multimodal specification. There are also several individual teams working on specific exploratory efforts in the areas of events, architectures, natural language, and ink. Events, for example, are particularly important to multimodal applications because of the need to synchronize and coordinate inputs from different modalities. As the group completes the compilation of use cases and requirements, it will begin working on the specifications that define standards for multimodal markup.

Ideally, these standards will both accommodate current technologies, and will also be extensible to accommodate future input modalities as they become available. In addition to a multimodal specification, the group is also chartered to define an ancillary specification for representing user input in a normalized form, so that inputs from different modalities will have compatible representations. This work builds on earlier work that was done in the Voice Browser group on the Natural Language Semantics Markup Language (NLSML). For a description of the NLSML specification, please refer to my earlier VoiceXML Review article.

What the Multimodal Interaction Working Group isn't doing.

While the Multimodal Interaction Working Group and the W3C will play an important role in standardization efforts, there is a great deal of very important work to be done in multimodal application development which falls outside of the standardization process. For example, although we understand a lot about what makes a GUI interface usable, and we're starting to acquire some of the same knowledge about voice interfaces, there is still a tremendous amount to be learned about how to design easy-to-use multimodal interfaces. These are results that will begin to emerge from the multimodal research community, independent developers, usability researchers and commercial deployments as applications are developed and used in real situations. Another area of important work that's outside the scope of standardization is the development of tools for application development.

What other activities are relevant to Multimodal Interaction?

Because multimodal interaction includes so many other components, the number of other activities and standards, both inside and outside of the W3C, that are potentially relevant is quite large. It's important for the multimodal work to leverage these other standards in order both to fit into the web and wireless environments as well as to avoid reinventing the wheel.

Here are just a few examples of some standards and activities that the multimodal group needs to be familiar with.

W3C standards relevant to web documents in general such as XHTML, XForms, and XML Events are clearly important.
The W3C Speech Interface Framework is producing several important speech-related standards, in particular VoiceXML, SSML -- the Speech Synthesis Markup Language and SRGS -- Speech Recognition Grammar Specification.
Closely related to multimodal input is multimedia output such as audio, video and animations. Fortunately, there is a very comprehensive W3C Recommendation, SMIL 2.0, which defines a standard for coordinated multimedia output.
Because multimodal applications are of tremendous interest in wireless and telephony environments, it's also important for the Working Group to be informed about telephony standards. Organizations such as the European Telecommunications Standards Institute (ETSI) and 3GPP also do work that is complementary to the multimodal group's. A newly announced initiative, Open Mobile Alliance (OMA), has the goal of creating an interoperable global market for future mobile services by taking advantage of open standards.
Because wireless devices are much less capable than desktop systems, issues such as distribution of processing across client and server devices become of interest. For example, standards for distributed speech recognition such as the Aurora standard, being developed by ETSI are extremely relevant.
Collaboration among companies on developing new approaches prior to standardization is also important. For example, the SALT Forum has developed a specification for tags that can be embedded in HTML or XHTML documents to support the development of GUI applications that include speech interaction. Similarly, another industry group has developed a multimodal specification based on integrating XHTML and VoiceXML which has been submitted to and acknowledged by the W3C. These specifications are clearly of great interest to the multimodal interaction group.

How can I find out more?

The W3C Multimodal Interaction Working Group maintains a public page describing its activities as part of the main W3C web site . You can find links to the group's charter, the public email archive, and many other related documents on that site. Employees of W3C member organizations can also access the group's internal web pages and email archive. The W3C Multimodal Interaction working group clearly has some exciting challenges ahead of it. The final result will move us closer to the vision of transparent access to the web by anyone, anytime and anywhere.

back to the top

Copyright © 2001-2002 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).

W3C The World Wide Web Consortium's Activities in Multimodal Interaction

By Deborah Dahl , Unisys Corporation, Chair, W3C Multimodal Interaction Working Group

By Deborah Dahl , Unisys Corporation, Chair,
W3C Multimodal Interaction Working Group