VoiceXML Review - Feature - Twenty Multimodal Projects Using X+V on the Opera Browser

Volume 6, Issue 2 - Mar/Apr 2006

Twenty Multimodal Projects Using X+V on the Opera Browser

By James A. Larson

Student projects

Opera with its new voice feature enables developers to create and deploy multimodal applications. If you want to experience the multimodal applications below, you will need to have the Opera version 8.0 browser for windows with the voice plugin installed.
Installation tips

Opera can support three types of user interfaces:

Dialog Type	URI	Comments
Verbal only	jim/paymentVerbal.xml	VoiceXML code embedded into HTML so Opera browser users can hear the verbal-only dialog
GUI only	jim/paymentGUI.xml	A traditional HTML application
Multimodal	jim/paymentMM.xml	VoiceXML code embedded into HTML code using X+V

During the last two weeks of the school term, students in the Oregon Health and Sciences University course CSE 564 and the Portland State University course CS 410/510 created several XHTML plus Voice (X+V) applications. Students had a working knowledge of XHTML and VoiceXML but no previous knowledge of X+V. I presented an hour overview of X+V syntax and two sample applications to students in both courses. Teams of two students were formed to design and implement a multimodal application within two weeks. The table below contains some of the student projects. If you install the Opera browser and the voice plug-in, then you can experience these multimodal applications. By clicking "source" under the "view" pulldown, you can examine the source code to see how students implemented the multimodal user interface.

Project Number	Project Name	Author	URI	Comments
1	Kid's holiday craft projects	Ashley Irving	ashley/home.xml	Choose the project by saying its number. Page through the instructions by saying "next." Do you think that the content should be spoken in addition to being displayed on the screen?
2	Making a peanut butter sandwich	Emerson Murphy-Hill	emerson/page1.xml	Page through the instructions by saying "next," "back," or "read."
3	Buying a car	Glenn Diviney	glenn/index.xml
4	Origami	Khanh Duong	khanh/index.xml
5	Collecting health data	Sunil Lahudia	sunil/healthData.xml
6	Buy movie tickets	Oindrila Mukherjee	qindrila/movieSel_test.xml	Use the name "James Bond" on the second screen to reserve tickets.
7	National park tours	Medha Nirguide	medha/Main.xml
8	Restaurant menu	Quang Nguyen	quang/project3.xml
9	Banking	Rajeshwari Patil	raj/homepage.xml	Account = "10"; Passcode = "hello". Amounts must be in increments of 100.
10	Library	Frank Adrian and Ken Anderson	ken/main.xml
11	Quick finder	Driss Takir	driss/voice.xml
12	Order computer	Ashwini Kulkarni	ashwini/login.xml	Use "david" for the user name and "capital" for the password.
13	Personal travel pictures	Chris Holm	chris/index.vxml	Say "next" or "previous" to view the cities.
14	Animal shelter	David Graves and Dina Suehiro	dina/index.xml
15	Cyber reader	Tom Feliz	tom/index.xml	Hear how much better a recording is than TTS.
16	Picture album	Ricky Cancro	ricky/index.php	PHP to generate XML code for a picture album.
17	Flash cards (addition)		jim/flashCardAddition.xml
18	Tune your violin	Based on a SALT program developed by Deborah Dahl, chair of the W3C Multimodal Interaction Working Group	dahl/tune.xml	Both hands are busy as a violin player requests to hear tones so the violin player can tune the violin.
19	Music world	Doan Ng	doan/mainpage.xml	This application simulates a hardware device with a push-to-speak button. You must click the "push to speak" soft button as you press the (default) ScrLk button on your computer.
20	The game of "go"	Sean Pearson	pearson/capturego.cgi

I have categorized the student projects into the following categories.

Hands-busy instruction. This category of project provides incremental instruction while the user manipulates real-world artifacts with their hands. This category includes applications in which the user diagnoses a problem (determine what is wrong with a car's motor), repaire a device (fix a leaky faucet); and construct artifacts (project 1: a talking holiday project book for children; project 2: a talking recipe book; and project 3: creating a complex origami artifact). This category has the potential of being used to diagnose or repair products without calling a live-help agent. The application may be available by connecting to an application server, or the application may be supplied on a CD-rom packaged with the product. Project 17 (not necessarily a hands-busy instruction application) illustrates the multimodal equivalent of "flash cards." Some children use "flash cards" to learn their addition tables. Repetitive drills, such as flash cards, can be used not only for math skills but also for language training, spelling, and other rote memory training.

Entertainment. This category includes audio poems, stories (project 3: a child's fairy tale book), music (a audio-controlled juke box), and games. Gaming enthusiasts can use voice as a "third hand" while manipulating controls with both real hands.

Data collection. These applications are primarily verbal forms. For each electronic form slot or menu, the user speaks an answer to a question presented to the user either verbally, visually, or both. As a fall-back, voice users may also enter information directly into a GUI-based form. Projects falling into this category, include:

3: buy a car
5: collecting health data
6: buy movie tickets
7: order meals from a restaurant menu
9: banking
11: quick finder for locating local businesses
12: ordering a computer
14: reviewing homeless pets at an animal shelter

These applications can be very useful if your hands are busy, or if these applications are implemented on a handheld device. Usability testing is needed to determine if voice really benefits these applications when used on a desktop PC in an office environment.

Photo album tour. Another popular class of applications is a photo album tour. Some projects consist of an ordered sequence pictures that the user navigates by saying "next," "previous," or "home" (project 13: personal travel pictures). More complicated projects support a hierarchical structure in which users move up or down a hierarchy of pictures and navigate within a sequential sets of pictures at the leaves of the hierarchy (project 7: national parks tour). Project 19 (music world), replaces the photos by audio clips, which enables users to create their own playlist of downloaded tunes. These are simple applications to construct. One student constructed a PHP program to generate a photo album tour (project 16: picture album). Almost anyone can use such a generation tool to create their own personal photo album tour. Just as enabling users to author text, spreadsheets, and presentations, and e-mail make office GUI-based applications popular. Enabling users to be authors may be the key to popularizing multimodal applications.

Multimodal interface to a traditional Web page. Project 10 (library application) is an example of a GUI application that also supports verbal input. The user may switch between the traditional GUI and the multimodal user interface at any time. Because of user's typing speed, background noise, and privacy issues, extensive usability testing will determine multimodal user interfaces, such as the library application, will become popular.

New novel applications. These applications are not possible with a traditional GUI. For example, project 18 illustrates an interesting "hands busy" application: the user asks for musical notes to be played while using both hands to tune a violin or other musical instrument. Project 19 illustrates how to speak the name of a tune to be played. Imagine, speaking to your I-pod to select tunes!

Students experience with Opera's implementation of X+V

Students knowledgeable with VoiceXML were able to design and implement these applications for the Opera browser in two weeks. Most students were pleased to be able to create a multimodal application that can be accessed by anyone world wide using the popular Opera browser with the speech plugin. Many students asked their friends and family to interact with their applications, and felt good about their positive comments.

While most students were able to complete their projects within two weeks, they experienced several types of problems:

Lack of documentation. While the Opera documentation was useful to get started, it lacked a good description the element. The IBM documentation described the syntax of the element, but did not provide a good example.
Browser problems. Several students reported that the Opera browser speech recognition system stopped after several errors, even when a document was refreshed, the Opera browser still did not recognize user speech. Students worked around the problem by closing the browser and restarting it.
Debug tools. Experienced student programmers complained about the lack of development and debug tools. Several times students received the message, "an error has occurred," with no further information about the error.
X+V language. Students complained about several limitations of X+V, including: (1) no support for the W3C Speech Synthesis Markup Language (SSML); (2) no support for the W3C Speech Recognition Grammar Specification (SRGS); and (3) several missing VoiceXML 2.0 elements including the element.

I spoke with an Opera representative who indicated that the Opera documentation will be improved, and W3C SSML and SRGS will be supported in the future. He also indicated that Opera is considering development tools.

IBM recently announced the Multimodal Tools Project for Eclipse [http://www.alphaworks.ibm.com/tech/mmtp] that avoids or overcomes the bulleted problems above. There are lots of papers and articles referenced by http://www-306.ibm.com/software
/pervasive/multimodal/ and http://www-128.ibm.com/developerworks/. There is also an X+V Programmer's Reference Guide in the IBM Multimodal Tools Package. For debugging, the IBM Browser has a voice log window where programmers can trace their application and an X+V debugger. The IBM version also supports the W3C speech Synthesis Markup Language (SSML) and both versions of the W3C Speech Recognition Grammar Specifications (SRGS).

If you are inspired by these examples and have a novel multmodal application in mind, I invite you to build a demo using X+V. See getting started.

Conclusion

Clever students are able to create interesting and novel applications as long as some training in VoiceXML, HTML, XPath, Java Script, and X+V is provided. Usability testing is needed to: (1) refine and improve the user interfaces to the initial prototypes and (2) determine if the application has both wide appeal and use.

I challenge students to use X+V to create other new and novel multimodal applications. I'll post them along with the applications shown above if they (1) are error free, (2) conform to general moral standards (no pornography, please), and (3) represent a new use of multimodal technology.

back to the top