Volume 1, Issue 4 - April 2001
   
 

Answers to Your Questions About VoiceXML

By Jeff Kunins

In this monthly column, an industry expert will answer common questions about VoiceXML and related technologies. Readers are encouraged to submit questions about VoiceXML, including development, voice-user interface design, and speech technology in general, or how VoiceXML is being used commercially in the marketplace. If you have a question about VoiceXML, e-mail it to and be sure to read future issues of VoiceXML Review for the answer.

This month we received a few more great questions from the readership. It's great to see momentum begin to develop here, and I look forward to the point where too many questions are coming in each month to publish answers to every one.

Q: I'm working with VoiceXML, and want to know how I can submit recorded sound? In which format is the recorded sound stored, and how is it matched with our written grammar?

A: Yes, VoiceXML does support recording audio from the caller, such as a personal voicemail message. Once this audio has been recorded, the recording can be played back and/or posted back to your Web server for offline processing and permanent storage. The <record> element in VoiceXML initiates a recording and stores the result in a variable, which can then be used by the <value> and <submit> elements for further manipulation.

For example (from the VoiceXML 1.0 spec):

<?xml version="1.0"?>
<vxml version="1.0">
  <form>       <record name="greeting" maxtime="10" dtmfterm="true" type="audio/wav"?                              <prompt>At the tone, please say your greeting.</prompt>           <noinput>I didn't hear anything, please try again</noinput>       </record>       <field>           <prompt>Your greeting is <value expr="greeting"/></prompt>           <prompt>To keep it, say yes. To discard it, say no.</prompt>           <filled>             <if cond="confirm">               <submit next="save_greeting.pl" method="post" namelist="greeting"/>
                   </if>
            <clear/>           </filled>
                   </field>   </form> </vxml>   


In this example, the caller is prompted for a simple voicemail greeting. The recording is played back to the caller for confirmation, and given their approval is posted back to the Web server (presumably for storage as the caller's official greeting for this voicemail system.)

Here are a few additional points about recorded audio and VoiceXML:

  • Audio Formats. VoiceXML 1.0 does not specify any particular file formats for recorded audio. The "type" attribute allows the application developer to specify which MIME type they would prefer. If not specified, the recording "defaults to a platform-specific format". VoiceXML 1.0 does not specify exactly how a VoiceXML platform should advertise which formats it does support, or how it should behave if a developer requests an unsupported format. The documentation for the VoiceXML platform you're using should provide this information; however, most of today's commercially available VoiceXML platforms support standard 8-bit, 8Khz RIFF-encoded Windows .WAV files.
  • Grammars active while recording. VoiceXML 1.0 specifies the "modal" attribute for the <record> element. If "modal" is set to 'true' (the default), then no grammars are active while recording. If, however, "modal" is specified as 'false', then all appropriately scoped grammars will be active while recording. If a grammar is matched while recording, VoiceXML 1.0 does not explicitly specify what should happen to the audio that was recorded thus far; implementations may choose to discard the audio recorded thus far and jump to the appropriate <filled> handler for the matched grammar. That said, most if not all commercially available VoiceXML platforms today do not support simultaneous recording and recognition, and do not support "modal=false" for <record>. The VoiceXML 1.0 specification explicitly calls out this point.
  • HTTP POST and audio data. When using the <submit> element to POST audio data back to your Web server, VoiceXML 1.0 does not explicitly specify how the VoiceXML platform should send the data. Two methods in commercial use today are HTTP multipart MIME form-data, and HTTP URL-encoded data. Multipart form data is typically three times smaller/faster than URL-encoded data, though your Web server must be properly configured to accept multipart form data. See the documentation of your VoiceXML platform of choice for details on how audio data is posted.
  • Secure POST of audio data. In order to securely POST audio data over the Internet using SSL (Secure Sockets Layer), both your Web server and your VoiceXML platform of choice must be configured to support SSL. Not all VoiceXML platforms support SSL for HTTP POST. That said, this issue is typically only relevant if your VoiceXML platform is running remotely on a different network than your application Web servers.

Q: Is VoiceXML only usable for telephony applications, or can it also be used for PC (client) applications?

A: VoiceXML 1.0 is explicitly designed for developing voice-enabled telephony applications. That is why it includes some elements for call control (e.g., <transfer>), as well as the basics of voice recognition and audio playback. However, nothing would prevent the development of a VoiceXML interpreter/platform focused on PC applications. For instance, many companies have begun using Web pages and HTML as a way to develop self-contained "client only" applications that don't explicitly require external Internet access. A VoiceXML platform tailored for PC applications would likely implement a subset of the full VoiceXML 1.0 specification.

Continued...

back to the top

 

Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).