VoiceXML Review - Columns

Volume 5, Issue 1 - January / February 2005

First Words

Welcome to “First Words” – the VoiceXML Review’s column to teach you about VoiceXML and how you can use it. We hope you enjoy the lesson.

VoiceXML 2.1

In this lesson, we’re going to continue investigating VoiceXML 2.1.

You may recall that as VoiceXML platform vendors and application developers began to widely deploy VoiceXML applications, they began to identify potential future extensions to the language. The result of this experience is a collection of field-proven features that are candidates for addition to the VoiceXML language. These features are being proposed as part of VoiceXML 2.1.

Just as a reminder, VoiceXML 2.1 has been released as a Last Call Working Draft. Here is a pointer:

http://www.w3.org/TR/2004/WD-voicexml21-20040728/

Note: if you’re reading this article after VoiceXML 2.1 has been finalized and published, you should spend a few minutes tracking down the final specification rather than this link, as the specification may have undergone minor changes.

The new features proposed for VoiceXML 2.1 are based on feedback from application developers and VoiceXML platform developers. The features we’ve covered already include:

Referencing Grammars Dynamically – Generation of a grammar URI reference with an expression;
Referencing Scripts Dynamically – Generation of a script URI reference with an expression;
Recording user utterances while attempting recognition – Provides access to the actual caller utterance, for use in the user interface, or for submission to the application server.
Adding namelist to <disconnect> - The ability to pass information back to the VoiceXML platform environment (for example, if the application wishes to pass results to a CCXML session related to the call).

Here are the links to the previous articles in this series:

https://voicexmlreview.org/Sep2004/columns/sep2004_first_words.html
https://voicexmlreview.org/Nov2004/columns/nov2004_first_words.html

This issue, we’re going to look at:

Using to detect barge-in during prompt playback – Placement of ‘bookmarks’ within a prompt stream to identify where a barge-in has occurred;

The Tag

As the reader may know, VoiceXML is intended to work well with some additional standards developed within the W3C Voice Browser Working Group. In particular, the Speech Recognition Grammar Specification (SRGS, see http://www.w3.org/TR/2004/REC-speech-grammar-20040316/) and the Speech Synthesis Markup Language Specification (SSML, see http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/) work very well with VoiceXML.

The SSML specification defines the ‘mark’ tag, which allows the placement of a bookmark or marker into an SSML fragment that is going to be rendered by the SSML processor. When the SSML processor encounters such a mark tag in the SSML, it is required to inform the VoiceXML ‘interpreter context’ that it has done so.

VoiceXML 2.1 allows to be easily used within a VoiceXML application. In particular, the following additions to VoiceXML 2.0 are specified:

‘namexpr’ attribute on  - SSML defines a ‘name’ attribute for which allows the bookmark to be identified by the application. Each SSML fragment might therefore have multiple markers, each identified by a name. VoiceXML 2.1 extends this by adding the ‘nameexpr’ attribute, which allows the specification of an ECMAScript expression defining the name. This provides for more flexibility on the client-side, as well as providing consistency with the rest of the language (where most elements accept static and expression versions of particular attributes). As is usual for this attribute pairing, if the application specifies both ‘name’ and ‘namexpr’, an error.semantic event will be thrown.

Application Level access to bookmark information – Although VoiceXML 2.0 allows the use of , there is no defined mechanism to access the information returned to the interpreter context from the application level. That is, when a is processed, the relevant information is not available to the VoiceXML application itself. VoiceXML 2.1 specifies two attributes on the application.lastresult$ object: markname, and marktime. These reflect the name of, and time since (in milliseconds) the last was processed in an SSML fragment. Note that processing of the SSML will end when the fragment has been completed, or when a barge-in event occurs. This allows the application to determine where the barge-in occurred, using either time, or the bookmark (or both, as shown below).

In addition to the attributes on the application.lastresult$ object, if a successful recognition occurs as part of form-filling, the markname and marktime shadow variables for the form item will also be set to the same values as those in the application.lastresult$ variable.

Here is an example from the VoiceXML 2.1 Last Call Working Draft.

<?xml version="1.0" encoding="UTF-8"?>
<vxml xmlns="http://www.w3.org/2001/vxml" version="2.1">

   <var name="played_ad" expr="false"/>

   <form> 
     <field name="team">
       <prompt>
         <mark name="ad_start"/>
         Hockey scores brought to you by Elephant Peanuts.
         There's nothing like the taste of fresh roasted peanuts.
         Elephant Peanuts. Ask for them by name.
         <mark name="ad_end"/>
         <break time="500ms"/>
         Say the name of a team. For example, say Toronto Maple Leafs.
      </prompt>

      <grammar type="application/srgs+xml" src="teams.grxml"/>

      <filled>
        <prompt>
            Sorry, there is no hockey this year. Boo hoo.
        </prompt>

        <if cond="typeof(team$.markname) == 'string' &amp;&amp;
            (team$.markname=='ad_end' || 
            (team$.markname=='ad_start' &amp;&amp;
                team$.marktime &gt;= 5000))">
           <assign name="played_ad" expr="true"/>
        <else/>
           <assign name="played_ad" expr="false"/>
        </if>

      </filled>

     </field>
  </form> 

</vxml>

Now, beyond the fact that this has been shamelessly converted to a hockey-based example, and that the advertisement is perhaps interesting only to elephants, it demonstrates an interesting use-case for the tag. We want to be sure that the listener has in fact heard the advertisement, and that we can bill the sponsor for having played their ad to another caller. To do this, we check that we have either completed the entire ad (the ‘ad_end’ mark has been processed), or that we have started and played at least five seconds of audio. This length of time will be the time since the last mark (ad_start) was encountered. If either of these conditions is true, then the ECMAScript snippet in the <filled> block will assume the ad has been played to the listener. Ka-ching.

Another use-case might be allowing the application to restart prompt or SSML playback partway through a long fragment, by keeping track of how far along in the original playback we were interrupted.

Summary

Here is the direct link to the ‘mark’ tag feature:

http://www.w3.org/TR/2004/WD-voicexml21-20040728/#sec-mark

In future issues, we’re going to look at these:

Using <data> to fetch XML without requiring a dialog transition – Retrieval of XML data, and construction of a related DOM object, without requiring a transition to another VoiceXML page.
Concatenating prompts dynamically using <foreach> - Building of prompt sequences dynamically using Ecmascript;
Adding type to <transfer> - Support for additional transfer flexibility (in particular, a supervised transfer), among other capabilities.

These are features that will likely get a full article each, as they are powerful, and can provide the VoiceXML developer with new ways to build applications.

VoiceXML 2.1 proposes some useful additional features for VoiceXML 2.0, based on real-world deployment experience. We’re going to continue looking at these in the forthcoming issues drilling down into these features. As always, if you questions or topics for VoiceXML 2.0 or 2.1, drop us a line!

back to the top