VoiceXML 2.0 from the Inside

By Dr. Scott McGlashan

1. Introduction

With the publication in October 2001 of VoiceXML 2.0 as a W3C Working Draft, VoiceXML is finally on its way to become a W3C standard. VoiceXML 2.0 [VOICEXML-2.0] is based on VoiceXML 1.0 [VOICEXML-1.0] which was submitted to the W3C Voice Browser Working Group by the VoiceXML Forum in May 2000. In this article, we examine some of the key changes in the first public working draft of VoiceXML 2.0 as compared to the VoiceXML 1.0 specification.

We begin by explaining the background and the scope of VoiceXML 2.0 before looking at some of the changes in details. A full list of changes can be found in Appendix J - Changes from VoiceXML 1.0 (Hyperlinks to Sections and Appendixes refer to the October 2001 VoiceXML 2.0 Working Draft). Application developers familar with VoiceXML 1.0 need to be especially aware of all changes, since VoiceXML 1.0 documents may require some modifications to work on VoiceXML 2.0 platforms. We finish up with some comments on where we believe the language will change as it moves through the W3C standardization process, and indicate some of the requested changes defered to a future version of VoiceXML.

2. Background

Since the founding of the Voice Browser Working Group in March 1999, the group had the mission of developing a suite of standards related to speech and dialog. These standards formed the W3C Speech Interface Framework [SIF] and cover markup languages for speech synthesis, speech recognition, natural language and dialog, amongst others. Since the VoiceXML Forum had made clear its intention to develop VoiceXML 1.0 and submit it to the Voice Browser Working Group, the dialog team focused its efforts on specifying requirements for a W3C dialog markup language [DIALOG-REQS] and providing detailed technical feedback to the Forum as VoiceXML 1.0 evolved.

With the submission of VoiceXML 1.0, the dialog team began its work in earnest of developing VoiceXML into a dialog markup language for the Speech Interface Framework. A change request process was established in order to manage requests for changes in VoiceXML 2.0 from members of the Working Group and other interested parties; changes could include editorial, clarification, functional enhancements, all the way up to complete redesign of the language. Rather than try to incorporate every possible change into VoiceXML 2.0, we decided to limit the scope of changes.

3. Scope

In determining the scope of VoiceXML 2.0, the dialog team was guided by the general principle that VoiceXML 1.0 offered approximately the right level of functionality for a dialog markup language but required improvements in the following areas:

Interoperability: One of the goal of a standard such as VoiceXML is to provide document portability: the application developer should be able to write documents which runs on all VoiceXML platforms, and which demonstrate approximately the same behavior independent of the platform (performance differences between speech recognition and synthesis engines being acceptable differences). Given that such behavior must be testable, as for example in the VoiceXML Forum's Conformance Program, the standard must mandate a minimum level of support. For example, while VoiceXML 1.0 offered openness in terms of supported speech grammar and audio formats, it did not require that any particular format must be supported thus severely limiting document portability.
Functional Completeness: In order to build useful applications, the standard must describe key aspects of the cycle of generating system output, interpreting user input and transitioning from one dialog to another. In VoiceXML 1.0 there were gaps; for example, when prompts were actually played to the user. Furthermore, although we intended to minimize the addition of new functionality, there were a number of cases where functional enhancements were desirable to build compelling applications.
Clarity: The standard must provide a clear description and interpretation of all elements (and their attributes) in the language, how they interact with one another, and the expected platform behavior. VoiceXML 1.0 contains a number of omissions and contradictions which needed to be dealt with so that platform developers and application developers alike could clearly understand how the language worked, and what behavior could be expected.

Consequently, while it was not our intention to re-design VoiceXML 1.0 for a W3C standard, there were a number of changes which we needed to pursue in order to deliver these improvement. Many key improvements are described in the following sections, together with motivation for the change and illustrations of typical use cases.

4. Interoperability Changes

The key changes for interoperability concern a number of formats which must be supported by all VoiceXML 2.0 platforms:

input: the XML Format of the Speech Recognition Grammar Specification [SRGS] for speech and DTMF input
output: the Speech Synthesis Markup Language [SSML] for text-to-speech and audio output
protocol: the HTTP protocol [RFC2616] for fetching documents and resources
audio: basic ALAW and MLAW formats for audio file formats

Developers using these formats are guaranteed that their applications run on any VoiceXML platform which conforms to the VoiceXML 2.0 specification.

4.1 Input Changes

VoiceXML 1.0 did not require that any particular speech grammar format was supported by platforms. The JSpeech Grammar Format [JSGF] was extensively used in examples, and this gave the (misleading) impression that it was required - it was not.

In general, VoiceXML 2.0 delegates part of its behavior to other specifications in W3C Speech Interface Framework (and in turn becomes dependent on these specifications). In VoiceXML 2.0 the XML format of the grammar specification [SRGS] is required (Section 3.1). All conforming platforms must support the XML format of [SRGS] for speech and DTMF grammars. Other formats, such as the JSpeech Grammar format as well as propriety ones, may be supported. Although [SRGS] also includes a ABNF format (a compact format familiar to traditional speech grammar developer), the decision to standardize on the XML format was motivated by the general tendency for XML-based languages in W3C standards, the availability of XML tools and knowledge, and that the XML format provides the same functionality as the ABNF format.

In the following example, the <grammar> element references an external speech grammar "number.grxml":

<form>
  <field name="travellers">
      <prompt>How many are traveling?</prompt>
      <grammar mode="voice" src="http://www.example.com/number.grxml" 
                  type="application/grammar+xml"/>
  </field>
</form>

Note the recommended file suffix for this grammar format is ".grxml" and that the media type of the resource is "application/grammar+xml" (the media type is tentative and awaiting approval from the IETF). Alternatively, the XML format of the grammar can be specified inline:

<form>
  <field name="travellers">
   <prompt>How many are traveling?</prompt>
   <grammar mode="voice" type="application/grammar+xml" root="num">
       <rule id="num">
         <one-of>
             <item>one</item>
             <item>two</item>
             ...
         </one-of>
       </rule>
   </grammar>
  </field>
</form>

where the "root" attribute in <grammar> indicates the top-level rule to activate during recognition.

When grammars are specified in this XML format, it is not permitted for a VoiceXML platform to reject the grammar by throwing an "error.unsupported.format". However, it may reject the grammar if it is specified in a language not supported by the platform (e.g. Swedish grammars, <grammar xml:lang="sv" ... />), grammars which are ill-formed, missing, and so on.

The <grammar> element also contains a "mode" attribute with values to indicate it is a 'voice' grammar or a DTMF grammar. The <dtmf> element of VoiceXML 1.0 is replaced in VoiceXML 2.0 with a <grammar> element with its mode attribute set to "dtmf". VoiceXML 2.0 platforms must support this XML format for DTMF grammars.

By using a single mandatory XML grammar format in VoiceXML 2.0, application developers can write portable speech and DTMF grammars.

4.2 Output Changes

Just as VoiceXML 2.0 delegates the required grammar format to another Speech Interface Framework language, so too speech output. All platforms must supported the Speech Synthesis Markup Language [SSML] (Section 4.1.2). For example:

<form>
  <field name="travellers">
   <prompt>How <emphasis> many </emphasis> are traveling?</prompt>
  </field>
</form>

SSML elements such as <emphasis> can only appear inside the <prompt> element; however, it is good practice to put all output inside a <prompt> element.

Finally, note that all speech markup elements used in VoiceXML 1.0 (<emp>, <div>, <pros>, and <sayas>) are no longer supported in VoiceXML 2.0; they have been superceded by the corresponding elements in SSML.

4.3 Protocol Changes

Unlike VoiceXML 1.0, there is a requirement in VoiceXML 2.0 that the HTTP protocol [RFC2616] is supported (Section 6.1.4). Thus grammar and other resources which need to be fetched by the platform can always be specified using HTTP:

<form>
  <field name="travellers">
   <prompt> <audio src="http://www.example.com/travel.wav"/> </prompt>
   <grammar src="http://www.example.com/number.grxml" 
               type="application/grammar+xml"/>
  </field>
</form>

Furthermore, VoiceXML 2.0 platforms must follow the Cache Correctness rules of HTTP (Section 6.1.2). These specify in detail when resources in the platform cache can be used again, and when fresh resources are to be fetched. VoiceXML 2.0 also make available fetching attributes "maxage" and "maxstale" which gives developers fine-grain control over when cached resources are used; for example, they can be used to ensure that a large audio or grammar file is only fetched once so minimizing delay for the end user. Other changes which affect fetching behavior in VoiceXML 2.0 include removing the 'caching' attribute and property, and removing 'stream' as a value of 'fetchhint'.

4.4 Audio

The audio formats which platforms are recommended to support for playback and recording in VoiceXML 1.0 have become mandatory in VoiceXML 2.0 (Appendix E - Audio File Formats ).

Applications which use 8 bit, 8Khz, mono, PCM Mu-law or Alaw, raw or WAV (RIFF) header audio formats will be portable across VoiceXML 2.0 platforms. Of course, as with VoiceXML 1.0, platforms may also support other formats such as G.729, GSM, MP3, to name but a few.

5. Functional Changes

Although our primary intention was not to extend the functionality of the language, a number of features were added which we felt would significantly contribute to its usefulness.

5.1 Accessing Information about the Last Recognition Result

VoiceXML 2.0 defines a new variable, application.lastresult$, which provides information about the last recognition in the application (Section 5.1.5). The variable is read-only and contains the information about the string recognized, its confidence, its input mode (DTMF or voice), and its semantic interpretation.

One motivation for this variable is that VoiceXML 1.0 didn't provide a mechanism for inspecting the recognition result arising from grammars in the <link> element. If user input in an active dialog (typically a <field> inside a <form>) matched a grammar in a <link>, there is no way to evaluate its recognition result before executing its action. In comparison when user input matches a grammar in <field>, the result can be evaluated using the <filled> element; for example, the developer can check the confidence by inspecting the field "confidence" shadow variable.

In VoiceXML 2.0, application.lastresult$ can be used to evaluate a <link> grammar recognition result: for example

<link event="linkevent">
    <grammar src="./linkgrammar.grxml" type="application/grammar+xml"/>
</link>

<form>
    ...
    <field>
        ...
        <catch event="linkevent">
            <if cond="application.lastresult$.confidence < 0.7">
                <goto nextitem="confirmlinkdialog"/>
            <else/>
                <goto next="./main_menu.html"/>
            </if>
        </catch>
    </field>
    ...
</form>

The <link> throws the event "linkevent" which is caught by the field-level <catch>. The executable content in the <catch> is then able to inspect the recognition result and transition to different dialog states depending on confidence. Crucially, the event handler here is able to evaluate the recognition result and run a confirmation dialog (assume a subdialog in a field called "confirmlinkdialog") which if disconfirmed by the user, allows the application to carry on as before. Finally, "lastresult$" is defined at the application level so that if the <link> and <catch> have been specified in the application root document, the recognition result would still be accessible.

A second motivation for "application.lastresult$" is to support developer access to the 'n-best' recognition results. For "application.lastresult$" is also an array of elements, where each element, represented by "application.lastresult$[i]", describes a possible recognition result. The results are ordered by recognition confidence and then by grammar precedence. A developer is able to inspect these results by iterating through the array, as shown in the following example:

<form>
    ...
    <field>
        ...
        <filled>
            <script>
            // number of results 
             var len = application.lastresult$.length;
             // iterate through array
             for (var i = 0; i < len; i++) {
             // check if DTMF 
             if (application.lastresult$[i].mode == "DTMF") {
               ...
               }
             }
            </script>
        </filled>
    ...
</form>

5.2 Logging for Developers

The <log> element is new in VoiceXML 2.0 (Section 5.3.13). It is typically used by developers during the debugging process to generate a debug message. It can appear in executable content such as in a <filled> element, for example:

<field name="dest-city">
   <filled>
      <debug>confidence: <value expr="dest-city.confidence$"/></debug>
   </filled>
</field>

Here the recognition confidence of the user's input is logged. Although exactly how it is logged is platform-dependent, many platforms will support sending debugging information to the email account of the document maintainer (See Section 6.2).

5.3 Types of Bargein

VoiceXML 2.0 provides the developer with more control over the type of bargein performed by the platform with the new "bargeintype" attribute in the <prompt> element (Section 4.1.5.1 ). This attribute has a number of values which determine how aggressively bargein is performed.

With the values "energy" or "speech", the prompt will be stopped whenever energy or speech respectively (or DTMF) is detected. Of course, if the user input does not satisfactorily match the active speech or DTMF grammars, then a <nomatch> event is thrown. While this behavior can be appropriate for many applications, it can be frustrating for users if a long prompt, such as a news broadcast audio file, needs to be re-started due to mis-recognition.

When "bargeintype" has the value "recognition", a <nomatch> event will never be generated. The prompt is only stopped when the user input satisfactorily matches an active grammar; input not matching the grammar is ignored and the platform continues to recognize. This is particularly useful if one or more 'hotwords' are to be recognized while all other input is to be ignored; for example, a 'stop' command during playback of a long audio file.

Note that "recognition" bargein is inappropriate when more than a simple phrase is to be recognized. Using it with an data input application where the user can say "I want to travel from London to Paris" can be problematic: since the prompt will not be stopped when the user starts speaking, the user needs to continue speaking over the prompt!

5.4 Controlling Grammar Generation in <menu>

In VoiceXML 1.0, the text content of <choice> elements in <menu> were used to generate a grammar specifying sub-phrases. For example,

<menu>
  <prompt>
    Welcome home. Say one of: <enumerate/>
  </prompt>
  <choice ... >     Sports news                  </choice>
  <choice ... >     Weather news                 </choice>
  <choice ... >     Stargazer astrophysics news  </choice>
</menu>

The last <choice> would be matched if the user said phrases such as "Stargazer", "Stargazer News", "astrophysics news" and so forth. The exact grammar generation mechanism may be language and platform dependent. While there are some uses cases for this mechanism, there is also a strong use case for introducing a strict form of grammar generation where a <choice> is matched if and only if the user says exactly the content. If alternative phrases are required, these can be specified in multiple <choice>s. This provides the developer with more control over what is recognized, rather than leaving up to the platform vendor (and thereby making application behavior less consistent across platforms).

To provide this control, an "accept" attribute on <menu> was introduced in VoiceXML 2.0 (Section 2.2): with an "exact" value (the default) the element defines the exact phrase to be recognized, while an "approximate" value indicates the earlier 'approximate' matching. The attribute is also defined on <choice> so specific <choice> elements can override the general <menu> strategy.

5.5 Universal Grammars

VoiceXML 2.0 also provides developer with greater control over universal grammars.

Universals grammar can be defined by a platform for 'cancel', 'exit', and 'help' events; for example, a platform may define "help me" as a part of the help grammar and anytime the user says "help me" a 'help' event is thrown. While these can be very useful when initially developing an application, they can also be seriously problematic when deploying production-grade applications since:

the grammars are platform-dependent: this detracts from document portability, and does not allow the developer to define their own universal grammars
the grammars may not be appropriate in all applications and developers cannot switched them off in a platform-independent manner

In VoiceXML 2.0, a "universals" property has been introduced (Section 6.3.6). The property is used to determined whether universal grammars are enabled (by default, they are disabled - universals="none"), and if they are enabled, then provides control over which grammars are enabled (universals="all" to enable all grammars, universals="help exit" to enable only the 'help' and 'exit' grammars). Note that this property does not affect the default catch handlers for the 'exit', 'help' and 'cancel' events. If a developer wants to define their own "help" grammar for an application, they can do so by defining a <link> as follows:

<link event="help">
   <grammar mode="voice" type="application/grammar+xml" root="root">
       <rule id="root">
         <one-of>
             <item>what can I do</item>
             <item>can you help me</item>
             ...
         </one-of>
       </rule>
   </grammar>
</link>

The 'help' event thrown when this grammar is matched can then be caught by the default 'help' handler, or by a developer-specified <catch> handler appropriate to the application.

5.6 <throw> and <catch>

<throw> and <catch> have been enhanced in VoiceXML 2.0 to provide additional information (Section 5.2.1, Section 5.2.2).

<throw> now has attributes to allow developers to specify additional information besides the event name; the attribute "message" is used to statically specify the additional information while "messageexpr" to dynamically specify the information, for example:

<field name="dest-city">
   <filled>
         <throw event="com.example.matched" messageexpr="dest-city.confidence"/>
   </filled> 
</field>

where the event "com.example.matched" is thrown together with a message containing the confidence score of the recognized field "dest-city".

One of the problems with <catch> in VoiceXML 1.0 was that it was impossible to specific a handler for a general event type and then process specific event types in different ways. In VoiceXML 2.0, <catch> handlers have the new anonymous variables "_event", containing the full name of the event that was thrown, and "_message", containing the value of the message string from the corresponding <thrown>, for example:

<catch name="com.example">
     <if cond="_event == 'com.example.matched'">
        <prompt>The confidence was <value expr="_message"/></prompt>
        <elseif cond="_event == 'com.example.notmatched'"/> 
        ...
        <elseif cond="_event == 'com.example.anothercase'"/> 
        ...
        <else/>
        ...
     </if>
</catch>

5.7 <audio>

The <audio> element has been enhanced with an "expr" attribute (Section 4.1.3). In addition to providing the obvious capability to dynamically set the audio to be played back, this enhancement also allows <audio> element to be silently ignored if the "expr" attribute evaluates to ECMAScript undefined. Application developers can use this feature to specify a list of <audio> elements in their document where each <audio> element is only activated if "expr" has a defined value. The following example shows how this can be used to read out the names of playing cards using concatenated audio files:

<form>
  <script src="cardgame.js"/>
  <!-- script contains the function sayCard(position) which returns a URI
       to an audio file reading out the card name in the specified position; 
       if there are no more cards, then returns undefined --!>

  <field name="takecard" type="boolean">
       <prompt>
           <audio src="you_have.wav"</audio>
           <!-- maximum hand of 5 cards is described --!>
           <audio expr="sayCard(1)"</audio>
           <audio expr="sayCard(2)"</audio>
           <audio expr="sayCard(3)"</audio>
           <audio expr="sayCard(4)"</audio>
           <audio expr="sayCard(5)"</audio>
           <audio src="another.wav"></audio>
        </prompt>
        ...
   </field>
</form>

In each of the <audio> elements reading out the card values, the value of "expr" is determined by the script function "sayCard(position)". The script function returns either a specification of the audio file to be played, or undefined. This allows the code fragment to read out up to five cards; if there is less than five cards, then the remaining <audio> will be silently ignored since their "expr" will evaluate to undefined.

As mentioned above in 4.3, VoiceXML 2.0 also provides better control over resource fetching using HTTP with the attributes "maxage" and "maxstale" (Section 6.1.1). These attributes can appear in elements where documents and resources are fetched including <submit>, <grammar>, <script> and <audio>. One typical use of this feature is to minimize the fetching of large audio files which change infrequently. For example:

<audio src="very-large-audio-file.wav" maxage="1000s" maxstale="1000000s"</audio>

where the fetched resource will 'live' in the cache for 1000 seconds ("maxage") and if the cached resources has not expired by more than 100,000 seconds ("maxstale"), then the cached resource will be used rather than fetched. Depending on default values for these attributes, a VoiceXML platform may attempt to fetch this file each time it is encountered in a VoiceXML document. By specifying them appropriately, a developer can significantly reduce system latency.

5.8 Specifying Language

In VoiceXML 2.0, the "lang" attribute of <vxml> has been replaced with "xml:lang" to bring VoiceXML into alignment with other W3C XML languages (Section 1.5.1). The application developer can specify the language for both spoken input and output by assigning this attribute a language value defined in [RFC1766]; for example for interaction in Swedish ("sv"):

<vxml xml:lang="sv" version="2.0" xmlns="http://www.w3.org/2001/vxml">
   ...
</vxml>

The language feature on the <vxml> element is inherited down the document hierarchy so that, unless specified otherwise, speech grammar and synthesis elements inherit this feature. The attribute can also be specified on <grammar> and <prompt> in order to override the inherited value (Section 3.1.1.5, Section 4.1). Developers are then able to specify that prompts are spoken in different languages:

<vxml xml:lang="sv" version="2.0" xmlns="http://www.w3.org/2001/vxml">
  <form>
      <field name="email-body">
          <grammar src="http://grammarlib/email-commands-sv.grxml"
                      type="application/grammar+xml"/>
          <prompt xml:lang="en"> 
          ... text in English ... 
          </prompt>
          <prompt xml:lang="fr"> 
          ... text in French ... 
          </prompt>
      </field>
  </form>
</vxml>

In this email application example, the <grammar> element inherits the Swedish language specification - so users can speak Swedish commands - but the <prompt>s override this and specify that part of the email is to be read out in English and another part in French. Note that not all VoiceXML platforms will support multiple languages and those that do will vary in their language support. If an unsupported language is encountered by a platform, an "error.unsupported.language" event will be thrown (with "_message" specifying the unsupported language).

6. Clarification Changes

This section presents some of the major changes which clarify inconsistencies in VoiceXML 1.0 and cases where the specification lacked sufficient detail for application developers or platform implementers.

6.1 Subdialogs

The <subdialog> element provides a mechanism for decomposing complex sequences of dialogs to better structure them, or to create reusable components. Its description in VoiceXML 1.0 was unclear in terms of the relationship between the calling dialog and the subdialog itself, as well as the relationship between the subdialog and its document context (this was not helped by the inclusion of a "modal" attribute in <subdialog> description but no such attribute in the DTD!).

In VoiceXML 2.0, a subdialog context is independent of its calling dialog, but the subdialog context follows normal scoping rules for grammars, events and variables (Section 2.3.4).

A subdialog is independent of its calling dialog because the subdialog executes in a new execution context where counter and variables are reinitialized. No state or variable instances are shared between the calling dialog and the subdialog context. The relationship can be seen as analogous to when two universes are connected by a worm hole, the worm hole being a conduit by which parameters are passed from the calling dialog into the subdialog, and values or an event are <return>ed from the subdialog to the calling dialog. For example, consider the following document with two forms, the first containing a <subdialog> element defining the calling dialog and the second, the subdialog itself:

<!-- document variable -->
<var name="name"/>

<!-- form dialog that calls a subdialog -->
<form>
  <subdialog name="result" src="#getdriverslicense">
   <assign name="name" expr="'John Doe'"/>
   <param name="birthday" expr="'2000-02-10'"/>
   <filled>
      ...
   </filled>
  </subdialog>
</form>

<!-- subdialog to get drivers license -->
<form id="getdriverslicense">
  <var name="birthday"/>
  <field name="drivelicense">
   <grammar src="http://grammarlib/drivegrammar.grxml"
      type="application/grammar+xml"/>
   <prompt> Please say your driver's license, <value expr="name"/>. </prompt>
   <filled>
     <if cond="validdrivelicense(drivelicense,birthday)">
       <var name="status" expr="true"/>
     <else/>
       <var name="status" expr="false"/>
     </if>
     <return namelist="drivelicense status"/>
   </filled>
  </field>
</form>

When the <subdialog> in the first form is called, an new execution context is created for the subdialog context described in the second form. This subdialog context has access to the <param> "birthday" and <return>s the "driverlicense" and "status" variables to the calling dialog. No other information is directly shared by the two contexts. In particular, the value of the "name" variable in the calling context is assigned the value 'John Doe', but that variable instance is not available to the subdialog context since it is in a separate execution context - the variable is defined, but it has an undefined value in the subdialog context. For this example to speak out the user's name using <value>, the "name" variable instance should have been passed as parameter into the subdialog context.

Within the subdialog context, however, normal scoping rules for grammars, events and variables apply. Active grammars in the subdialog context would include those defined in the same document or the document's root. Event handling and variable binding likewise follow the standard scoping hierarchy. Events thrown in a subdialog are treated by event handlers defined within its context; they can only be passed to the calling context by a local event handler which explicitly returns the event to the calling context. And, as illustrated above, the document level variable "name" is accessible in the subdialog context.

From a programming perspective, subdialogs behave differently from subroutines because the calling and called contexts are independent. While a subroutine can access variable instances in its calling routine, a subdialog cannot access the same variable instance defined in its calling dialog. Similarly, subdialogs do not follow the event percolation model in languages like Java where an event thrown in a subroutine automatically percolates up to the calling context if not handled in the called context.

6.2 Root and Leaf Documents

VoiceXML 1.0 lacked clarity in definition of, and transitions, between root and leaf documents. VoiceXML 2.0 explicitly defines these transitions in terms of <choice>, <goto> <link>, <subdialog>, and <submit> elements and explains whether the application root context is preserved or initialized (Section 1.5.2).

In root to leaf and leaf to leaf transitions, the root context is preserved. In leaf to root cases, the root context is initiated when the transition is caused by a <submit>; other transitions result in the root context being preserved. In root to root cases, the application root context is always initialized. In the case of transitions to subdialogs, the calling dialog's application root context (if any) is preserved untouched during subdialog context execution. If the subdialog is invoked with an empty URI references, as with "#getdriverslicense" in the example above, new root and leaf document contexts are initialized from the same root and leaf documents used in the calling dialog. Finally, all other transitions cause the application root context to be initialized.

6.3 Prompt Queuing and Input Collection

VoiceXML 2.0 clarifies the relationship between prompt queuing and input collection (Section 4.1.8). A VoiceXML interpreter is always in one of two states:

waiting for input, or
transitioning between field items in response to input received while in the wait state.

In terms of the Form Interpretation Algorithm, the wait state is entered in the collect phase of a field item, while the transitioning state covers the process and select phases (and collect phase for non-field items like <block>). This model clarifies that executable context is always run to completion since it is executed in the transitioning state - and no input can be received in this state.

Furthermore, the model also clarifies when prompts are queued and when they are played. Prompts are queued in the transition state, and are played when either the interpreter reaches the waiting state, or when it begins fetching a resource for which fetchaudio is specified. Take the simple case below:

<form>

  <block>
   <script src="http://www.example.com/init1.jsp"/>
   <prompt>Welcome to the game!</prompt>
   <script src="http://www.example.com/init2.jsp"/>
  </block>

  <field name="name">
   <prompt>Please say your name</prompt>
   <grammar src="http://grammarlib/getplayername.grxml"
      type="application/grammar+xml"/>
  </field>

</form>

The interpreter is in a transitioning state when executing the <block>: the first <script> is executed to completion, the "Welcome to the game!" <prompt> is queued and the second <script> is then executed to completion. It stays in the transitioning state as it enters the "name" <field> while it queues the field prompt "Please say your name". It then moves into the waiting state after it has played the queued prompts, activated the active grammars and is awaiting input.

6.4 Variables in VoiceXML and ECMAScript

VoiceXML 2.0 clarifies the relationship between VoiceXML and ECMAScript variables (Section 5.1).

VoiceXML variables are in almost all respects equivalent to ECMAScript variables: declaring a variable using <var> is equivalent to using a var statement within a <script> element. They also share the same variable space, so VoiceXML variables can be used in scripts just as variables defined in scripts can be used in VoiceXML. And both types of variables can be <submit>ed. However, there is one important difference: ECMAScript allows the use of undeclared variables and these cannot be used in VoiceXML.

This equivalence imposes some constraints on variable naming in VoiceXML since they must adhere to the variable naming conventions of ECMAScript; for example, they cannot contain ECMAScript reserved words. They must also follow the rules for referential correctness; for example, they cannot contain a dot. A VoiceXML variable, such as a field item, with the name "name.firstname" is illegal in ECMAScript and would cause a 'error.semantic' to be thrown. Furthermore, VoiceXML imposes more restrictions on variable names: variables beginning with "_" or ending with "$" are reserved for internal use, and variable with names of logical scopes (such as "session" and "dialog") are not recommended since they can unexpectedly hide pre-defined variables due to variable scoping.

6.5 Conformance

VoiceXML 2.0 clarifies conformance both in terms of VoiceXML documents and in terms of VoiceXML processors (Appendix F - Conformance ). This aligns VoiceXML with other W3C specifications and the definitions are, to a large extent, aligned with those in the Speech Grammar [SRGS] and Speech Synthesis [SSML] specifications.

A conforming VoiceXML 2.0 document requires that it is a well-formed XML document, and that it provides a namespace declaration on the <vxml> element:

<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">
   ...
</vxml>

Elements and attributes from non-VoiceXML namespaces are permitted if and only if, when such elements and attributes are removed from the document, the resulting document is a valid VoiceXML document.

A conforming VoiceXML processor (typically part of a VoiceXML platform or gateway) must be able to parse and process conforming VoiceXML documents. It does not need to be a validating XML parser. However it must be a conforming [SSML] processor and a conforming XML grammar processor [SRGS]. In other words, it must be able to correctly interpret these embedded markup languages as defined in their respective specifications (note that a VoiceXML platform may delegate their processing to distributed SSML and XML speech grammar processors).

A conforming processor must also be able to support the syntax and semantics of all elements in the VoiceXML 2.0 specification - a conforming platform cannot pick an arbitrary subset of elements to support. While most elements must be supported, the <transfer> element is effectively optional: a conforming platform can legally throw a "error.unsupported.<transfer>" event.

Finally, while the Voice Browser Working Group in W3C is responsible for the VoiceXML specification and its conformance definition, it is the VoiceXML Forum, specifically its Conformance Committee, which is responsible for determining whether a given VoiceXML platform is actually conformant or not.

7. Further Changes in VoiceXML 2.0 During the W3C Recommendation Process

Although VoiceXML 2.0 has been published as a Working Draft, it is not yet a W3C standard. To achieve that, VoiceXML 2.0 needs to go through the following stages:

Last Call Working Draft (consensus within the working group that the specification is mature)
Candidate Recommendation (formally addressed external comments and issues)
Proposed Recommendation (demonstrate that the specification can be implemented and that such implementations are interoperable)
Recommendation (significant support from W3C membership)

However, given the widespread support for VoiceXML in terms of existing commercial implementations and applications, it is the intention of the Voice Browser Working Group to make the next version of the Working Draft the last one with significant changes, although later changes may be unavoidable due to inconsistencies, comments from other W3C working groups, etc.

At time of writing, there are over 100 change requests which we expect to incorporate into the next version. Some of the more significant planned changes and clarifications include for example:

definition of a VoiceXML 2.0 schema
specification of mutually exclusive attributes causes an error event to be thrown
stricter conformance statement
further clarification of semantic interpretation and how results from DTMF and speech grammars are mapped into VoiceXML form variables
use of generic 'connection' object for <transfer>
synchronization with latest SSML and SRGS specifications
platform support for builtin types becomes optional
further clarification of the relationship with ECMAScript, especially scoping

The next version is expected to be published by the end of Q1 2002. From that time onwards, changes to the VoiceXML 2.0 specification should be minor.

8. Changes Deferred beyond VoiceXML 2.0

During the process of the developing VoiceXML 2.0, there have been many requests for new features and functionality which the Voice Browser Working Group have deferred beyond VoiceXML 2.0. While there is no guarantee that the following features will actually make it into the next version, they serve to illustrate the type of additional functionality which has been requested:

Output tracking and control
Transition dialogs
Dynamic data binding
Reusability Framework
External event/message passing
More control over grammar scoping
Voice Verification
Capability interrogation

Note that issues relating to call control, lexica, natural language semantics, and embedding in a multi-modal context are also under consideration and specific proposals will be developed by other committees within the Voice Browser Working Group as well as other W3C Working Groups.

9. Conclusions

VoiceXML 2.0 provides a significant advance over VoiceXML 1.0, not so much in terms of functionality but in terms of interoperability and clarity.

The dialog team of the Voice Browser Working Group will continue to drive the VoiceXML 2.0 specification through the W3C standards process as efficiently as possible so a mature and stable specification is available to VoiceXML Forum application and platform developer as quickly as possible.

Requests for further changes to the language, comments on the language or our process, can be submitted to .

Acknowledgments

As the chairman of the dialog team, I would like to thank my co-editors of VoiceXML 2.0 for their insight and hard work, and the rest of the Working Group for their comments and support. Any errors in this document are exclusively mine though!

References

[DIALOG-REQS]: " Dialog Requirements for Voice Markup Languages". McGlashan. W3C Working Draft, December 1999.
See http://www.w3.org/TR/voice-dialog-reqs/
[JSGF]: "JSpeech Grammar Format", Andrew Hunt, W3C Note, June 2000.
See http://www.w3.org/TR/2000/NOTE-jsgf-20000605/
[RFC1766]: "Tags for the Identification of Languages", IETF RFC 1766, 1995
See http://www.ietf.org/rfc/rfc1766.txt
[RFC2616]: "Hypertext Transfer Protocol -- HTTP/1.1 ", IETF RFC 2616, 1999.
See http://www.ietf.org/rfc/rfc2616.txt
[SIF]: " Introduction and Overview of W3C Speech Interface Framework ". Larson. W3C Working Draft, December 2000.
See http://www.w3.org/TR/voice-intro/
[SRGS]: "Speech Recognition Grammar Specification for the W3C Speech Interface Framework". Hunt and McGlashan. W3C Working Draft, August 2001.
See http://www.w3.org/TR/2001/WD-speech-grammar-20010820/
[SSML]: "Speech Synthesis Markup Language". Walker and Hunt. W3C Working Draft, January 2001.
See http://www.w3.org/TR/2001/WD-speech-synthesis-20010103/
[VOICEXML-1.0]: "Voice eXtensible Markup Language 1.0", Boyer et al, W3C Note, May 2000.
See http://www.w3.org/TR/2000/NOTE-voicexml-20000505
[VOICEXML-2.0]: " Voice Extensible Markup Language (VoiceXML) Version 2.0 ". McGlashan et al. W3C Working Draft, October 2001.
See http://www.w3.org/TR/2001/WD-voicexml20-20011023/