VoiceXML Review - Feature Articles

The Voice Browser Working Group was chartered by the World Wide Web Consortium (W3C) within the User Interface Activity in May 1999 to prepare and review markup languages that enable voice browsers. Members meet weekly via telephone and quarterly in face-to-face meetings.

The W3C Voice Browser Working Group is open to any member of the W3C Consortium. The Voice Browser Working Group has also invited experts whose affiliations are not members of the W3C Consortium. The four founding members of the VoiceXML Forum, as well as telelphony applications venders, speech recognition and text to speech engine venders, web portals, hardware venders, software venders, telcos and appliance manufactures have representatives who participate in the Voice Browser Working Group. Current members include AT&T, Avaya, BeVocal, BT, Canon, Cisco, Converse, Comverse, Dynamicsoft, General Magic, HP, IBM, Informio, Infospace, Intel, Lernout & Houspie, Loquendo, Lucent, Microsoft, Mitre, Motorola, Nokia, Nortel, Nuance, Open Wave, Phillips, Pipe Beach, Snowshore Networks, Speech Works, Sun, Syntellect, Telera, TellMe.com, Unisys, Verascape, VoiceGenie, Voxeo, in addition to several invited experts.

What was a single language, VoiceXML 1.0, has been extended into several related markup languages, each providing a useful facility for developing web-based speech applications. These facilities are organized into the W3C Speech Interface Framework.

The Voice Browser Working group has defined the W3C Speech Interface Framework, shown in Figure 1. The white boxes represent typical components of a speech-enabled web application. The black arrows represent data flowing among these components. The blue ovals indicate data specified using markup languages used to guide components to accomplish their respective tasks.

The components of any specific voice browser may differ significantly from the Components shown in Figure 1. For example, the Context Interpretation, Language Generation and Media Planning components may be incorporated into the Dialog Manager, or the tone recognizer may be incorporated into the Context Interpretation. However, most voice browser implementations will still be able to use of the various markup languages defined in the W3C Speech Interface Framework.

Automatic Speech Recognizer (ASR)--accepts speech from the user and produces text. The ASR uses a grammar to recognize words from the user's spoken speech. Some ASRs use grammars specified by a developer using the Speech Grammar Markup Language. Grammars may contain elements of a Semantic Attachment Language which instruct the ASR how to generate a semantic representation of the meaning of the user's utterance. Other ASRs use statistical grammars generated from large corpora of speech data. These grammars are represented using the N-gram Grammar Markup Language. The ASR also uses the Pronunciation Lexicon to obtain pronunciations of words.

DTMF Tone Recognizer--accepts touch-tones produced by a telephone when the user presses the keys on the telephone's keypad. Telephone users may use touch-tones to enter digits or make menu selections.

Language Understanding Component--extracts semantics from a text string by using a prespecified grammar. The text string may by produced by an ASR or be entered directly by a user via a keyboard. The Language Understanding Component may also use grammars specified using the Speech Grammar Markup Language or the N-gram Grammar Markup Language. The output of the Language Understanding Component is expressed using the Natural Language Semantics Markup Language.

Context Interpreter--enhances the semantics from the Language Understanding Module by obtaining context information from a dialog history (not shown in Figure 1). For example, the Context Interpreter may replace a pronoun by a noun to which the pronoun referred. The input and output from the Context Interpreter is expressed using the Natural Language Semantics Markup Language.

Dialog Manager--prompts the user for input, makes sense of the input, and determines what to do next according to instructions in a dialog script specified using the VoiceXML 2.0 modeled after VoiceXML 1.0. Depending upon the input received, the dialog manager may invoke application services, or download another dialog script from the web, or cause information to be presented to the user. The Dialog Manager accepts input specified using the Natural Language Semantics Markup Language. Dialog scripts may refer to Reusable Dialog Components, portions of another dialog script which can be reused across multiple applications. The Dialog Manager may fetch dialog scripts from the Web for interpretation, and submit changes to databases on the Web. The Dialog Manger may also issue telephony control commands to the Telephone System. The Dialog Manager uses the Call Control Markup Language to control telephone connections.

Media Planner--determines whether output from the dialog manager should be presented to the user as synthetic speech or prerecorded audio.

Recorded audio player--replays prerecorded audio files to the user, either in conjunction with, or in place of synthesized voices.

Language Generator--Accepts text from the media planner and prepares it for presentation to the user as spoken voice via a text-to-speech synthesizer (TTS). The text may contain markup tags expressed using the Speech Synthesis Markup Language which provides hints and suggestions for how acoustic sounds should be produced. These tags may be produced automatically by the Language Generator or manually inserted by a developer.

Text-to-Speech Synthesizer (TTS)--Accepts text from the Language Generator and produces acoustic signals which the user hears as a human-like voice according to hints specified using the Speech Synthesis Markup Language. The TTS also uses the Pronunciation Lexicon to obtain pronunciations of words.

The Voice Browser Working Group is also defining an architecture for Reusable Dialog Components. As their name suggests, reusable components can be reused in other dialog scripts, decreasing the implementation effort and increasing user interface consistency. The Working Group may also define a collection of reusable components which solicit the user's credit card number and exploration date, solicit the user's address, etc.

To review the latest requirement and specification documents for each of the following languages, see the section titled Requirements and Language specification Documents on our W3C Voice Browser home web site. The remainder of this section contains a brief overview of four of these markup languages.

The VoiceXML 2.0 supports four I/O modes: speech recognition and DTMF as input with synthesized speech and prerecorded speech as output. VoiceXML 2.0 supports system-directed speech dialogs where the system prompts the user for responses, makes sense of the input, and determines what to do next. VoiceXML 2.0 also supports mixed initiative speech dialogs. In addition, VoiceXML 2.0 also supports task switching and the handling of events, such as recognition errors, incomplete information entered by the user, timeouts, barge-in, and developer-defined events. Barge-in allows users to speak while the browser is speaking. The VoiceXML 2.0 is modeled after VoiceXML 1.0 designed by the VoiceXML Forum, whose founding members are AT&T, IBM, Lucent, and Motorola. VoiceXML 2.0 contains clarifications and minor enhancements to VoiceXML 1.0. VoiceXML also contains a new <log> tag for use in debugging and application evaluation.

As an example, the following is a dialog fragment used in a system-directed speech dialog that prompts the user for a the size of a pizza. The grammar is expressed using the XML speech grammar notation.

<field name="pizza_size"/>
<prompt>What size pizza?</prompt>
<grammar mode="speech">
    <rule id="rule" scope="public">
      <one-of>
        <item tag>small</item>
        <item tag>medium</item>
        <item tag>large</item>
      </one-of>
    </rule>
</grammar>

This dialog fragment consists of a single prompt, "Which size pizza?" and a grammar that describes valid values that the user may say in response to the prompt. If the user responds with a pizza size of "large", the ECMAScript string value "large" would be assigned to the "pizza_size" field.

As another example, the following is a dialog fragment in which the user speaks a sentence containing several pieces of information. This example assumes that the pizza grammar illustrated above is stored in a separate file named "pizza_order.gram".

<form id="Ajax" >

    <grammar src="order.grm" type="application/grammar"/>

    <initial name="start">
       <prompt>
            <voice gender="female" category = "adult">
                 Welcome to <emphasis level="strong">Ajax Pizza. </emphasis>
                 </voice>
            </prompt>
       </initial>

    <field name="order">
       <prompt> What would you like to order?</prompt>
       </field>

The Voice Browser Working Group reviewed over 250 change requests to VoiceXML 1.0, resulting in numerous clarifications, corrections, and explanations to VoiceXML 1.0, resulting in VoiceXML 2.0.

What's New with VoiceXML 2.0?

By Jim A. Larson