Volume 1, Issue 4 - April 2001
   
 

Introduction to the W3C Grammar Format

By Andrew Hunt

Introduction

The W3C Voice Browser Working Group [1] has released a draft specification for the W3C Speech Recognition Grammar Format [2] that promises to enhance the interoperability of VoiceXML Browsers and drive the portability of VoiceXML applications. This article summarizes the key features of the draft specification and the application of the specification to VoiceXML application development.

The role of grammars in a spoken dialog application is to define for the VoiceXML browser the words and patterns of words that a user can say at any particular point in a dialog. For example, the following grammar allows a caller to say the name of one of four cities: "New York," "Sydney," "Boston," or "Berlin."

<?xml version="1.0"?>


<grammar xml:lang="en" version="1.0">
  <rule id="city" scope="public">
    <one-of>
      <item> new york </item>
      <item> sydney </item>       <item> boston </item>       <item> berlin </item>     </one-of>   </rule> </grammar>


Grammar authoring is a critical facet in the development of robust, usable telephony speech applications. When an application's grammars accurately model the speech input from callers, the usability of the application is enhanced and caller satisfaction is likely to be higher. With the rapid growth of the speech technology market and the increasing deployment of commercial applications, grammar authoring is becoming an important skill for speech developers and is increasingly becoming an area of specialization.

The VoiceXML 1.0 specification [3] documents the use of the Java Speech Grammar Format (JSGF) to describe grammars but does not mandate that browsers support JSGF. Despite the use of JSGF in the VoiceXML 1.0 specification, the language is agnostic to the grammar format and it is acceptable for an application to use any grammar format supported by a browser.

Current deployments of VoiceXML and other speech applications most often use proprietary grammar formats; typically, the native format of the speech recognizer embodied in the browser. However, with VoiceXML there is a promise of platform interoperability for the application and thus a compelling need to standardize upon a common cross-platform grammar format.

The VoiceXML 2.0 specification [see footnote] being developed in the W3C will require that all VoiceXML 2.0 browsers support the XML Form of the W3C Speech Recognition Grammar Format. This will provide a common baseline for grammar interoperability. The W3C grammar specification is modeled on the JSpeech Grammar Format [4], submitted to the W3C by Sun Microsystems in June 2000. The current grammar draft is in its Last Call release and is planned to proceed to finalization by late 2001. The W3C process encourages open participation and comments on the current draft are welcome.

Two Grammar Standards!

The W3C Speech Recognition Grammar Format specification embodies two equivalent languages.

  • XML Form of the W3C Speech Recognition Grammar Format: Represents a grammar as an XML document with the logical structure of the grammar captured by XML elements. This format is ideal for computer-to-computer communication of grammars because widely available XML technology (parsers, XSLT, etc.) can be used to produce and accept the grammar format.

  • Augmented BNF (ABNF) Form of the W3C Speech Recognition Grammar Format: The logical structure of the grammar is captured by a combination of traditional BNF (Backus-Naur Form) and a regular expression language. This format is familiar to many current speech application developers, is similar to the proprietary grammar formats of most current speech recognizers and is a more compact representation than XML. However, a special parser is required to accept this format.

Grammars written in either format can be converted to the other format without loss of information (except formatting). The two formats co-exist because the Working Group found it important to support both computer-to-computer communication format and a more familiar human-readable format (but, as with all decisions reached by a committee, there is a spectrum of opinion on these matters).

Importantly, the Working Group has decided that the XML Grammar Format is the required grammar format for VoiceXML 2.0; that is, all compliant VoiceXML 2.0 browsers will be required to support the XML Grammar Format. Support for the ABNF format is recommended, but optional.

As a result, the XML language is used for most examples in this article. For examples of ABNF see the W3C specification (http://www.w3.org/TR/speech-grammar/).

Basic Grammar Document

The body of a grammar defines a set of rules. Each rule has a name and that name must be unique within the grammar. The scope of each rule is declared as either public or private. A public rule may be activated for recognition; for example, when referenced by a <grammar> element in VoiceXML. A public rule may also be imported into other grammars. All non-public rules are private. Private rules can be referenced only by other rules within the same grammar but they can reference public rules imported from other grammars. This public/private distinction should be familiar to Java developers.

Most importantly a rule defines an expansion that declares how the rule is expanded into words, references to other rules and patterns of words and references.

Rules and Tokens

Words, or more precisely tokens, are the basic units of a grammar and indicate those things that a user can say. Any token is a legal expansion in a rule definition. If a token contains white-space (e.g., "Rio de Janeiro") it should be contained in quotes. Sequences of individual tokens are separated by white space and the sequence is a legal expansion. Tokens can be enclosed in a <token> element that may be used to indicate the language of the contained token. For example:

hello
new york
"Rio de Janeiro"
to be or not to be
<token xml:lang="fr">francois corriveau</token> <!-- French -->


A rule reference is a legal expansion and is represented by a <ruleref> element. A rule reference is equivalent to a non-terminal reference in a traditional grammar. The referenced rule is provided by a URI. The referenced rule may be local to the grammar, in which case the URI is of the form "#rulename". The referenced rule may be any public rule of another grammar in which case a relative URI or absolute URI is used. The <ruleref> element is always an empty element (contains no text or other elements).

<ruleref uri="#city"/>
<ruleref uri="../locations.xml#city"/>
<ruleref uri="http://myexample.com/grammars/locations.xml#city"/>


Logical Operations

A sequence of legal expansions is itself a legal expansion. The sequence may be surrounded in an <item> element or other elements such as <count> or <rule>. As mentioned previously, tokens in sequence should be separated by white space. Sequential elements other than tokens (the <token>, <ruleref>, <item>, <count> and <one-of> elements) do not require white-space separation. The following are each examples of sequences:

phone home
call the "Rio de Janeiro" office
call<ruleref uri="#location"/>
<item>call <ruleref uri="#location"/></item>
<count num="optional">please</count> call home


The <one-of> element is used to declare a set of alternative expansions. The <one-of> element must contain one or more <item> elements, each of which declares one of the alternatives. In the following example, each alternative is a single token but any legal expansion can be contained within the item.

<one-of>
<item> new york </item>
<item> sydney </item>
<item> boston </item>
<item> berlin </item>
</one-of>


The <count> element indicates that the expansion it contains might be optional (zero of one occurrences), or may occur zero-or-more or one-or-more times.

this is <count num="optional">not</count> good
this is <count num="0+">very</count> good

Continued...

Footnote 1: While we fully expect that the dialog language from the W3C will be called VoiceXML 2.0, it's not official until we have the first public document from the W3C using this name.
(return to text)

back to the top

 

Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).