VoiceXML Review - Feature Articles

Volume 2, Issue 5 - Jul/Aug 2002

Update on SSML

By Daniel C. Burnett,

The Speech Synthesis Markup Language (SSML) [1], as its name implies, provides a standardized annotation for instructing speech synthesizers on how to convert written language input into spoken language output. This language has been under development within the Voice Browser Working Group (VBWG) of the World Wide Web Consortium (W3C) for a few years. This article provides a brief update on the status and future of SSML. For background on SSML and an introduction to its features, see [2].

Status

In the previous VoiceXML Review article on SSML [2], the authors indicated that SSML was "nearly completed". Although the January 2001 version of the specification was issued as a Last Call Working Draft (WD) [3], it was found to have a number of contentious items. In April of this year, the Voice Browser Working Group issued another Working Draft [1] (not a Last Call this time) with some minor content changes. The group is now working towards publication of a new Last Call WD.

Changes in the most recent Working Draft

The April 2002 draft has a fairly small number of changes from the January 2001 draft. It was released primarily to provide XML Schema support for use in VoiceXML [4] and to bring the definition of valid SSML documents in line with that in the other Voice Browser Working Group specifications.

Schema

Programmers in the world of XML are probably familiar with Document Type Definitions (DTDs) [5], documents that roughly define syntax constraints for XML documents. Although DTDs can provide some help with XML document validation, they are notoriously weak at representing complicated mixed content models (both elements and text permitted as content) and cross-element constraints. The XML Schema language is a more powerful language for representing such constraints, allowing for more of the syntactic constraints of the language to be caught by validating parsers.

The W3C has now moved from encouraging the use of XML Schema to the stronger position of explicitly discouraging the use of DTDs. While the creation of a schema when you already have a DTD is fairly straightforward, the fact that SSML is expected to be embedded in other markup languages (of which VoiceXML is the first example) brought additional requirements to the table:

the need to be able to incorporate SSML elements into the host language namespace
the need to modify the SSML elements to add host language-specific attributes and functionality

In the SSML specification the DTD is now informational only, while the schema provides the normative definition of the syntax of the language.

Document structure

All of the Voice Browser Working Group specifications have undergone revision in how they define valid documents. The most recent Working Draft of SSML brought the definition of valid SSML documents in line with the definitions of valid VoiceXML and SRGS [6] documents. The most obvious change is the addition of the "SSML documents" section (now section 3). This section describes the headers required and permitted for valid SSML documents.

Some key things to note here are that

the XML declaration is required
a DOCTYPE declaration is optional, but if present should reference the public and system identifiers given in section 3.1.
the <speak> root element is required, and the xmlns attribute must specify the SSML namespace as given in section 3.1.

There is also now a version attribute on <speak>.

The Conformance section (now section 4) has been cleaned up a bit as well, again following the example of the VoiceXML and SRGS specifications. In particular, it more clearly distinguishes SSML fragments from SSML documents and specifies the requirements for each.

Other miscellaneous changes

In an attempt to move non-normative sections out of the main body, the Future Study, Examples, and DTD sections have been moved into appendices. There are also new appendices on:

Audio file formats -- This appendix lists the audio formats that an SSML processor must be able to read and play out.
Internationalization -- It is important that the syntax of SSML be able to indicate the input language (the language in which the text content is written) and the output language/dialect/speaker. Of course, SSML is not a universal translator and will not arbitrarily convert text written in one human language into the spoken form of some other human language. Also, there is no guarantee that any particular written (input) or spoken (output) language will be supported by a given SSML processor. Nevertheless, it would be convenient to allow the output language for items like dates (say, in a year-month-day format) to be changed with no more work than updating a flag indicating the output language or speaker.

The availability and operating characteristics of features such as pitch, volume, rate, etc. depend heavily on the synthesis technology used in any given implementation. Thus, requests for changes in these values via SSML elements will have varied effects across platforms. While some amount of testing can be done to increase interoperability at a gross level, different engines will never produce the same synthesized output given the same input. The most recent Working Draft includes a new section, section 1.4, whose purpose is to forewarn a potential user of SSML about this issue.

The references section has also been cleaned up a bit and converted into the format used by the other Voice Browser WG specifications.

What to expect in the future

Any changes for the next draft are likely to fall into two categories: clarifications of ambiguous or confusing features and text, and the addition features requested or encouraged by other groups in the W3C. Two portions of the specification that were vague in the last Working Draft are the use of the xml:lang attribute and the <say-as> element.

Clarification and refinement

xml:lang

In the XML namespace there is an attribute lang [7]. The valid values for this attribute are human language identifiers defined by IETF RFC3066 [8]. In most specifications, this attribute is used to indicate the language in which the enclosed text is written. While it has that meaning in SSML as well, in both the <voice> and <say-as> elements there is a question as to whether the xml:lang attribute provides some indication of what the output language/voice should be. If so, does it represent both the written language and the intended output language? If not, how are these distinguished? These are some of the questions raised by the use of the xml:lang attribute in a markup language regulating the pronunciation of text and not just using text as an input.

say-as

Whether or not the original intention of the <say-as> element was clear, it is not clear today. Although this element has some characteristics of a formatter for spoken output, the most common use is as an input formatter -- how the text content is formatted for reading ("interpret-as") for content that might otherwise be ambiguous, such as <say-as type="date"> 5/12 </say-as>. The optional format specifier can in this case clarify whether this is May 12 ("date:md"), December 5 ("date:dm"), May 2012 ("date:my"), etc.

In one sense the <say-as> element is unnecessary, as any text to be spoken can always be written out, in the SSML document, in its full orthographic form and will be spoken as such. In other words, if you really care about which text is spoken, you can just write it out yourself. However, it is extremely common for an application to receive data such as a date or time in a compact form (such as "1/1/2000" or "23:59:59") and expect the synthesis engine to be able to render it. In fact, most synthesis engines have significant knowledge built in about how best to read out dates, times, etc. -- frequently more knowledge than the application writer himself may have.

Given the input-vs-output role confusion of this element, there are at least two problems:

Descriptions of some of the types (currency, measure, and address in particular) are too sketchy to determine either how the input should be interpreted or how it should be spoken.
The output, if not otherwise specified, is assumed to be based on the current locale. This is problematic for applications that routinely output the same content into multiple languages. In order to successfully build such applications, authors must do these conversions outside the markup itself, even if the underlying synthesis engine is capable of such transformations.

The <say-as> element is a significant convenience, but in order for it to reach its full potential, it needs to better distinguish between and allow for indication of the input and output formats.

Alignment with other W3C work

The second category of likely changes is in the area of features encouraged by other work in the W3C. Both the VoiceXML specification and the Speech Recognition Grammar Specification have added support for xml:base and <metadata>/rdf, so it is reasonable to consider that these might be added to SSML at some point in the future.

xml:base

Many HTML programmers are familiar with the <BASE> element, an element used to set the base path used for resolution of relative URIs. The XML Base specification [9] standardizes this by establishing a common attribute in the XML namespace, xml:base, that could be used to indicate this base path. Although a base path for relative URI resolution can often be obtained from protocol header information, it is still convenient to be able to set the path directly within the document.

metadata/rdf

The <metadata> element in VoiceXML and SRGS provides a mechanism for expressing information about the document. Both recommend the use of the Resource Description Format (RDF) syntax [10] and schema [11] as the content format for this element. RDF "provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web." ([4], section 6.2.2).

This element (with suggested content structure) is part of the W3C's Semantic Web Initiative, an attempt to develop standard ways of representing the meaning of XML-structured data on the World Wide Web. As such, it is likely that such a capability will be encouraged for SSML.

Conclusion

Although the movement of SSML from a Last Call Working Draft back to a generic Working Draft may at first appear that the specification is not progressing, the contrary is true. The most recent Working Draft represents an improvement in clarity over the prior version and sets the stage for clearing out some of the last substantial ambiguities in the specification. It also paves the way for the introduction of features that connect it more fully with the W3C's vision for the World Wide Web. The Conclusion of the previous article on SSML began, "Widespread adoption of SSML by TTS engine developers may energize the development of new classes of speech-enabled applications . . . ." [2] SSML is already supported on a handful of text-to-speech engines, with more expected as the specification moves closer to Recommendation.

References

[1] D. C. Burnett, M. R. Walker and A. Hunt, editors, Speech Synthesis Markup Language Specification, W3C Working Draft, April 5, 2002, work in progress. (http://www.w3.org/TR/2002/WD-speech-synthesis-20020405/)

[2] M. R. Walker and A. Hunt, "The Speech Synthesis Markup Language for the W3C VoiceXML Standard", VoiceXML Review, April 2001, work in progress, Feature article #2. (https://voicexmlreview.org/Apr2001/features/ssml1.html)

[3] M. R. Walker and A. Hunt, editors, Speech Synthesis Markup Language Specification, W3C Last-Call Working Draft, Jan 3, 2001, work in progress. (http://www.w3.org/TR/2001/WD-speech-synthesis-20010103/)

[4] S. McGlashan et al., editors, Voice Extensible Markup Language (VoiceXML) Version 2.0, W3C Last-Call Working Draft, April 24, 2002, work in progress.(http://www.w3.org/TR/2002/WD-voicexml20-20020424/)

[5] See Information Processing -- Text and Office Systems -- Standard Generalized Markup Language (SGML), ISO 8879:1986.(http://www.iso.ch/cate/d16387.html)

[6] A. Hunt and S. McGlashan, editors, Speech Recognition Grammar Specification Version 1.0, W3C Candidate Recommendation, June 26, 2002, work in progress. (http://www.w3.org/TR/2002/CR-speech-grammar-20020626/)

[7] See Section 2.12 of T. Bray, et al., Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendation, October 6, 2000. (http://www.w3.org/TR/2000/REC-xml-20001006/)

[8] H. Alvestrand, Tags for the Identification of Languages, IETF RFC3066, January 2001. (http://www.w3.org/TR/2002/CR-speech-grammar-20020626/)

[9] J. Marsh, editor, XML Base, W3C Recommendation, June 27, 2001. (http://www.w3.org/TR/2001/REC-xmlbase-20010627/)

[10] O. Lassila and R. R. Swick, editors, Resource Description Framework (RDF) Model and Syntax Specification, W3C Recommendation, February 22, 1999. (http://www.w3.org/TR/REC-rdf-syntax/)

[11] D. Brickley and R.V. Guha, editors, Resource Description Framework (RDF) Schema Specification, W3C Candidate Recommendation, March 27, 2000, work in progress. (http://www.w3.org/TR/2000/CR-rdf-schema-20000327/)

back to the top

Copyright © 2001-2002 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).