VoiceXML is a language for creating voice-user interfaces, particularly for the telephone. It uses speech recognition and touchtone (DTMF keypad) for input, and pre-recorded audio and text-to-speech synthesis (TTS) for output. It is based on the Worldwide Web Consortium's (W3C's) Extensible Markup Language (XML), and leverages the web paradigm for application development and deployment. By having a common language, application developers, platform vendors, and tool providers all can benefit from code portability and reuse.
With VoiceXML, speech recognition application development is greatly simplified by using familiar web infrastructure, including tools and Web servers. Instead of using a PC with a Web browser, any telephone can access VoiceXML applications via a VoiceXML "interpreter" (also known as a "browser") running on a telephony server. Whereas HTML is commonly used for creating graphical Web applications, VoiceXML can be used for voice-enabled Web applications.
There are two schools of thought regarding the use of VoiceXML:
As a way to voice-enable a Web site, or
As an open-architecture solution for building next-generation interactive voice response telephone services.
One popular type of application is the voice portal, a telephone service where callers dial a phone number to retrieve information such as stock quotes, sports scores, and weather reports. Voice portals have received considerable attention lately, and demonstrate the power of speech recognition-based telephone services. These, however, are certainly not the only application for VoiceXML. Other application areas, including voice-enabled intranets and contact centers, notification services, and innovative telephony services, can all be built with VoiceXML.
By separating application logic (running on a standard Web server) from the voice dialogs (running on a telephony server), VoiceXML and the voice-enabled Web allow for a new business model for telephony applications known as the Voice Service Provider. This permits developers to build phone services without having to buy or run equipment.
While originally designed for building telephone services, other applications of VoiceXML, such as speech-controlled home appliances, are starting to be developed.
VoiceXML Features
The rapid growth of the Web was due largely to its open architecture and high-level common interfaces to differing computing resources. HTML and HTTP hide much of the complexity of building interactive applications. Just as an HTML developer doesn't need to know how bits paint the screen of a web user's PC, VoiceXML shields developers from many of the complexities of telephony platforms.
VoiceXML has features to control audio output; audio input; presentation logic and control flow; event handling; and basic telephony connections. These and other features are described as follows:
Dialogs
Audio Output
Speech synthesis controls (text-to-speech, or TTS) , , etc.
Pre-recorded audio (files or streams)
Audio Input
Speech recognition (ASR)
Audio recording
Touchtone (Dual-tone Multi-Frequency, or DTMF)
Presentation logic
Control flow , , etc.
ECMAScript client-side scripting
Server-side/dynamic content generation
Event handling
Bad input ,
Shorthand
,
Basic Connection Control
Call transfer and bridging
Disconnect
Beyond the scope of the language are application logic, state management, dialog generation and sequencing, database operations, and interfaces to legacy systems (e.g., "screen scraping"). These are handled by traditional Web application programming techniques.
Architecture
A VoiceXML application consists of several components, as shown in Figure 1:
Application Server: Typically a Web server, which runs the application logic, and may contain a database or interfaces to an external database or transaction server.
VoiceXML Telephony Server: A platform that runs a VoiceXML interpreter that acts as a client to the application server. The interpreter understands VoiceXML dialogs and controls speech and telephony resources. These resource include ASR, TTS, audio play and record functions, as well as a telephone network interface.
Internet-style network: A TCP/IP-based packet network that connects the application server and telephony server via HTTP.
Telephone Network: Typically the Public Switched Telephone Network (PSTN), but could be a private telephone network (e.g. PBX), or VoIP packet network. Caller: Any telephone that can connect to the telephone network.