An Introduction to Speech Recognition
(Continued from Part 1)
How it Works
Now that we've discussed some of the basic terms and concepts involved in speech recognition, let's put them together and take a look at how the speech recognition process works.
As you can probably imagine, the speech recognition engine has a rather complex task to handle. The task involves translating raw audio input into recognized text that an application can process. As shown in Figure 1 below, the major components we want to discuss are:
- Audio input
- Grammar(s)
- Acoustic Model
- Recognized text
Figure 1: Components of a Speech Recognition Engine
The first thing we want to take a look at is the audio input coming into the recognition engine. It is important to understand that this audio stream is rarely pristine. It contains not only the speech data (what was said) but also background noise. This noise can interfere with the recognition process, and the speech engine must handle (and possibly even adapt to) the environment within which the audio is spoken.
As we've discussed, it is the job of the speech recognition engine to convert spoken input into text. To do this, it employs all sorts of data, statistics, and software algorithms. Its first job is to process the incoming audio signal and convert it into a format best suited for further analysis. Once the speech data is in the proper format, the engine searches for the best match. It does this by taking into consideration the words and phrases it knows about (the active grammars), along with its knowledge of the environment in which it is operating (for VoiceXML, this is the telephony environment). The knowledge of the environment is provided in the form of an acoustic model. Once it identifies the most likely match for what was said, it returns what it recognized as a text string.
Most speech engines try very hard to find a match, and are usually very "forgiving." But it is important to note that the engine is always returning it's best guess for what was said.
Acceptance and Rejection
When the recognition engine processes an utterance, it returns a result. The result can be one of two states: acceptance or rejection. An accepted utterance is one in which the engine returns recognized text.
Whatever the caller says, the speech recognition engine tries very hard to match the utterance to a word or phrase in the active grammar. Sometimes the match may be poor because the caller said something that the application was not expecting, or the caller spoke indistinctly. In these cases, the speech engine returns the closest match, which might be incorrect. Some engines also return a confidence score along with the text to indicate the likelihood that the returned text is correct.
Not all utterances that are processed by the speech engine are accepted. Each processed utterance is flagged by the engine as either accepted or rejected.
Speech Recognition in the Telephony Industry
VoiceXML uses speech recognition over the telephone, and this introduces some unique challenges. First and foremost is the bandwidth of the audio stream. The plain old telephone system (POTS), as we know and love it, uses an 8 kHz audio sampling rate. This is a much lower bandwidth than, say, the desktop, which uses a 22kHz sampling rate. The quality of the audio stream is considerably degraded in the telephony environment, thus making the recognition process more difficult.
The telephony environment can also be quite noisy, and the equipment is quite variable. Users may be calling from their homes, their offices, the mall, the airport, their cars--the possibilities are endless. They may also call from cell phones, speaker phones, and regular phones. Imagine the challenge that is presented to the speech recognition engine when a user calls from the cell phone in her car, driving down the highway with the windows down and the radio blasting!
Another consideration is whether or not to support barge-in. Barge-in (also known as cut-thru) refers to the ability of a caller to interrupt a prompt as it is playing, either by saying something or by pressing a key on the phone keypad. This is often an important usability feature for expert users looking for a "fast path" or in applications where prompts are necessarily long.
When the caller barges in with speech, it is essential that the prompt be terminated immediately (or, at least, perceived to be immediately by the caller). If there is any noticeable delay (>300 milliseconds) from when the user says something and when the prompt ends, then, quite often, the caller does not think that the system heard what was said, and will most likely repeat what s/he said, and both the caller and the system get into a confusing situation. This is known as the "Stuttering Effect."
There is also another phenomenon related to barge-in, and that is called "Lombard Speech." Lombard Speech refers to the tendency of people to speak louder in noisy environments, in an attempt to be heard over the noise. Callers barging in tend to speak louder than they need to, which can be problematic in speech recognition systems. Speaking louder doesn't help the speech recognition process. On the contrary, it distorts the voice and hinders the speech recognition process instead.
Conclusions
Speech recognition will revolutionize the way people conduct business over the Web and will, ultimately, differentiate world-class e-businesses. VoiceXML ties speech recognition and telephony together and provides the technology with which businesses can develop and deploy voice-enabled Web solutions today! These solutions can greatly expand the accessibility of Web-based self-service transactions to customers who would otherwise not have access, and, at the same time, leverage a business' existing Web investments. Speech recognition and VoiceXML clearly represent the next wave of the Web.
back to the top
Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).
|