VoiceXML Review - Columns - Speak & Listen

Volume 3, Issue 4 - July/August 2003

In this monthly column, an industry expert will answer common questions about VoiceXML and related technologies. Readers are encouraged to submit questions about VoiceXML, including development, voice-user interface design, and speech technology in general, or how VoiceXML is being used commercially in the marketplace. If you have a question about VoiceXML, e-mail it to and be sure to read future issues of VoiceXML Review for the answer.

Q: How do I begin playing an audio file starting from an offset?

A: You can implement code on your Web server that accepts an HTTP request including the following parameters:

The path to the audio file on the server or an uploaded audio file (e.g. via a submitted variable)
The offset from which you want the interpreter to begin playing.

The code parses, modifies, and sends the updated audio file header to the VoiceXML interpreter. The code then seeks to the designated offset in the data segment of the original file and streams the remainder of the file back to the interpreter.

Appendix E of the VoiceXML 2.0 specification describes the various audio file formats that a VoiceXML interpreter must support. A robust server-side script should be able to handle all of these formats. If you know the format of the audio files that comprise your voice application, your job is easier.

Even if you only have to deal with a single type of audio file, unraveling the file format may seem daunting.
Fortunately, open source utilities are available to ease the burden. For example, you can use "Sound Exchange" (SOX).

Download SOX from http://sox.sourceforge.net/
Compile it into an executable (Windows developers can use the Win32 binary 'out of the box')
Run SOX, passing the following four arguments:

- The path to the input file
- The path to the output file
- The value 'trim' (without the quotes)
- The number of milliseconds to remove from the beginning of the file.

The following command creates a new file, "sample-new.wav", containing the contents of "sample.wav" excluding the first 15 seconds.

sox sample.wav sample-new.wav trim 15000

If performance is not an issue (it's best to avoid creating new processes and writing large files to disk in response to an HTTP request), you can call this utility from a CGI script running on your Web server. If your application requires optimal performance, you can write code that integrates more tightly into your Web server environment. SOX is not a bad place to start, however, and an example CGI script follows:

#!/usr/local/bin/perl -w
use strict;
use CGI qw(param);

# TODO: location of SOX utility on your Web server
use constant SOX => "sox";
# TODO: base location of the audio files
use constant AUDIODIR => "/var/audio/myapp";
# TODO: temporary location to store 'trimmed' files
use constant TEMPDIR => "/var/tmp/";

my $wav = param("wav");
my $offset = param("offset");

if (!$wav)
{
Log("Missing parameter 'wav'");
print "Status: 400 Bad Request\n\n";
exit(0);
}

# SECURITY: don't allow unrestricted access to your file system
$wav =~ s/(^[A-Za-z]:[\/\\]|^[\/\\]|\.\.)//g;
my $in = AUDIODIR . $wav;

if (!-f $in)
{
Log("Invalid path: $in");
print "Status: 400 Bad Request\n\n";
exit(0);
}

print < Content-Type: audio/x-wav

HEADER

my $tmpfile = $in;
if (defined($offset) && $offset > 0)
{
$tmpfile = TEMPDIR . "trim-$$.wav";
Log("Running SOX on $in to produce $tmpfile [$offset]");
system(SOX, $in, $tmpfile, "trim", $offset);
}

# enable autoflush; aka disable output buffering
my $old = select STDOUT; $| = 1; select $old;

# open the temp file and stream it to the client/interpreter
open HAUDIO, "<$tmpfile";
binmode HAUDIO, ':raw';
while ()
{
print $_;
}
close HAUDIO;

if (defined($offset) && $offset > 0)
{
unlink $tmpfile; # clean up
}

# diagnostic messages; check server log
sub Log
{
print STDERR $_[0] . "\n";
}

The script accepts two parameters:

wav - the name of the .wav file to be returned to the browser
offset - the offset into the file where playing should begin

Given these two parameters, the script checks if the file exists.
If not, it returns an HTTP error to the client. Otherwise, it
sends an HTTP "Content-Type" header indicating the response
contains an audio file. If the offset parameter was missing
or the value is not greater than zero, the script simply returns
the original audio file. Otherwise, it runs SOX to generate the trimmed audio to a temporary file,
returns the contents of the temporary file to the client, and deletes the temporary file.

To use this script on your own Web Server, you'll need Perl.
You'll also need to update your Web Server configuration to allow Perl scripts to be executed.
You'll also need to update the constants at the top of the script to point valid locations
specific to your server environment.

Now that we've got a server-side script that trims audio files, let's utilize it.
The following snippet of VoiceXML calls the CGI passing both the 'wav' and 'offset' parameters.

xmlns="http://www.w3.org/2001/vxml">

Q. You've showed me how to play an audio file beginning at an offset. I want to write a Voice Mail
application that allows the user to move forward and back at any point while listening to a message.
How do I do that?

A. To allow the user to move forward and back while listening to a message, you need to determine
the position in the message where the user "barged in", for example, by pressing 3 on their telephone keypad
to move "forward" in the message, or by pressing 1 to move backward.
Although some VoiceXML interpreter implementations may support this feature today,
the VoiceXML 2.0 specification doesn't formally specify how this bargein data should be exposed.

If you want to stick to the standard and write code that's portable across implementations, you can obtain a rough approximation of where bargein occurred using two ECMAScript date objects and some simple math. I need to emphasize that this will only provide you with an approximate result. Without native support for determining when a bargein occurred during audio playback, you will undoubtedly obtain slightly different behavior on different platforms due to disparities in execution and recognition performance. Because voice recognition performance will vary most markedly across implementations, its best to stick to DTMF input for the commands that you allow while the message is being played back.

Somewhere in your application, you'll have a dialog that plays back the current message in the user's voice mailbox. Before queueing the message (via the

elapsed = current_time - start_time
new_offset = old_offset + elapsed + increment - fudge_factor

Since the audio file containing the recorded message doesn't get altered each time you play it,
you'll need to keep track of the accumulated offsets (old_offset). The elapsed time is easy to calculate:

simply instantiate a new ECMAScript Date object and subtract the Date object that you initialized in the prior to queueing the message. The increment is up to you. In the sample code below, I've chosen an increment of plus or minus 5 seconds.

See the variables FORWARD and REVERSE.

The fudge_factor will depend on platform performance. In the sample code below, I've chosen 3 seconds. See the variable FUDGE.

Once you've calculated the new offset, you'll pass the value to a server-side script like the one demonstrated in the Q&A above.

You'll probably modify the script to accept a unique message identifier that integrates with your backend message storage system.

Continued...

back to the top