VoiceXML Review - Feature - Follow-up on the VoiceXML Forum’s Training Survey: Four Key Focus Areas

Volume 5, Issue 3 - May/June 2005

Speaker Authentication and VoiceXML: combining technologies for next-generation interactive voice applications

By Judith Markowitz and Kenneth Rehor

Speaker Authentication (SA) refers to speech-processing technologies that answer the question Is this person likely to be who she/he claims to be? Because it uses features of that persons voice to answer the question, SA is called a biometric like automatic fingerprint or face recognition. Because a claim of identity has been made, the object is to authenticate that claim. Several interrelated technologies are used in SA:

Speaker Verification determines whether a speaker is who she/he claims to be. It always involves a one-to-one matching between the voice of the claimant and a sample known to belong to the claimed identity. The system accepts or rejects the identity claim of a speaker. Speaker verification is always synonymous with SA but SA is not always synonymous with speaker verification.

Speaker Identification attaches an identity to the voice of an unknown speaker and does not necessarily require a claim of identity. It always involves a ont-to-many matching between the voice of the unknown speaker and voice samples for known individuals who have been entered into the system. Finding the matching identity may be restricted to individuals who have registered with the system (called closed set identification) or the system may be allowed to admit This voice does not belong to one of the identities in the system (called open set identification). Closed set identification is useful for small, relatively fixed groups (e.g., members of the biometrics SIG). Open set identification is useful for rapidly-changing and large-scale operations, such as identification of terrorists. When a claim of identity is made (for example, Im a member of the VoiceXML Biometrics SIG group) speaker identification is synonymous with speaker authentication.

Speaker Recognition has two meanings. It is a generic term that covers both speaker identification and verification. That is, the process of automatically recognizing who is speaking on the basis of individual information included in speech signals whether or not that person has provided a claim of identity. Its second meaning is as a synonym of speaker identification.

How SA works Typically, SA involves a two-steps process: enrollment/training and authentication. In the enrollment step an individual provides one or more samples of her/his voice. Like speech recognition, the speaker-authentication system extracts salient features from the spoken input and uses them to create a model of that persons voice. The model created during enrollment is stored in a database as the enrollees reference model. Depending upon the system, that model can be a password/phrase, a series of utterances (e.g., 24 58 92 or Chicago, Illinois Atlanta, Georgia Los Angeles, California), or freely spoken speech.

The second step occurs when an unknown individual presents him/herself to the speaker-authentication system. The system obtains a claim of identity from that person. The claim of identity can take many forms: something the person tells the system (e.g., an employee ID number, a customer account number, their name), something the person presents to the system (e.g., a token), or something the system determines automatically (e.g., the telephone ANI). The system retrieves the voice model for that identity and asks the claimant to provide a voice sample. The voice model created from that new sample is compared with the reference model . The nature of the comparison varies with the underlying technology used and is similar to that used for speech recognition. For example, some systems use stochastic processing on hidden Markov models or Gaussian classifiers and some use dynamic time warping on templates. . If the two models are sufficiently similar, the system accepts the claim of identity as true. If they differ significantly, the system rejects the claim as false and labels the claimant an impostor. The method by which a speaker-authentication system determines how similar two models must be is to set a threshold level. Some systems may update the reference model with new information.

Applications of SA

The most widespread uses of SA today are performed over the telephone and are designed to automate security or monitoring operations. Increasingly, those applications are being coupled with speech recognition and text-to-speech synthesis.

Figure 1 displays a typical security application involving both speech recognition and speaker authentication. The system pictured uses the same utterance for both the claim of identity and SA. Speech recognition decodes the spoken account number (the identity claim) and hands the results to the SA system for authentication.

Figure 1: Typical Security Application

The approach to SA used in the example in Figure 1 employs text-dependent SA because it operates on a pre-determined utterance. That utterance was what the system asked the caller to say during the enrollment step.

Figure 2 shows a typical example of an offender monitoring system. It would be used with offenders confined to their homes (called home incarceration) or for offenders who are allowed to move among a clearly-specified number of locations/telephone numbers (e.g., home, work, school, AA meetings). Generally, these systems are programmed to call the offender at a specified telephone number at random times during a 24-hour period.

Figure 2: Offender-Monitoring System

In this example, speech recognition is used to capture the yes or no response to the first question. It might also be used to check whether the person repeated the items correctly. The approach to SA used here is called challenge-response or text-prompted SA. The sequences of numbers and/or words the system asks the offender to repeat are random to ensure that no tape recorder can be substituted for the offender. These systems also generally have controls that detect call-forwarding and similar features that might be used to circumvent the conditions of sentencing.

The third approach, called text-independent SA, is gaining momentum. As its name suggests, it can accept free flowing speech but it often requires more data than the other two approaches. Sometimes, text-independent SA is invoked by a call-center agent when the caller requests a transaction that involves security. In such situations, the SA system captures the callers spoken responses to the agent and generally runs until it has sufficient data to render a decision about the identity claim of the caller. Text-independent SA can also be part of a series of turns between a caller asking for access to a secured system and an automated ASR system. In a more complex application the biometric information about the persons voice is one element of an authentication decision based on multiple factors that may include another biometric or knowledge of personal information (e.g., birth date, mothers maiden name) and activities (e.g., the persons last transaction).

Performance

Like all biometric systems, SA products make three basic types of errors:

They reject a valid user as an impostor (false non-match/false rejection)
They accept an impostor as a valid user (false match/false acceptance)
They are unable to process the input data (failure to enroll or failure to verify)

Unlike fingerprint and face recognition there has been no systematic, third-party benchmarking of SA products and technologies. Only one study compared biometrics

Figure 3: Comparison of biometric products

using common methods for all products. The study was performed in 2001 by the Centre for Mathematics & Scientific Computing of the United Kingdoms National Physical Laboratory. The results of one test in that study are presented in Figure 3 which shows the false match and false non-match errors made by a selection of biometric products at a spectrum of acceptance/rejection threshold points. As mentioned earlier, the threshold represents the degree of correspondence that must exist between two models in order for a biometric system to accept a claim of identity as valid. When the threshold is set to require close correspondence between the models, biometric systems become more likely to make false non-match/false rejection errors. This is the situation on the left of the graph in Figure 3. The movement from left to right represents gradual loosening of the matching requirements. As that happens all biometric systems begin to make fewer false non-match/false rejection errors and more false match/false acceptance errors. The figure also shows that the one SA product included in the test (the brown line) out-performed many of the other biometric products at most threshold levels.

These findings, while encouraging, do not represent real-world conditions where all of the challenges that face ASR can cause problems for SA. They include channels tha6t are noisy and/or of poor quality, speakers whose behavior is unpredictable, and loud background noise.

Standards for SA

There are several standards that apply to SA or can be applied to it. The biometrics industry has developed a generic API specification called BioAPI. It was originally developed by a consortium of public and private organizations interested in creating an API that would support multi-biometric deployments and reduce the learning curve for application development. It is now American National Standards Institute standard (ANSI/INCITS 358-2002) and will soon be an International Standards Organization standard (IEC/ISO 19784). ISO is now working on conformance testing for BioAPI. BioAPI is written in C++. A Java Native Interface is available. BioAPI runs under Windows and Linux/UNIX and is ANSI X9.84 and CBEFF compliant (see below). BioAPI version 1.0 can be downloaded for free from www.bioapi.org.

ANSI X9.84, Biometric Information Management and Security for Financial Services is a life-cycle management standard that specifies the minimum security requirements for effective management of biometrics data throughout the life cycle of a deployment. It covers biometric data: enrollment, transmission, storage, verification/identification, and termination and includes security issues related to data integrity, confidentiality, authenticity, and non-repudiation. It is BioAPI and CBEFF compliant. As with the other two biometric standards, ANSI X9.84 is on a fast track for ISO approval.

MRCP v2.0, Media Resource Control Protocol, version 2.0 is a protocol developed in the IETF to enable distributed speech processing. A typical use is to connect a speech engine to a VoiceXML platform. MRCP v2.0 supports ASR, TTS, and both speaker verification and identification. (See the article about MRCP in the August 2002 issue of the VoiceXML Review https://voicexmlreview.org/Aug2002/features/ietf_speech.html) W3C VoiceXML 2.0 and 2.1 both were designed to support ASR and TTS but do not directly support SA. The feature of VoiceXML 2.0, and the recordutterance feature in VoiceXML 2.1 (http://www.w3.org/TR/voicexml21/#sec-reco_reco) would permit any sort of server-side processing of an utterance. However, it does not permit simultaneous ASR and SA.

VoiceXML Forum Speech Biometrics technical activity

To address the increasing industry interest in SA, the VoiceXML Forum created ad hoc special interest group (SIG) to investigate SA technologies and how they might be applied to VoiceXML-based systems. Interest in the initial activity has prompted the call for a formal technical committee; formation of a formal committee is underway.

The ad hoc group has been defining and analyzing requirements and use cases related to Voice User Interface concerns; distributed system design and architecture; security of biometrics data (e.g. voice prints, transaction details, etc.) and the trust relationship between application entities, particularly where they are independently owned or controlled. A review of existing proprietary VoiceXML SA extensions as well as server-side processing architectures is underway.

As with other technical activities in the VoiceXML Forum, this work is being conducted in cooperation with the W3C Voice Browser Working Group (http://www.w3.org/voice), the steward for VoiceXML and related standards. The proposed technical committees goals include:

Develop a common understanding of various industry standards, their relationship to one another, and their practical use in VoiceXML and VUI applications
Focal point for SIV VUI (in particular telephone-based VUI) infrastructure standardization
VUI design best practices
Application architecture and system design recommendations
Coordination between relevant standards groups:
- W3C, BioAPI Consortium, NIST, ANSI, ISO, etc.
Development of standards where appropriate (e.g. CBEFF structure)
Provide recommendations to other standards and groups including:
- W3C Voice Browser and Multimodal Interaction Working Groups
- IETF (MRCP)
- NIST (CBEFF)
- Others as appropriate
Synchronization with the work in multimodal and multichannel standards groups
- o How to combine with other modalities, biometric or otherwise (finger/thumb/hand print, retinal scan, etc. and handwriting, geolocation, position, buttons, etc.)

The goal is not to provide a specific design for SA in VoiceXML; however the committee may provide VoiceXML language design suggestions, and analysis of the W3Cs work. Participants in the initial ad hoc group include VoiceXML Forum members and invited experts from a variety of areas such as speech technology vendors, VoiceXML platform vendors, and voice application and tool developers. Feel free to contact the VoiceXML Forum for more information, or to participate in the Speaker Authentication and related technical activities.

1 Traditionally, these models have been called voiceprints. This term was coined to express the uniqueness of each persons voice by comparing the model to a fingerprint. Today, the term voice model is preferred because it is a more accurate description of the data construct.
2 Most speaker authentication involves verification (one-to-one comparison of models). Sometimes, the identity may contain more than one voice model (e.g., for a jointly-owned bank account). Then a separate comparison is made between the newly-created model and each reference model in order to determine whether the claimant is any one of them.
3 All biometric comparison is statistical. No two samples of the same fingerprint, for example, are the same, either.
4 The magenta dot on the left represents the iris recognition product which has a fixed threshold. All other products tested have adjustable thresholds.

back to the top