- What is Speaker Verification?
- What can Speaker Verification do?
- Using Speaker Verification
- Best Practice
Aculab Cloud's Voice Biometrics employs recent advances in Artificial Intelligence (AI) and the science of Big Data to verify a speaker, rapidly and reliably. In this first release, Voice Biometric Speaker Verification (SV) is available via a suite of web services that let you send your voice audio to the biometrics engine as a file. Each user you want to verify by voice can say the same phrase to register, verify and update (text dependent mode) or they can can register, verify and update using different phrases (text independent mode).
Speaker verification is available on a cloud account and is secured by access keys that you can create on your account. You group these keys and the users that create registrations with them in a User Group. Users register samples of their voice to establish a voice model that can be updated with additional samples to improve the model or to account for changes in environment or normal vocal variations.
In the initial release of Aculab Cloud's Voice Biometrics, a user's audio data can be supplied directly as a file or streamed to a WebSocket in discrete phrases. In the latter case, this may entail the client using Voice Activity Detection (VAD) to identify individual phrases from a stream of audio. The results of the verification will only be available after all the audio has been sent on the WebSocket. In future the Voice Biometric analysis may incorporate VAD and thus be able to analyse continuous streamed audio, optionally providing verification results on an ongoing basis.
What is Speaker Verification?
Speaker Verification (SV) is the process of verifying a speaker’s identity claim. It presupposes that the speaker (or user) is claiming to be a person who has previously registered with the system. The system verifies the user based on a sample of their speech.
Text dependencyThere are two main classes of Speaker Verification: text-dependent and text-independent. Text-dependent SV requires the user to say a specific phrase, or a specific sequence of words both when registering with the system and at verification. Text-independent SV verifies regardless of the words spoken.
The words used for text-dependent verification must always be the same (i.e. they form a passphrase) and the whole phrase must be said during a verification attempt. Consequently, text-dependent systems have a significant advantage: they can be more demanding about exactly what sound characteristics are acceptable at each point in time. This means that they can provide results with a high degree of confidence from a relatively short sample of speech.
Text independent verification has the advantage that it can be used at any point, or several times, during an extended dialogue. It can also be used in the background, for example while the user is talking with an Interactive Voice Response system or human operator.
Note: Text-dependent mode is the default for Aculab Cloud Voice Biometrics.
What can Speaker Verification do?Speaker Verification is primarily used to restrict access to confidential information and secure computer systems. It normally forms part of a larger access control system, which combines information from different sources to separate out genuine identity claims from those of impostors. This is termed Multi-Factor Authentication (MFA).
There are many ways to restrict access to secure systems, but very few which allow automatic verification of an individual’s identity using their voice, for example on a phone call. In this scenario, the information available is:
- the voice characteristics of the individual
- the sequence of words spoken
- any phone keys pressed
- any related metadata
The metadata, along with a combination of Speaker Verification (SV), Speech Recognition (SR), and key-presses (DTMF) provides complete coverage of all these factors. For example, the user would initially enter an account or user number (either using the phone keypad, or by saying the numbers). This forms their identity claim. They would then say a passphrase, and the vocal characteristics analysed by the SV system; or, the user may be prompted to say certain digits from their PIN or give some other information verbally. A speech recognition (SR) system identifies the words spoken, while the SV system analyses the vocal characteristics. The results of the analysis are then combined to confirm or deny the identity claim.
Beyond this example, significantly improved security can be achieved by continuing the verification process throughout any interaction. In that way, not only is it possible to improve the system’s confidence in its initial decision about the user’s identity, but it can also detect any changes in the voice of the user. This can be a very valuable method for detecting a replay attack or, in extreme cases, coercion of a user.
A Multi-Factor Authentication system will include checks on, for example: an account number, PIN, cryptographic response code, a caller’s CLI, behavioural statistics, etc. - as well as speaker verification. If any of these factors casts doubt on the user’s identity then there is generally a set of fallback checks, designed to clear up any level of doubt. Fallback checks may take more time and require interaction with a live operator, so to minimise costs it is important that the MFA system can make use of not just the results, but also any confidence measures associated with them. Additional checks should only be invoked when there is a high level of confidence that the initial decision was wrong. If all the component parts of an access control system are well integrated, the use of speaker verification can simultaneously reduce costs and increase security.
Using Speaker Verification
Registering a user involves “training” the Speaker Verification system about the range of sounds which are characteristic of that user. This is done by providing the system with samples of the user’s voice. The system analyses these samples and builds a model of the user so that it can estimate a confidence that subsequent sounds were produced by the same person. The accuracy of the SV system is primarily dependent on the data used for registration. This should be sufficient in terms of coverage of speech sounds, or ‘phonemes’, to match the range of sounds expected to be encountered during verification.
In practice, the robustness of the verification result can be affected by noise and channel distortion. The verification confidence is formed by estimating a number of similarities between the audio sample and the user's model. If the signal is corrupted with noise or distortion, then the similarities will drop. Although Aculab's Voice Biometric speaker verification is robust against noise and other distortions, clean signals will give the best results.
A user’s voice will vary very little when prompted to speak several times in quick succession (as is commonly done during registration). However, their voice will be noticeably different when they have been using the system for verification for many months, and have become familiar with the whole process, or are using different audio input equipment (e.g. a different phone).
Adaptation is required to counter the effects of insufficient or atypical registration data. This is essential to improving, and ultimately, maintaining, verification accuracy. The more the system is used for verification, the more examples of the user’s voice parameters become available, and these can be used to re-train the user's model. This will improve the model’s coverage and make it more specific to an individual's voice. It is essential that the model is only updated once the user’s identity has been thoroughly checked and confirmed, to avoid “hijacking” of the user’s model by an impostor.
Just as the acoustic characteristics of any individual’s speech are not strictly unique, they also vary over time. As a person ages, their voice will change. Even over a shorter timeframe, they may develop new vocal habits and mannerisms. Allowing the user’s model to track changes in their voice, through adaptation, will ensure that the system remains accurate.
Resisting Presentation Attacks (Spoofing)
A user's biometric data might be artificially produced (synthesized), covertly recorded or obtained through a hacked system and presented to speaker verification as a fake user identity claim. Aculab Cloud Voice Biometrics includes optional Presentation Attack Detection (PAD) when verifying a user that tries to detect audio that has been used before, manipulated speech or human mimicry and synthetically generated or converted speech. Although this is designed to detect the majority of attacks, some can be extremely sophisticated and Multi-Factor Authentication (MFA) is recommended to provide full protection from attack. For instance, MFA can help to establish that the speaker is live and not a recording.
Note: presentation attack detection is available when running verification only. Registration and updates must be performed in a controlled manner and environment that precludes presentation attacks.
There are some general principles which will affect any SV system’s performance. The most important part of the process is registration.
Registering using a fixed passphrase in text dependent mode will usually involve analysing several repetitions of that passphrase. For text-independent operation the registration should include most of the words that the user is most likely to say. This normally requires saying several complete sentences. As a general rule, any speech used for registration should contain at least 4 syllables, preferably more. More syllables will generally produce more precise models, and thus better verification accuracy.
We recommend using register with three individual speech samples, each containing the whole of the user’s pass-phrase (if text dependent) or a word rich sentence. Following registration we then recommend using verify at least twice, each with a new sample which is then used to update the user's model. After each verification, take note of the confidence score. If the user is failing to be verified, continue to perform the verification-update sequence, each time with a new sample. Once the user consistently passes verification, the user's model can be considered to be trained.
Verification does not typically require as much data as registration, but, as with registration, the longer and more phonetically varied the speech, the more accurate the verification result will be.
Following training, it may become necessary, in time, to adapt the user's model to improve accuracy using update. This would typically be passed the sample used in a recent verification attempt. However, this should only be done once the user’s identity has been confirmed by security questions or other independent methods, even if the user passed verification. In general, adaptation should be employed if a user is found to consistently fail verification, indicating that there has either been insufficient training or some systematic change in the user’s voice or the acoustic environment. Adaptation need not be performed after good verification scores, or on the basis of one or two failed attempts. More than one update may be required, each with a different sample. A sample of speech must not be used more than once to update a user's model.
The key to a successful SV deployment lies in the details of the system design, and, most importantly, in the handling of the dialogue. There are some decisions which need to be made at the outset. Firstly, the reason for the use of SV must be understood. Typically, there are only a few common reasons, but they need to be prioritised for each deployment:
- To reduce the time taken to complete a transaction.
- To reduce the need for direct interaction between the user and any human operator.
- To make the security process simpler and more natural.
- To make the user feel more confident that every precaution has been taken to secure their data.
- To reduce the success rate of impostors trying to access the system.
These priorities will determine trade-offs between speed of registration, ease of access for genuine users, reliable identification of impostors, and perceived security.
If usability and minimal disruption to the user are the most important aims, then Speaker Verification can still provide high security, provided best practice for registration is followed. As described earlier: three utterances should be used to create the initial model. This should be followed with at least two verification attempts, each using a new sample, and each followed by an update. The verification plus update loop should be continued until the user consistently passes verification.
If correct identification of impostors is deemed critically important, it is not enough to base the verification decision on a single, short utterance such as a pass-phrase. Instead, the speech must be analysed continuously. The best way to achieve this, without placing an unacceptable burden on the user, is to use a hybrid text-dependent and text-independent approach and to carefully tune the structure of the dialogue.
For the hybrid system, the user will require two models: a text dependent model for verification of the pass-phrase, and a text independent model for verification at points during the interaction. The models will require their own training phase that follows the best practice described above. The text dependent model will be based on repetitions of the pass phrase, the text independent model will be based on general samples of speech.
For example, a user could initiate verification by entering an account number using a phone keypad - this is the identity claim - followed by a simple text-dependent Speaker Verification transaction where the user is asked to say their pass-phrase. Then, during interaction with an operator, all the speech is collected and used for text-independent SV at regular intervals. Additionally, and especially in the case of a spoken PIN or a numerical passphrase, the spoken words can be identified by a SR system. The combination of all these factors should be sufficient to reject the vast majority of impostors without unduly delaying access to a legitimate caller. The dialogue of the interaction should be arranged such that any particularly critical transactions are delayed until sufficient examples of the user’s speech have been verified. If, by that point, the user has been flagged as a possible impostor, the system would prompt them to answer additional security questions.
By designing the dialogue carefully, a significant amount of speech can be collected without the user becoming impatient or frustrated, and this can improve the impostor detection rate.
Voice Biometric speaker verification is charged per User API web service call to register and verify, per minute with 15 second granularity. So, for example:
- A single call to register with three 7 second audio files is charged for 30 seconds.
- A call to verify with one 10 second audio files is charged for 15 seconds.
You can obtain charging information for a particular voice biometric API call using the application_status web service, passing it the application_instance_id returned by register or verify.
You can obtain charging information for voice biometric calls over a period of time using the Managing Reports web services.