What is voice biometrics?Biometrics is the science of measuring and analysing human biological factors. Therefore, voice biometrics involves measuring and analysing a person’s voice. The technology involved works on the principle that an individual’s voice is unique, in the same way that a fingerprint and other biometric characteristics are unique.
The uniqueness of a person’s voice is due to their physical characteristics and their speech habits. Due to that uniqueness, the identity of a speaker can be established or verified through an analysis of their voice. When a user registers in a system by providing samples of their voice, personal voice patterns are extracted from the audio and a unique reference model is created. The model or template is called a voiceprint, which is analogous to a fingerprint.
To identify or verify that a person is who they claim to be, voice biometrics technology uses algorithms to analyse their speech and compare that to the previously created model. If there is a match, the voice biometrics system will confirm that the speaker is the person registered against the voiceprint. Using a person’s voice in such a way has become an accepted and established practice.
What are the main uses of voice biometrics?Voice biometrics is used in two main ways; for the purposes of authorisation and identification.
The important use cases are: a) to validate the identity of a person making an identity claim; and b) to identify who an individual is.
The first case is proof of identity, and is synonymous with providing a PIN or password to authorise access to a service. Instead of providing their password, a user of the system gives a sample of their voice. That sample is then analysed and compared with a voiceprint registered to the genuine user. The use case is commonly referred to as speaker verification. We can say also that the user’s identity has been authenticated.
The other use case is to determine the identity of a person when e.g., they are claiming to be someone else or attempting to remain anonymous. By analysing a sample of speech and comparing that with all voiceprints in a database or hotlist, we can identify the individual. That’s why the use case is referred to as speaker identification. It is particularly useful for detecting and identifying known fraudsters, or for recognising repeat malicious callers.
Is voice better than other biometrics factors?Voice is comparable to other biometrics in many ways. However, voice does have some advantages, not least because the user doesn’t need a scanner, such as for iris and fingerprint recognition.
Voice is extremely easy to use and, because of that, it has a higher level of user acceptance than many other biometric identity verification methods. In terms of accuracy, voice is broadly equivalent to other methods, and it is no less secure than fingerprints, retina, or facial recognition. Essentially, voice is both convenient and reliable, not having to be concerned with residues or poor lighting.
A primary advantage of voice is that it is the only biometric technology that can be used remotely over the telephone. That means it is particularly suitable for authenticating callers to a contact centre or an inbound, IVR-driven self-service platform. In a similar way, it can be used to verify the identity of people called from an outbound contact centre solution.
How does voice biometrics compare to traditional forms of identification?Current, non-biometric methods involve shared secret knowledge and physical tokens. Secret knowledge takes the form of a PIN or password, or the answer to a security question i.e., it’s something you know. Examples of physical tokens i.e., something you have, include keys, ID cards, security fobs, drivers’ licences, and passports.
Unfortunately, the traditional methods are vulnerable to social engineering and theft. Tokens are routinely counterfeited and stolen, and passwords are routinely forgotten, left in plain sight, and stolen. Moreover, tokens can’t guarantee the positive identification of a person.
In contrast, biometrics is less open to being copied, hacked, shared, or stolen. And aside from jokes about losing your voice, as it involves something that relates to who you are, it can’t be lost. Furthermore, in terms of the inherent security of voice biometrics, a voiceprint is a derived code, it’s not a recording, it can’t be reverse engineered to reproduce speech, and if it were to be accessed by a hacker, the data would appear as a meaningless string of numbers that is functionally useless.
Using a biometric in combination with other methods equates to strong, multi-factor authentication. Having e.g., a mobile phone, knowing your PIN, and verifying your identity via voice biometrics is a secure method of reducing the vulnerability of systems and services to unauthorised access.
How accurate is voice biometrics?First of all, no biometric system is 100% foolproof. Industry reports and studies indicate that success rates above 90% should be the minimum acceptable, where success means that a person is able to authenticate their rightfully claimed identity.
Regardless of competitive claims, a clue to the nature of accuracy lies in the existence of three important errors that reflect the performance of a system. Those measures are the equal error rate (EER), the false acceptance rate (FAR), and the false rejection rate (FRR). Allowing an impostor to get in to the system is an error of false acceptance. Denying a genuine user access to the system is a false rejection error.
Industry reports indicate that optimal, text-dependent voice biometric engines can achieve a FAR of below 1%, with a corresponding FRR of less than 3%. However, it is difficult to compare systems based on published ‘accuracy’ figures. That’s partly because there is no industry standard dataset against which to measure performance, and partly because threshold settings and similarity ratings are vendor and implementation specific.
Notably, the point at which the FAR and FRR curves intersect is the EER. Thus, EER is the commonly accepted metric used to compare the separability of systems i.e., their effectiveness at differentiating between genuine users and impostors. That’s because unlike FAR and FRR, EER is independent of the threshold setting.
Notwithstanding all that, the sensible thing to do in relation to voice biometrics is to run trials in your target environment, based on real-world users.
If voice biometrics isn’t 100% accurate, why should it be used?Notwithstanding its imperfection, voice biometrics is a valuable technology in several ways. All businesses know about risk assessment, particularly in relation to IT and data security. If you conduct a risk assessment, the output will be preventative actions to mitigate the risk(s). Managing a risk means prioritising actions appropriate to the risk level.
Voice biometrics confers benefits beyond fraud detection and the mitigation of risks. However, in relation to mitigating fraud, voice biometrics offers a valuable resource in that:
a) its presence deters fraudsters;
b) its use makes the fraudster’s task very much harder; and
c) the results of using it mean the business will lose less to fraud.
Consequently, it must be obvious that voice biometrics can be very effective in helping to manage fraud risk.
The vulnerability of PINs, passwords and security questions has led to some very public data breaches at well-known organisations, putting customers at risk. Voice biometrics can reduce the security risk and mitigate fraud, whilst offering a more convenient user experience. However, the best practice is to implement voice biometrics in combination with other methods, leading to a strong, multi-factor authentication solution that will reduce the vulnerability of systems and services to unauthorised access.
What other benefits will I get from implementing voice biometrics?The business case for voice biometrics centres around a four-fold benefit, namely:
1. Fraud and security risks are mitigated;
2. The cost of authentication is reduced;
3. The customer experience is improved; and
4. Call takers’ moral and motivation is heightened.
In addition, there is a clear return on investment (ROI). An ROI is an accounting measure, but there can be other, ‘non-bean-counter’ benefits, which include the security outcomes, enrolment take-up, and customer satisfaction.
The benefits of increased security mean that:
i. you will lose less to fraud;
ii. you will save by not having to pursue fraudsters through the legal system; and
iii. you will save through having to pay out less in compensation and reimbursement.
The cost reduction benefit comes from replacing manual authentication with an automated system. The more the process is automated, the greater the saving. A voice biometric system can shave two-thirds or more off the cost of identifying and verifying callers. In UK contact centres, for example, the average time to authenticate a call via an agent is around 30 seconds. Contrast that with an automated, voice biometric system that achieves the same result in 10 seconds or less and you can see the potential for savings.
Automating the authentication process undoubtedly makes it easier and more convenient for customers. In addition, removing the tedium of having to ask the same security questions, day after day, is bound to have a positive effect on call takers’ morale. Customers will surely recognise the security benefits to them, and prefer using technology to having to remember the name of their first pet (school/car/etc).
How those benefits are measured will depend on the business. However, in most any scenario, a subscription licensing model will give you an ROI in less than a year, year after year after year.
What about languages?Fortunately, language isn’t a big issue, because voice biometrics works on the sounds that people make, rather than on what they say. Some vendors will need to train their voice biometric engine to get the best results for each language, because different languages use different sets of phonemes, or produce a different engine for each language. However, the best systems will operate independent of language, because they cater for a wide range of language sounds.
That’s not to say that a system can’t be fine-tuned, but training a system in such a way can involve more than just language. It’s feasible for a system to be fine-tuned for a fixed, text-dependent passphrase and for the speaker domain (e.g., environment, networks, and devices), but the most beneficial effects can be gained by applying best practices during set-up and implementation.
Where language is a factor, is where speech recognition is used in tandem with speaker verification (speaker recognition) to validate what is said in addition to who said what. However, that is a separate issue, which has no bearing on the performance of the voice biometric engine.
How is voice biometrics implemented?Voice samples
The first step is to determine how you will get voice samples from your customers, in order to create their individual, reference voiceprints. You need a method of collecting audio from each person, and you need to make sure you get that one voice, and that voice only, in each file i.e., you must avoid capturing others’ voices.
The most common method of getting audio is over the telephone, which is perfect, because that’s where voice biometrics comes into its own. In fact, some vendors will have optimised their speech verification system for telephone speech. The telephone can be a landline phone, mobile phone, VoIP phone, or a PBX handset, it can be a SIP client, WebRTC in a browser, or it could be Skype.
You will need to record audio from the caller, with just the caller’s voice, and pass that to the voice biometrics system. Integration with an IVR platform is the best method of achieving that, on the basis that future authentications are likely to be triggered by calls to the IVR. Other methods of recording audio include using a laptop with a microphone, a native ‘app’ on a mobile phone, or a studio microphone. You just need to ensure the recording format is suitable for the voice biometric system, and to minimise background sounds and noise.
If you are thinking of using database recordings of contact centre customers, most likely you will need to process the audio to separate the customer’s voice from that of any other person on the call e.g., an agent or supervisor.
Enrolment is the process of creating a reference voiceprint for the individual. The enrolment step is important as you need to consider what each person will have to say or speak in order to be enrolled in the system. To have enough audio to analyse for adequate enrolment in the system, you will need several (typically, at least three) distinct samples of each person’s voice. That requirement applies regardless of the source of the audio.
There are a number of options for enrolment (and later verification), under the labels of active and passive methods
This method gets its name from the idea that speech samples for enrolment are recorded passively while the speaker is in conversation with e.g., a call taker. Consequently, the speaker doesn’t have to say anything specific, such as a passphrase.
However, the label is somewhat of a misnomer, because the speaker must actively consent to the recording. And, since the introduction of the European Union’s General Data Protection Regulation (GDPR), biometric data (a ‘special category of personal data’), such as voiceprints, cannot be created and stored without the explicit consent of the ‘data subject’.
Because biometrics is a special category under the GDPR, it requires more protection in the sense that, in order to lawfully process voiceprints, you need to identify and document a lawful basis under Article 6 and a separate condition for processing under Article 9. The lawful basis for processing voiceprints may be e.g., consent, or the legal (regulatory) obligation of the controller/processor, or in order to protect the vital interests of the data subject by e.g., verifying their identity prior to transacting on their behalf.
Regulations notwithstanding, the benefit of this so-called passive approach is that you do not need to train the speaker, or ask them to say anything specific in order to enrol. It is perhaps ideal for the user, because they don’t have to remember a special passphrase.
A potential downside of this approach is that the caller may not say much, or what they do say isn’t of sufficient ‘phonetic diversity’. For an adequate enrolment, the recording samples need to contain enough unique spoken sounds. Ideally, that means the person speaking all the phonemes in their language twice, which for English means 44 (x2) phonemes.
As it happens, a caller is likely to speak all phonemes twice or more in a two- to three-minute phone conversation. However, as it is rare for one speaker in a telephone dialogue to talk continuously for more than 10 to 20 seconds, recordings should be multiple, shorter passages captured throughout the duration of the call. If you can get from three to five samples of usable speech, each of say 10 to 20 seconds duration, with silence removed and exclusively from the caller, you will get a more precise reference model, and subsequently, better verification accuracy.
This method is so-called, because the speaker must knowingly speak a specific sequence of words in order to enrol a voiceprint. Typically, a passphrase with a unique arrangement of about three seconds duration (about six to 10 words), repeated three to five times, is needed for an effective enrolment. Note that consent is needed regardless of the method used. It’s simply more overt with this method, as the speaker is fully engaged in repeating the passphrase to enrol.
An advantage of the active method is that the process of authentication is faster and more efficient. That’s because the same passphrase is used for enrolment and verification, and the length of speech sample that’s needed is short. An easy to say and remember phrase, containing a minimum of four syllables, is recommended. Very few phonemes are needed as long as you get enough samples. And, on the basis that users will enrol by repetition, repeating sounds or phonemes within the passphrase can be a good thing, because they are never said in exactly the same way.
Active participation means the speaker has to devote time to the process of enrolment, which may be considered a downside. However, as the speaker is aware of the process and its purpose in either case, that’s unlikely to be a real issue. And considering the benefits to the user, having to follow ‘repeat after me’ instructions is an investment in future convenience. It’s also worth adding that if the system allows autonomous passphrase selection per user, such objections disappear.
Other active methods
As alternatives or in addition to the classic passphrase option, there are other active things you can ask your customers to do for enrolment, and verification. With a primary purpose of voice biometrics being to enhance security and mitigate fraud, how you manage enrolment and verification will have a bearing on the effectiveness of any solution.
A fixed or static passphrase, whether based on words (text) or numbers (digits), has the disadvantage of being relatively easy for a determined, high-tech fraudster to record a voice and use that to try to fool the system. Vendors go to great lengths to be able to detect recorded playbacks and other methods of spoofing. However, in terms of risk assessment, in a given application, the designer must consider how likely it is that an attack will be attempted, and what the consequences of any resultant security breach will be.
If the likelihood of a breach is significant, or the consequences severe, then a simple, fixed passphrase verification is unlikely to be suitable. The best practice is always for applications to be designed to use strong, multi-factor authentication, usually including text-prompted speaker verification with different randomly selected prompts for each access attempt. In that way, your solution will be at its most effective.
Adding a second prompt to get the caller to speak a non-static, random word or number sequence give two benefits: i) it adds an additional authentication factor; and ii) it serves as a form of liveness detection i.e., if the caller responds with the correct sequence, it’s more likely to be a human than a machine. To be able to verify using such active methods, enrolment must involve getting the customer to speak enough words and numbers to be able to generate a viable voiceprint, rather than simply repeating a passphrase.
For example, in order to enrol, you could get the customer to speak the numbers zero to nine in ascending order, followed by speaking them in descending order, followed by speaking two, five-digit sets. That will give you the spoken diversity to have customers verify by repeating, say, one or two sets of randomly generated four-, five-, or six-digit strings, which are of course different every time. That means you will collect a robust speech sample in a simple way, which is an excellent approach that will give you high security and convenience.
If your voice biometric system includes speech recognition and DTMF recognition, you get the added advantage of being able to verify what is said in addition to who said it. At the end of the day, the methods you choose will depend on the level of security and user experience you wish to provide
Voice biometric systems employ a method of scoring to express the similarity between a user’s voice and their reference voiceprint. Therefore, it follows that the higher the score, the greater the probability of the voice belonging to the correct person. It also follows that authentication depends on the score being higher than a pre-determined threshold.
In real-world systems, it is a fact that the errors of FAR and FRR have to be managed. Because a voice biometric system is a statistical system based on probabilities, there is a trade-off between those error rates. They are inversely separable, and if you were to plot them against a threshold axis, you’d see that they overlap.
Setting a threshold within that area of overlap means some impostors will undoubtedly score higher than the threshold, and some genuine users will score lower. Therefore, it is unavoidable that some classification errors will occur.
If you set a higher confidence level, such that no impostor scores higher, there will be no false acceptances. However, at that same threshold, some genuine users may fail to score high enough and be falsely rejected. Conversely, if you set a low threshold, such that no user is falsely rejected, some impostors will score high enough to be falsely accepted. If you choose an optimum threshold between those two points, it is inevitable that both false rejections and false acceptances will occur.
The necessary trade-off is a balance of security versus convenience. Set a high threshold to block impostors and you will inconvenience some genuine users. Therefore, the threshold setting depends on the application and the relative importance of those two considerations.
To maintain both high security and convenience, the best practice is to set the threshold sensitivity high enough to strongly reject impostors. It also makes sense to enable user retries. Retries are commonplace in any IVR, contact centre, or self-service platform, and by allowing retries you increase the chances of a genuine user being authorised. It works on the principle of ‘if we’re sure, we’ll let you in, if not, we’ll ask for another sample’. However, you should not enable more than two or three attempts as, well… that would be silly.
It is industry standard security practice to lock accounts after two or three failed attempts at verification. Furthermore, adding several layers of security, acting in tandem with a multi-factor authentication, is also considered best practice when implementing voice biometrics for high-risk transactions.