Prosody application note: isolated word recognition

Human Factors

Many factors can affect acceptability, recognition accuracy, and the users' perceptions of how useful an ASR system is to them. The design of the prompts given to each caller is critical in respect to all of these. They must:

  1. Inform the caller that they are talking to a speech recogniser
  2. State unambiguously and clearly what information is required of them and how it should be presented to the system
  3. Encourage them to speak in a clear and definite fashion
  4. Not confuse them with too many options
  5. Be clear and concise, but not so concise that a brief distraction causes them to miss important instructions
  6. Describe "error recovery" procedures (e.g. if all else fails, say "help" to request the attention of a human operator)
  7. Ensure that they know that they will benefit from using the system (and that it is worth their while to cooperate with it)

It is also important to make sure that each caller can get to the information they need quickly and without too much prompting, and that they are passed to a human operator quickly if they are having difficulty. Such difficulties can be detected by the proportion of recognition results which are being rejected or are deemed uncertain by the ASR algorithm. Problems can also be detected by the duration of the intervals between a prompts and the corresponding recognition results: if these are too long, something is probably wrong, and it would be best to transfer the caller to an operator if the problem persists. Such human factors issues can be crucial to acceptance of the system by the general public.

Note also, that well-designed prompts can play an important role in synchronising the speaker and the recogniser: if sm_asr_listen_for() is invoked while a caller is in the process of speaking, the first part of their utterance will be lost. If the asr_mode is anything other than kSMASRModeDisabled, the remainder of the utterance will then be treated as part of the following speech. This may lead to an error in the immediately following ASR result. Playing the caller a voice prompt immediately prior to activating ASR on a channel is a good method to synchronise the caller’s speech with the ASR system (it encourages them to stop talking and start again).

Factors Affecting Recognition

The biggest single factor affecting recognition performance in real applications (both in terms of accuracy and latency) is the choice of vocabulary. In particular, the pronunciation of short and potentially emotive words such as "no" is highly variable. By comparison, digits are rarely used to convey emotion, and are spoken much more consistently. It is therefore wise to minimise the number of active vocabulary words and their confusability. For example, the word "no" can easily be confused with "oh" or even "nine" (especially if "no" is actually pronounced "nah" in vernacular speech). Recognition performance will therefore improve dramatically if a digit vocabulary (zero/nought/oh/one/two/.../nine) is kept separate from a confirmation (yes/no) vocabulary.

Barge-In

When speech recognition is active on a channel, any incoming signal may give rise to a recognition result. The more speech-like is the signal, the bigger is the chance of it producing a result. The more clearly it is spoken, the bigger is the chance of that result being correct.

If ASR is active while a spoken prompt is being replayed to the caller, the echo from the caller's telephone is often sufficiently loud and clear to cause spurious recognition results. The simplest way to avoid this problem is to disable recognition until the replay of the prompt has completed. However, this can slow down the caller's navigation of a menu-based application: once callers become familiar with the structure of a series of ASR prompts, they will often find it more efficient to pre-empt the prompts, and say the next word before the prompt has finished, a process known as "barge-in".

Prosody's ASR algorithm can selectively ignore the echo component, and thus allow barge-in. The recogniser can be activated immediately a replay starts, provided the channel performing the replay is specified as the sidetone channel when invoking sm_asr_listen_for().

In deciding whether to allow barge-in in an application, the developer should keep in mind the restrictions associated with the use of the sidetone channel i.e. that both the input (ASR) and the output (prompt) channels must reside on the same module, and therefore that that module must have sufficient processing resources to perform replay and ASR.

Languages and Vocabularies

The available vocabularies for Prosody ASR are distributed in the files listed in the following tables, but note that the complexity of each model is different, so the files have different sizes. Note also that some filenames ("zero.sas", "stop.sas", etc.) are used in more than one language's vocabulary. Despite having the same names, these files are not interchangeable, and it is important that the correct version of each is used when recognising the respective language. The files for a particular language are distributed in a directory called $(TiNG)/iwr/gen/$lang where $lang is a language code consistent with ISO 639-1, ISO 3166, and IETF RFC 3066.

British (UK) English

One one.sas
Two two.sas
Three three.sas
Four four.sas
Five five.sas
Six six.sas
Seven seven.sas
Eight eight.sas
Nine nine.sas
Zero zero.sas
Nought nought.sas
Oh oh.sas
Yes yes.sas
No no.sas
Help help.sas
Start start.sas
Restart restart.sas
Stop stop.sas
Erase erase.sas
Delete delete.sas
Cancel cancel.sas
Double double.sas
Triple triple.sas
Treble treble.sas
Phone phone.sas
Call call.sas
Get me get-me.sas
Save save.sas
Store store.sas
Remember remember.sas
New new.sas
Name name.sas
Number number.sas
Dial dial.sas
Record record.sas
End end.sas
Operator operator.sas
Emergency emergncy.sas
Directory directry.sas

American (US) English

One one.sas
Two two.sas
Three three.sas
Four four.sas
Five five.sas
Six six.sas
Seven seven.sas
Eight eight.sas
Nine nine.sas
Zero zero.sas
Double double.sas
Oh oh.sas
Yes yes.sas
No no.sas
Help help.sas
Start start.sas
Restart restart.sas
Stop stop.sas
Erase erase.sas
Delete delete.sas
Cancel cancel.sas
Directory directry.sas
Phone phone.sas
Call call.sas
New new.sas
Name name.sas
Number number.sas
Dial dial.sas
Save save.sas
End end.sas
Operator operator.sas
Emergency emergncy.sas

German (De) German

Eins eins.sas
Zwei zwei.sas
Zwo zwo.sas
Drei drei.sas
Vier vier.sas
Fünf funf.sas
Sechs sechs.sas
Sieben sieben.sas
Acht acht.sas
Neun neun.sas
Null null.sas
Ja ja.sas
Nein nein.sas
Ne ne.sas
Na na.sas
Andere wahl andrwahl.sas
Korrigieren korrigrn.sas
Bestätigung bestgung.sas
Befragen befragen.sas
Telephonistin telfnstn.sas
Stornierung stornrng.sas
Zuhören zuhoren.sas
Wiederholung wdrholng.sas
Inhalt inhalt.sas
Vohrer vohrer.sas
Hilfe hilfe.sas
Information informtn.sas
Zurück zuruck.sas
Stop stop.sas
Beenden beenden.sas
Anfang anfang.sas
Weiter weiter.sas
Aufnehmen aufnehmn.sas
Nachricht nachrcht.sas
Bestellen bestelln.sas

French (Fr) French

Un un.sas
Deux deux.sas
Trois trois.sas
Quatre quatre.sas
Cinq cinq.sas
Six six.sas
Sept sept.sas
Huit huit.sas
Neuf neuf.sas
Zéro zero.sas
Oui oui.sas
Non non.sas
Autre choix autrechx.sas
Correction correctn.sas
Validation validatn.sas
Consultation conslttn.sas
Opérateur operteur.sas
Quitter quitter.sas
Écouter ecouter.sas
Répéter repeter.sas
Sommaire sommaire.sas
Mode d'emploi mddmploi.sas
Guide guide.sas
Information infrmatn.sas
Précédent precednt.sas
Suivant suivant.sas
Retour retour.sas
Stop stop.sas