Speech Recognition

Overview

Aculab Cloud uses Google Speech-to-Text, a multilingual natural language speech recogniser powered by machine learning. In practice, this means you tell it what language it'll be hearing then it will do its best to transcribe whatever you say to it. You can optionally provide hint information to adapt the recogniser to words or phrases which are more likely to be said.

In combination with Text To Speech (TTS), our Speech Recognition allows your application to present a natural, conversational interface to the user. This gives you much flexibility in how to drive the conversation, including the use of AI driven chatbots.

Our Speech Recognition is available for REST applications only, and requires REST API v2. It may be accessed using the get_input, play, run_speech_menu, start_transcription and stop_transcription actions.

Languages

Currently, our Speech Recognition supports 120 languages and language variants. For the up to date list, see Speech Recognition Languages.

Models

Google Speech-to-Text defines a number of models that have been trained from millions of examples of audio from specific sources, for example phone calls or videos. Recognition accuracy can be improved by using the specialized model that relates to the kind of audio data being analysed.

For example, the phone_call model used on audio data recorded from a phone call will produce more accurate transcription results than the default, command_and_search, or video models.

Premium models

Google have made premium models available for some languages, for specific use cases (e.g. medical_conversation). These models have been optimized to more accurately recognise audio data from these specific use cases. See Speech Recognition Languages to see which premium models are available for your language.

Use cases

Conversations

In most applications, the main action used to drive a conversation with the user is get_input. This allows you to play a file or TTS prompt, then receive a transcription of the user's response passed to your next_page. An example interaction might be:

  • Prompt: "What would you like to do?"
  • Response: "Pay a bill."

The run_speech_menu action, being somewhat more restricted, is ideal for menu driven applications. Here, a file or TTS prompt is played, and the user's response, which must be one of a set of specified words or short phrases, is passed to the selected next_page. An example interaction here might be:

  • Prompt:"Would you like to speak to Sales, Marketing or Support?"
  • Response: "Support"

Play with selective barge-in

The play action seems at first sight an odd place to feature Speech Recognition. However, consider the case when the user is listening to a long recorded voicemail. They may say a small number of things to stop it, for example "Next", "Again" or "Delete" but, with it being a long voicemail, there is always the chance of the Speech Recognition transcribing some background speech. The play action allows the application to specify whether barge-in on speech is allowed and, if so, whether it is restricted to specific supplied phrases.

Live transcription

The start_transcription and stop_transcription actions allow your application to receive a live transcription, sent to their chosen page, of the speech on any combination of the inbound and outbound audio streams, all performed outside the ongoing IVR call flow. These actions would typically be used to allow the application to be aware of and react to the content of human to human conversations. For example, a section of the agent or receptionist's screen may update to display a 'Book an appointment' button if the caller mentions they'd like to. Alternatively, the manager's screen may update to flag while an agent is involved in a particularly difficult conversation.

Live translation

The connect action can be configured to include an AI translator in the conversation. The translator will use TTS to say translations of the speech recognized from each user to both parties.

Speech adaption

When starting speech recognition, as well as specifying the language, the application may optionally provide a set of words or phrases, word_hints, to adapt the recognition to what is more likely to be said. This reflects the fact that very few conversations are open ended - the application generally has some prior knowledge of the speech it is expecting to receive. For example, when asking the caller to pick a colour, including "aquamarine" in the word hints will make the recogniser more likely to transcribe that than "aqua marine". Similarly, when asking the caller to say a digit, providing a set of word_hints comprising all the digits will improve the accuracy of the transcription.

Note that, in spite of the above, our Speech Recognition is not a grammar based speech recogniser. So, for example, you can't constrain its output to be four digits, a time or a date. However, armed with word_hints and some post processing of the transcription, it can allow very natural, expressive dialogues and is well matched to the increasingly human-like conversations of modern AI chatbots.

Charging

On a trial account you can start using Speech Recognition straight away.

For other accounts, our Speech Recognition is charged per recognition, per minute with 15 second granularity. So, for example:

  • A get_input which listens for 12 seconds will be charged for 15 seconds.
  • A start_transcription for separate outbound and inbound audio which listens for 3 minutes 20 seconds will be charged for 7 minutes (each of the two transcriptions is charged for 3 minutes and 30 seconds).

You can obtain detailed charge information for a specific call using the Application Status web service. You can obtain detailed charge information for calls over a period of time using the Managing Reports web services. When using transcription in Separate mode there will be two corresponding entries in the Feature Data Record (FDR), one for each direction.