You can use TwiML's <Gather>
verb to collect digits or transcribe speech during a call.
Starting Oct 1, 2024, you can set Google Speech-to-Text (STT) v2 as your default speech-to-text provider for <Gather>
. Additionally, you can specify Google V2 or Deepgram as a provider in your <Gather>
speechModel attribute. This feature is currently in Public Beta.
To learn more, see Enable Google STT V2 on your account and associated updates to <Gather>
hints, language, and speechModel attributes.
The following example shows the most basic use of <Gather>
TwiML:
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Gather/>4</Response>
You can always send Twilio plain TwiML, or leverage the helper libraries to add TwiML to your web applications:
1const VoiceResponse = require('twilio').twiml.VoiceResponse;234const response = new VoiceResponse();5response.gather();67console.log(response.toString());
1<?xml version="1.0" encoding="UTF-8"?>2<!-- page located at http://example.com/simple_gather.xml -->3<Response>4<Gather/>5</Response>
When Twilio executes this TwiML, it will pause for up to five seconds to wait for the caller to enter digits on their keypad. A few things might happen next:
#
symbol or 5 seconds of silence. Twilio then sends the user's input as a parameter in a POST
request to the URL hosting the <Gather>
TwiML (if no action attribute is provided) or to the action
URL.By nesting <Say> or <Play> in your <Gather>, you can read some text or play music for your caller while waiting for their input. See "Nest other verbs" below for examples and more information.
To enable V2 of Google Cloud's STT as the default on your Twilio account and override your existing <Gather>
Speech configuration, do the following:
Once enabled, Twilio will map your existing <Gather>
attributes to use Google STT V2 as the underlying speech-to-text provider. To explicitly specify a supported Google STT V2 model, see the speechModel attribute.
<Gather>
supports the following attributes that change its behavior. Review the table for updates related to Google STT V2 and the ability to specify a provider (googlev2
or deepgram
) on the speechModel attribute.
Attribute name | Allowed values | Default value |
---|---|---|
action | URL (relative or absolute) | current document URL |
finishOnKey | 0 -9 , # , * , and '' (the empty string) | # |
hints | "words, phrases that have many words". Supported class tokens or keywords will vary according to your provider and version | none |
input | dtmf , speech , dtmf speech | dtmf |
language | Google STT V1: BCP-47 language tags Google STT V2 & Deepgram: Supported languages will vary according to your selected speechModel | en-US |
method | GET , POST | POST |
numDigits | positive integer | unlimited |
partialResultCallback | URL (relative or absolute) | none |
partialResultCallbackMethod | GET , POST | POST |
profanityFilter | true , false | true |
speechTimeout | positive integer or auto | timeout attribute value |
timeout | positive integer | 5 |
speechModel | Google STT V1: default , numbers_and_commands , phone_call , experimental_conversations , experimental_utterances Google STT V2 or Deepgram: These Google STT V2 models are currently not supported: chirp , chirp_telephony , and chirp_2 . See provider-specific examples for passing the provider_model as a value | default |
enhanced | true , false enhanced only applies to the phone_call model in Google STT V1. | false |
actionOnEmptyResult | true , false | false |
Use one or more of these attributes in a <Gather>
verb like so:
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Gather input="speech dtmf" timeout="3" numDigits="1">4<Say>Please press 1 or say sales for sales.</Say>5</Gather>6</Response>
The action
attribute takes an absolute or relative URL as a value. When the caller finishes entering digits (or the timeout is reached), Twilio will make an HTTP request to this URL. That request will include the user's data and Twilio's standard request parameters.
If you do not provide an action
parameter, Twilio will POST
to the URL that houses the active TwiML document.
Twilio may send some extra parameters with its request after the <Gather>
ends:
If you gather digits from the caller, Twilio will include the Digits
parameter containing the numbers your caller entered.
If you specify speech as an input with input="speech"
, Twilio will include SpeechResult
and Confidence
:
SpeechResult
contains the transcribed result of your caller's speech.Confidence
contains a confidence score between 0.0 and 1.0. A higher confidence score means a better likelihood that the transcription is accurate.Note: Your code should not expect confidence
as a required field as it is not guaranteed to be accurate, or even set, in any of the results.
After <Gather>
ends and Twilio sends its request to your action
URL, the current call will continue using the TwiML you send back from that URL. Because of this, any TwiML verbs that occur after your <Gather>
are unreachable.
However, if the caller did not enter any digits or speech, call flow would continue in the original TwiML document.
Without an action
URL, Twilio will re-request the URL that hosts the TwiML you just executed. This can lead to unwanted looping behavior if you're not careful. See our example below for more information.
If you started or updated a call with a twiml
parameter, the action
URLs for <Record>
, <Gather>
, and <Pay>
must be absolute.
The Call Resource API Docs have language-specific examples of creating and updating Calls with TwiML:
twiml
parameter.twiml
parameter.Imagine you have the following TwiML hosted at http://example.com/complex_gather.xml
:
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Gather>4<Say>5Please enter your account number,6followed by the pound sign7</Say>8</Gather>9<Say>We didn't receive any input. Goodbye!</Say>10</Response>
Scenario 1: If the caller:
finishOnKey
value) before entering any other digitsthen they will hear, "We didn't receive any input. Goodbye!"
Scenario 2: If the caller:
then the <Say>
verb will stop speaking and wait for the user's action.
Scenario 3: If the caller:
12345
and then presses #
, orthen Twilio will submit the digits and request parameters to the URL hosting this TwiML (http://example.com/complex_gather.xml
). Twilio will fetch this same TwiML again and execute it, getting the caller stuck in this <Gather>
loop.
To avoid this behavior, it's best practice to point your action
URL to a new URL that hosts some other TwiML for handling the duration of the call.
The following code sample is almost identical to the TwiML above, but we've added the action
and method
attributes:
1const VoiceResponse = require('twilio').twiml.VoiceResponse;234const response = new VoiceResponse();5const gather = response.gather({6action: '/process_gather.php',7method: 'GET'8});9gather.say('Please enter your account number,\nfollowed by the pound sign');10response.say('We didn\'t receive any input. Goodbye!');1112console.log(response.toString());
1<?xml version="1.0" encoding="UTF-8"?>2<!-- page located at http://example.com/complex_gather.xml -->3<Response>4<Gather action="/process_gather.php" method="GET">5<Say>6Please enter your account number,7followed by the pound sign8</Say>9</Gather>10<Say>We didn't receive any input. Goodbye!</Say>11</Response>
Now when the caller enters their input, Twilio will submit the digits and request parameters to the process_gather.php
URL.
If we wanted to read back this input to the caller, our code hosted at /process_gather.php
might look like:
1<?php2// page located at http://yourserver/process_gather.php3echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";4echo "<Response><Say>You entered " . $_REQUEST['Digits'] . "</Say></Response>";5?>
finishOnKey
lets you set a value that your caller can press to submit their digits.
For example, if you set finishOnKey
to #
and your caller enters 1234#
, Twilio will immediately stop waiting for more input after they press #
.
Twilio will then submit Digits=1234
to your action
URL (note that the #
is not included).
Allowed values for this attribute are:
#
(this is the default value)*
0
-9
''
)If you use an empty string, <Gather>
will capture all user input and no key will end the <Gather>
. In this case, Twilio submits the user's digits to the action
URL only after the timeout is reached.
If the following TwiML is used, finishOnKey
has no impact once the caller starts speaking.
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Gather input="speech dtmf" finishOnKey="#" timeout="5">4<Say>5Please say something or press * to access the main menu6</Say>7</Gather>8<Say>We didn't receive any input. Goodbye!</Say>9</Response>
You can improve Twilio's recognition of the words or phrases you expect from your callers by adding hints
to your <Gather>
.
The hints
attribute contains a list of words or phrases that Twilio should expect during recognition. Supported class tokens or keywords will vary according to your provider and version:
You may provide up to 500 words or phrases in this list, separating each entry with a comma. Your hints may be up to 100 characters each, and you should separate each word in a phrase with a space, e.g.:
hints="this is a phrase I expect to hear, keyword, product name, name"
We have implemented Google's V1 and V2class tokens list to improve recognition. You can pass a class token directly in the hints.
hints="$OOV_CLASS_ALPHANUMERIC_SEQUENCE"
Hints are passed as keywords to Deepgram.
Specify which inputs (DTMF or speech) Twilio should accept with the input
attribute.
The default input
for <Gather>
is dtmf
. You can set input
to dtmf
, speech
, or dtmf speech
.
If you set input
to speech
, Twilio will gather speech from the caller for a maximum duration of 60 seconds.
Please note that <Gather>
speech recognition is not yet optimized for Alphanumeric inputs (e.g. ABC123), this could lead to inaccurate results and thus, we do not recommend it.
If you set dtmf speech
for your input, the first detected input (speech
or dtmf
) will take precedence. If speech
is detected first, finishOnKey
will be ignored.
The following example shows a <Gather>
that specifies speech input from the user. When this TwiML executes, the caller will hear the <Say>
prompt. Twilio will then collect speech input for up to 60 seconds.
Once the caller stops speaking for five seconds, Twilio posts their transcribed speech to your action
URL.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;234const response = new VoiceResponse();5const gather = response.gather({6input: 'speech',7action: '/completed'8});9gather.say('Welcome to Twilio, please tell us why you\'re calling');1011console.log(response.toString());
1<?xml version="1.0" encoding="UTF-8"?>2<!-- page located at http://example.com/simple_gather.xml -->3<Response>4<Gather input="speech" action="/completed">5<Say>Welcome to Twilio, please tell us why you're calling</Say>6</Gather>7</Response>
The language
attribute specifies the language Twilio should recognize from your caller.
This value defaults to en-US
, but you can set it to any supported language of your provider.
You can set your language to any of our supported languages. See the full list.
Google STT V2 and Deepgram models map to specific languages. Please use the appropriate language supported by the selected speechModel. The language
attribute is passed through to the provider unchanged.
The method
you set on <Gather>
tells Twilio whether to request your action URL via HTTP GET
or POST
.
POST
is <Gather>
's default method.
You can set the number of digits you expect from your caller by including numDigits
in <Gather>
.
The numDigits
attribute only applies to DTMF input.
For example, you might wish to set numDigits="5"
when asking your caller to enter their 5-digit zip code. Once the caller enters the final digit of 94117
, Twilio will immediately submit the data to your action
URL.
If you provide a partialResultCallback
URL, Twilio will make requests to this URL in real-time as it recognizes speech. These requests will contain a parameter labeled UnstableSpeechResult
which contains partial transcriptions. These transcriptions may change as the speech recognition progresses.
The webhooks Twilio makes to your partialResultCallback
are asynchronous. They do not accept any TwiML in response. If you want to take more actions based on this partial result, you need to use the REST API to modify the call.
The profanityFilter
specifies if Twilio should filter profanities out of your speech transcription. This attribute defaults to true
, which replaces all but the initial character in each filtered profane word with asterisks, e.g., 'f***.'
If you set this attribute to false
, Twilio will no longer filter any profanities in your transcriptions.
When collecting speech from your caller, speechTimeout
sets the limit (in seconds) that Twilio will wait after a pause in speech before it stops its recognition. After this timeout is reached, Twilio will post the speechResult
to your action
URL.
If you use both timeout
and speechTimeout
in your <Gather>
, timeout
will take precedence for DTMF input and speechTimeout
will take precedence for speech.
If you set speechTimeout
to auto
, Twilio will stop speech recognition when there is a pause in speech and return the results immediately.
timeout
allows you to set the limit (in seconds) that Twilio will wait for the caller to press another digit or say another word before it sends data to your action
URL.
For example, if timeout
is 3
, Twilio wait three seconds for the caller to press another key or say another word before submitting their data.
Twilio will wait until all nested verbs execute before it begins the timeout
period.
The default timeout
value is 5
.
speechModel
allows you to select a specific model that is best suited for your use case to improve the accuracy of speech to text.
Starting October 1, 2024, you can specify a provider along with a supported model from Google Speech-to-Text (STT) V2 (googlev2
) or Deepgram (deepgram
). The following Google STT V2 models are currently not supported:
chirp
chirp_telephony
chirp_2
The speechModel
has to be specified in the format {provider_model}
, where provider is either googlev2
or deepgram
and the model is a supported model from Google Speech-to-Text (STT) V2 or Deepgram. See below for examples.
See Google STT V2's mapping of BCP-47 language and models for more details.
1<Gather input="speech" speechModel="googlev2_telephony">2<Say>Please tell us why you're calling.</Say>3</Gather>
See Deepgram's language and model mapping for more details.
1<Gather input="speech" speechModel="deepgram_nova-2">2<Say>Please tell us why you're calling.</Say>3</Gather>
For users on Google STT V1, the attribute supports default
, numbers_and_commands
, phone_call
,experimental_conversations
, and experimental_utterances
.
numbers_and_commands
and phone_call
are best suited for the use cases where you'd expect to receive short queries such as voice commands or voice search. phone_call
is best for audio that originated from a PSTN phone call (typically an 8khz sample rate).
1<Gather input="speech" enhanced="true"speechModel="phone_call">2<Say>Please tell us why you're calling.</Say>3</Gather>
The phone_call
value for speechModel currently only supports a set of languages, they are: en-US
, en-GB
, en-AU
, fr-FR
, fr-CA
, ja-JP
, ru-RU
, es-US
, es-ES
, and pt-BR
. You must also set the speechTimeout
value to a positive integer, rather than using auto
.
Experimental models are designed to give access to the latest speech technology and machine learning research, and can provide higher accuracy for speech recognition over other available models. However, some features that are supported by other available models are not yet supported by the experimental models such as confidence scores.
The experimental_utterances
model is for short utterances that are a few seconds in length and is useful for trying to capture commands or other single shot directed speech use cases; think "press 0 or say 'support' to speak with an agent." The experimental_conversations
model supports spontaneous speech and conversations; think "tell us why you're calling today."
Both experimental_conversations
and experimental_utterances
values for speechModel currently only supports a set of languages.
Please explore all options to see which works best for your use case.
The enhanced
attribute instructs <Gather>
to use a premium speech model that will improve the accuracy of transcription results. The premium speech model is only supported with the phone_call
model in Google STT V1. It costs 50% more at $0.03 per 15s of <Gather> than the standard phone_call
model. The premium phone_call
model was built using thousands of hours of training data. It ensures 54% fewer errors when transcribing phone conversations when compared to the basic phone_call
model.
The following TwiML instructs <Gather>
to use premium phone_call
model:
1<Gather input="speech" enhanced="true"speechModel="phone_call">2<Say>Please tell us why you're calling.</Say>3</Gather>4
<Gather>
will ignore the enhanced
attribute if any other speechModel
, other than phone_call
is used.
For example, in the following TwiML, <Gather>
ignores the enhanced
attribute and applies standard numbers_and_commands
speechModel
.
1<Gather input="speech" enhanced="true" speechModel="numbers_and_commands">2<Say>Please tell us why you're calling.</Say>3</Gather>4
The premium enhancedphone_call
model is priced at $0.03 per utterance while the standard phone_call
model is priced at $0.02 per utterance. For Public Beta, use of either Google STT V2 or Deepgram API is priced at $0.025 per utterance.
Read the Language appendix to see if the enhanced model is available for your language.
actionOnEmptyResult
allows you to force <Gather>
to send a webhook to the action URL even when there is no DTMF input. By default, if <Gather>
times out while waiting for DTMF input, it will continue on to the next TwiML instruction.
For example, in the following TwiML when <Gather>
times out, <Say>
instruction is executed.
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Gather>4<Say>5Please enter your account number,6followed by the pound sign7</Say>8</Gather>9<Say>We didn't receive any input. Goodbye!</Say>10</Response>
To always force <Gather>
to send a webhook to the action
URL, use the following TwiML:
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Gather actionOnEmptyResult="true" action="/gather-action">4<Say>5Please enter your account number,6followed by the pound sign7</Say>8</Gather>9</Response>
You can nest the following verbs within <Gather>:
The following example shows a <Gather>
with a nested <Say>
. This will read some text to the caller, and allows the caller to enter input at any time while that text is read to them:
1const VoiceResponse = require('twilio').twiml.VoiceResponse;234const response = new VoiceResponse();5const gather = response.gather({6input: 'speech dtmf',7timeout: 3,8numDigits: 19});10gather.say('Please press 1 or say sales for sales.');1112console.log(response.toString());
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Gather input="speech dtmf" timeout="3" numDigits="1">4<Say>Please press 1 or say sales for sales.</Say>5</Gather>6</Response>
When a <Gather>
contains nested <Say>
or <Play>
verbs, the timeout
begins either after the audio completes or when the caller presses their first key. If <Gather>
contains multiple <Play>
verbs, the contents of all files will be retrieved before the <Play>
begins.
If you are using <Play>
verbs, we recommend hosting your media in AWS S3 in us-east-1, eu-west-1, or ap-southeast-2 depending on which Twilio Region you are using. No matter where you host your media files, always ensure that you're setting appropriate Cache Control headers. Twilio uses a caching proxy in its webhook pipeline and will cache media files that have cache headers. Serving media out of Twilio's cache can take 10ms or less. Keep in mind that we run a fleet of caching proxies so it may take multiple requests before all of the proxies have a copy of your file in cache.
When a <Gather>
reaches its timeout without any user input, call control will fall through to the next verb in your original TwiML document.
If you wish to have Twilio submit a request to your action
URL even if <Gather>
times out, include a <Redirect>
after the <Gather>
like this:
1const VoiceResponse = require('twilio').twiml.VoiceResponse;234const response = new VoiceResponse();5const gather = response.gather({6action: '/process_gather.php',7method: 'GET'8});9gather.say('Enter something, or not');10response.redirect({11method: 'GET'12}, '/process_gather.php?Digits=TIMEOUT');1314console.log(response.toString());
1<?xml version="1.0" encoding="UTF-8"?>2<!-- page located at http://example.com/gather_hints.xml -->3<Response>4<Gather action="/process_gather.php" method="GET">5<Say>Enter something, or not</Say>6</Gather>7<Redirect method="GET">8/process_gather.php?Digits=TIMEOUT9</Redirect>10</Response>
With this code, Twilio will move to the next verb in the document (<Redirect>
) when <Gather>
times out. In our example, we instruct Twilio to make a new GET
request to /process_gather.php?Digits=TIMEOUT
A few common problems users face when working with <Gather>
:
Problem: <Gather>
doesn't receive caller input when the caller is using a VoIP phone.
Solution: Some VoIP phones have trouble sending DTMF digits. This is usually because these phones use compressed bandwidth-conserving audio protocols that interfere with the transmission of the digit's signal. Consult your phone's documentation on DTMF problems.
Problem: Twilio does not send the Digits
parameter to your <Gather>
URL.
Solution: Check to ensure your application is not responding to the action
URL with an HTTP 3xx redirect. Twilio will follow this redirect, but won't resend the Digits
parameter.
If you encounter other issues with <Gather>
, please reach out to our support team for assistance.