Media Streams Overview
Learn how to use Twilio Media Streams to capture, transmit, and receive raw audio from live voice calls over WebSockets in near real-time. This feature allows you to access both inbound and outbound audio tracks for immediate external processing. You can use this guide to implement self-service automation, create inbound contact centers, create outbound contact centers, deploy AI agents, and create AI/ML transcriptions.
See Related reference documentation to learn more about the TwiML and API resources used in this guide.
Media Streams provides access to the raw audio from a Programmable Voice call by streaming it over WebSockets to a destination you specify. This enables use cases such as real-time transcriptions, sentiment analysis, voice authentication, and more. You can also stream raw audio into a Twilio Voice call from another application, enabling conversational Interactive Voice Response (IVR) or real-time interactions with an AI chatbot.
Support for Twilio Regions
You can use Media Streams in the Ireland (IE1) and Australia (AU1) Regions. To set up Media Streams with these regions, follow the guides for non-US outbound and inbound calls. The default region remains US1.
To get started with Media Streams, first make sure you're familiar with making and receiving voice calls with Twilio. If this is your first time building with Twilio Programmable Voice, complete one of the Voice quickstarts.
Before you build with Media Streams, decide whether you need a unidirectional or bidirectional stream. The sections below explain the differences between each option and provide links to helpful docs and resources.
In a unidirectional Media Stream, your WebSocket application receives the call's audio but can't send audio back to Twilio for playback.
With unidirectional Media Streams, you can receive the inbound audio track (the audio that is incoming to Twilio), the outbound audio track (the audio that Twilio is generating on the call), or both tracks.
DTMF isn't supported with unidirectional streams.
Start a unidirectional stream with the <Start><Stream> TwiML verb or through the Stream resource REST API.
If you use TwiML for the stream, Twilio executes <Start><Stream>, which initiates the audio stream with your WebSocket server, and then executes the next TwiML instruction you provide.
You can stop a unidirectional Media Stream using <Stop><Stream> or via the Stream resource.
Use the following resources to build your unidirectional Media Streams application:
- <Stream> TwiML Reference
- Stream Resource API Reference
- WebSocket Messages
- Learn about the format and contents of the WebSocket messages that Twilio sends to your server
Sample applications are available in the following languages:
Bidirectional Media Streams are those in which your WebSocket application both receives audio from Twilio and can send audio to Twilio, which is then played on the Call. An example use case for bidirectional Media Streams is to facilitate a real-time voice conversation with an AI assistant.
With bidirectional Media Streams, you can only receive the inbound track.
DTMF is supported with bidirectional Media Streams only in the inbound direction, from Twilio toward your media server. Sending DTMF outbound from your media server toward Twilio is not supported.
To start a bidirectional Media Stream, use <Connect><Stream>. These TwiML instructions block subsequent TwiML instructions unless the WebSocket connection is disconnected.
You can't use the Stream resource to start a bidirectional Media Stream.
You can stop a bidirectional Media Stream by closing the WebSocket connection from your server or by ending the Call.
Check out the following resources to help you build your bidirectional Media Streams application:
- <Stream> TwiML Reference
- WebSocket Messages
- The "Send WebSocket messages to Twilio" section covers how to send audio back to Twilio.
- Basic bidirectional stream sample application
For unidirectional Streams, you can stream up to four tracks at a time on a Call.
For bidirectional Streams, you can have only one Stream per Call.
Each Media Stream is associated with one WebSocket connection.
Your Media Streams application must be able to communicate with Twilio.
Configure your firewall rules to allow secure WebSocket connections (TCP port 443) from Twilio to your WebSocket servers from any public IP address.
You must also configure your application to validate the X-Twilio-Signature header. This is how your application verifies that a Media Stream is coming from an authentic Twilio source. Learn more on the General Usage - Security page.
This guide covers a feature that can support the following use cases:
You can use the feature in this guide to build intelligent, interactive voice responses that understand spoken intent dynamically. By streaming raw call audio to your server, you can process user voice inputs with custom speech recognition models to fulfill requests automatically.
To learn more advanced features that you can use with self-service automation, see Voice self-service automation.
You can use the feature in this guide to optimize live call handling workflows for support centers. Real-time media streaming enables backend software to monitor caller audio, providing immediate context or assisting with automated verification steps before routing the call to an agent.
To learn more advanced features that you can use with inbound contact centers, see Voice inbound contact center.
You can use the feature in this guide to enrich outreach quality and tracking. Outbound calls can stream their active audio directly to analytical infrastructure to monitor script compliance or measure engagement parameters programmatically.
To learn more advanced features that you can use with outbound contact centers, see Voice outbound contact center.
You can use the feature in this guide to establish bidirectional conversations between human callers and conversational artificial intelligence models. Real-time audio streaming feeds backend AI engines, which can immediately generate voice responses and stream them back to the caller over WebSockets.
To learn more advanced features that you can use with AI agents, see Voice AI agents.
You can use the feature in this guide to capture the precise audio of your calls for immediate machine-driven evaluation. Streaming the raw audio to machine learning pipelines supports automated post-call summarization, live language translation, or continuous sentiment profiling.
To learn more advanced features that you can use with AI or ML transcription, see Voice AI and ML transcription.
Explore the following guides to build on what you've learned in this guide:
- Consume a real-time Media Stream using WebSockets, Python, and Flask: Set up a practical developer environment to analyze raw call audio files via Python.
- Connect Twilio with your Dialogflow Agent: Integrate automated natural language processing into your active voice channels.
- Sample application GitHub repo: Explore the sample application that demonstrates how to implement Media Streams.