This documentation is for reference only. We are no longer onboarding new customers to Programmable Video. Existing customers can continue to use the product until December 5, 2026.
We recommend migrating your application to the API provided by our preferred video partner, Zoom. We've prepared this migration guide to assist you in minimizing any service disruption.
Twilio recommends Compositions to mix all track recordings from a single Room, as it is a refined product that solves the various edge cases of mixing recordings with different start or offset times. Only use the approach described here if you have a business reason for not using Compositions. The transcoding tools described below are not Twilio products and as such we will not provide support for errors or problems that may arise when developers try to compose Recordings with the following procedure.
When you record a Twilio Video Group Room, you will most likely end up with multiple individual track recordings once the room is completed. By default, when you record a Group Room, Twilio records all participants' audio and video tracks and creates individual recordings for each one of those tracks. This provides the flexibility to compose into a single video output and gives control over which particular tracks will be displayed in the final composition.
If you don't want to automatically record the audio and video tracks for all participants in a Group Room, you can use Recording Rules to specify the tracks you want to record. Also, you can configure your Group Room default settings in the Twilio console. Here you can set the rooms to default to group rooms and to turn recording on/off. (You can't record peer-to-peer rooms because the media does not go through Twilio servers.)
Twilio offers Compositions, a service that creates playable files from Recordings and automatically takes into account the Recordings' timing variations. However, if your use case does not fit the Compositions product, you can manually mix recordings into a single playable file.
If you choose not to use Compositions, there are several factors you will need to consider when manually mixing the recordings together into a single output. In particular, you will need to take into account each recordings' start_time
and offset
values, as these might differ for each participant and cause synchronization issues when mixed.
In this tutorial, you'll learn how to synchronize two participants' track recordings, each with different start_time
and offset
values. The output will be a single video in either webm
or mp4
format, with the participants' videos side-by-side in a 2x1 grid.
ffmpeg
for mixing Recordings into a single file. Click here to download. Click here for the official documentation.
mp4
format, you will need to compile a version of
ffmpeg
that includes the
libfdk_aac
audio codec.
Click here
for the official documentation.
ffprobe
to gather information about each Recording's start time.
Click here
for the official documentation.
After you have recorded a Group Room, you might want to merge all of the recorded tracks into a single playable file so you can review the full contents of the Room. If you merge recorded tracks without considering their different start_time
or offset
values, the output will not be synchronized.
There are several reasons tracks from the same Room might have different start_time
and offset
values, such as:
offset
in the recordings' metadata. The offset is the time in milliseconds elapsed between an arbitrary point in time, common to all group rooms, and the moment when the source room of this track started.
The example for this tutorial will be a scenario in which you want to mix Recordings from a Group Room with two participants. In this scenario:
offset
values.
Mixing both Alice's and Bob's tracks together without taking into account the different start_time
and offset
values will result in a media file with synchronization issues, where Alice and Bob's tracks are not playing at the proper times.
The output file this tutorial produces will mix the two video and two audio tracks and ensure they are correctly synchronized. The video tracks will be placed side by side in a 2x1 grid layout, with a resolution of 1024x768
.
First, you will need to find the SID for each of the recordings you would like to mix. You can do this via the REST API. Below is the API Call to retrieve the recording SIDs. (Note that you should pass the Room SID in the GroupingSid
argument as an array with a single item.) You will need these recording SIDs in the next step.
Click "Show Sample Response" in the bottom left corner of the code samples below to see the JSON response that would be returned from making the API calls. In this example, you should retrieve the sid
for each of the recordings.
1// Download the helper library from https://www.twilio.com/docs/node/install2const twilio = require("twilio"); // Or, for ESM: import twilio from "twilio";34// Find your Account SID and Auth Token at twilio.com/console5// and set the environment variables. See http://twil.io/secure6const accountSid = process.env.TWILIO_ACCOUNT_SID;7const authToken = process.env.TWILIO_AUTH_TOKEN;8const client = twilio(accountSid, authToken);910async function listRecording() {11const recordings = await client.video.v1.recordings.list({12groupingSid: ["RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"],13limit: 20,14});1516recordings.forEach((r) => console.log(r.accountSid));17}1819listRecording();
1{2"recordings": [3{4"account_sid": "ACaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",5"status": "completed",6"date_created": "2015-07-30T20:00:00Z",7"date_updated": "2015-07-30T21:00:00Z",8"date_deleted": "2015-07-30T22:00:00Z",9"sid": "RTaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",10"source_sid": "MTaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",11"size": 23,12"type": "audio",13"duration": 10,14"container_format": "mka",15"codec": "OPUS",16"track_name": "A name",17"offset": 10,18"status_callback": "https://mycallbackurl.com",19"status_callback_method": "POST",20"grouping_sids": {21"room_sid": "RMaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",22"participant_sid": "PAaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"23},24"media_external_location": "https://my-super-duper-bucket.s3.amazonaws.com/my/path/",25"encryption_key": "public_key",26"url": "https://video.twilio.com/v1/Recordings/RTaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",27"links": {28"media": "https://video.twilio.com/v1/Recordings/RTaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/Media"29}30}31],32"meta": {33"page": 0,34"page_size": 50,35"first_page_url": "https://video.twilio.com/v1/Recordings?Status=completed&DateCreatedAfter=2017-01-01T00%3A00%3A01Z&DateCreatedBefore=2017-12-31T23%3A59%3A59Z&SourceSid=source_sid&MediaType=audio&GroupingSid=RMaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&PageSize=50&Page=0",36"previous_page_url": null,37"url": "https://video.twilio.com/v1/Recordings?Status=completed&DateCreatedAfter=2017-01-01T00%3A00%3A01Z&DateCreatedBefore=2017-12-31T23%3A59%3A59Z&SourceSid=source_sid&MediaType=audio&GroupingSid=RMaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&PageSize=50&Page=0",38"next_page_url": null,39"key": "recordings"40}41}
The next step is to extract the offset value for each of the four recordings via its metadata. You can do this using the REST API.
Keep track of these offsets, as you will need them in a later step.
1// Download the helper library from https://www.twilio.com/docs/node/install2const twilio = require("twilio"); // Or, for ESM: import twilio from "twilio";34// Find your Account SID and Auth Token at twilio.com/console5// and set the environment variables. See http://twil.io/secure6const accountSid = process.env.TWILIO_ACCOUNT_SID;7const authToken = process.env.TWILIO_AUTH_TOKEN;8const client = twilio(accountSid, authToken);910async function fetchRecording() {11const recording = await client.video.v112.recordings("RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")13.fetch();1415console.log(recording.offset);16}1718fetchRecording();
1{2"account_sid": "ACaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",3"status": "processing",4"date_created": "2015-07-30T20:00:00Z",5"date_updated": "2015-07-30T21:00:00Z",6"date_deleted": "2015-07-30T22:00:00Z",7"sid": "RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",8"source_sid": "MTaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",9"size": 0,10"url": "https://video.twilio.com/v1/Recordings/RTaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",11"type": "audio",12"duration": 0,13"container_format": "mka",14"codec": "OPUS",15"track_name": "A name",16"offset": 10,17"status_callback": "https://mycallbackurl.com",18"status_callback_method": "POST",19"grouping_sids": {20"room_sid": "RMaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"21},22"media_external_location": "https://my-super-duper-bucket.s3.amazonaws.com/my/path/",23"encryption_key": "public_key",24"links": {25"media": "https://video.twilio.com/v1/Recordings/RTaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/Media"26}27}
Next, you should download each recording. You can do this via the REST API. The following curl command retrieves the URL that you can use to download the media content of a Recording.
1curl 'https://video.twilio.com/v1/Recordings/RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Media' \2-u $TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN
You will get back a JSON response that contains a redirect_to
URL, similar to the response below. Go to this URL to download the recording file.
{"redirect_to": "https://com-twilio-us1-video-recording..."}
The audio files you download will be in .mka
format and the video files will be in .mkv
format.
At this point, you should have four recordings downloaded on your machine, as well as the offset values for each of these recordings.
This next step uses ffprobe
to retrieve the start_time
for each recording. You will need to perform this step on each recording.
Below is an example of how to get Alice's audio start_time
using the following ffprobe
command:
ffprobe -show_entries format=start_time alice.mka
The output will look similar to the output below, and it will include the start_time
:
1Input #0, matroska,webm, from 'alice.mka':2Metadata:3encoder : GStreamer matroskamux version 1.8.1.14creation_time : 2017-06-30T09:03:44.000000Z5Duration: 00:13:09.36, start: 1.564000, bitrate: 48 kb/s6Stream #0:0(eng): Audio: opus, 48000 Hz, stereo, fltp (default)7Metadata:8title : Audio9start_time=1.564000
After retrieving both the offset
from the Recordings metadata and start_time
from the ffprobe
on the Recording media, you can create a table like the one below. (The creation_time
will also be in the output of the ffprobe
command above; we are referencing it below to demonstrate that it is not the correct value to use when mixing tracks. It is not needed in any of the following steps and will be removed from the table going forward.)
A table of each recording's offset and start time, along with its creation time.
Track Name | offset (in ms) | start_time (in ms) | creation_time |
---|---|---|---|
alice.mka | 163481005731 | 1564 | 2017-06-30T09:03:44.000000Z |
alice.mkv | 163481005731 | 1584 | 2017-06-30T09:03:44.000000Z |
bob.mka | 163481005732 | 20789 | 2017-06-30T09:04:03.000000Z |
bob.mkv | 163481005732 | 20814 | 2017-06-30T09:04:03.000000Z |
The start_time
and offset
for a participant's audio and video are not required to be the same. This can happen in the scenario of a Room recovery. You can also see the approximate 20 seconds that Alice was in the room before Bob reflected in the start_time
s of each participant's recordings.
It is important to use start_time
as reference and not creation_time
. The recording's creation_time
is the time that the user joined the call, but the start_time
refers to when the first sample of data was received for the recording. Additionally, creation_time
does not have millisecond precision and could lead to synchronization issues.
Next, you will need to calculate the relative offset of each track so that the tracks will be synchronized. To calculate the relative offset:
offset
to the
start_time
. In the sample table below, we store this value in the
Addition
column.
Addition
value of all tracks. In the sample, that's
alice.mka
, with an
Addition
value of 163481007295. Copy this value into every row of the
Reference Value
column, as you will need to reference it in the next step.
Reference Value
from the
Addition
value for each recording to create the
relative_offset
in milliseconds. You will need the
relative_offset
value when mixing the tracks together.
The following table shows the current values for our tracks, in which alice.mka
is the reference value with 163481007295.
A table of each recording's offset, start time, and calculated fields for determining the relative offset.
Track Name | offset (in ms) | start_time (in ms) | Addition | Reference Value | relative_offset (in ms) |
---|---|---|---|---|---|
alice.mka | 163481005731 | 1564 | 163481007295 | 163481007295 | 0 |
alice.mkv | 163481005731 | 1584 | 163481007315 | 163481007295 | 20 |
bob.mka | 163481005732 | 20789 | 163481026521 | 163481007295 | 19226 |
bob.mkv | 163481005732 | 20814 | 163481026546 | 163481007295 | 19251 |
The final step is to mix all the tracks in a single file. The command will:
Below is the complete command to obtain the mixed file in webm
format with a 1024x768 (width x height) resolution. It's a long command! You can see an explanation for each section below.
1ffmpeg -i alice.mkv -i bob.mkv -i alice.mka -i bob.mka \2-filter_complex " \3[0]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs0], \4color=black:size=512x768:duration=0.020[b0], \5[b0][vs0]concat[r0c0]; \6[1]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs1], \7color=black:size=512x768:duration=19.251[b1], \8[b1][vs1]concat[r0c1]; \9[r0c0][r0c1]hstack=inputs=2[video]; \10[2]aresample=async=1[a0]; \11[3]aresample=async=1,adelay=19226.0|19226.0[a1]; \12[a0][a1]amix=inputs=2[audio]" \13-map '[video]' \14-map '[audio]' \15-acodec libopus \16-vcodec libvpx \17output.webm
ffmpeg -i alice.mkv -i bob.mkv -i alice.mka -i bob.mka \
In the first line of this command, you specify the input files, which are the four recordings.
The following section breaks down each line of the filter operation.
-filter_complex <script>
a. This will perform the filter operation specified in the following string.
"[0]scale=<half of width>:-2,pad=<half of width>:<height>:(ow-iw)/2:(oh-ih)/2[vs0]
b. This section selects the first input file (here, Alice's video) and scales it to half the width of the desired resolution (512) while maintaining the original aspect ratio. Additionally, it pads the scaled video (pad
) and tags it [vs0]
.
color=black:size=<half of width>x<height>:duration=<relative offset in seconds>[b0],\
c. The next step is to generate black frames for the duration of the track's relative_offset
(which you calculated step 5), in seconds. This is intended to delay the track to keep it in sync with the other recordings.
[b0][vs0]concat[r0c0];\
d. This step concatenates the black stream [b0]
with the padded stream [vs0]
, and tags it as [r0c0]
. Then it concatenates the black frames with the padded frames.
[1]scale=<half of width>:-2,pad=<half of width>:<height>:(ow-iw)/2:(oh-ih)/2[vs1],\
e. This step is the same as step b, repeated for the second input file (Bob's video). The output of this line is tagged as [vs1]
.
color=black:size=<half of width>x<height>:duration=<relative offset in seconds>[b1],
f. This step is the same as step c, except the duration
should be set to the relative_offset
, in seconds, that you calculated for the second participant's video recording. In this example, it's 19.226. The output of this line is tagged as [b1]
.
[b1][vs1]concat[r0c1]
g. This is the same as step d. It concatenates the black stream [b1]
with the padded stream [vs1]
, and tags it as [r0c1]
.
[r0c0][r0c1]hstack=inputs=2[video]
h. This line configures the filter that will perform the horizontal video stacking (creating the 2x1 video grid). In this example, there are two video tracks, which is why the argument is 2[video]
.
[2]aresample=async=1[a0];\
i. This line resamples the first audio input track (Alice's audio, which was the input at index [2]
in the input list). resample
fills and trims the audio track if needed (see more information in the resampler docs). The resampled audio is tagged as [a0]
.
[3]aresample=async=1,adelay=19226.0|19226.0[a1];\
j. This line similarly resamples the second audio input track, which in this example is Bob's audio. Here, the relative offset was 19226 ms. adelay
specifies the audio delay for both left and right channels in milliseconds. The resampled and delayed audio is tagged as [a1]
.
[a0][a1]amix=inputs=2[audio]" \
k. This configures the filter that will perform the audio mixing. In this sample case, there are two tracks, so the argument is 2[audio]
. This is the final line of the filter script.
Below are the commands used to produce the output:
-map '[video]'
a. This selects the stream marked as video
to be used in the output
-map '[audio]'
b. This selects the stream marked as audio
to be used in the output
-acodec libopus
c. The audio codec to use. For mp4
use libfdk_aac
. (See the note in Requirements about compiling a version of ffmpeg
with libfdk_aac
if you want to create an mp4
output file.)
-vcodec libvpx
d. The video codec to use. For mp4
use libx264
.
output.webm
e. The output file name
The following command would produce an output file in mp4
format. The command follows the same format as the webm
command above, with a few alterations:
libfdk_aac
and the video codec is
libx264
.
-vsync 2 \
line immediately following the
-map '[audio]'
line. This line works with the
libx264
video encoder.
output.mp4
.
1ffmpeg -i alice.mkv -i bob.mkv -i alice.mka -i bob.mka \2-filter_complex "\3[0]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs0],\4color=black:size=512x768:duration=0.020[b0],\5[b0][vs0]concat[r0c0];\6[1]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs1],\7color=black:size=512x768:duration=19.251[b1],\8[b1][vs1]concat[r0c1];\9[r0c0][r0c1]hstack=inputs=2[video];\10[2]aresample=async=1[a0];\11[3]aresample=async=1,adelay=19226.0|19226.0[a1];\12[a0][a1]amix=inputs=2[audio]" \13-map '[video]' \14-map '[audio]' \15-vsync 2 \16-acodec libfdk_aac \17-vcodec libx264 \18output.mp4
There are many situations where developers want to know the start, end, or duration of a track. For example, if you would like to concatenate black frames after the video track ends, you would need to know the start
and end
of the media track. In order to find these values, you can leverage ffprobe
.
The examples below demonstrate how to use ffprobe
to find the start time, end time, and duration of a track. The examples below use the example video track alice.mkv
.
ffprobe -i alice.mkv -show_frames 2>/dev/null | head -n 30 | grep -w pkt_dts | grep -Eo '[0-9]+'
This command outputs the start time of alice.mkv
, which is 1564 ms.
ffprobe -i alice.mkv -show_frames 2>/dev/null | tail -n 30 | grep -w pkt_dts | grep -Eo '[0-9]+'
This command outputs the end time of alice.mkv
, which is 142242 ms.
The duration of the track is the difference between the end_time
(142242 ms) and start_time
(1564 ms), which results in a duration of 140678 ms.