AudioFrame carries a single packet of voice — typically 20 ms of µ-law or
PCM. It rides on the same envelope as every other RPC, but on uni-streams
to keep latency low and avoid pairing every send with an ack.
AudioFrame schema
| Field | Type | Notes |
|---|---|---|
call_sid | string | Identifies the call this packet belongs to. |
payload | string | Raw codec bytes (length-prefixed). Treated as opaque. |
codec | string | "PCMU" (G.711 µ-law), "PCMA" (G.711 A-law), or "PCM16". |
sequence_number | uint64 | Monotonic per (call_sid, direction). Used to detect loss. |
end_of_stream | bool | Final frame; the gateway will close the audio uni-stream. |
method_id for an audio frame is always 2991054320 (0xb241_b9b0).
Frame layout on the wire
Outbound (your mic → trunk)
Open exactly one client-initiated unidirectional stream per call and write framedAudioFrames back-to-back. Don’t open a stream per packet —
that overwhelms the gateway’s flow-control budget within seconds.
end_of_stream = true, close the stream. The gateway will not accept
more frames on it.
Pacing
For µ-law @ 8 kHz with a 20 ms ptime, payload = 160 bytes. Send one frame every 20 ms (50 fps). Faster pacing is buffered by the trunk and arrives late on the far end; slower pacing causes audible gaps.Inbound (trunk → your speaker)
The gateway opens server-initiated uni-streams. After yourEventStreamRequest
subscription, every uni-stream is multiplexed: each frame’s method_id
determines whether it’s audio (2991054320) or a CallEvent (959835745).
A typical demuxer:
Codec choices
| Codec | Bitrate | Use when |
|---|---|---|
PCMU | 64 kbit/s | Talking to PSTN/SIP. The default for trunks. |
PCMA | 64 kbit/s | EU/PSTN. Same wire shape, different table. |
PCM16 | 256 kbit/s | Sending TTS or studio-quality content into the gateway. The gateway re-encodes to the trunk’s codec. |
OriginateRequest.default_app_args accordingly.
Loss handling
Usesequence_number to detect dropped packets. The gateway does not
retransmit audio (that defeats latency). For PSTN calls, any missing
sequence number on egress is heard as a 20 ms silence — fine for voice,
catastrophic for DTMF, so use INFO-method DTMF rather than in-band tones
when possible.