AudioFrame carries a single packet of voice — typically 20 ms of µ-law or PCM. It rides on the same envelope as every other RPC, but on uni-streams to keep latency low and avoid pairing every send with an ack.

AudioFrame schema

FieldTypeNotes
call_sidstringIdentifies the call this packet belongs to.
payloadstringRaw codec bytes (length-prefixed). Treated as opaque.
codecstring"PCMU" (G.711 µ-law), "PCMA" (G.711 A-law), or "PCM16".
sequence_numberuint64Monotonic per (call_sid, direction). Used to detect loss.
end_of_streamboolFinal frame; the gateway will close the audio uni-stream.
method_id for an audio frame is always 2991054320 (0xb241_b9b0).

Frame layout on the wire

+----------------+----------------+----------------------+
| u32 length LE  | method_id      | AudioFrame envelope  |
+----------------+----------------+----------------------+
This is the standard envelope format — audio is not special.

Outbound (your mic → trunk)

Open exactly one client-initiated unidirectional stream per call and write framed AudioFrames back-to-back. Don’t open a stream per packet — that overwhelms the gateway’s flow-control budget within seconds.
client uni-stream  ─────[frame][frame][frame] … [frame eos=true]─────▶
After end_of_stream = true, close the stream. The gateway will not accept more frames on it.

Pacing

For µ-law @ 8 kHz with a 20 ms ptime, payload = 160 bytes. Send one frame every 20 ms (50 fps). Faster pacing is buffered by the trunk and arrives late on the far end; slower pacing causes audible gaps.

Inbound (trunk → your speaker)

The gateway opens server-initiated uni-streams. After your EventStreamRequest subscription, every uni-stream is multiplexed: each frame’s method_id determines whether it’s audio (2991054320) or a CallEvent (959835745). A typical demuxer:
async for frame in incoming_uni_streams:
    method_id, body = read_frame(frame)
    if method_id == 2991054320:        # AudioFrame
        af = parse_audio_frame(body)
        speaker.play(af.payload)
    elif method_id == 959835745:       # CallEvent
        ev = parse_call_event(body)
        on_event(ev)

Codec choices

CodecBitrateUse when
PCMU64 kbit/sTalking to PSTN/SIP. The default for trunks.
PCMA64 kbit/sEU/PSTN. Same wire shape, different table.
PCM16256 kbit/sSending TTS or studio-quality content into the gateway. The gateway re-encodes to the trunk’s codec.
The gateway transcodes for you on ingress; on egress it sends whatever the far-end negotiated. If you need a specific egress codec, set OriginateRequest.default_app_args accordingly.

Loss handling

Use sequence_number to detect dropped packets. The gateway does not retransmit audio (that defeats latency). For PSTN calls, any missing sequence number on egress is heard as a 20 ms silence — fine for voice, catastrophic for DTMF, so use INFO-method DTMF rather than in-band tones when possible.