WebSocket Protocol

Connection Flow

After obtaining a session token, connect to the WebSocket endpoint and follow this sequence:

Client                              Server
  |                                    |
  |  1. Open WebSocket connection      |
  |---------------------------------->>|
  |                                    |
  |  2. Send: { "token": "<jwt>" }     |
  |---------------------------------->>|
  |                                    |
  |  3. Receive: { "type": "connected" }
  |<<----------------------------------|
  |                                    |
  |  4. Start sending mic audio (PCM)  |
  |==================================>>|
  |                                    |
  |  5. Receive: { "type": "agent_ready" }
  |<<----------------------------------|
  |                                    |
  |  6. Bidirectional PCM audio        |
  |<<================================>>|
  |                                    |

Authentication

The first message you send over the WebSocket must be a JSON string containing your session token. You have 10 seconds to send this message before the connection is closed.

{ "token": "<your_session_token>" }

Server Messages

The server sends two types of messages: text (JSON control messages) and binary (PCM audio frames from the AI agent).

Message Type	Format	Description
`connected`	JSON text	Authentication succeeded. You can start sending microphone audio.
`agent_ready`	JSON text	The AI agent is listening. Full duplex voice conversation is active.
`session_ended`	JSON text	Session has ended. Includes a `reason` field (`agent_disconnected`, `room_closed`, or `server_shutdown`). Clean up resources.
`error`	JSON text	An error occurred. Includes `code` and `message` fields.
(binary)	Raw bytes	PCM audio frame from the AI agent. Play it back to the user.

Message Formats

// Connected
{ "type": "connected" }

// Agent Ready
{ "type": "agent_ready" }

// Session Ended
{ "type": "session_ended", "reason": "agent_disconnected" | "room_closed" | "server_shutdown" }

// Error
{ "type": "error", "code": "AUTH_FAILED", "message": "Invalid or expired token" }

Audio Format

Audio is streamed as raw PCM in both directions — the same format for microphone input and agent output.

Property	Value
Encoding	PCM signed 16-bit little-endian (s16le)
Sample Rate	16,000 Hz
Channels	1 (mono)
Frame Duration	20 ms
Samples per Frame	320
Bytes per Frame	640

Send microphone audio as binary WebSocket messages. Each message should be exactly 640 bytes (one 20ms frame). Agent audio arrives in the same format.

Session Lifecycle

idle --> connecting --> connected --> agent_ready --> ended
            |                              |
            +---------- error <------------+

State	Description
`idle`	No active session. Ready to connect.
`connecting`	Fetching token and opening WebSocket.
`connected`	WebSocket authenticated. You can start sending audio.
`agent_ready`	AI agent is listening. Full duplex conversation active.
`ended`	Session terminated. Clean up resources.

Next Steps

Now that you understand the protocol, head to the Web Guide or Flutter Guide for a complete implementation walkthrough.

Getting Started

Integration

Implementation Guides

Reference

Connection Flow

Authentication

Server Messages

Message Formats

Audio Format

Session Lifecycle

Next Steps

Getting Started

Integration

Implementation Guides

Reference

Documentation Index

​Connection Flow

​Authentication

​Server Messages

​Message Formats

​Audio Format

​Session Lifecycle

​Next Steps

Connection Flow

Authentication

Server Messages

Message Formats

Audio Format

Session Lifecycle

Next Steps