Skip to main content

Connection Flow

After obtaining a session token, connect to the WebSocket endpoint and follow this sequence:
Client                              Server
  |                                    |
  |  1. Open WebSocket connection      |
  |---------------------------------->>|
  |                                    |
  |  2. Send: { "token": "<jwt>" }     |
  |---------------------------------->>|
  |                                    |
  |  3. Receive: { "type": "connected" }
  |<<----------------------------------|
  |                                    |
  |  4. Start sending mic audio (PCM)  |
  |==================================>>|
  |                                    |
  |  5. Receive: { "type": "agent_ready" }
  |<<----------------------------------|
  |                                    |
  |  6. Bidirectional PCM audio        |
  |<<================================>>|
  |                                    |

Authentication

The first message you send over the WebSocket must be a JSON string containing your session token. You have 10 seconds to send this message before the connection is closed.
{ "token": "<your_session_token>" }

Server Messages

The server sends two types of messages: text (JSON control messages) and binary (PCM audio frames from the AI agent).
Message TypeFormatDescription
connectedJSON textAuthentication succeeded. You can start sending microphone audio.
agent_readyJSON textThe AI agent is listening. Full duplex voice conversation is active.
session_endedJSON textSession has ended. Includes a reason field. Clean up resources.
errorJSON textAn error occurred. Includes code and message fields.
(binary)Raw bytesPCM audio frame from the AI agent. Play it back to the user.

Message Formats

// Connected
{ "type": "connected" }

// Agent Ready
{ "type": "agent_ready" }

// Session Ended
{ "type": "session_ended", "reason": "agent_disconnected" }

// Error
{ "type": "error", "code": "AUTH_FAILED", "message": "Invalid or expired token" }

Audio Format

Audio is streamed as raw PCM in both directions — the same format for microphone input and agent output.
PropertyValue
EncodingPCM signed 16-bit little-endian (s16le)
Sample Rate16,000 Hz
Channels1 (mono)
Frame Duration20 ms
Samples per Frame320
Bytes per Frame640
Send microphone audio as binary WebSocket messages. Each message should be exactly 640 bytes (one 20ms frame). Agent audio arrives in the same format.

Session Lifecycle

idle --> connecting --> connected --> agent_ready --> ended
            |                              |
            +---------- error <------------+
StateDescription
idleNo active session. Ready to connect.
connectingFetching token and opening WebSocket.
connectedWebSocket authenticated. You can start sending audio.
agent_readyAI agent is listening. Full duplex conversation active.
endedSession terminated. Clean up resources.

Next Steps

Now that you understand the protocol, head to the Web Guide or Flutter Guide for a complete implementation walkthrough.