Overview
The WebSocket endpoint provides real-time speech-to-text conversion with streaming audio. This is ideal for applications requiring low-latency audio generation (e.g. interactive assistants). For one-shot http endpoint, refer STT REST.Endpoint
Authentication
All WebSocket connections require the following headers:| Header | Required | Description | Example |
|---|---|---|---|
x-api-key-id | Yes | API key identifier for authentication. | api_key_id_123 |
x-api-request-id | No | Unique request correlation ID (UUID). | c2ddae6a-da67-47dc-b0e7-70865a3701bc |
Connection Flow
- Client opens a WebSocket connection to
/stt/v3with the required auth headers. - Server immediately sends a
connectedmessage with the active configuration. - Client continuously sends binary audio frames (raw PCM, 16-bit LE, 16 kHz, mono).
- Server detects speech segments via VAD and responds with
processingandtranscriptmessages. - Either side may close the connection at any time.
Audio Format
| Property | Value |
|---|---|
| Encoding | PCM signed 16-bit little-endian |
| Sample Rate | 16,000 Hz |
| Channels | 1 (mono) |
| Chunk Size | 512 samples (32 ms per frame) |
Sending Audio (Client -> Server)
The client sends binary WebSocket frames at a steady cadence. Each frame must be exactly 1024 bytes (512 × 16-bit samples = 32 ms of audio). Frames should be sent in real time; buffering or bursting may degrade VAD accuracy.Server Responses (Server -> Client)
The server sends JSON text frames for control and transcription. All messages share a commontype discriminator field and an ISO-8601 timestamp.