Skip to main content

Overview

The WebSocket endpoint provides real-time speech-to-text conversion with streaming audio. This is ideal for applications requiring low-latency audio generation (e.g. interactive assistants). For one-shot http endpoint, refer STT REST.

Endpoint

wss://api.vachana.ai/stt/v3

Authentication

All WebSocket connections require the following headers:
HeaderRequiredDescriptionExample
x-api-key-idYesAPI key identifier for authentication.api_key_id_123
x-api-request-idNoUnique request correlation ID (UUID).c2ddae6a-da67-47dc-b0e7-70865a3701bc

Connection Flow

  1. Client opens a WebSocket connection to /stt/v3 with the required auth headers.
  2. Server immediately sends a connected message with the active configuration.
  3. Client continuously sends binary audio frames (raw PCM, 16-bit LE, 16 kHz, mono).
  4. Server detects speech segments via VAD and responds with processing and transcript messages.
  5. Either side may close the connection at any time.

Audio Format

PropertyValue
EncodingPCM signed 16-bit little-endian
Sample Rate16,000 Hz
Channels1 (mono)
Chunk Size512 samples (32 ms per frame)

Sending Audio (Client -> Server)

The client sends binary WebSocket frames at a steady cadence. Each frame must be exactly 1024 bytes (512 × 16-bit samples = 32 ms of audio). Frames should be sent in real time; buffering or bursting may degrade VAD accuracy.

Server Responses (Server -> Client)

The server sends JSON text frames for control and transcription. All messages share a common type discriminator field and an ISO-8601 timestamp.

Connected Status

Sent once immediately after the WebSocket handshake succeeds.
{
  "type": "connected",
  "message": "STT service ready — VAD service connected",
  "request_id": "<x-api-request-id>",
  "timestamp": "2024-01-15T10:30:00.000Z",
  "config": {
    "sample_rate": 16000,
    "chunk_size": 512,
  }
}

Processing Status

Emitted when the VAD has detected the end of a speech segment and transcription has begun. Acts as a low-latency acknowledgement that audio was captured and is being processed.
{
  "type": "processing",
  "timestamp": "2024-01-15T10:30:05.123Z"
}

Transcript Response

Contains the transcribed text for a completed speech segment.
{
  "type": "transcript",
  "timestamp": "2024-01-15T10:30:05.987Z",
  "text": "Hello, how are you today?",
  "audio_duration_ms": 2340,
  "segment_id": "<segment_id>",
  "segment_index": "<segment_index>",
  "latency": 320,
  "detected_language": "en",
}

Error State

Sent when the server encounters a recoverable or fatal error.
{
  "type": "error",
  "timestamp": "2024-01-15T10:30:10.000Z",
  "message": "STT engine failed to initialize"
}