Skip to main content

Overview

Stream audio in real-time and receive transcriptions as speech is detected. Best for live conversations or interactive applications. The Realtime STT is multilingual which detects language automatically. For pre-recorded audio, use STT REST.

Endpoint

wss://api.vachana.ai/stt/v3/stream

Authentication

All Realtime connections require the following headers:
HeaderRequiredDescriptionExample
x-api-key-idYesAPI key identifier for authentication.api_key_id_123
lang_codeYesLanguage code for transcription. Defaults to en-IN. See supported language codes below.en-IN
x-sample-rateNoSample rate of the audio in Hz. Accepted values: 8000, 16000, 44100, 48000. Defaults to 16000.16000
Choosing the right sample rate:
  • 48000 — Mac microphone or browser getUserMedia default
  • 44100 — Mac microphone alternate / CD-quality audio
  • 16000 — Wideband telephony; sent as-is with no resampling
  • 8000 — Narrow-band telephony (legacy PSTN / VoIP)

Supported Language Codes

LanguageCodeNative ScriptExample Text
Bengalibn-INBengali (বাংলা)“আমি ভাত খাই”
Englishen-INLatin”I am going to the market”
Gujaratigu-INGujarati (ગુજરાતી)“હું બજાર જાઉં છું”
Hindihi-INDevanagari (हिन्दी)“मैं बाज़ार जा रहा हूँ”
Kannadakn-INKannada (ಕನ್ನಡ)“ನಾನು ಮಾರುಕಟ್ಟೆಗೆ ಹೋಗುತ್ತೇನೆ”
Malayalamml-INMalayalam (മലയാളം)“ഞാൻ ചന്തയിലേക്ക് പോകുന്നു”
Marathimr-INDevanagari (मराठी)“मी बाजारात जातोय”
Punjabipa-INGurmukhi (ਪੰਜਾਬੀ)“ਮੈਂ ਬਾਜ਼ਾਰ ਜਾ ਰਿਹਾ ਹਾਂ”
Tamilta-INTamil (தமிழ்)“நான் சந்தைக்கு செல்கிறேன்”
Telugute-INTelugu (తెలుగు)“నేను మార్కెట్‌కి వెళ్తున్నాను”
Hinglish(Latin) (experimental)en-hi-IN-latnLatin”Main market ja raha hu”
Hinglish (experimental)en-hi-in-cmLatin + Devanagari (हिन्दी)“मैं market जा रहा हूँ”
Auto-detect (experimental)en-IN,hi-IN,ta-IN,te-IN,kn-IN,ml-IN,gu-IN,mr-IN,bn-IN,pa-INAll supportedAutomatically detects language

Connection Flow

  1. Client opens a Realtime connection to /stt/v3/stream with the required auth headers.
  2. Server immediately sends a connected message with the active configuration.
  3. Client continuously sends binary audio frames (raw PCM, 16-bit LE, 16 kHz, mono).
  4. Server detects speech segments via VAD and responds with processing and transcript messages along with the detected language.
  5. Either side may close the connection at any time.

Audio Format

For 16kHz

PropertyValue
EncodingPCM signed 16-bit little-endian
Sample Rate16,000 Hz
Channels1 (mono)
Chunk Size512 samples (32 ms per frame)

For 8kHz

PropertyValue
EncodingPCM signed 16-bit little-endian
Sample Rate8,000 Hz
Channels1 (mono)
Chunk Size512 samples (64 ms per frame)

Sending Audio (Client -> Server)

The client sends binary WebSocket frames at a steady cadence. Each frame must be exactly 1024 bytes (512 × 16-bit samples = 32 ms of audio). Frames should be sent in real time; buffering or bursting may degrade VAD accuracy.

Server Responses (Server -> Client)

The server sends JSON text frames for control and transcription. All messages share a common type discriminator field and an ISO-8601 timestamp.

Connected Status

Sent once immediately after the WebSocket handshake succeeds.
{
  "type": "connected",
  "message": "STT service ready — VAD service connected",
  "timestamp": "2024-01-15T10:30:00.000Z",
  "config": {
    "sample_rate": 16000,
    "chunk_size": 512,
  }
}

Processing Status

Emitted when the VAD has detected the end of a speech segment and transcription has begun. Acts as a low-latency acknowledgement that audio was captured and is being processed.
{
  "type": "processing",
  "timestamp": "2024-01-15T10:30:05.123Z"
}

Transcript Response

Contains the transcribed text for a completed speech segment.
{
  "type": "transcript",
  "timestamp": "2024-01-15T10:30:05.987Z",
  "text": "Hello, how are you today?",
  "audio_duration_ms": 2340,
  "segment_id": "<segment_id>",
  "segment_index": "<segment_index>",
  "latency": 320,
}

Error State

Sent when the server encounters a recoverable or fatal error.
{
  "type": "error",
  "timestamp": "2024-01-15T10:30:10.000Z",
  "message": "STT engine failed to initialize"
}