Overview
Stream audio in real-time and receive transcriptions as speech is detected. Best for live conversations or interactive applications. The Realtime STT is multilingual which detects language automatically. For pre-recorded audio, use STT REST.Endpoint
Authentication
All Realtime connections require the following headers:| Header | Required | Description | Example |
|---|---|---|---|
x-api-key-id | Yes | API key identifier for authentication. | api_key_id_123 |
lang_code | Yes | Language code for transcription. Defaults to en-IN. See supported language codes below. | en-IN |
x-sample-rate | No | Sample rate of the audio in Hz. Accepted values: 8000, 16000, 44100, 48000. Defaults to 16000. | 16000 |
Choosing the right sample rate:
48000— Mac microphone or browsergetUserMediadefault44100— Mac microphone alternate / CD-quality audio16000— Wideband telephony; sent as-is with no resampling8000— Narrow-band telephony (legacy PSTN / VoIP)
Supported Language Codes
| Language | Code | Native Script | Example Text |
|---|---|---|---|
| Bengali | bn-IN | Bengali (বাংলা) | “আমি ভাত খাই” |
| English | en-IN | Latin | ”I am going to the market” |
| Gujarati | gu-IN | Gujarati (ગુજરાતી) | “હું બજાર જાઉં છું” |
| Hindi | hi-IN | Devanagari (हिन्दी) | “मैं बाज़ार जा रहा हूँ” |
| Kannada | kn-IN | Kannada (ಕನ್ನಡ) | “ನಾನು ಮಾರುಕಟ್ಟೆಗೆ ಹೋಗುತ್ತೇನೆ” |
| Malayalam | ml-IN | Malayalam (മലയാളം) | “ഞാൻ ചന്തയിലേക്ക് പോകുന്നു” |
| Marathi | mr-IN | Devanagari (मराठी) | “मी बाजारात जातोय” |
| Punjabi | pa-IN | Gurmukhi (ਪੰਜਾਬੀ) | “ਮੈਂ ਬਾਜ਼ਾਰ ਜਾ ਰਿਹਾ ਹਾਂ” |
| Tamil | ta-IN | Tamil (தமிழ்) | “நான் சந்தைக்கு செல்கிறேன்” |
| Telugu | te-IN | Telugu (తెలుగు) | “నేను మార్కెట్కి వెళ్తున్నాను” |
| Hinglish(Latin) (experimental) | en-hi-IN-latn | Latin | ”Main market ja raha hu” |
| Hinglish (experimental) | en-hi-in-cm | Latin + Devanagari (हिन्दी) | “मैं market जा रहा हूँ” |
| Auto-detect (experimental) | en-IN,hi-IN,ta-IN,te-IN,kn-IN,ml-IN,gu-IN,mr-IN,bn-IN,pa-IN | All supported | Automatically detects language |
Connection Flow
- Client opens a Realtime connection to
/stt/v3/streamwith the required auth headers. - Server immediately sends a
connectedmessage with the active configuration. - Client continuously sends binary audio frames (raw PCM, 16-bit LE, 16 kHz, mono).
- Server detects speech segments via VAD and responds with
processingandtranscriptmessages along with the detected language. - Either side may close the connection at any time.
Audio Format
For 16kHz
| Property | Value |
|---|---|
| Encoding | PCM signed 16-bit little-endian |
| Sample Rate | 16,000 Hz |
| Channels | 1 (mono) |
| Chunk Size | 512 samples (32 ms per frame) |
For 8kHz
| Property | Value |
|---|---|
| Encoding | PCM signed 16-bit little-endian |
| Sample Rate | 8,000 Hz |
| Channels | 1 (mono) |
| Chunk Size | 512 samples (64 ms per frame) |
Sending Audio (Client -> Server)
The client sends binary WebSocket frames at a steady cadence. Each frame must be exactly 1024 bytes (512 × 16-bit samples = 32 ms of audio). Frames should be sent in real time; buffering or bursting may degrade VAD accuracy.Server Responses (Server -> Client)
The server sends JSON text frames for control and transcription. All messages share a commontype discriminator field and an ISO-8601 timestamp.