Speech-to-Text (Realtime)

Overview

Stream audio in real-time and receive transcriptions as speech is detected. Best for live conversations or interactive applications. The Realtime STT is multilingual which detects language automatically. For pre-recorded audio, use STT REST.

Endpoint

wss://api.vachana.ai/stt/v3/stream

Authentication

All Realtime connections require the following headers:

Header	Required	Description	Example
`x-api-key-id`	Yes	API key identifier for authentication.	`api_key_id_123`
`lang_code`	Yes	Language code for transcription. Defaults to `en-IN`. See supported language codes below.	`en-IN`
`x-sample-rate`	No	Sample rate of the audio in Hz. Accepted values: `8000`, `16000`, `44100`, `48000`. Defaults to `16000`.	`16000`

Choosing the right sample rate:

48000 — Mac microphone or browser getUserMedia default
44100 — Mac microphone alternate / CD-quality audio
16000 — Wideband telephony; sent as-is with no resampling
8000 — Narrow-band telephony (legacy PSTN / VoIP)

Supported Language Codes

Language	Code	Native Script	Example Text
Bengali	`bn-IN`	Bengali (বাংলা)	“আমি ভাত খাই”
English	`en-IN`	Latin	”I am going to the market”
Gujarati	`gu-IN`	Gujarati (ગુજરાતી)	“હું બજાર જાઉં છું”
Hindi	`hi-IN`	Devanagari (हिन्दी)	“मैं बाज़ार जा रहा हूँ”
Kannada	`kn-IN`	Kannada (ಕನ್ನಡ)	“ನಾನು ಮಾರುಕಟ್ಟೆಗೆ ಹೋಗುತ್ತೇನೆ”
Malayalam	`ml-IN`	Malayalam (മലയാളം)	“ഞാൻ ചന്തയിലേക്ക് പോകുന്നു”
Marathi	`mr-IN`	Devanagari (मराठी)	“मी बाजारात जातोय”
Punjabi	`pa-IN`	Gurmukhi (ਪੰਜਾਬੀ)	“ਮੈਂ ਬਾਜ਼ਾਰ ਜਾ ਰਿਹਾ ਹਾਂ”
Tamil	`ta-IN`	Tamil (தமிழ்)	“நான் சந்தைக்கு செல்கிறேன்”
Telugu	`te-IN`	Telugu (తెలుగు)	“నేను మార్కెట్‌కి వెళ్తున్నాను”
Hinglish(Latin) (experimental)	`en-hi-IN-latn`	Latin	”Main market ja raha hu”
Hinglish (experimental)	`en-hi-in-cm`	Latin + Devanagari (हिन्दी)	“मैं market जा रहा हूँ”
Auto-detect (experimental)	`en-IN`,`hi-IN`,`ta-IN`,`te-IN`,`kn-IN`,`ml-IN`,`gu-IN`,`mr-IN`,`bn-IN`,`pa-IN`	All supported	Automatically detects language

Connection Flow

Client opens a Realtime connection to /stt/v3/stream with the required auth headers.
Server immediately sends a connected message with the active configuration.
Client continuously sends binary audio frames (raw PCM, 16-bit LE, 16 kHz, mono).
Server detects speech segments via VAD and responds with processing and transcript messages along with the detected language.
Either side may close the connection at any time.

Audio Format

For 16kHz

Property	Value
Encoding	PCM signed 16-bit little-endian
Sample Rate	16,000 Hz
Channels	1 (mono)
Chunk Size	512 samples (32 ms per frame)

For 8kHz

Property	Value
Encoding	PCM signed 16-bit little-endian
Sample Rate	8,000 Hz
Channels	1 (mono)
Chunk Size	512 samples (64 ms per frame)

Sending Audio (Client -> Server)

The client sends binary WebSocket frames at a steady cadence. Each frame must be exactly 1024 bytes (512 × 16-bit samples = 32 ms of audio). Frames should be sent in real time; buffering or bursting may degrade VAD accuracy.

Server Responses (Server -> Client)

The server sends JSON text frames for control and transcription. All messages share a common type discriminator field and an ISO-8601 timestamp.

Connected Status

Sent once immediately after the WebSocket handshake succeeds.

{
  "type": "connected",
  "message": "STT service ready — VAD service connected",
  "timestamp": "2024-01-15T10:30:00.000Z",
  "config": {
    "sample_rate": 16000,
    "chunk_size": 512,
  }
}

Processing Status

Emitted when the VAD has detected the end of a speech segment and transcription has begun. Acts as a low-latency acknowledgement that audio was captured and is being processed.

{
  "type": "processing",
  "timestamp": "2024-01-15T10:30:05.123Z"
}

Transcript Response

Contains the transcribed text for a completed speech segment.

{
  "type": "transcript",
  "timestamp": "2024-01-15T10:30:05.987Z",
  "text": "Hello, how are you today?",
  "audio_duration_ms": 2340,
  "segment_id": "<segment_id>",
  "segment_index": "<segment_index>",
  "latency": 320,
}

Error State

Sent when the server encounters a recoverable or fatal error.

{
  "type": "error",
  "timestamp": "2024-01-15T10:30:10.000Z",
  "message": "STT engine failed to initialize"
}

Python SDK

The official Python SDK wraps the WebSocket connection, audio pacing, and event parsing into a clean async interface so you can focus on your application logic.

Installation

pip install gnani-vachana

Requires Python 3.9+.

Authentication

The streaming client requires only your API key.

from gnani.stt import GnaniSTTStreamClient

stream = GnaniSTTStreamClient(api_key="your-api-key", language_code="hi-IN")

Stream Audio from a File

The recommended approach — use the async context manager and the stream_audio helper. It handles real-time pacing automatically so audio is sent at the correct cadence for VAD.

import asyncio
from gnani.stt import GnaniSTTStreamClient, StreamTranscriptEvent

async def main():
    async with GnaniSTTStreamClient(
        api_key="your-api-key",
        language_code="hi-IN",
        sample_rate=16000,
    ) as stream:
        with open("audio.pcm", "rb") as f:
            transcripts = await stream.stream_audio(
                f,
                on_transcript=lambda t: print(f"Transcript: {t.text}"),
                on_processing=lambda p: print("Processing..."),
                realtime_pace=True,   # sends frames at real-time cadence
            )

    print(f"Total segments: {len(transcripts)}")

asyncio.run(main())

Iterate Over Events Manually

If you need lower-level control — for example to handle each event type differently or interleave sending and receiving — iterate over the stream directly.

import asyncio
from gnani.stt import GnaniSTTStreamClient, StreamTranscriptEvent, StreamProcessingEvent

async def main():
    async with GnaniSTTStreamClient(
        api_key="your-api-key",
        language_code="hi-IN",
    ) as stream:
        # Send audio chunks
        with open("audio.pcm", "rb") as f:
            while chunk := f.read(1024):
                await stream.send_audio(chunk)
                await asyncio.sleep(0.032)  # 32 ms per frame

        # Process events
        async for event in stream:
            if isinstance(event, StreamTranscriptEvent):
                print(f"[Segment {event.segment_index}] {event.text}")
                print(f"  Duration: {event.audio_duration_ms}ms  Latency: {event.latency}ms")
            elif isinstance(event, StreamProcessingEvent):
                print("Processing speech...")

asyncio.run(main())

Using 8 kHz Audio (Telephony)

stream = GnaniSTTStreamClient(
    api_key="your-api-key",
    language_code="en-IN",
    sample_rate=8000,
)

SDK Event Types

All events are typed dataclasses with a raw field containing the full server JSON.

Event	Key Fields	Description
`StreamConnectedEvent`	`message`, `sample_rate`, `chunk_size`	Sent once after the WebSocket handshake. Confirms the active config.
`StreamProcessingEvent`	`timestamp`	VAD detected end-of-speech. Transcription has started.
`StreamTranscriptEvent`	`text`, `segment_index`, `audio_duration_ms`, `latency`	Completed transcript for a speech segment.
`StreamErrorEvent`	`message`, `timestamp`	Server-side error, recoverable or fatal.

Error Handling

from gnani.stt import (
    StreamConnectionError,
    StreamClosedError,
    StreamError,
)

try:
    async with GnaniSTTStreamClient(api_key="your-api-key") as stream:
        await stream.send_audio(chunk)
except StreamConnectionError as e:
    print(f"Could not connect: {e}")
except StreamClosedError as e:
    print(f"Stream was already closed: {e}")
except StreamError as e:
    print(f"Server error: {e.message} (at {e.timestamp})")

Vachana

Speech-to-Text

Text-to-Speech

Voice Cloning

Speech-to-Text (Realtime)

Overview

Endpoint

Authentication

Supported Language Codes

Connection Flow

Audio Format

For 16kHz

For 8kHz

Sending Audio (Client -> Server)

Server Responses (Server -> Client)

Connected Status

Processing Status

Transcript Response

Error State

Python SDK

Installation

Authentication

Stream Audio from a File

Iterate Over Events Manually

Using 8 kHz Audio (Telephony)

SDK Event Types

Error Handling

Vachana

Speech-to-Text

Text-to-Speech

Voice Cloning

Documentation Index

​Overview

​Endpoint

​Authentication

​Supported Language Codes

​Connection Flow

​Audio Format

​For 16kHz

​For 8kHz

​Sending Audio (Client -> Server)

​Server Responses (Server -> Client)

​Connected Status

​Processing Status

​Transcript Response

​Error State

​Python SDK

​Installation

​Authentication

​Stream Audio from a File

​Iterate Over Events Manually

​Using 8 kHz Audio (Telephony)

​SDK Event Types

​Error Handling

Overview

Endpoint

Authentication

Supported Language Codes

Connection Flow

Audio Format

For 16kHz

For 8kHz

Sending Audio (Client -> Server)

Server Responses (Server -> Client)

Connected Status

Processing Status

Transcript Response

Error State

Python SDK

Installation

Authentication

Stream Audio from a File

Iterate Over Events Manually

Using 8 kHz Audio (Telephony)

SDK Event Types

Error Handling