Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.inya.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Stream audio in real-time and receive transcriptions as speech is detected. Best for live conversations or interactive applications. The Realtime STT is multilingual which detects language automatically. For pre-recorded audio, use STT REST.

Endpoint

wss://api.vachana.ai/stt/v3/stream

Authentication

All Realtime connections require the following headers:
HeaderRequiredDescriptionExample
x-api-key-idYesAPI key identifier for authentication.api_key_id_123
lang_codeYesLanguage code for transcription. Defaults to en-IN. See supported language codes below.en-IN
x-sample-rateNoSample rate of the audio in Hz. Accepted values: 8000, 16000, 44100, 48000. Defaults to 16000.16000
Choosing the right sample rate:
  • 48000 — Mac microphone or browser getUserMedia default
  • 44100 — Mac microphone alternate / CD-quality audio
  • 16000 — Wideband telephony; sent as-is with no resampling
  • 8000 — Narrow-band telephony (legacy PSTN / VoIP)

Supported Language Codes

LanguageCodeNative ScriptExample Text
Bengalibn-INBengali (বাংলা)“আমি ভাত খাই”
Englishen-INLatin”I am going to the market”
Gujaratigu-INGujarati (ગુજરાતી)“હું બજાર જાઉં છું”
Hindihi-INDevanagari (हिन्दी)“मैं बाज़ार जा रहा हूँ”
Kannadakn-INKannada (ಕನ್ನಡ)“ನಾನು ಮಾರುಕಟ್ಟೆಗೆ ಹೋಗುತ್ತೇನೆ”
Malayalamml-INMalayalam (മലയാളം)“ഞാൻ ചന്തയിലേക്ക് പോകുന്നു”
Marathimr-INDevanagari (मराठी)“मी बाजारात जातोय”
Punjabipa-INGurmukhi (ਪੰਜਾਬੀ)“ਮੈਂ ਬਾਜ਼ਾਰ ਜਾ ਰਿਹਾ ਹਾਂ”
Tamilta-INTamil (தமிழ்)“நான் சந்தைக்கு செல்கிறேன்”
Telugute-INTelugu (తెలుగు)“నేను మార్కెట్‌కి వెళ్తున్నాను”
Hinglish(Latin) (experimental)en-hi-IN-latnLatin”Main market ja raha hu”
Hinglish (experimental)en-hi-in-cmLatin + Devanagari (हिन्दी)“मैं market जा रहा हूँ”
Auto-detect (experimental)en-IN,hi-IN,ta-IN,te-IN,kn-IN,ml-IN,gu-IN,mr-IN,bn-IN,pa-INAll supportedAutomatically detects language

Connection Flow

  1. Client opens a Realtime connection to /stt/v3/stream with the required auth headers.
  2. Server immediately sends a connected message with the active configuration.
  3. Client continuously sends binary audio frames (raw PCM, 16-bit LE, 16 kHz, mono).
  4. Server detects speech segments via VAD and responds with processing and transcript messages along with the detected language.
  5. Either side may close the connection at any time.

Audio Format

For 16kHz

PropertyValue
EncodingPCM signed 16-bit little-endian
Sample Rate16,000 Hz
Channels1 (mono)
Chunk Size512 samples (32 ms per frame)

For 8kHz

PropertyValue
EncodingPCM signed 16-bit little-endian
Sample Rate8,000 Hz
Channels1 (mono)
Chunk Size512 samples (64 ms per frame)

Sending Audio (Client -> Server)

The client sends binary WebSocket frames at a steady cadence. Each frame must be exactly 1024 bytes (512 × 16-bit samples = 32 ms of audio). Frames should be sent in real time; buffering or bursting may degrade VAD accuracy.

Server Responses (Server -> Client)

The server sends JSON text frames for control and transcription. All messages share a common type discriminator field and an ISO-8601 timestamp.

Connected Status

Sent once immediately after the WebSocket handshake succeeds.
{
  "type": "connected",
  "message": "STT service ready — VAD service connected",
  "timestamp": "2024-01-15T10:30:00.000Z",
  "config": {
    "sample_rate": 16000,
    "chunk_size": 512,
  }
}

Processing Status

Emitted when the VAD has detected the end of a speech segment and transcription has begun. Acts as a low-latency acknowledgement that audio was captured and is being processed.
{
  "type": "processing",
  "timestamp": "2024-01-15T10:30:05.123Z"
}

Transcript Response

Contains the transcribed text for a completed speech segment.
{
  "type": "transcript",
  "timestamp": "2024-01-15T10:30:05.987Z",
  "text": "Hello, how are you today?",
  "audio_duration_ms": 2340,
  "segment_id": "<segment_id>",
  "segment_index": "<segment_index>",
  "latency": 320,
}

Error State

Sent when the server encounters a recoverable or fatal error.
{
  "type": "error",
  "timestamp": "2024-01-15T10:30:10.000Z",
  "message": "STT engine failed to initialize"
}

Python SDK

The official Python SDK wraps the WebSocket connection, audio pacing, and event parsing into a clean async interface so you can focus on your application logic.

Installation

pip install gnani-vachana
Requires Python 3.9+.

Authentication

The streaming client requires only your API key.
from gnani.stt import GnaniSTTStreamClient

stream = GnaniSTTStreamClient(api_key="your-api-key", language_code="hi-IN")

Stream Audio from a File

The recommended approach — use the async context manager and the stream_audio helper. It handles real-time pacing automatically so audio is sent at the correct cadence for VAD.
import asyncio
from gnani.stt import GnaniSTTStreamClient, StreamTranscriptEvent

async def main():
    async with GnaniSTTStreamClient(
        api_key="your-api-key",
        language_code="hi-IN",
        sample_rate=16000,
    ) as stream:
        with open("audio.pcm", "rb") as f:
            transcripts = await stream.stream_audio(
                f,
                on_transcript=lambda t: print(f"Transcript: {t.text}"),
                on_processing=lambda p: print("Processing..."),
                realtime_pace=True,   # sends frames at real-time cadence
            )

    print(f"Total segments: {len(transcripts)}")

asyncio.run(main())

Iterate Over Events Manually

If you need lower-level control — for example to handle each event type differently or interleave sending and receiving — iterate over the stream directly.
import asyncio
from gnani.stt import GnaniSTTStreamClient, StreamTranscriptEvent, StreamProcessingEvent

async def main():
    async with GnaniSTTStreamClient(
        api_key="your-api-key",
        language_code="hi-IN",
    ) as stream:
        # Send audio chunks
        with open("audio.pcm", "rb") as f:
            while chunk := f.read(1024):
                await stream.send_audio(chunk)
                await asyncio.sleep(0.032)  # 32 ms per frame

        # Process events
        async for event in stream:
            if isinstance(event, StreamTranscriptEvent):
                print(f"[Segment {event.segment_index}] {event.text}")
                print(f"  Duration: {event.audio_duration_ms}ms  Latency: {event.latency}ms")
            elif isinstance(event, StreamProcessingEvent):
                print("Processing speech...")

asyncio.run(main())

Using 8 kHz Audio (Telephony)

stream = GnaniSTTStreamClient(
    api_key="your-api-key",
    language_code="en-IN",
    sample_rate=8000,
)

SDK Event Types

All events are typed dataclasses with a raw field containing the full server JSON.
EventKey FieldsDescription
StreamConnectedEventmessage, sample_rate, chunk_sizeSent once after the WebSocket handshake. Confirms the active config.
StreamProcessingEventtimestampVAD detected end-of-speech. Transcription has started.
StreamTranscriptEventtext, segment_index, audio_duration_ms, latencyCompleted transcript for a speech segment.
StreamErrorEventmessage, timestampServer-side error, recoverable or fatal.

Error Handling

from gnani.stt import (
    StreamConnectionError,
    StreamClosedError,
    StreamError,
)

try:
    async with GnaniSTTStreamClient(api_key="your-api-key") as stream:
        await stream.send_audio(chunk)
except StreamConnectionError as e:
    print(f"Could not connect: {e}")
except StreamClosedError as e:
    print(f"Stream was already closed: {e}")
except StreamError as e:
    print(f"Server error: {e.message} (at {e.timestamp})")