Documentation Index
Fetch the complete documentation index at: https://docs.inya.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Contact centers handling financial services, insurance, or healthcare operate under strict regulatory requirements. Agents must follow scripts, disclose specific information, and avoid prohibited language. Traditional QA reviews 2–5% of calls after the fact. By the time a violation is caught, it has already happened hundreds of times. This guide shows you how to build a system that monitors every call in real time. Audio streams to the Vachana WebSocket STT API. Transcripts arrive within milliseconds of speech completion. A compliance and quality engine processes each segment, matches against rule sets, and fires alerts to your backend — while the call is still live.| Capability | Implementation |
|---|---|
| Live transcription | WebSocket stream to wss://api.vachana.ai/stt/v3/stream with per-segment transcript events |
| Compliance detection | Keyword and phrase matching on each transcript event with configurable rule sets |
| Quality monitoring | Silence detection, interruption tracking, escalation phrase matching from segment metadata |
| Real-time alerts | Async alert dispatcher — webhook, queue, or supervisor dashboard |
| Reconnect handling | Exponential backoff with session continuity across drops |
Architecture
The system has three logical layers: audio ingestion, transcription, and monitoring. Each runs concurrently in an async event loop.Prerequisites
| Requirement | Details |
|---|---|
| Vachana API key | Available from the Vachana dashboard. Used as the x-api-key-id header on the WebSocket connection. |
| Python 3.9+ | Required by the SDK. The full example uses asyncio, dataclasses, and typed event classes. |
| Audio source | PCM 16-bit LE, mono. Either 8kHz (PSTN/legacy VoIP) or 16kHz (wideband VoIP). Defaults to 16kHz. |
| Alert target | An HTTP endpoint, message queue, or Redis channel to receive alerts. |
Authentication
Authentication is performed at connection time via HTTP headers on the WebSocket upgrade request. There is no separate auth step — the connection either opens or returns 401.| Header | Required | Description |
|---|---|---|
x-api-key-id | Yes | Your Vachana API key. |
lang_code | Yes | BCP-47 language code. Defaults to en-IN. Pass comma-separated codes for multilingual auto-detection. |
x-sample-rate | No | Audio sample rate in Hz. Accepted: 8000, 16000, 44100, 48000. Defaults to 16000. |
x-format | No | Set transcribe for ITN (numbers, currency, dates in written form). ITN applies to hi-IN and en-IN only. |
End-to-End Workflow
Call starts — open WebSocket connection
wss://api.vachana.ai/stt/v3/stream with auth headers and language config. A session object is created and keyed to the call ID.Receive connected event — confirm config
connected event confirming sample rate and chunk size. Any mismatch (wrong sample rate, unsupported language) surfaces immediately.Stream audio in 1024-byte frames
VAD triggers — receive processing event
processing event. Use this timestamp to measure speech-to-transcript latency and to start a silence timer in the quality engine.Transcript arrives — run compliance and quality engines
transcript event carries text, segment_index, audio_duration_ms, and latency. Both engines process the text synchronously. Alerts are dispatched async so they never block the next transcript.Alerts fire — supervisor is notified
CRITICAL hits the supervisor dashboard immediately; WARNING queues for post-call review.Connecting to the WebSocket API
The SDK’sGnaniSTTStreamClient wraps the WebSocket connection, frame pacing, and event parsing. Use it as an async context manager.
Streaming Audio
Audio format requirements
| Property | 16kHz (wideband VoIP) | 8kHz (PSTN / legacy) |
|---|---|---|
| Encoding | PCM signed 16-bit little-endian | PCM signed 16-bit little-endian |
| Channels | 1 (mono) | 1 (mono) |
| Frame size | 1024 bytes (512 samples = 32ms) | 1024 bytes (512 samples = 64ms) |
| x-sample-rate | 16000 | 8000 |
WebSocket Event Reference
| Event type | When sent | Key fields |
|---|---|---|
connected | Once, immediately after handshake. | message, config.sample_rate, config.chunk_size, timestamp |
processing | Each time VAD detects end-of-speech. | timestamp |
transcript | After transcription of a VAD segment completes. | text, segment_index, segment_id, audio_duration_ms, latency, timestamp |
error | Server-side error, recoverable or fatal. | message, timestamp |
latency field (milliseconds from end of speech to transcript delivery) is your primary observability metric for pipeline health. Track p50, p95, p99 per call session and alert if p95 consistently exceeds your SLA threshold.Compliance Detection
The compliance engine runs on eachtranscript event. It checks segment text against three rule categories: prohibited keywords, risk phrases, and required disclosures. All checks are synchronous string operations — they complete in under 1ms per segment.
Quality Monitoring
Error Handling & Reconnect Logic
WebSocket connections drop. The reconnect loop below uses exponential backoff with full jitter and caps at a configurable maximum. Session state is preserved across reconnects usingprocessed_indices to deduplicate segments.
| Error | Cause | Strategy |
|---|---|---|
StreamConnectionError | 401, invalid API key, unsupported language code. | Do not retry. Fix config and redeploy. |
StreamClosedError | Server closed cleanly (service restart, session timeout). | Retry with backoff. Session state is preserved. |
ConnectionResetError / OSError | Network drop, TCP reset, intermediary timeout. | Exponential backoff + jitter. Cap at MAX_RECONNECTS. |
StreamError | STT engine failure reported in an error event. | Log, retry once. Flag the call for manual review on repeat failures. |
Production Best Practices
Concurrency model
Concurrency model
asyncio.Task. The audio producer and event consumer run concurrently within that task. Do not use threads — the WebSocket library is async-native. A single well-tuned Python process handles 100+ concurrent calls comfortably; the bottleneck is network I/O, not CPU.Alert dispatch — never block the transcript consumer
Alert dispatch — never block the transcript consumer
asyncio.create_task(). A slow downstream system under load must never delay the next transcript event.Latency optimization
Latency optimization
| Optimization | Impact |
|---|---|
| Co-locate with telephony bridge | Run the monitor in the same region as the Vachana API. Cross-region adds 50–150ms RTT per frame delivery. |
| 16kHz over 8kHz when possible | Higher accuracy transcripts mean fewer false positives in compliance matching. |
| Pre-compile compliance patterns | Compile all regex at engine __init__. Never compile inside the hot path. |
| Buffer writes, not reads | Write to an in-memory session buffer. Flush to the database at call end or on CRITICAL alerts only. |
Key metrics to track per session
Key metrics to track per session
| Metric | Source |
|---|---|
transcript_latency_ms | latency field on each transcript event. Track p50/p95/p99. |
segment_count | Increment on each transcript event. |
compliance_hit_rate | Compliance hits / total segments per call. |
silence_gap_seconds | Max silence gap derived from processing event timestamps. |
reconnect_count | session.reconnect_count, incremented on each reconnect. |
Debugging
| Symptom | Cause | Fix |
|---|---|---|
Connection immediately closes — no connected event | Invalid API key, wrong lang_code, missing required headers. | Log the WebSocket close code — 4001 = auth failure. |
| Transcripts arrive but text is empty or garbled | x-sample-rate does not match the actual audio sample rate. Audio is not mono PCM. | Run ffprobe on the source. Convert stereo to mono before streaming. |
| VAD fires too often — sentences cut mid-utterance | Frames being burst-sent faster than real time. | Enforce asyncio.sleep(frame_interval) after every send. |
| VAD never fires — no processing or transcript events | Audio buffer is all zeros. Audio source is not connected. | Print frame[:32].hex(). All zeros = silent source. |
| Compliance rules fire on unrelated text | Substring match without word boundaries. | Switch to word-boundary regex. Lowercase and strip punctuation before matching. |
| Duplicate alerts on reconnect | Session buffer re-processed after reconnect. | Check segment_index in session.processed_indices before dispatching any alert. |
Full Runnable Example
What to Build Next
- Speaker Diarization — Separate agent and customer voices. Attribute compliance hits to the correct speaker.
- Sentiment Analysis — Feed each segment’s text to a sentiment model. Track the sentiment arc across the call.
- Agent Assist — On each
transcriptevent, call an LLM with the running conversation context to surface next-best-action suggestions in real time. - LLM Summarisation — At call end, send the full session transcript to an LLM for structured output: issue, resolution, action items, disposition.
- Compliance Scoring — Build a per-call compliance score (0–100) based on rule severity, frequency, and placement in the call.
pip install gnani-vachana