Documentation Index
Fetch the complete documentation index at: https://docs.inya.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Transcribe multi-speaker audio at scale using the Vachana Batch STT API. This guide walks through every step — from submitting an audio file to receiving a clean, speaker-separated transcript — using podcast transcription as the working example.
Audio-first content — podcasts, interview recordings, panel discussions — carries information that stays locked unless it is transcribed. Speaker-level transcription is what separates a readable document from a wall of undifferentiated text. You know who said what, when they said it, and for how long.
| Capability | What it enables downstream |
|---|
| Speaker-separated output | Per-speaker text blocks mean editors can review one voice at a time, and content teams can attribute quotes accurately. |
| Time-aligned segments | Every segment carries a start_time and end_time, enabling subtitle generation, chapter markers, and clip extraction at a specific timestamp. |
| Segment-level confidence | Flag low-confidence segments for human review rather than reviewing the entire transcript. |
| 10 Indian languages | Transcribe Hindi, Tamil, Telugu, Kannada, Malayalam, Bengali, Gujarati, Marathi, Punjabi, and English without switching providers or pipelines. |
| Batch processing | Submit up to 10 files in a single API call. Run overnight jobs, backfill archives, or process weekly episode batches without managing queues yourself. |
Other Use Cases
The same submit-poll-parse pipeline works for any long-form, speaker-rich audio. Any scenario involving long audio files, two speakers, and a need for speaker-separated text maps directly to this pipeline.
| Use Case | Description |
|---|
| Journalist Interviews | Transcribe field recordings with interviewer and subject separated. Feed directly into editorial workflows without manual formatting. |
| Parliamentary & Panel Debates | Attribute statements to the correct speaker for political reporting, fact-checking, or archival. Supports Devanagari and regional scripts natively. |
| EdTech Lecture Recordings | Transcribe faculty and student exchanges. Generate accessible transcripts for students, search indexes for course platforms, and study material exports. |
| Legal Depositions & Hearings | Produce verbatim speaker-attributed records of proceedings. Use confidence scores to flag segments requiring court reporter verification. |
| Radio Archive Digitisation | Backfill years of archived broadcasts into searchable, attributed text. Batch processing handles large volumes without manual queuing. |
| Corporate Town Halls & Earnings Calls | Generate attributed transcripts of leadership Q&A sessions. Surface speaker-specific statements for internal comms or investor relations. |
| Documentary & Film Production | Auto-generate interview transcripts for rough-cut editing. Export time-coded speaker lines directly to editing software. |
| Doctor-Patient Consultations | Transcribe recorded consultations with doctor and patient separated. Enable structured documentation workflows for EMR systems. |
Two-speaker limit: The Vachana Batch STT API supports a maximum of two distinct speakers per file. It is optimised for two-party audio — interviews, conversations, and one-on-one recordings. Panel discussions with three or more speakers are outside the current scope.
Prerequisites
| Requirement | Details |
|---|
| Vachana API key | Available from the Vachana dashboard. You will use this as the X-API-Key-ID header on every request. |
| Python 3.9+ | The pipeline uses f-strings, pathlib, and typing patterns that require Python 3.9 or later. |
| Audio files | Supported formats: AAC, WAV, FLAC, ALAC, OGG (Vorbis), Opus. Each file must be under 1 hour and the total payload under 80 MB. |
| ffmpeg | Required only if you plan to split files longer than 1 hour using pydub. Install with brew install ffmpeg (macOS) or apt install ffmpeg (Linux). |
# HTTP client (used for submit and poll calls)
pip install requests
# Audio chunking — only needed for files over 1 hour
pip install pydub
No SDK is required for this pipeline. All calls use the standard HTTP REST endpoints.
Authentication
| Header | Required | Description |
|---|
X-API-Key-ID | Yes | Your Vachana API key. Required on both the submit and status calls. |
X-API-Request-ID | No | A UUID you assign for tracing. Useful for correlating your application logs with platform support. |
Store your API key as an environment variable. Never hardcode it in source files or commit it to version control.
GNANI_API_KEY=your-api-key-here
import os
API_KEY = os.getenv("GNANI_API_KEY")
HEADERS = {"X-API-Key-ID": API_KEY}
Never hardcode API keys. Do not commit credentials to version control. Use environment variables, a secrets manager, or a vault. Rotate your key immediately if it is exposed.
| Item | Limit |
|---|
| Max audio duration | Less than 1 hour per file |
| Max files per request | 10 files per API call |
| Max total payload size | 80 MB across all files and form fields combined |
| Minimum poll interval | 60 seconds between status calls for the same job_id |
| Speaker diarization | Maximum 2 speakers per file |
| Format | Extension | Notes |
|---|
| WAV | .wav | Uncompressed. Highest quality but largest file size. |
| FLAC | .flac | Lossless compression. Good balance of quality and size for archival audio. |
| AAC | .m4a | Common podcast export format. Well-supported across all recording tools. |
| ALAC | .m4a | Lossless Apple format. Use when source is from Apple recording tools. |
| OGG (Vorbis) | .ogg | Open format. Common in Linux recording pipelines. |
| Opus | .opus | Efficient lossy compression. Smallest file sizes — recommended for high-volume batch jobs. |
Files over 1 hour: Split into chunks before submitting. The split_audio() helper in the full script handles this automatically using pydub. Stitch the resulting transcripts in order after parsing.
Supported Languages
Pass the BCP-47 code in the language_code field of your submit request.
| Language | Code | Native Script | ITN |
|---|
| Bengali | bn-IN | বাংলা | — |
| English | en-IN | Latin | Yes |
| Gujarati | gu-IN | ગુજરાતી | — |
| Hindi | hi-IN | हिन्दी | Yes |
| Kannada | kn-IN | ಕನ್ನಡ | — |
| Malayalam | ml-IN | മലയാളം | — |
| Marathi | mr-IN | मराठी | — |
| Punjabi | pa-IN | ਪੰਜਾਬੀ | — |
| Tamil | ta-IN | தமிழ் | — |
| Telugu | te-IN | తెలుగు | — |
ITN (Inverse Text Normalization) converts spoken numbers, currency, dates, times, and phone numbers into compact written form. Currently available for hi-IN and en-IN only. Set format=transcribe in the request to enable it.
Pipeline
Submit the audio file
POST to /stt/v3/batch/submit with your audio file, language code, and format preference. Receive a job_id immediately. Save it — you will need it in every poll call.
Poll for completion
GET /stt/v3/batch/status/{job_id} every 60 seconds. Loop until status reaches completed or failed. The results field is null until the job is complete.
Parse the segments
Iterate over results[].segments. Group by speaker_id. Build per-speaker text blocks with timestamps and confidence scores.
Save the transcript
Write the speaker-labelled transcript to a text file, along with a JSON file containing per-speaker talk time and segment metadata.
Step 1 — Submit
import os
import requests
from pathlib import Path
BATCH_SUBMIT = "https://api.vachana.ai/stt/v3/batch/submit"
def submit_job(
audio_path: str,
language_code: str = "hi-IN",
itn: bool = True,
) -> str:
api_key = os.getenv("GNANI_API_KEY")
headers = {"X-API-Key-ID": api_key}
with open(audio_path, "rb") as f:
files = [("audio_files", (Path(audio_path).name, f, "audio/wav"))]
data = {
"language_code": language_code,
"is_multi_channel": "false",
"format": "transcribe" if itn else "verbatim",
}
resp = requests.post(BATCH_SUBMIT, headers=headers, files=files, data=data)
resp.raise_for_status()
job_id = resp.json()["job_id"]
print(f"Submitted. job_id: {job_id}")
return job_id
Submitting multiple files: Pass additional ("audio_files", ...) tuples to the files list. Up to 10 files are accepted per request, as long as the total payload stays under 80 MB.
Step 2 — Poll
import time
from typing import Optional
BATCH_STATUS = "https://api.vachana.ai/stt/v3/batch/status/{job_id}"
POLL_INTERVAL = 60 # seconds — enforced minimum; do not reduce
def poll_until_complete(job_id: str) -> Optional[list]:
api_key = os.getenv("GNANI_API_KEY")
headers = {"X-API-Key-ID": api_key}
url = BATCH_STATUS.format(job_id=job_id)
print(f"Polling job {job_id} every {POLL_INTERVAL}s...")
while True:
time.sleep(POLL_INTERVAL)
resp = requests.get(url, headers=headers)
resp.raise_for_status()
payload = resp.json()
status = payload["status"]
progress = payload.get("overall_progress", "–")
print(f" [{status}] progress: {progress}%")
if status == "completed":
print(f"Job complete. {payload['completed_files']} file(s) transcribed.")
return payload.get("results", [])
if status == "failed":
print(f"Job failed: {payload.get('error')}")
return None
Minimum poll interval: 60 seconds. The API enforces a rate limit of one status call per 60 seconds per job_id. Do not reduce it.
Step 3 — Parse
import json
from pathlib import Path
from typing import Dict
def parse_results(results: list, output_dir: Path) -> Dict[str, dict]:
outputs = {}
for file_result in results:
fname = Path(file_result["filename"]).stem
segments = file_result.get("segments", [])
if not segments or file_result.get("status") == "failed":
print(f"Skipping {fname}: {file_result.get('error', 'no segments')}")
continue
lines, speaker_times, segment_meta = [], {}, []
for seg in segments:
spk = seg.get("speaker_id", "UNKNOWN")
text = seg.get("text", "").strip()
start = seg.get("start_time", 0.0)
end = seg.get("end_time", 0.0)
ts = f"{int(start // 60):02d}:{int(start % 60):02d}"
lines.append(f"[{ts}] SPEAKER_{spk}: {text}")
speaker_times[spk] = speaker_times.get(spk, 0.0) + (end - start)
segment_meta.append({
"segment_id": seg.get("segment_id"),
"speaker_id": spk,
"start_time": start,
"end_time": end,
"text": text,
"confidence": seg.get("confidence"),
"language_detected": seg.get("language_detected"),
})
transcript_path = output_dir / f"{fname}_transcript.txt"
transcript_path.write_text("\n".join(lines), encoding="utf-8")
metadata_path = output_dir / f"{fname}_metadata.json"
metadata_path.write_text(json.dumps({
"filename": file_result["filename"],
"total_duration_s": file_result.get("total_duration"),
"speaker_talk_time": {f"SPEAKER_{k}": round(v, 2) for k, v in speaker_times.items()},
"segments": segment_meta,
}, indent=2, ensure_ascii=False), encoding="utf-8")
outputs[fname] = {
"transcript_path": str(transcript_path),
"metadata_path": str(metadata_path),
}
print(f"Parsed: {fname} → {len(lines)} segments, {len(speaker_times)} speaker(s)")
return outputs
Full Script
import os
import json
import time
import requests
from pathlib import Path
from typing import Dict, List, Optional
BATCH_SUBMIT = "https://api.vachana.ai/stt/v3/batch/submit"
BATCH_STATUS = "https://api.vachana.ai/stt/v3/batch/status/{job_id}"
POLL_INTERVAL = 60
OUTPUT_DIR = "outputs"
Path(OUTPUT_DIR).mkdir(exist_ok=True)
def submit_job(audio_paths: List[str], language_code: str = "hi-IN", itn: bool = True) -> str:
api_key = os.getenv("GNANI_API_KEY")
headers = {"X-API-Key-ID": api_key}
files = [("audio_files", (Path(p).name, open(p, "rb"), "audio/wav")) for p in audio_paths]
data = {"language_code": language_code, "is_multi_channel": "false",
"format": "transcribe" if itn else "verbatim"}
resp = requests.post(BATCH_SUBMIT, headers=headers, files=files, data=data)
resp.raise_for_status()
for _, (_, fh, _) in files:
fh.close()
job_id = resp.json()["job_id"]
print(f"Submitted {len(audio_paths)} file(s). job_id: {job_id}")
return job_id
def poll_until_complete(job_id: str) -> Optional[list]:
api_key = os.getenv("GNANI_API_KEY")
headers = {"X-API-Key-ID": api_key}
url = BATCH_STATUS.format(job_id=job_id)
print(f"Polling every {POLL_INTERVAL}s...")
while True:
time.sleep(POLL_INTERVAL)
resp = requests.get(url, headers=headers)
resp.raise_for_status()
payload = resp.json()
status = payload["status"]
print(f" [{status}] {payload.get('overall_progress', '–')}%")
if status == "completed":
return payload.get("results", [])
if status == "failed":
print(f"Job failed: {payload.get('error')}")
return None
def parse_results(results: list, output_dir: Path) -> Dict[str, dict]:
outputs = {}
for file_result in results:
fname = Path(file_result["filename"]).stem
segments = file_result.get("segments", [])
if not segments or file_result.get("status") == "failed":
print(f"Skipping {fname}: {file_result.get('error', 'no segments')}")
continue
lines, speaker_times, segment_meta = [], {}, []
for seg in segments:
spk = seg.get("speaker_id", "UNKNOWN")
text = seg.get("text", "").strip()
start = seg.get("start_time", 0.0)
end = seg.get("end_time", 0.0)
ts = f"{int(start // 60):02d}:{int(start % 60):02d}"
lines.append(f"[{ts}] SPEAKER_{spk}: {text}")
speaker_times[spk] = speaker_times.get(spk, 0.0) + (end - start)
segment_meta.append({
"segment_id": seg.get("segment_id"),
"speaker_id": spk,
"start_time": start,
"end_time": end,
"text": text,
"confidence": seg.get("confidence"),
"language_detected": seg.get("language_detected"),
})
transcript_path = output_dir / f"{fname}_transcript.txt"
transcript_path.write_text("\n".join(lines), encoding="utf-8")
metadata_path = output_dir / f"{fname}_metadata.json"
metadata_path.write_text(json.dumps({
"filename": file_result["filename"],
"total_duration_s": file_result.get("total_duration"),
"speaker_talk_time": {f"SPEAKER_{k}": round(v, 2) for k, v in speaker_times.items()},
"segments": segment_meta,
}, indent=2, ensure_ascii=False), encoding="utf-8")
outputs[fname] = {"transcript_path": str(transcript_path), "metadata_path": str(metadata_path)}
print(f"Saved: {transcript_path.name}")
return outputs
if __name__ == "__main__":
job_id = submit_job(audio_paths=["/path/to/episode_01.wav"], language_code="hi-IN", itn=True)
results = poll_until_complete(job_id)
if results:
output_dir = Path(OUTPUT_DIR) / f"job_{job_id}"
output_dir.mkdir(parents=True, exist_ok=True)
outputs = parse_results(results, output_dir)
print(f"\nDone. {len(outputs)} transcript(s) saved to {output_dir}/")
Sample Output
outputs/
└── job_batch_7f3a92c1d4e8/
├── episode_01_transcript.txt ← speaker-labelled, time-stamped transcript
└── episode_01_metadata.json ← duration, talk time, segment detail
episode_01_transcript.txt
[00:00] SPEAKER_1: नमस्ते, मैं हूँ रवि शर्मा और आज हम बात करेंगे भारत के स्टार्टअप इकोसिस्टम के बारे में।
[00:07] SPEAKER_2: हाँ रवि जी, बहुत अच्छा विषय है। पिछले पाँच साल में बहुत कुछ बदला है।
[00:14] SPEAKER_1: बिल्कुल। ₹2,00,000 करोड़ से ज़्यादा की फंडिंग आई है 2024 में।
[00:22] SPEAKER_2: और यूनिकॉर्न्स की संख्या भी 100 के पार पहुँच गई है।
episode_01_metadata.json
{
"filename": "episode_01.wav",
"total_duration_s": 2847.5,
"speaker_talk_time": {
"SPEAKER_1": 1423.8,
"SPEAKER_2": 1389.2
},
"segments": [
{
"segment_id": 0,
"speaker_id": 1,
"start_time": 0.0,
"end_time": 6.8,
"text": "नमस्ते, मैं हूँ रवि शर्मा...",
"confidence": 0.96,
"language_detected": "hi-IN"
}
]
}
ITN Reference
When format=transcribe is set on the submit request, ITN post-processes every transcript — converting spoken-form numbers, currency, dates, times, and phone numbers into compact written form. Available for hi-IN and en-IN only.
| Category | Spoken input (ASR) | Written output (ITN) |
|---|
| Numbers | पाँच लाख बीस हज़ार | 5,20,000 |
| Currency | तीन रुपये पचास पैसे | ₹3.50 |
| Currency (en) | five thousand rupees | ₹5,000 |
| Dates | बीस जनवरी दो हज़ार पच्चीस | 20 जनवरी 2025 |
| Times | शाम पाँच बजे | शाम 17:00 |
| Phone numbers | नौ आठ सात छह पाँच चार तीन दो एक शून्य | 9876543210 |
| Code-mixed | pay do lakh rupees by fifteenth march | pay ₹2,00,000 by 15th March |
Native script digits: Pass itn_native_numerals=true alongside format=transcribe to render digits in the native script of the target language — for example, Devanagari numerals (₹५,०००) for Hindi. English always outputs Western Arabic digits regardless of this setting.
What ITN does not change: Idiomatic and ambiguous phrases are preserved intentionally. दो तीन meaning “a few” stays as text, not 2 or 3. Imperative verbs like कर दो or ले दो are kept as words.