Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.inya.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Transcribe multi-speaker audio at scale using the Vachana Batch STT API. This guide walks through every step — from submitting an audio file to receiving a clean, speaker-separated transcript — using podcast transcription as the working example. Audio-first content — podcasts, interview recordings, panel discussions — carries information that stays locked unless it is transcribed. Speaker-level transcription is what separates a readable document from a wall of undifferentiated text. You know who said what, when they said it, and for how long.
CapabilityWhat it enables downstream
Speaker-separated outputPer-speaker text blocks mean editors can review one voice at a time, and content teams can attribute quotes accurately.
Time-aligned segmentsEvery segment carries a start_time and end_time, enabling subtitle generation, chapter markers, and clip extraction at a specific timestamp.
Segment-level confidenceFlag low-confidence segments for human review rather than reviewing the entire transcript.
10 Indian languagesTranscribe Hindi, Tamil, Telugu, Kannada, Malayalam, Bengali, Gujarati, Marathi, Punjabi, and English without switching providers or pipelines.
Batch processingSubmit up to 10 files in a single API call. Run overnight jobs, backfill archives, or process weekly episode batches without managing queues yourself.

Other Use Cases

The same submit-poll-parse pipeline works for any long-form, speaker-rich audio. Any scenario involving long audio files, two speakers, and a need for speaker-separated text maps directly to this pipeline.
Use CaseDescription
Journalist InterviewsTranscribe field recordings with interviewer and subject separated. Feed directly into editorial workflows without manual formatting.
Parliamentary & Panel DebatesAttribute statements to the correct speaker for political reporting, fact-checking, or archival. Supports Devanagari and regional scripts natively.
EdTech Lecture RecordingsTranscribe faculty and student exchanges. Generate accessible transcripts for students, search indexes for course platforms, and study material exports.
Legal Depositions & HearingsProduce verbatim speaker-attributed records of proceedings. Use confidence scores to flag segments requiring court reporter verification.
Radio Archive DigitisationBackfill years of archived broadcasts into searchable, attributed text. Batch processing handles large volumes without manual queuing.
Corporate Town Halls & Earnings CallsGenerate attributed transcripts of leadership Q&A sessions. Surface speaker-specific statements for internal comms or investor relations.
Documentary & Film ProductionAuto-generate interview transcripts for rough-cut editing. Export time-coded speaker lines directly to editing software.
Doctor-Patient ConsultationsTranscribe recorded consultations with doctor and patient separated. Enable structured documentation workflows for EMR systems.
Two-speaker limit: The Vachana Batch STT API supports a maximum of two distinct speakers per file. It is optimised for two-party audio — interviews, conversations, and one-on-one recordings. Panel discussions with three or more speakers are outside the current scope.

Prerequisites

RequirementDetails
Vachana API keyAvailable from the Vachana dashboard. You will use this as the X-API-Key-ID header on every request.
Python 3.9+The pipeline uses f-strings, pathlib, and typing patterns that require Python 3.9 or later.
Audio filesSupported formats: AAC, WAV, FLAC, ALAC, OGG (Vorbis), Opus. Each file must be under 1 hour and the total payload under 80 MB.
ffmpegRequired only if you plan to split files longer than 1 hour using pydub. Install with brew install ffmpeg (macOS) or apt install ffmpeg (Linux).
# HTTP client (used for submit and poll calls)
pip install requests

# Audio chunking — only needed for files over 1 hour
pip install pydub
No SDK is required for this pipeline. All calls use the standard HTTP REST endpoints.

Authentication

HeaderRequiredDescription
X-API-Key-IDYesYour Vachana API key. Required on both the submit and status calls.
X-API-Request-IDNoA UUID you assign for tracing. Useful for correlating your application logs with platform support.
Store your API key as an environment variable. Never hardcode it in source files or commit it to version control.
.env
GNANI_API_KEY=your-api-key-here
loading credentials
import os

API_KEY = os.getenv("GNANI_API_KEY")
HEADERS = {"X-API-Key-ID": API_KEY}
Never hardcode API keys. Do not commit credentials to version control. Use environment variables, a secrets manager, or a vault. Rotate your key immediately if it is exposed.

Limits & Supported Formats

ItemLimit
Max audio durationLess than 1 hour per file
Max files per request10 files per API call
Max total payload size80 MB across all files and form fields combined
Minimum poll interval60 seconds between status calls for the same job_id
Speaker diarizationMaximum 2 speakers per file

Supported Audio Formats

FormatExtensionNotes
WAV.wavUncompressed. Highest quality but largest file size.
FLAC.flacLossless compression. Good balance of quality and size for archival audio.
AAC.m4aCommon podcast export format. Well-supported across all recording tools.
ALAC.m4aLossless Apple format. Use when source is from Apple recording tools.
OGG (Vorbis).oggOpen format. Common in Linux recording pipelines.
Opus.opusEfficient lossy compression. Smallest file sizes — recommended for high-volume batch jobs.
Files over 1 hour: Split into chunks before submitting. The split_audio() helper in the full script handles this automatically using pydub. Stitch the resulting transcripts in order after parsing.

Supported Languages

Pass the BCP-47 code in the language_code field of your submit request.
LanguageCodeNative ScriptITN
Bengalibn-INবাংলা
Englishen-INLatinYes
Gujaratigu-INગુજરાતી
Hindihi-INहिन्दीYes
Kannadakn-INಕನ್ನಡ
Malayalamml-INമലയാളം
Marathimr-INमराठी
Punjabipa-INਪੰਜਾਬੀ
Tamilta-INதமிழ்
Telugute-INతెలుగు
ITN (Inverse Text Normalization) converts spoken numbers, currency, dates, times, and phone numbers into compact written form. Currently available for hi-IN and en-IN only. Set format=transcribe in the request to enable it.

Pipeline

1

Submit the audio file

POST to /stt/v3/batch/submit with your audio file, language code, and format preference. Receive a job_id immediately. Save it — you will need it in every poll call.
2

Poll for completion

GET /stt/v3/batch/status/{job_id} every 60 seconds. Loop until status reaches completed or failed. The results field is null until the job is complete.
3

Parse the segments

Iterate over results[].segments. Group by speaker_id. Build per-speaker text blocks with timestamps and confidence scores.
4

Save the transcript

Write the speaker-labelled transcript to a text file, along with a JSON file containing per-speaker talk time and segment metadata.

Step 1 — Submit

submit_job()
import os
import requests
from pathlib import Path

BATCH_SUBMIT = "https://api.vachana.ai/stt/v3/batch/submit"

def submit_job(
    audio_path: str,
    language_code: str = "hi-IN",
    itn: bool = True,
) -> str:
    api_key = os.getenv("GNANI_API_KEY")
    headers = {"X-API-Key-ID": api_key}

    with open(audio_path, "rb") as f:
        files = [("audio_files", (Path(audio_path).name, f, "audio/wav"))]
        data  = {
            "language_code":   language_code,
            "is_multi_channel": "false",
            "format":          "transcribe" if itn else "verbatim",
        }
        resp = requests.post(BATCH_SUBMIT, headers=headers, files=files, data=data)

    resp.raise_for_status()
    job_id = resp.json()["job_id"]
    print(f"Submitted. job_id: {job_id}")
    return job_id
Submitting multiple files: Pass additional ("audio_files", ...) tuples to the files list. Up to 10 files are accepted per request, as long as the total payload stays under 80 MB.

Step 2 — Poll

poll_until_complete()
import time
from typing import Optional

BATCH_STATUS  = "https://api.vachana.ai/stt/v3/batch/status/{job_id}"
POLL_INTERVAL = 60  # seconds — enforced minimum; do not reduce

def poll_until_complete(job_id: str) -> Optional[list]:
    api_key = os.getenv("GNANI_API_KEY")
    headers = {"X-API-Key-ID": api_key}
    url     = BATCH_STATUS.format(job_id=job_id)

    print(f"Polling job {job_id} every {POLL_INTERVAL}s...")

    while True:
        time.sleep(POLL_INTERVAL)

        resp = requests.get(url, headers=headers)
        resp.raise_for_status()
        payload  = resp.json()
        status   = payload["status"]
        progress = payload.get("overall_progress", "–")

        print(f"  [{status}]  progress: {progress}%")

        if status == "completed":
            print(f"Job complete. {payload['completed_files']} file(s) transcribed.")
            return payload.get("results", [])

        if status == "failed":
            print(f"Job failed: {payload.get('error')}")
            return None
Minimum poll interval: 60 seconds. The API enforces a rate limit of one status call per 60 seconds per job_id. Do not reduce it.

Step 3 — Parse

parse_results()
import json
from pathlib import Path
from typing import Dict

def parse_results(results: list, output_dir: Path) -> Dict[str, dict]:
    outputs = {}

    for file_result in results:
        fname    = Path(file_result["filename"]).stem
        segments = file_result.get("segments", [])

        if not segments or file_result.get("status") == "failed":
            print(f"Skipping {fname}: {file_result.get('error', 'no segments')}")
            continue

        lines, speaker_times, segment_meta = [], {}, []

        for seg in segments:
            spk   = seg.get("speaker_id", "UNKNOWN")
            text  = seg.get("text", "").strip()
            start = seg.get("start_time", 0.0)
            end   = seg.get("end_time",   0.0)

            ts = f"{int(start // 60):02d}:{int(start % 60):02d}"
            lines.append(f"[{ts}] SPEAKER_{spk}: {text}")
            speaker_times[spk] = speaker_times.get(spk, 0.0) + (end - start)
            segment_meta.append({
                "segment_id":        seg.get("segment_id"),
                "speaker_id":        spk,
                "start_time":        start,
                "end_time":          end,
                "text":              text,
                "confidence":        seg.get("confidence"),
                "language_detected": seg.get("language_detected"),
            })

        transcript_path = output_dir / f"{fname}_transcript.txt"
        transcript_path.write_text("\n".join(lines), encoding="utf-8")

        metadata_path = output_dir / f"{fname}_metadata.json"
        metadata_path.write_text(json.dumps({
            "filename":          file_result["filename"],
            "total_duration_s":  file_result.get("total_duration"),
            "speaker_talk_time": {f"SPEAKER_{k}": round(v, 2) for k, v in speaker_times.items()},
            "segments":          segment_meta,
        }, indent=2, ensure_ascii=False), encoding="utf-8")

        outputs[fname] = {
            "transcript_path": str(transcript_path),
            "metadata_path":   str(metadata_path),
        }
        print(f"Parsed: {fname}{len(lines)} segments, {len(speaker_times)} speaker(s)")

    return outputs

Full Script

podcast_transcription.py
import os
import json
import time
import requests
from pathlib import Path
from typing  import Dict, List, Optional

BATCH_SUBMIT  = "https://api.vachana.ai/stt/v3/batch/submit"
BATCH_STATUS  = "https://api.vachana.ai/stt/v3/batch/status/{job_id}"
POLL_INTERVAL = 60
OUTPUT_DIR    = "outputs"

Path(OUTPUT_DIR).mkdir(exist_ok=True)


def submit_job(audio_paths: List[str], language_code: str = "hi-IN", itn: bool = True) -> str:
    api_key = os.getenv("GNANI_API_KEY")
    headers = {"X-API-Key-ID": api_key}

    files = [("audio_files", (Path(p).name, open(p, "rb"), "audio/wav")) for p in audio_paths]
    data  = {"language_code": language_code, "is_multi_channel": "false",
              "format": "transcribe" if itn else "verbatim"}

    resp = requests.post(BATCH_SUBMIT, headers=headers, files=files, data=data)
    resp.raise_for_status()

    for _, (_, fh, _) in files:
        fh.close()

    job_id = resp.json()["job_id"]
    print(f"Submitted {len(audio_paths)} file(s). job_id: {job_id}")
    return job_id


def poll_until_complete(job_id: str) -> Optional[list]:
    api_key = os.getenv("GNANI_API_KEY")
    headers = {"X-API-Key-ID": api_key}
    url     = BATCH_STATUS.format(job_id=job_id)

    print(f"Polling every {POLL_INTERVAL}s...")
    while True:
        time.sleep(POLL_INTERVAL)
        resp    = requests.get(url, headers=headers)
        resp.raise_for_status()
        payload = resp.json()
        status  = payload["status"]
        print(f"  [{status}]  {payload.get('overall_progress', '–')}%")
        if status == "completed":
            return payload.get("results", [])
        if status == "failed":
            print(f"Job failed: {payload.get('error')}")
            return None


def parse_results(results: list, output_dir: Path) -> Dict[str, dict]:
    outputs = {}

    for file_result in results:
        fname    = Path(file_result["filename"]).stem
        segments = file_result.get("segments", [])

        if not segments or file_result.get("status") == "failed":
            print(f"Skipping {fname}: {file_result.get('error', 'no segments')}")
            continue

        lines, speaker_times, segment_meta = [], {}, []

        for seg in segments:
            spk   = seg.get("speaker_id", "UNKNOWN")
            text  = seg.get("text", "").strip()
            start = seg.get("start_time", 0.0)
            end   = seg.get("end_time",   0.0)

            ts = f"{int(start // 60):02d}:{int(start % 60):02d}"
            lines.append(f"[{ts}] SPEAKER_{spk}: {text}")
            speaker_times[spk] = speaker_times.get(spk, 0.0) + (end - start)
            segment_meta.append({
                "segment_id":        seg.get("segment_id"),
                "speaker_id":        spk,
                "start_time":        start,
                "end_time":          end,
                "text":              text,
                "confidence":        seg.get("confidence"),
                "language_detected": seg.get("language_detected"),
            })

        transcript_path = output_dir / f"{fname}_transcript.txt"
        transcript_path.write_text("\n".join(lines), encoding="utf-8")

        metadata_path = output_dir / f"{fname}_metadata.json"
        metadata_path.write_text(json.dumps({
            "filename":          file_result["filename"],
            "total_duration_s":  file_result.get("total_duration"),
            "speaker_talk_time": {f"SPEAKER_{k}": round(v, 2) for k, v in speaker_times.items()},
            "segments":          segment_meta,
        }, indent=2, ensure_ascii=False), encoding="utf-8")

        outputs[fname] = {"transcript_path": str(transcript_path), "metadata_path": str(metadata_path)}
        print(f"Saved: {transcript_path.name}")

    return outputs


if __name__ == "__main__":
    job_id = submit_job(audio_paths=["/path/to/episode_01.wav"], language_code="hi-IN", itn=True)

    results = poll_until_complete(job_id)

    if results:
        output_dir = Path(OUTPUT_DIR) / f"job_{job_id}"
        output_dir.mkdir(parents=True, exist_ok=True)
        outputs = parse_results(results, output_dir)
        print(f"\nDone. {len(outputs)} transcript(s) saved to {output_dir}/")

Sample Output

outputs/
└── job_batch_7f3a92c1d4e8/
    ├── episode_01_transcript.txt   ← speaker-labelled, time-stamped transcript
    └── episode_01_metadata.json    ← duration, talk time, segment detail
episode_01_transcript.txt
[00:00] SPEAKER_1: नमस्ते, मैं हूँ रवि शर्मा और आज हम बात करेंगे भारत के स्टार्टअप इकोसिस्टम के बारे में।
[00:07] SPEAKER_2: हाँ रवि जी, बहुत अच्छा विषय है। पिछले पाँच साल में बहुत कुछ बदला है।
[00:14] SPEAKER_1: बिल्कुल। ₹2,00,000 करोड़ से ज़्यादा की फंडिंग आई है 2024 में।
[00:22] SPEAKER_2: और यूनिकॉर्न्स की संख्या भी 100 के पार पहुँच गई है।
episode_01_metadata.json
{
  "filename": "episode_01.wav",
  "total_duration_s": 2847.5,
  "speaker_talk_time": {
    "SPEAKER_1": 1423.8,
    "SPEAKER_2": 1389.2
  },
  "segments": [
    {
      "segment_id": 0,
      "speaker_id": 1,
      "start_time": 0.0,
      "end_time": 6.8,
      "text": "नमस्ते, मैं हूँ रवि शर्मा...",
      "confidence": 0.96,
      "language_detected": "hi-IN"
    }
  ]
}

ITN Reference

When format=transcribe is set on the submit request, ITN post-processes every transcript — converting spoken-form numbers, currency, dates, times, and phone numbers into compact written form. Available for hi-IN and en-IN only.
CategorySpoken input (ASR)Written output (ITN)
Numbersपाँच लाख बीस हज़ार5,20,000
Currencyतीन रुपये पचास पैसे₹3.50
Currency (en)five thousand rupees₹5,000
Datesबीस जनवरी दो हज़ार पच्चीस20 जनवरी 2025
Timesशाम पाँच बजेशाम 17:00
Phone numbersनौ आठ सात छह पाँच चार तीन दो एक शून्य9876543210
Code-mixedpay do lakh rupees by fifteenth marchpay ₹2,00,000 by 15th March
Native script digits: Pass itn_native_numerals=true alongside format=transcribe to render digits in the native script of the target language — for example, Devanagari numerals (₹५,०००) for Hindi. English always outputs Western Arabic digits regardless of this setting.
What ITN does not change: Idiomatic and ambiguous phrases are preserved intentionally. दो तीन meaning “a few” stays as text, not 2 or 3. Imperative verbs like कर दो or ले दो are kept as words.