Podcast Transcription with Speaker Labels

Overview

Transcribe multi-speaker audio at scale using the Vachana Batch STT API. This guide walks through every step — from submitting an audio file to receiving a clean, speaker-separated transcript — using podcast transcription as the working example. Audio-first content — podcasts, interview recordings, panel discussions — carries information that stays locked unless it is transcribed. Speaker-level transcription is what separates a readable document from a wall of undifferentiated text. You know who said what, when they said it, and for how long.

Capability	What it enables downstream
Speaker-separated output	Per-speaker text blocks mean editors can review one voice at a time, and content teams can attribute quotes accurately.
Time-aligned segments	Every segment carries a `start_time` and `end_time`, enabling subtitle generation, chapter markers, and clip extraction at a specific timestamp.
Segment-level confidence	Flag low-confidence segments for human review rather than reviewing the entire transcript.
10 Indian languages	Transcribe Hindi, Tamil, Telugu, Kannada, Malayalam, Bengali, Gujarati, Marathi, Punjabi, and English without switching providers or pipelines.
Batch processing	Submit up to 10 files in a single API call. Run overnight jobs, backfill archives, or process weekly episode batches without managing queues yourself.

Other Use Cases

The same submit-poll-parse pipeline works for any long-form, speaker-rich audio. Any scenario involving long audio files, two speakers, and a need for speaker-separated text maps directly to this pipeline.

Use Case	Description
Journalist Interviews	Transcribe field recordings with interviewer and subject separated. Feed directly into editorial workflows without manual formatting.
Parliamentary & Panel Debates	Attribute statements to the correct speaker for political reporting, fact-checking, or archival. Supports Devanagari and regional scripts natively.
EdTech Lecture Recordings	Transcribe faculty and student exchanges. Generate accessible transcripts for students, search indexes for course platforms, and study material exports.
Legal Depositions & Hearings	Produce verbatim speaker-attributed records of proceedings. Use confidence scores to flag segments requiring court reporter verification.
Radio Archive Digitisation	Backfill years of archived broadcasts into searchable, attributed text. Batch processing handles large volumes without manual queuing.
Corporate Town Halls & Earnings Calls	Generate attributed transcripts of leadership Q&A sessions. Surface speaker-specific statements for internal comms or investor relations.
Documentary & Film Production	Auto-generate interview transcripts for rough-cut editing. Export time-coded speaker lines directly to editing software.
Doctor-Patient Consultations	Transcribe recorded consultations with doctor and patient separated. Enable structured documentation workflows for EMR systems.

Two-speaker limit: The Vachana Batch STT API supports a maximum of two distinct speakers per file. It is optimised for two-party audio — interviews, conversations, and one-on-one recordings. Panel discussions with three or more speakers are outside the current scope.

Prerequisites

Requirement	Details
Vachana API key	Available from the Vachana dashboard. You will use this as the `X-API-Key-ID` header on every request.
Python 3.9+	The pipeline uses f-strings, `pathlib`, and `typing` patterns that require Python 3.9 or later.
Audio files	Supported formats: AAC, WAV, FLAC, ALAC, OGG (Vorbis), Opus. Each file must be under 1 hour and the total payload under 80 MB.
ffmpeg	Required only if you plan to split files longer than 1 hour using pydub. Install with `brew install ffmpeg` (macOS) or `apt install ffmpeg` (Linux).

# HTTP client (used for submit and poll calls)
pip install requests

# Audio chunking — only needed for files over 1 hour
pip install pydub

No SDK is required for this pipeline. All calls use the standard HTTP REST endpoints.

Authentication

Header	Required	Description
`X-API-Key-ID`	Yes	Your Vachana API key. Required on both the submit and status calls.
`X-API-Request-ID`	No	A UUID you assign for tracing. Useful for correlating your application logs with platform support.

Store your API key as an environment variable. Never hardcode it in source files or commit it to version control.

.env

GNANI_API_KEY=your-api-key-here

loading credentials

import os

API_KEY = os.getenv("GNANI_API_KEY")
HEADERS = {"X-API-Key-ID": API_KEY}

Never hardcode API keys. Do not commit credentials to version control. Use environment variables, a secrets manager, or a vault. Rotate your key immediately if it is exposed.

Limits & Supported Formats

Item	Limit
Max audio duration	Less than 1 hour per file
Max files per request	10 files per API call
Max total payload size	80 MB across all files and form fields combined
Minimum poll interval	60 seconds between status calls for the same `job_id`
Speaker diarization	Maximum 2 speakers per file

Supported Audio Formats

Format	Extension	Notes
WAV	`.wav`	Uncompressed. Highest quality but largest file size.
FLAC	`.flac`	Lossless compression. Good balance of quality and size for archival audio.
AAC	`.m4a`	Common podcast export format. Well-supported across all recording tools.
ALAC	`.m4a`	Lossless Apple format. Use when source is from Apple recording tools.
OGG (Vorbis)	`.ogg`	Open format. Common in Linux recording pipelines.
Opus	`.opus`	Efficient lossy compression. Smallest file sizes — recommended for high-volume batch jobs.

Files over 1 hour: Split into chunks before submitting. The split_audio() helper in the full script handles this automatically using pydub. Stitch the resulting transcripts in order after parsing.

Supported Languages

Pass the BCP-47 code in the language_code field of your submit request.

Language	Code	Native Script	ITN
Bengali	`bn-IN`	বাংলা	—
English	`en-IN`	Latin	Yes
Gujarati	`gu-IN`	ગુજરાતી	—
Hindi	`hi-IN`	हिन्दी	Yes
Kannada	`kn-IN`	ಕನ್ನಡ	—
Malayalam	`ml-IN`	മലയാളം	—
Marathi	`mr-IN`	मराठी	—
Punjabi	`pa-IN`	ਪੰਜਾਬੀ	—
Tamil	`ta-IN`	தமிழ்	—
Telugu	`te-IN`	తెలుగు	—

ITN (Inverse Text Normalization) converts spoken numbers, currency, dates, times, and phone numbers into compact written form. Currently available for hi-IN and en-IN only. Set format=transcribe in the request to enable it.

Pipeline

Submit the audio file

POST to /stt/v3/batch/submit with your audio file, language code, and format preference. Receive a job_id immediately. Save it — you will need it in every poll call.

Poll for completion

GET /stt/v3/batch/status/{job_id} every 60 seconds. Loop until status reaches completed or failed. The results field is null until the job is complete.

Parse the segments

Iterate over results[].segments. Group by speaker_id. Build per-speaker text blocks with timestamps and confidence scores.

Save the transcript

Write the speaker-labelled transcript to a text file, along with a JSON file containing per-speaker talk time and segment metadata.

Step 1 — Submit

submit_job()

import os
import requests
from pathlib import Path

BATCH_SUBMIT = "https://api.vachana.ai/stt/v3/batch/submit"

def submit_job(
    audio_path: str,
    language_code: str = "hi-IN",
    itn: bool = True,
) -> str:
    api_key = os.getenv("GNANI_API_KEY")
    headers = {"X-API-Key-ID": api_key}

    with open(audio_path, "rb") as f:
        files = [("audio_files", (Path(audio_path).name, f, "audio/wav"))]
        data  = {
            "language_code":   language_code,
            "is_multi_channel": "false",
            "format":          "transcribe" if itn else "verbatim",
        }
        resp = requests.post(BATCH_SUBMIT, headers=headers, files=files, data=data)

    resp.raise_for_status()
    job_id = resp.json()["job_id"]
    print(f"Submitted. job_id: {job_id}")
    return job_id

Submitting multiple files: Pass additional ("audio_files", ...) tuples to the files list. Up to 10 files are accepted per request, as long as the total payload stays under 80 MB.

Step 2 — Poll

poll_until_complete()

import time
from typing import Optional

BATCH_STATUS  = "https://api.vachana.ai/stt/v3/batch/status/{job_id}"
POLL_INTERVAL = 60  # seconds — enforced minimum; do not reduce

def poll_until_complete(job_id: str) -> Optional[list]:
    api_key = os.getenv("GNANI_API_KEY")
    headers = {"X-API-Key-ID": api_key}
    url     = BATCH_STATUS.format(job_id=job_id)

    print(f"Polling job {job_id} every {POLL_INTERVAL}s...")

    while True:
        time.sleep(POLL_INTERVAL)

        resp = requests.get(url, headers=headers)
        resp.raise_for_status()
        payload  = resp.json()
        status   = payload["status"]
        progress = payload.get("overall_progress", "–")

        print(f"  [{status}]  progress: {progress}%")

        if status == "completed":
            print(f"Job complete. {payload['completed_files']} file(s) transcribed.")
            return payload.get("results", [])

        if status == "failed":
            print(f"Job failed: {payload.get('error')}")
            return None

Minimum poll interval: 60 seconds. The API enforces a rate limit of one status call per 60 seconds per job_id. Do not reduce it.

Step 3 — Parse

parse_results()

import json
from pathlib import Path
from typing import Dict

def parse_results(results: list, output_dir: Path) -> Dict[str, dict]:
    outputs = {}

    for file_result in results:
        fname    = Path(file_result["filename"]).stem
        segments = file_result.get("segments", [])

        if not segments or file_result.get("status") == "failed":
            print(f"Skipping {fname}: {file_result.get('error', 'no segments')}")
            continue

        lines, speaker_times, segment_meta = [], {}, []

        for seg in segments:
            spk   = seg.get("speaker_id", "UNKNOWN")
            text  = seg.get("text", "").strip()
            start = seg.get("start_time", 0.0)
            end   = seg.get("end_time",   0.0)

            ts = f"{int(start // 60):02d}:{int(start % 60):02d}"
            lines.append(f"[{ts}] SPEAKER_{spk}: {text}")
            speaker_times[spk] = speaker_times.get(spk, 0.0) + (end - start)
            segment_meta.append({
                "segment_id":        seg.get("segment_id"),
                "speaker_id":        spk,
                "start_time":        start,
                "end_time":          end,
                "text":              text,
                "confidence":        seg.get("confidence"),
                "language_detected": seg.get("language_detected"),
            })

        transcript_path = output_dir / f"{fname}_transcript.txt"
        transcript_path.write_text("\n".join(lines), encoding="utf-8")

        metadata_path = output_dir / f"{fname}_metadata.json"
        metadata_path.write_text(json.dumps({
            "filename":          file_result["filename"],
            "total_duration_s":  file_result.get("total_duration"),
            "speaker_talk_time": {f"SPEAKER_{k}": round(v, 2) for k, v in speaker_times.items()},
            "segments":          segment_meta,
        }, indent=2, ensure_ascii=False), encoding="utf-8")

        outputs[fname] = {
            "transcript_path": str(transcript_path),
            "metadata_path":   str(metadata_path),
        }
        print(f"Parsed: {fname} → {len(lines)} segments, {len(speaker_times)} speaker(s)")

    return outputs

Full Script

podcast_transcription.py

import os
import json
import time
import requests
from pathlib import Path
from typing  import Dict, List, Optional

BATCH_SUBMIT  = "https://api.vachana.ai/stt/v3/batch/submit"
BATCH_STATUS  = "https://api.vachana.ai/stt/v3/batch/status/{job_id}"
POLL_INTERVAL = 60
OUTPUT_DIR    = "outputs"

Path(OUTPUT_DIR).mkdir(exist_ok=True)


def submit_job(audio_paths: List[str], language_code: str = "hi-IN", itn: bool = True) -> str:
    api_key = os.getenv("GNANI_API_KEY")
    headers = {"X-API-Key-ID": api_key}

    files = [("audio_files", (Path(p).name, open(p, "rb"), "audio/wav")) for p in audio_paths]
    data  = {"language_code": language_code, "is_multi_channel": "false",
              "format": "transcribe" if itn else "verbatim"}

    resp = requests.post(BATCH_SUBMIT, headers=headers, files=files, data=data)
    resp.raise_for_status()

    for _, (_, fh, _) in files:
        fh.close()

    job_id = resp.json()["job_id"]
    print(f"Submitted {len(audio_paths)} file(s). job_id: {job_id}")
    return job_id


def poll_until_complete(job_id: str) -> Optional[list]:
    api_key = os.getenv("GNANI_API_KEY")
    headers = {"X-API-Key-ID": api_key}
    url     = BATCH_STATUS.format(job_id=job_id)

    print(f"Polling every {POLL_INTERVAL}s...")
    while True:
        time.sleep(POLL_INTERVAL)
        resp    = requests.get(url, headers=headers)
        resp.raise_for_status()
        payload = resp.json()
        status  = payload["status"]
        print(f"  [{status}]  {payload.get('overall_progress', '–')}%")
        if status == "completed":
            return payload.get("results", [])
        if status == "failed":
            print(f"Job failed: {payload.get('error')}")
            return None


def parse_results(results: list, output_dir: Path) -> Dict[str, dict]:
    outputs = {}

    for file_result in results:
        fname    = Path(file_result["filename"]).stem
        segments = file_result.get("segments", [])

        if not segments or file_result.get("status") == "failed":
            print(f"Skipping {fname}: {file_result.get('error', 'no segments')}")
            continue

        lines, speaker_times, segment_meta = [], {}, []

        for seg in segments:
            spk   = seg.get("speaker_id", "UNKNOWN")
            text  = seg.get("text", "").strip()
            start = seg.get("start_time", 0.0)
            end   = seg.get("end_time",   0.0)

            ts = f"{int(start // 60):02d}:{int(start % 60):02d}"
            lines.append(f"[{ts}] SPEAKER_{spk}: {text}")
            speaker_times[spk] = speaker_times.get(spk, 0.0) + (end - start)
            segment_meta.append({
                "segment_id":        seg.get("segment_id"),
                "speaker_id":        spk,
                "start_time":        start,
                "end_time":          end,
                "text":              text,
                "confidence":        seg.get("confidence"),
                "language_detected": seg.get("language_detected"),
            })

        transcript_path = output_dir / f"{fname}_transcript.txt"
        transcript_path.write_text("\n".join(lines), encoding="utf-8")

        metadata_path = output_dir / f"{fname}_metadata.json"
        metadata_path.write_text(json.dumps({
            "filename":          file_result["filename"],
            "total_duration_s":  file_result.get("total_duration"),
            "speaker_talk_time": {f"SPEAKER_{k}": round(v, 2) for k, v in speaker_times.items()},
            "segments":          segment_meta,
        }, indent=2, ensure_ascii=False), encoding="utf-8")

        outputs[fname] = {"transcript_path": str(transcript_path), "metadata_path": str(metadata_path)}
        print(f"Saved: {transcript_path.name}")

    return outputs


if __name__ == "__main__":
    job_id = submit_job(audio_paths=["/path/to/episode_01.wav"], language_code="hi-IN", itn=True)

    results = poll_until_complete(job_id)

    if results:
        output_dir = Path(OUTPUT_DIR) / f"job_{job_id}"
        output_dir.mkdir(parents=True, exist_ok=True)
        outputs = parse_results(results, output_dir)
        print(f"\nDone. {len(outputs)} transcript(s) saved to {output_dir}/")

Sample Output

outputs/
└── job_batch_7f3a92c1d4e8/
    ├── episode_01_transcript.txt   ← speaker-labelled, time-stamped transcript
    └── episode_01_metadata.json    ← duration, talk time, segment detail

episode_01_transcript.txt

[00:00] SPEAKER_1: नमस्ते, मैं हूँ रवि शर्मा और आज हम बात करेंगे भारत के स्टार्टअप इकोसिस्टम के बारे में।
[00:07] SPEAKER_2: हाँ रवि जी, बहुत अच्छा विषय है। पिछले पाँच साल में बहुत कुछ बदला है।
[00:14] SPEAKER_1: बिल्कुल। ₹2,00,000 करोड़ से ज़्यादा की फंडिंग आई है 2024 में।
[00:22] SPEAKER_2: और यूनिकॉर्न्स की संख्या भी 100 के पार पहुँच गई है।

episode_01_metadata.json

{
  "filename": "episode_01.wav",
  "total_duration_s": 2847.5,
  "speaker_talk_time": {
    "SPEAKER_1": 1423.8,
    "SPEAKER_2": 1389.2
  },
  "segments": [
    {
      "segment_id": 0,
      "speaker_id": 1,
      "start_time": 0.0,
      "end_time": 6.8,
      "text": "नमस्ते, मैं हूँ रवि शर्मा...",
      "confidence": 0.96,
      "language_detected": "hi-IN"
    }
  ]
}

ITN Reference

When format=transcribe is set on the submit request, ITN post-processes every transcript — converting spoken-form numbers, currency, dates, times, and phone numbers into compact written form. Available for hi-IN and en-IN only.

Category	Spoken input (ASR)	Written output (ITN)
Numbers	पाँच लाख बीस हज़ार	5,20,000
Currency	तीन रुपये पचास पैसे	₹3.50
Currency (en)	five thousand rupees	₹5,000
Dates	बीस जनवरी दो हज़ार पच्चीस	20 जनवरी 2025
Times	शाम पाँच बजे	शाम 17:00
Phone numbers	नौ आठ सात छह पाँच चार तीन दो एक शून्य	9876543210
Code-mixed	pay do lakh rupees by fifteenth march	pay ₹2,00,000 by 15th March

Native script digits: Pass itn_native_numerals=true alongside format=transcribe to render digits in the native script of the target language — for example, Devanagari numerals (₹५,०००) for Hindi. English always outputs Western Arabic digits regardless of this setting.

What ITN does not change: Idiomatic and ambiguous phrases are preserved intentionally. दो तीन meaning “a few” stays as text, not 2 or 3. Imperative verbs like कर दो or ले दो are kept as words.

Documentation Index

​Overview

​Other Use Cases

​Prerequisites

​Authentication

​Limits & Supported Formats

​Supported Audio Formats

​Supported Languages

​Pipeline

​Step 1 — Submit

​Step 2 — Poll

​Step 3 — Parse

​Full Script

​Sample Output

​ITN Reference

Overview

Other Use Cases

Prerequisites

Authentication

Limits & Supported Formats

Supported Audio Formats

Supported Languages

Pipeline

Step 1 — Submit

Step 2 — Poll

Step 3 — Parse

Full Script

Sample Output

ITN Reference