Quick Start - Inya Docs

Prerequisites

Before you begin, ensure you have:

A valid API key (sign up on the vachana platform to generate API keys)
cURL installed, or an API client such as Postman

Speech-to-Text (STT)
Text-to-Speech (TTS)
Voice Cloning (VC)

REST
Realtime

Use a test audio file with the following requirements -

Format: WAV, MP3, OGG, FLAC, AAC, M4A
Sampling rate: 8 kHz – 44.1 kHz
Maximum duration: 60 seconds

Your First Speech-to-Text Request

Minimal example to transcribe a Hindi audio file:

curl -X POST https://api.vachana.ai/stt/v3 \
  -H 'Content-Type: multipart/form-data' \
  -H 'X-API-Key-ID: <API_KEY>' \
  -F audio_file='/path/to/your/audio.wav' \
  -F language_code=hi-IN

Replace these values:

<API_KEY>: Your Vachana API key
/path/to/your/audio.wav: Path to your audio file
hi-IN: Language code (see Language Codes for all options)

Expected STT Response

On success, you’ll receive a JSON response like:

{
  "success": true,
  "transcript": "नमस्ते, आप कैसे हैं?"
}

For real-time, low-latency transcription of streaming audio, Vachana provides a Realtime API.

Connection

Create a Realtime connection using your API credentials:

const ws = new Realtime("wss://api.vachana.ai/stt/v3", {
  headers: {
    "x-api-key-id": "<API_KEY>",
  }
});

Send Audio

Send raw PCM audio frames over the Realtime connection. Audio requirements:

Format: PCM 16-bit
Sample rate: 16 or 8 kHz
Channels: Mono
Chunk size: 1024 bytes per frame (512 samples)

Client-to-server messages must contain only binary audio frames. Do not wrap the audio in JSON.

Expected STT Response

The server sends JSON text frames containing transcription segments:

{
  "type": "transcript",
  "timestamp": "2024-01-15T10:30:05.987Z",
  "text": "Hello, how are you today?",
  "audio_duration_ms": 2340,
  "segment_id": "seg_abc123",
  "segment_index": 1,
  "latency": 320,
  "detected_language": "en"
}

REST
Streaming
Realtime

Have your input text ready. You’ll also need a voice name — see Voice Options for available voices.

Your First Text-to-Speech Call

Minimal example for REST TTS (synchronous audio). This endpoint returns the full synthesized audio as a binary response.

curl -X POST https://api.vachana.ai/api/v1/tts/inference \
  -H 'Content-Type: application/json' \
  -H 'X-API-Key-ID: <API_KEY>' \
  -d '{
        "text": "नमस्ते, आप कैसे हैं?",
        "voice": "sia",
        "model": "vachana-voice-v2",
        "audio_config": {
          "sample_rate": 44100,
          "num_channels": 1,
          "sample_width": 2,
          "encoding": "linear_pcm",
          "container": "wav"
        }
      }' \
  -output response.wav

Expected TTS Response

A successful request will return a 200 OK HTTP status. The response body will contain raw binary audio data representing the synthesized text, adhering to the format specified in your audio_config.

HTTP/1.1 200 OK
Content-Type: audio/wav

<binary audio data>

This endpoint streams synthesized audio using Server-Sent Events (SSE). Audio is generated and delivered incrementally as it becomes available.

Your First Streaming Call

curl -X POST https://api.vachana.ai/api/v1/tts/sse \
  -H 'Content-Type: application/json' \
  -H 'X-API-Key-ID: <API_KEY>' \
  -d '{
        "text": "नमस्ते, आप कैसे हैं?",
        "voice": "sia",
        "model": "vachana-voice-v2"
      }'

Expected SSE Response

A successful request will return a 200 OK HTTP status. The response body will contain a stream of server-sent events. Each chunk contains base64 encoded audio fragments.

HTTP/1.1 200 OK
Content-Type: text/event-stream

event: audio_chunk
data: UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=

event: audio_chunk
data: //NkxAAAAANIAAAAAExBTUUzLjEwMKqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

event: completed
data: {"status": "success"}

For ultra-low latency applications, the Realtime API allows you to stream text input and receive synthesized audio continuously.

Connection

Connect using a Realtime client with the required authentication headers:

const ws = new Realtime("wss://api.vachana.ai/api/v1/tts", {
  headers: {
    "x-api-key-id": "<API_KEY>",
  }
});

Send Text

Once connected, send a JSON payload containing the text to synthesize.

{
  "text": "नमस्ते, आप कैसे हैं?",
  "voice": "sia",
  "model": "vachana-voice-v2"
}

Audio Stream Response

Upon successful connection, the server will return a 101 Switching Protocols status to establish the Realtime.Once the text payload is sent, the server will immediately begin streaming audio back as a sequence of binary Realtime frames containing the raw PCM audio data, terminating or keeping the connection open depending on the application context.

Voice cloning works in two steps:

Generate embeddings — upload a reference audio file to get a speaker_embedding
Synthesize — pass the embedding with your text to any VC TTS endpoint

Step 1: Generate Voice Embeddings

Upload a reference audio file (WAV/MP3, ideally 5–30 seconds of clear speech):

curl -X POST https://api.vachana.ai/api/v1/tts/voice-clone/embeddings \
  -H 'X-API-Key-ID: <API_KEY>' \
  -F audio_file='@/path/to/reference.wav'

Replace these values:

<API_KEY>: Your Vachana API key
/path/to/reference.wav: Path to your reference audio file

Expected Embeddings Response

{
  "embedding": "<embedding-string>",
  "shape": [1, 768],
  "dtype": "torch.bfloat16"
}

Step 2: Synthesize with Your Cloned Voice

REST
Streaming
Realtime

Pass the speaker_embedding from Step 1 to synthesize audio in your cloned voice:

curl -X POST https://api.vachana.ai/api/v1/tts/inference \
  -H 'Content-Type: application/json' \
  -H 'X-API-Key-ID: <API_KEY>' \
  -d '{
        "text": "नमस्ते, आप कैसे हैं?",
        "model": "vachana-vc-v1",
        "audio_config": {
          "sample_rate": 44100,
          "num_channels": 1,
          "sample_width": 2,
          "encoding": "linear_pcm",
          "container": "wav"
        },
        "speaker_embedding": {
          "embedding": "<your-embedding-string>",
          "shape": [1, 768],
          "dtype": "torch.bfloat16"
        }
      }' \
  --output cloned_voice.wav

A successful request returns a 200 OK with raw binary audio data in the specified format.

Stream cloned voice audio progressively via Server-Sent Events:

curl -X POST https://api.vachana.ai/api/v1/tts/sse \
  -H 'Content-Type: application/json' \
  -H 'X-API-Key-ID: <API_KEY>' \
  -d '{
        "text": "नमस्ते, आप कैसे हैं?",
        "model": "vachana-vc-v1",
        "speaker_embedding": {
          "embedding": "<your-embedding-string>",
          "shape": [1, 768],
          "dtype": "torch.bfloat16"
        }
      }'

The response streams base64-encoded audio chunks as server-sent events, identical in format to the TTS SSE response.

For the lowest latency, stream text and receive cloned voice audio over a WebSocket:

const ws = new WebSocket("wss://api.vachana.ai/api/v1/tts", {
  headers: {
    "Content-Type": "application/json",
    "X-API-Key-ID": "<API_KEY>",
  },
});

ws.on("open", () => {
  ws.send(JSON.stringify({
    text: "नमस्ते, आप कैसे हैं?",
    model: "vachana-vc-v1",
    audio_config: { sample_rate: 44100, encoding: "linear_pcm" },
    speaker_embedding: {
      embedding: "<your-embedding-string>",
      shape: [1, 768],
      dtype: "torch.bfloat16",
    },
  }));
});

ws.on("message", (data) => {
  // Handle binary PCM audio chunks
});

The server streams binary PCM audio chunks over the WebSocket connection.

Next Steps

Speech-to-Text: STT REST and STT Realtime for all STT parameters and language options
Text-to-Speech: REST, Streaming (SSE), and Realtime for TTS options
Voice Cloning: VC Embeddings, REST, Streaming, and Realtime for voice cloning options

Vachana

Speech-to-Text

Text-to-Speech

Voice Cloning

Documentation Index

​Prerequisites

​Your First Speech-to-Text Request

​Expected STT Response

​Connection

​Send Audio

​Expected STT Response

​Your First Text-to-Speech Call

​Expected TTS Response

​Your First Streaming Call

​Expected SSE Response

​Connection

​Send Text

​Audio Stream Response

​Step 1: Generate Voice Embeddings

​Expected Embeddings Response

​Step 2: Synthesize with Your Cloned Voice

​Next Steps

Prerequisites

Your First Speech-to-Text Request

Expected STT Response

Connection

Send Audio

Expected STT Response

Your First Text-to-Speech Call

Expected TTS Response

Your First Streaming Call

Expected SSE Response

Connection

Send Text

Audio Stream Response

Step 1: Generate Voice Embeddings

Expected Embeddings Response

Step 2: Synthesize with Your Cloned Voice

Next Steps