Documentation Index
Fetch the complete documentation index at: https://docs.inya.ai/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before you begin, ensure you have:
- A valid API key (sign up on the vachana platform to generate API keys)
- cURL installed, or an API client such as Postman
Speech-to-Text (STT)
Text-to-Speech (TTS)
Voice Cloning (VC)
Use a test audio file with the following requirements -
- Format: WAV, MP3, OGG, FLAC, AAC, M4A
- Sampling rate: 8 kHz – 44.1 kHz
- Maximum duration: 60 seconds
Your First Speech-to-Text Request
Minimal example to transcribe a Hindi audio file:curl -X POST https://api.vachana.ai/stt/v3 \
-H 'Content-Type: multipart/form-data' \
-H 'X-API-Key-ID: <API_KEY>' \
-F audio_file='/path/to/your/audio.wav' \
-F language_code=hi-IN
Replace these values:
<API_KEY>: Your Vachana API key
/path/to/your/audio.wav: Path to your audio file
hi-IN: Language code (see Language Codes for all options)
Expected STT Response
On success, you’ll receive a JSON response like:{
"success": true,
"transcript": "नमस्ते, आप कैसे हैं?"
}
For real-time, low-latency transcription of streaming audio, Vachana provides a Realtime API.Connection
Create a Realtime connection using your API credentials:const ws = new Realtime("wss://api.vachana.ai/stt/v3", {
headers: {
"x-api-key-id": "<API_KEY>",
}
});
Send Audio
Send raw PCM audio frames over the Realtime connection.
Audio requirements:
- Format: PCM 16-bit
- Sample rate: 16 or 8 kHz
- Channels: Mono
- Chunk size: 1024 bytes per frame (512 samples)
Client-to-server messages must contain only binary audio frames. Do not wrap the audio in JSON.Expected STT Response
The server sends JSON text frames containing transcription segments:{
"type": "transcript",
"timestamp": "2024-01-15T10:30:05.987Z",
"text": "Hello, how are you today?",
"audio_duration_ms": 2340,
"segment_id": "seg_abc123",
"segment_index": 1,
"latency": 320,
"detected_language": "en"
}
Have your input text ready. You’ll also need a voice name — see Voice Options for available voices.Your First Text-to-Speech Call
Minimal example for REST TTS (synchronous audio). This endpoint returns the full synthesized audio as a binary response.curl -X POST https://api.vachana.ai/api/v1/tts/inference \
-H 'Content-Type: application/json' \
-H 'X-API-Key-ID: <API_KEY>' \
-d '{
"text": "नमस्ते, आप कैसे हैं?",
"voice": "sia",
"model": "vachana-voice-v2",
"audio_config": {
"sample_rate": 44100,
"num_channels": 1,
"sample_width": 2,
"encoding": "linear_pcm",
"container": "wav"
}
}' \
-output response.wav
Expected TTS Response
A successful request will return a 200 OK HTTP status. The response body will contain raw binary audio data representing the synthesized text, adhering to the format specified in your audio_config.HTTP/1.1 200 OK
Content-Type: audio/wav
<binary audio data>
This endpoint streams synthesized audio using Server-Sent Events (SSE). Audio is generated and delivered incrementally as it becomes available.Your First Streaming Call
curl -X POST https://api.vachana.ai/api/v1/tts/sse \
-H 'Content-Type: application/json' \
-H 'X-API-Key-ID: <API_KEY>' \
-d '{
"text": "नमस्ते, आप कैसे हैं?",
"voice": "sia",
"model": "vachana-voice-v2"
}'
Expected SSE Response
A successful request will return a 200 OK HTTP status. The response body will contain a stream of server-sent events. Each chunk contains base64 encoded audio fragments.HTTP/1.1 200 OK
Content-Type: text/event-stream
event: audio_chunk
data: UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=
event: audio_chunk
data: //NkxAAAAANIAAAAAExBTUUzLjEwMKqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
event: completed
data: {"status": "success"}
For ultra-low latency applications, the Realtime API allows you to stream text input and receive synthesized audio continuously.Connection
Connect using a Realtime client with the required authentication headers:const ws = new Realtime("wss://api.vachana.ai/api/v1/tts", {
headers: {
"x-api-key-id": "<API_KEY>",
}
});
Send Text
Once connected, send a JSON payload containing the text to synthesize.{
"text": "नमस्ते, आप कैसे हैं?",
"voice": "sia",
"model": "vachana-voice-v2"
}
Audio Stream Response
Upon successful connection, the server will return a 101 Switching Protocols status to establish the Realtime.Once the text payload is sent, the server will immediately begin streaming audio back as a sequence of binary Realtime frames containing the raw PCM audio data, terminating or keeping the connection open depending on the application context. Voice cloning works in two steps:
- Generate embeddings — upload a reference audio file to get a
speaker_embedding
- Synthesize — pass the embedding with your text to any VC TTS endpoint
Step 1: Generate Voice Embeddings
Upload a reference audio file (WAV/MP3, ideally 5–30 seconds of clear speech):curl -X POST https://api.vachana.ai/api/v1/tts/voice-clone/embeddings \
-H 'X-API-Key-ID: <API_KEY>' \
-F audio_file='@/path/to/reference.wav'
Replace these values:
<API_KEY>: Your Vachana API key
/path/to/reference.wav: Path to your reference audio file
Expected Embeddings Response
{
"embedding": "<embedding-string>",
"shape": [1, 768],
"dtype": "torch.bfloat16"
}
Step 2: Synthesize with Your Cloned Voice
Pass the speaker_embedding from Step 1 to synthesize audio in your cloned voice:curl -X POST https://api.vachana.ai/api/v1/tts/inference \
-H 'Content-Type: application/json' \
-H 'X-API-Key-ID: <API_KEY>' \
-d '{
"text": "नमस्ते, आप कैसे हैं?",
"model": "vachana-vc-v1",
"audio_config": {
"sample_rate": 44100,
"num_channels": 1,
"sample_width": 2,
"encoding": "linear_pcm",
"container": "wav"
},
"speaker_embedding": {
"embedding": "<your-embedding-string>",
"shape": [1, 768],
"dtype": "torch.bfloat16"
}
}' \
--output cloned_voice.wav
A successful request returns a 200 OK with raw binary audio data in the specified format. Stream cloned voice audio progressively via Server-Sent Events:curl -X POST https://api.vachana.ai/api/v1/tts/sse \
-H 'Content-Type: application/json' \
-H 'X-API-Key-ID: <API_KEY>' \
-d '{
"text": "नमस्ते, आप कैसे हैं?",
"model": "vachana-vc-v1",
"speaker_embedding": {
"embedding": "<your-embedding-string>",
"shape": [1, 768],
"dtype": "torch.bfloat16"
}
}'
The response streams base64-encoded audio chunks as server-sent events, identical in format to the TTS SSE response. For the lowest latency, stream text and receive cloned voice audio over a WebSocket:const ws = new WebSocket("wss://api.vachana.ai/api/v1/tts", {
headers: {
"Content-Type": "application/json",
"X-API-Key-ID": "<API_KEY>",
},
});
ws.on("open", () => {
ws.send(JSON.stringify({
text: "नमस्ते, आप कैसे हैं?",
model: "vachana-vc-v1",
audio_config: { sample_rate: 44100, encoding: "linear_pcm" },
speaker_embedding: {
embedding: "<your-embedding-string>",
shape: [1, 768],
dtype: "torch.bfloat16",
},
}));
});
ws.on("message", (data) => {
// Handle binary PCM audio chunks
});
The server streams binary PCM audio chunks over the WebSocket connection.
Next Steps