Prerequisites
Before you begin, ensure you have:- A valid API key (sign up on the vachana platform to generate API keys)
- cURL installed, or an API client such as Postman
- Speech-to-Text (STT)
- Text-to-Speech (TTS)
- Voice Cloning (VC)
- REST
- Realtime
Use a test audio file with the following requirements -Replace these values:
- Format: WAV, MP3, OGG, FLAC, AAC, M4A
- Sampling rate: 8 kHz – 44.1 kHz
- Maximum duration: 60 seconds
Your First Speech-to-Text Request
Minimal example to transcribe a Hindi audio file:curl -X POST https://api.vachana.ai/stt/v3 \
-H 'Content-Type: multipart/form-data' \
-H 'X-API-Key-ID: <API_KEY>' \
-F audio_file='/path/to/your/audio.wav' \
-F language_code=hi-IN
<API_KEY>: Your Vachana API key/path/to/your/audio.wav: Path to your audio filehi-IN: Language code (see Language Codes for all options)
Expected STT Response
On success, you’ll receive a JSON response like:{
"success": true,
"transcript": "नमस्ते, आप कैसे हैं?"
}
For real-time, low-latency transcription of streaming audio, Vachana provides a Realtime API.
Connection
Create a Realtime connection using your API credentials:const ws = new Realtime("wss://api.vachana.ai/stt/v3", {
headers: {
"x-api-key-id": "<API_KEY>",
}
});
Send Audio
Send raw PCM audio frames over the Realtime connection. Audio requirements:- Format: PCM 16-bit
- Sample rate: 16 or 8 kHz
- Channels: Mono
- Chunk size: 1024 bytes per frame (512 samples)
Expected STT Response
The server sends JSON text frames containing transcription segments:{
"type": "transcript",
"timestamp": "2024-01-15T10:30:05.987Z",
"text": "Hello, how are you today?",
"audio_duration_ms": 2340,
"segment_id": "seg_abc123",
"segment_index": 1,
"latency": 320,
"detected_language": "en"
}
- REST
- Streaming
- Realtime
Have your input text ready. You’ll also need a voice name — see Voice Options for available voices.
Your First Text-to-Speech Call
Minimal example for REST TTS (synchronous audio). This endpoint returns the full synthesized audio as a binary response.curl -X POST https://api.vachana.ai/api/v1/tts/inference \
-H 'Content-Type: application/json' \
-H 'X-API-Key-ID: <API_KEY>' \
-d '{
"text": "नमस्ते, आप कैसे हैं?",
"voice": "sia",
"model": "vachana-voice-v2",
"audio_config": {
"sample_rate": 44100,
"num_channels": 1,
"sample_width": 2,
"encoding": "linear_pcm",
"container": "wav"
}
}' \
-output response.wav
Expected TTS Response
A successful request will return a200 OK HTTP status. The response body will contain raw binary audio data representing the synthesized text, adhering to the format specified in your audio_config.HTTP/1.1 200 OK
Content-Type: audio/wav
<binary audio data>
This endpoint streams synthesized audio using Server-Sent Events (SSE). Audio is generated and delivered incrementally as it becomes available.
Your First Streaming Call
curl -X POST https://api.vachana.ai/api/v1/tts/sse \
-H 'Content-Type: application/json' \
-H 'X-API-Key-ID: <API_KEY>' \
-d '{
"text": "नमस्ते, आप कैसे हैं?",
"voice": "sia",
"model": "vachana-voice-v2"
}'
Expected SSE Response
A successful request will return a200 OK HTTP status. The response body will contain a stream of server-sent events. Each chunk contains base64 encoded audio fragments.HTTP/1.1 200 OK
Content-Type: text/event-stream
event: audio_chunk
data: UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=
event: audio_chunk
data: //NkxAAAAANIAAAAAExBTUUzLjEwMKqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
event: completed
data: {"status": "success"}
For ultra-low latency applications, the Realtime API allows you to stream text input and receive synthesized audio continuously.
Connection
Connect using a Realtime client with the required authentication headers:const ws = new Realtime("wss://api.vachana.ai/api/v1/tts", {
headers: {
"x-api-key-id": "<API_KEY>",
}
});
Send Text
Once connected, send a JSON payload containing the text to synthesize.{
"text": "नमस्ते, आप कैसे हैं?",
"voice": "sia",
"model": "vachana-voice-v2"
}
Audio Stream Response
Upon successful connection, the server will return a101 Switching Protocols status to establish the Realtime.Once the text payload is sent, the server will immediately begin streaming audio back as a sequence of binary Realtime frames containing the raw PCM audio data, terminating or keeping the connection open depending on the application context.Voice cloning works in two steps:Replace these values:
- Generate embeddings — upload a reference audio file to get a
speaker_embedding - Synthesize — pass the embedding with your text to any VC TTS endpoint
Step 1: Generate Voice Embeddings
Upload a reference audio file (WAV/MP3, ideally 5–30 seconds of clear speech):curl -X POST https://api.vachana.ai/api/v1/tts/voice-clone/embeddings \
-H 'X-API-Key-ID: <API_KEY>' \
-F audio_file='@/path/to/reference.wav'
<API_KEY>: Your Vachana API key/path/to/reference.wav: Path to your reference audio file
Expected Embeddings Response
{
"embedding": "<embedding-string>",
"shape": [1, 768],
"dtype": "torch.bfloat16"
}
Step 2: Synthesize with Your Cloned Voice
- REST
- Streaming
- Realtime
Pass the A successful request returns a
speaker_embedding from Step 1 to synthesize audio in your cloned voice:curl -X POST https://api.vachana.ai/api/v1/tts/inference \
-H 'Content-Type: application/json' \
-H 'X-API-Key-ID: <API_KEY>' \
-d '{
"text": "नमस्ते, आप कैसे हैं?",
"model": "vachana-vc-v1",
"audio_config": {
"sample_rate": 44100,
"num_channels": 1,
"sample_width": 2,
"encoding": "linear_pcm",
"container": "wav"
},
"speaker_embedding": {
"embedding": "<your-embedding-string>",
"shape": [1, 768],
"dtype": "torch.bfloat16"
}
}' \
--output cloned_voice.wav
200 OK with raw binary audio data in the specified format.Stream cloned voice audio progressively via Server-Sent Events:The response streams base64-encoded audio chunks as server-sent events, identical in format to the TTS SSE response.
curl -X POST https://api.vachana.ai/api/v1/tts/sse \
-H 'Content-Type: application/json' \
-H 'X-API-Key-ID: <API_KEY>' \
-d '{
"text": "नमस्ते, आप कैसे हैं?",
"model": "vachana-vc-v1",
"speaker_embedding": {
"embedding": "<your-embedding-string>",
"shape": [1, 768],
"dtype": "torch.bfloat16"
}
}'
For the lowest latency, stream text and receive cloned voice audio over a WebSocket:The server streams binary PCM audio chunks over the WebSocket connection.
const ws = new WebSocket("wss://api.vachana.ai/api/v1/tts", {
headers: {
"Content-Type": "application/json",
"X-API-Key-ID": "<API_KEY>",
},
});
ws.on("open", () => {
ws.send(JSON.stringify({
text: "नमस्ते, आप कैसे हैं?",
model: "vachana-vc-v1",
audio_config: { sample_rate: 44100, encoding: "linear_pcm" },
speaker_embedding: {
embedding: "<your-embedding-string>",
shape: [1, 768],
dtype: "torch.bfloat16",
},
}));
});
ws.on("message", (data) => {
// Handle binary PCM audio chunks
});
Next Steps
- Speech-to-Text: STT REST and STT Realtime for all STT parameters and language options
- Text-to-Speech: REST, Streaming (SSE), and Realtime for TTS options
- Voice Cloning: VC Embeddings, REST, Streaming, and Realtime for voice cloning options