Chapter 32

Voice AI & Conversational Interfaces

Here's a number that should reshape how you think about AI interfaces: 450 million.

That's the approximate number of internet users in India who prefer voice over text. They can speak fluently but typing — especially in their native language — is slow, error-prone, and frustrating. For these 450 million people, a chatbot with a text input box is not an AI product. It's a barrier.

Voice is the most natural interface humans have. We learn to speak years before we learn to read. We can speak roughly 150 words per minute — three times faster than most people type. Voice carries tone, emotion, urgency, and nuance that text flattens. And for millions of people who are less literate or differently abled, voice is not a preference. It's a necessity.

This chapter covers the technology stack that makes voice AI work, the unique challenges of building for India's linguistic landscape, and the production patterns that separate voice demos from voice products.

The Voice AI Stack

Every voice AI system follows the same fundamental pipeline, regardless of whether it's a customer support bot, a voice assistant, or a real-time conversational agent:

The Voice AI Pipeline:

    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
    │   CAPTURE    │    │     STT     │    │     LLM     │    │     TTS     │
    │              │───▶│  (Speech-   │───▶│  (Language   │───▶│  (Text-to-  │
    │  Microphone  │    │   to-Text)  │    │   Model)    │    │   Speech)   │
    │  audio input │    │  "Transcribe│    │  "Understand│    │  "Speak the │
    │              │    │   speech"   │    │   & respond" │    │   response" │
    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
         │                                                          │
         │                    ┌─────────────┐                       │
         │                    │   PLAYBACK   │                       │
         └────────────────────│              │◀──────────────────────┘
                              │  Speaker /   │
                              │  audio output│
                              └─────────────┘

    Total round-trip latency budget: &lt; 1.5 seconds for conversational feel
    ─────────────────────────────────────────────────────────────────────
    STT: ~300ms  │  LLM: ~500ms  │  TTS: ~200ms  │  Network: ~300ms

Each stage has its own technology choices, tradeoffs, and failure modes. Let's examine them.

Stage 1: Speech-to-Text (STT) — Hearing the User

STT converts raw audio — a waveform of sound — into text that an LLM can process. The state of the art has improved dramatically, but it's still not perfect, especially for non-English languages, accented speech, and noisy environments.

OpenAI Whisper is the most important STT model for builders to know. Released as open-source in September 2022, Whisper supports 99 languages, runs locally (no API needed), and achieves near-human accuracy on English transcription. Key facts:

Languages: 99, including Hindi, Tamil, Telugu, Bengali, Marathi, and most major Indian languages
Cost: Free (open-source, MIT license). You pay for compute infrastructure.
Accuracy: Word Error Rate (WER) of ~4% on English (human performance is ~5%). Hindi WER is approximately 12-18% depending on accent and recording quality.
Models: Ranges from "tiny" (39M parameters, runs on a phone) to "large-v3" (1.5B parameters, needs a GPU for real-time performance)
Special capability: Whisper handles code-mixed speech — sentences that blend languages, like "Bhai, mera order ka status kya hai?" — better than most alternatives because its training data included naturally code-mixed audio

Whisper Model Sizes and Use Cases:

  Model       Parameters   Speed (CPU)   Speed (GPU)   WER (English)
  ─────────   ──────────   ───────────   ───────────   ─────────────
  tiny        39M          ~10x realtime  ~50x         ~8.0%
  base        74M          ~7x realtime   ~40x         ~6.7%
  small       244M         ~4x realtime   ~25x         ~5.5%
  medium      769M         ~2x realtime   ~15x         ~4.8%
  large-v3    1.5B         ~0.5x          ~8x          ~4.2%
  ─────────   ──────────   ───────────   ───────────   ─────────────

  "~10x realtime" means 10 minutes of audio transcribed in 1 minute
  "~0.5x realtime" means 10 minutes of audio takes 20 minutes on CPU

Real-Life Example

In building PrabhupadaAI — a voice-enabled system that lets users have conversations with the teachings of Srila Prabhupada — we discovered a critical Whisper gotcha: CPU-only inference for long audio files is impractically slow. A 10-minute audio recording took over 20 minutes to transcribe on CPU with the large-v3 model. The solution was using the medium model for real-time interactions (acceptable quality, fast enough on CPU) and batching long-form transcriptions to a GPU-enabled service. This CPU vs. GPU decision is the first architectural choice every voice AI builder faces.

Alternatives to Whisper:

Deepgram: Commercial API, very fast (streaming transcription with <300ms latency), strong for real-time applications. Charges per minute of audio.
Google Cloud Speech-to-Text: Good Indian language support, pay-per-use, integrates with Google's ecosystem.
Azure Speech Services: Microsoft's offering, strong enterprise features, good Hindi support.
Sarvam AI: India-focused, built specifically for Indian languages and accents. More on this below.

Stage 2: The LLM — Understanding and Responding

Once speech is transcribed to text, the LLM processes it exactly as it would any text input. But voice contexts impose constraints that text contexts don't:

Transcription errors: The LLM receives "order ka states" instead of "order ka status." It needs to be robust to these kinds of errors.
Conversational brevity: Voice users speak in fragments. "Where's my order" not "Could you please check the status of order #4471?"
Response length: Voice responses must be concise. A 500-word text response that reads fine on screen is excruciating to listen to.
Turn-taking: The LLM needs to detect when the user has finished speaking (not paused mid-sentence) and when it should stop talking.

Why This Exists

The fundamental constraint of voice is that it's serial — both the user and the system can process audio at roughly 150 words per minute, no faster. A text interface can display 500 words in an instant; the user scans for what they need. A voice interface must speak those 500 words one at a time, taking over three minutes. This seriality constraint means voice AI responses must be dramatically shorter, more structured, and more direct than text responses.

Stage 3: Text-to-Speech (TTS) — The Voice of the System

TTS converts the LLM's text response into natural-sounding speech. The quality of TTS has undergone a revolution — from robotic, stilted speech five years ago to voices that are nearly indistinguishable from human speech today.

ElevenLabs is the current leader in voice quality and cloning. Key capabilities:

Voice cloning: Upload 30 seconds of audio and ElevenLabs creates a synthetic voice that sounds like the speaker. Quality improves with more audio — 5 minutes produces excellent results.
Multilingual: Supports 29 languages including Hindi, with natural-sounding prosody (the rhythm and intonation patterns of speech)
Latency: API latency of ~200-400ms for generating speech, with streaming support (start playing audio before the full response is generated)
Pricing: ~$0.15-0.30 per 1,000 characters, depending on plan

Real-Life Example

For PrabhupadaAI, we used ElevenLabs to clone Srila Prabhupada's distinctive voice from historical audio recordings. The technical approach: direct HTTP calls to ElevenLabs' API rather than their Python SDK. The SDK added unnecessary dependencies and abstraction. Direct HTTP gave us finer control over streaming, error handling, and timeout management. This "skip the SDK, use the API" pattern is a recurring theme in production voice systems where you need precise control over latency and error behavior.

Alternatives to ElevenLabs:

Google Cloud TTS: Excellent quality, good price, strong multilingual support. WaveNet voices are near-human quality.
Azure TTS: Microsoft's offering, very natural-sounding, good SSML (Speech Synthesis Markup Language) support for controlling pronunciation and pacing.
Coqui TTS: Open-source, runs locally, no API costs. Quality is good but not ElevenLabs-level.
Sarvam AI: Indian-language-first TTS with natural prosody for Hindi, Tamil, Telugu, and other languages.

Stage 4: Real-Time Voice (The Next Frontier)

The pipeline described above — STT → LLM → TTS — has an inherent latency problem. Each stage adds delay, and the total round-trip can reach 2-3 seconds. For conversational AI, that's too slow. Natural conversation has response gaps of 200-500 milliseconds.

Real-time voice models collapse the entire pipeline into a single model that processes audio input and produces audio output directly, without the text intermediary:

Traditional Pipeline:                    Real-Time Voice:
┌───────┐  ┌───────┐  ┌───────┐        ┌──────────────────────┐
│  STT  │→ │  LLM  │→ │  TTS  │        │  Single Model        │
│ 300ms │  │ 500ms │  │ 200ms │        │  Audio In → Audio Out │
└───────┘  └───────┘  └───────┘        │  ~320ms total         │
     Total: ~1,000ms + network          └──────────────────────┘

GPT-4o Realtime API: OpenAI's real-time voice mode processes audio natively — the model "hears" the user and "speaks" the response without text conversion. Latency drops below 500ms. The model can detect interruptions, adjust tone, and maintain conversational flow in ways that are impossible with the traditional pipeline.

Claude Voice: Anthropic's voice interface for Claude follows a similar real-time approach, with particular attention to maintaining Claude's careful, thoughtful response style in the voice medium.

The tradeoff: real-time voice models sacrifice some control. With the traditional pipeline, you can inspect and modify the text between STT and TTS — filtering harmful content, correcting errors, or redirecting the conversation. With real-time models, the audio-to-audio path is more opaque.

Voice AI for India: The 450 Million Opportunity

India's voice AI opportunity is unique in the world, and builders who understand its specific challenges will have an enormous advantage.

The Scale of the Opportunity

450M+ voice-preferring users: India has ~800 million internet users. Roughly 55-60% prefer voice interaction in their native language over typing.
22 official languages: Hindi, Bengali, Telugu, Marathi, Tamil, Urdu, Gujarati, Kannada, Malayalam, Odia, Punjabi, and 11 more. Each with distinct scripts, phonetics, and grammatical structures.
Code-mixing is the norm: Real Indian speech blends languages fluidly. A Hindi speaker might say "Bhai, mera order ka ETA kya hai? I ordered 2 hours back, still showing preparing." This isn't broken language — it's how 500 million+ Indians naturally communicate.
Low-bandwidth environments: Many users are on 3G or slow 4G connections. Audio streaming needs to work at low bitrates. Sending high-quality audio to a cloud API and waiting for a response is often impractical.

India's Voice AI Challenge Matrix:

  Challenge              Impact         Current State        Needed
  ────────────────────   ────────────   ────────────────     ──────────────
  Code-mixing            Very High      Whisper handles      Better models for
  (Hindi-English)                       reasonably well      low-resource pairs

  Regional accents       High           Models trained on    More diverse
  (100+ distinct)                       standard accents     training data

  Background noise       High           Works in quiet       Need robust
  (streets, shops)                      environments         noise handling

  Low bandwidth          Medium-High    Requires 4G+         Edge/offline
  (rural 3G)                            for real-time        STT models

  Script diversity       Medium         Good for major       Weak for
  (12+ scripts)                         languages            minority langs

  Conversational norms   Medium         Trained on Western   Need culturally
  (politeness markers,                  conversation         aware models
   honorifics)                          patterns

Case Study: Meesho's GenAI Voice Bot

Meesho, the Indian social commerce platform with 150M+ monthly active users, deployed a GenAI-powered voice bot for customer support in 2025. The numbers are striking:

60,000+ calls per day handled entirely by the voice bot
Languages: Hindi, English, Tamil, Telugu, Kannada, Bengali — six languages at launch
Resolution rate: 65-70% of queries resolved without human handoff
Average call duration: 2 minutes 15 seconds (down from 4+ minutes with human agents)
User satisfaction: Comparable to human agents for routine queries (order status, return initiation, delivery updates)

The architecture follows the standard pipeline (STT → LLM → TTS) but with India-specific optimizations:

Noise-robust STT: Meesho's users often call from busy marketplaces, auto-rickshaws, and crowded homes. Their STT pipeline includes a noise-suppression pre-processing step before transcription.
Code-mixing aware LLM: The system prompt explicitly instructs the LLM to understand and respond in the user's code-mixed style. If the user speaks Hinglish, the bot responds in Hinglish.
Culturally appropriate TTS: The bot's voice, speaking pace, and politeness markers (using "aap" not "tum," adding "ji" as an honorific) were designed with cultural consultants to feel natural to the target demographic.

Sarvam AI: Indian Languages First

Sarvam AI, founded in 2023 by Vivek Raghavan and Pratyush Kumar, is building AI infrastructure specifically for Indian languages. Their approach differs from global labs:

Training data: Built specifically from Indian-language audio and text corpora, not adapted from English-centric datasets
Models: STT and TTS models optimized for Indian languages, accents, and code-mixing patterns
Open models: Several of their models are open-source, enabling the Indian developer community to build on top
Latency optimization: Designed for Indian network conditions, with smaller model sizes that work on lower-end hardware

For builders creating voice products for the Indian market, Sarvam's models deserve serious evaluation alongside Whisper and ElevenLabs.

Production Voice AI: Patterns and Gotchas

Pattern 1: Streaming Everything

Voice AI has a latency budget of roughly 1.5 seconds for the full round trip. Exceeding this makes the conversation feel unnatural. The key to meeting this budget is streaming at every stage:

Streaming STT: Start processing audio while the user is still speaking
Streaming LLM: Begin generating the response as soon as enough context is available
Streaming TTS: Start synthesizing speech from the first sentence while the LLM is still generating the rest

Without streaming:
  User speaks (3s) → STT waits → STT runs (0.5s) → LLM runs (1.5s) → TTS runs (0.5s) → Play
  Total wait after user stops: 2.5 seconds

With streaming:
  User speaks → STT streams in real-time → LLM starts as STT sends partial text
  → TTS starts on first LLM sentence → Audio plays while LLM still generating
  Total wait after user stops: ~0.5-0.8 seconds

Pattern 2: Endpointing — Knowing When the User Has Stopped

One of the hardest problems in voice AI is endpointing (also called Voice Activity Detection or VAD) — determining when the user has finished speaking. Too aggressive, and you cut the user off mid-sentence. Too passive, and there's an awkward silence after the user finishes.

Common approaches:

Silence duration: Wait for 700ms-1200ms of silence before assuming the user is done. This is the simplest approach but fails when users pause to think mid-sentence.
Semantic endpointing: Use a small model to analyze the transcribed text so far and predict whether the sentence is complete. "I want to order" → incomplete. "I want to order a pizza" → likely complete.
Prosodic cues: Analyze pitch contour and energy patterns. Falling intonation typically signals the end of a statement. Rising intonation signals a question or continuation.

Why This Exists

Think about how you know when a friend has finished speaking in a phone conversation. You're not timing the silence. You're processing the meaning of their words, the intonation of their voice, and the conversational context. Good endpointing combines all of these signals.

Pattern 3: Graceful Degradation

Voice systems fail differently from text systems. When a text chatbot produces a bad response, the user can re-read it, try rephrasing, or scroll up for context. When a voice system fails, the user hears silence, garbled audio, or an irrelevant response with no way to "scroll back."

Design for failure:

STT failure (can't understand the user): "I'm sorry, I didn't catch that. Could you say that again?"
LLM latency spike: Play a brief filler sound or phrase ("Let me check that for you...") while waiting
TTS failure: Fall back to a pre-recorded message or text display
Network failure: Cache the most common responses locally. Order status, store hours, and basic FAQs can be served from device storage.

Pattern 4: The Hybrid Approach

The most practical voice AI architecture for many products isn't pure voice — it's voice with visual fallback:

Hybrid Voice + Visual Interface:

  ┌─────────────────────────────────────┐
  │         VOICE INPUT                  │  User speaks naturally
  │   "Show me red shoes under 2000"     │
  └──────────────────┬──────────────────┘
                     │
                     ▼
  ┌─────────────────────────────────────┐
  │         VOICE RESPONSE               │  Short, confirming
  │   "Here are red shoes under ₹2000"   │
  └──────────────────┬──────────────────┘
                     │
                     ▼
  ┌─────────────────────────────────────┐
  │         VISUAL DISPLAY               │  Product cards, images,
  │   [Shoe 1] [Shoe 2] [Shoe 3]        │  prices, "Add to Cart"
  │    ₹1,499   ₹1,899   ₹1,299        │  buttons
  └─────────────────────────────────────┘

Voice handles the input (faster than typing, especially in native languages). Visual handles the output (faster to scan than listening to a list of options). This hybrid approach is particularly powerful for e-commerce, food delivery, and any product where the user needs to compare options.

Production Gotcha: Audio Encoding and Format Confusion

A surprisingly common source of bugs in voice systems is audio format mismatch. Different components expect different formats:

Component	Common Expected Format
Whisper	16kHz, mono, 16-bit PCM or WAV
ElevenLabs TTS output	24kHz, mono, MP3 or PCM
Browser Web Audio API	44.1kHz or 48kHz, various formats
Phone systems (Twilio)	8kHz, mono, mulaw

Sending 44.1kHz stereo audio to Whisper (which expects 16kHz mono) doesn't produce an error — it produces garbage transcription. The model "hears" the audio at the wrong speed and frequency, like playing a 33 RPM record at 78 RPM. Always verify sample rate and channel count at every boundary in your pipeline.

The Future of Voice: Multimodal and Ambient

The trajectory of voice AI points toward two convergences:

Multimodal convergence: Models that process audio, video, text, and images simultaneously. GPT-4o already does this — it can "see" through a camera while "hearing" the user and "speaking" a response. The next generation will be seamless: point your phone at a restaurant menu, ask "What's good here?" in Hindi, and get a voice response that references both the menu items it can see and reviews it has retrieved.

Ambient AI: Voice interfaces that are always available without explicit invocation. No "Hey Siri" or button press. The system listens passively, detects when it's being addressed (a hard endpointing problem), and responds. Apple's on-device AI processing and always-on Neural Engine are designed for exactly this scenario.

For builders in India, the intersection of voice AI and India's linguistic diversity represents one of the largest underserved markets in technology. The tools exist. The demand exists. What's needed are builders who understand both the technology and the cultural context.

Try It Yourself

Voice pipeline latency test: Use any voice AI product (Siri, Google Assistant, Alexa, or an AI chatbot with voice). Time the gap between when you stop speaking and when the response audio begins playing. Is it under 1.5 seconds? Over 2 seconds? How does the latency feel from a conversational perspective?
Code-mixing challenge: If you speak a language other than English, try a voice assistant with a code-mixed query — a sentence that blends your language with English. How well does it handle the switch? Does it transcribe accurately? Does it respond in the same mixed style?
Whisper experiment: If you have a technical setup, install Whisper locally (pip install openai-whisper) and transcribe a recording of yourself speaking. Try different model sizes (tiny, small, medium, large). Compare the accuracy and speed. Notice how accuracy improves with model size but latency increases.
Voice UX audit: Pick a task you normally do by typing (ordering food, checking bank balance, booking a ride). Redesign the interaction as a voice-first experience. What changes? What information needs to be spoken vs. displayed visually? Where does voice excel and where does it struggle?

Chapter endnotes

[1] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. "Robust Speech Recognition via Large-Scale Weak Supervision." OpenAI, 2022. The Whisper paper, which demonstrated that training on 680,000 hours of multilingual audio with weak supervision produces a single model capable of 99-language transcription.

[2] India's voice-preferring user estimates drawn from multiple sources including IAMAI-Kantar ICUBE reports on Indian internet usage, Google-KPMG study on Indian languages, and Meesho's public data on user behavior. The 450M figure is approximate and growing as smartphone penetration increases.

[3] Meesho's GenAI Voice Bot metrics from public presentations by Meesho's technology team at industry conferences, 2025. Individual performance numbers may vary across languages and query types.

[4] ElevenLabs capabilities and pricing based on their public documentation and API as of March 2026. Voice cloning quality varies significantly with the quality and quantity of input audio — clean, studio-quality recordings produce markedly better clones than noisy phone recordings.

[5] GPT-4o Realtime API capabilities described in OpenAI's documentation. The real-time audio mode represents a fundamentally different architecture from the standard STT→LLM→TTS pipeline, processing audio tokens natively rather than converting to text first.

[6] Sarvam AI's capabilities based on their published models and public documentation. As an early-stage company focused specifically on Indian languages, their model offerings and performance characteristics are evolving rapidly.