AI Calling Agents: What the Numbers Actually Look Like in 2026

The claims around AI calling agents are all over the place. Some vendors say 90% cost reduction. Others talk about human-level CSAT. Most of these numbers come from 2-week pilots on 500 calls with a friendly QA team watching.
Here's what it actually looks like when you're running over a crore of voice minutes a month across real enterprise telephony—noisy audio, code-mixed languages, spiky traffic, and CRM integrations that sometimes time out.
First response time—and why the clock starts at form submission
This is the one metric where AI calling agents are unambiguously, not-even-close better than human teams.
| Response window | Lead conversion rate (avg) |
|---|---|
| < 1 minute | ~38–45% |
| 1–5 minutes | ~22–28% |
| 5–30 minutes | ~10–14% |
| 30 min – 2 hours | ~5–7% |
| > 2 hours | < 3% |
These numbers vary by industry, but the shape of the curve is consistent. Lead intent degrades fast. Someone who submitted a form at 10:43pm is still warm at 10:44pm. By the time your best rep calls at 9am, that person has looked at three competitors, talked to a channel partner, and maybe already scheduled a site visit somewhere else.
A human team can't reliably call within 60 seconds at 2am. An AI agent can. That's the gap.
The production architecture for first-response dialing looks roughly like this:
# Webhook fires on new lead submission
@app.post("/webhook/lead")
async def handle_new_lead(payload: LeadPayload):
lead = await crm.upsert_lead(payload)
# Score intent before dialing—don't waste agent time on junk
intent_score = await intent_scorer.score(
source=lead.source,
form_data=lead.form_data,
time_of_day=lead.submitted_at
)
if intent_score.value >= DIAL_THRESHOLD: # typically 0.6–0.7
await dialer.queue_call(
lead_id=lead.id,
priority="high",
delay_seconds=30, # brief pause before ring
language=lead.inferred_language or "hi-IN",
agent_config="lead_qualification_v3"
)
return {"status": "queued", "intent": intent_score.value}
The 30-second delay before ring is intentional. Calling someone 4 seconds after they hit Submit feels like surveillance. 30–45 seconds feels like a fast, attentive team.
Conversation quality—the honest version
AI agents are better than your worst human agents. They're worse than your best ones. At scale, you don't get to run all best agents.
Here's what the performance distribution actually looks like across a 10,000-agent human contact center versus a deployed AI agent on the same call type:
| Metric | Bottom 20% human agents | Median human agent | Top 20% human agents | AI agent |
|---|---|---|---|---|
| Script adherence | 52% | 74% | 91% | 97% |
| CRM data completeness | 43% | 68% | 88% | 94% |
| First-call qualification rate | 28% | 51% | 69% | 55–62% |
| Avg handle time (qualification) | 8.2 min | 5.6 min | 4.1 min | 3.8 min |
| Language switch accuracy | 31% | 58% | 79% | 91% |
The AI lands around the 65th–70th percentile of human performance on qualification calls. That's the honest benchmark. Not "beats all humans"—beats most of the people actually taking calls on a Tuesday afternoon.
For complex objection handling, relationship selling, and anything requiring genuine empathy, humans still win. The AI knows this too—its escalation logic is tuned to hand off when the conversation goes somewhere the playbook doesn't cover.
How the STT layer actually handles Indian telephony audio
This is the part nobody explains properly in sales decks, but it's where most voice AI deployments fail quietly.
Indian telephony audio is hard. 8kHz compressed PSTN. Background noise—traffic, family members talking, TV. Regional accents on top of accents. And code-mixing that switches mid-word.
A typical production call might sound like: "Haan, main Galleria wala flat dekhna chahta hoon, but possession timeline kya hai?"
Standard off-the-shelf STT trained on US English or even formal Hindi will produce garbled output on this. The word error rate (WER) on clean English benchmarks means nothing here.
What matters in production:
| STT condition | Acceptable WER threshold | What happens above it |
|---|---|---|
| Clean 16kHz audio | < 5% | Baseline—easy |
| 8kHz PSTN, quiet | < 10% | Manageable |
| 8kHz PSTN, noisy | < 18% | Entity extraction degrades |
| Code-mixed Hinglish | < 15% | LLM can compensate partially |
| Vernacular (Odia, Punjabi) | < 20% | Needs language-specific fine-tuning |
When WER crosses 20%, the downstream LLM starts hallucinating corrections. It fills gaps with statistically likely words, not the right ones. A customer says "Srusti Estates" and the model hears "Fruity Estates"—the reasoning layer invents a plausible-sounding answer about a property that doesn't exist.
The fix isn't purely a better model. It's domain-specific fine-tuning on your actual call recordings, plus entity biasing—injecting your property names, brand names, and product terminology into the decoding beam search so they score higher than phonetically similar nonsense.
# Entity biasing in STT request (Sarvam / similar API pattern)
stt_config = {
"model": "saarika:v2",
"language_code": "hi-IN",
"audio_encoding": "MULAW",
"sample_rate_hertz": 8000,
"speech_contexts": [
{
"phrases": DOMAIN_ENTITIES, # ["Galleria", "Srusti", "2-BHK", ...]
"boost": 15.0 # confidence boost on these tokens
}
],
"enable_automatic_punctuation": True,
"model_variant": "telephony" # different from standard—critical
}
The telephony model variant is not optional. Systems that use the standard model variant on 8kHz audio and wonder why accuracy is low are making a very basic mistake.
Language handling—the metric that's usually missing from dashboards
India has 22 scheduled languages. The production AI calling agents running today handle 12+ of them in a single deployment. The number that doesn't get reported: how many qualified conversations are enterprises not having because their AI only works in Hindi and English.
A rough estimate from campaigns across Tier-2 and Tier-3 markets:
| Language cluster | Share of inbound leads (Indian RE market avg) | Share typically handled by AI |
|---|---|---|
| Hindi | 34% | 95%+ |
| English / Hinglish | 28% | 90%+ |
| Tamil | 9% | 75–85% |
| Telugu | 8% | 75–85% |
| Kannada | 6% | 70–80% |
| Marathi | 5% | 70–80% |
| Gujarati | 4% | 65–75% |
| Bengali, Odia, others | 6% | 40–60% |
That last row is where conversations are falling through the floor. An AI agent that can't handle Odia well is failing 3–6% of your leads silently. They still get a call. The call just doesn't go anywhere useful, and the CRM records it as "no response" rather than "language mismatch."
Code-mixing is a separate problem from multilingual support. A model that knows Hindi and knows English doesn't automatically handle Hinglish—the syntax, the pronoun-dropping, the verb conjugations that don't match either parent language. It needs specific training on code-mixed corpora, and most commercial models haven't done this at depth beyond the major pairs.
Latency architecture—what "under 1 second" actually requires
Sub-second turn latency is not a default. It's the result of deliberate architectural choices at every layer.
Caller speaks → [end-of-speech detection] → STT stream → LLM inference → TTS stream → audio playback
Target: < 900ms total
Here's where time actually goes:
| Component | Target latency | Where it breaks |
|---|---|---|
| End-of-speech detection | 80–150ms | Silence threshold too conservative (adds 400ms) |
| STT (streaming, first token) | 120–180ms | Using batch mode instead of streaming |
| LLM first token | 200–350ms | Cold container start, large context window |
| TTS first audio chunk | 150–220ms | Waiting for full response before synthesizing |
| Network + telephony jitter | 50–120ms | Cross-region routing, no edge inference |
| Total (optimistic) | 600–1020ms | |
| Total (typical, untuned) | 1800–3500ms | Feels broken |
The two biggest latency killers in practice: waiting for silence to confirm end-of-speech (instead of predicting it), and waiting for the full LLM response before starting TTS synthesis.
Streaming TTS—sending the first sentence to speech while the model is still generating the rest—cuts perceived latency by 300–500ms. Users hear the agent start speaking within 600ms even if the full answer takes 1.2 seconds to generate.
async def stream_response_to_audio(llm_stream, tts_client, audio_buffer):
sentence_buffer = ""
async for token in llm_stream:
sentence_buffer += token
# Don't wait for full response—synthesize sentence by sentence
if ends_sentence(sentence_buffer):
audio_chunk = await tts_client.synthesize_streaming(
text=sentence_buffer,
voice="aarya-v2",
sample_rate=8000,
encoding="MULAW"
)
await audio_buffer.push(audio_chunk)
sentence_buffer = ""
This is the kind of implementation detail that separates 700ms systems from 2.5-second systems. The model being used is often the same. The architecture around it is what changes the number.
What to measure—and how vendors obscure each one
If you're evaluating an AI calling agent vendor, these are the metrics that matter. I've included what the good ones show you and what you should ask for if they don't.
| Metric | What good looks like | Red flag |
|---|---|---|
| First-call resolution rate | 55–70% on L1 queries | "Deflection rate" (not the same thing) |
| Escalation accuracy | > 88% appropriate escalations | Overall escalation rate without accuracy |
| CRM completeness post-call | > 90% fields populated | "Conversation logs available" |
| P95 turn latency | < 1200ms on telephony | Median latency only (hides tail) |
| WER on your audio | < 15% on production recordings | Benchmark dataset WER |
| Language switch accuracy | > 85% on code-mixed | English-only test accuracy |
| CSAT on AI-handled calls | Within 0.3 pts of human-handled | CSAT on escalated calls only |
The demos are always good. The pilots on clean audio in a quiet office are always good. The metric that reveals everything is P95 latency measured on production telephony audio during a campaign launch with 800 concurrent calls. Ask for that number. Most vendors won't have it.
The production stack in one diagram
For reference—this is roughly what a production AI calling agent stack looks like when it's built to hold up:
┌─────────────────────────────────────────────────┐
│ TELEPHONY LAYER │
│ Exotel / TataTele / Twilio SIP │
│ Inbound + Outbound │
└────────────────────┬────────────────────────────┘
│ 8kHz MULAW audio stream
┌────────────────────▼────────────────────────────┐
│ SPEECH-TO-TEXT (STT) │
│ Fine-tuned on telephony + domain entities │
│ Streaming mode · WER < 12% on prod audio │
└────────────────────┬────────────────────────────┘
│ text + confidence scores
┌────────────────────▼────────────────────────────┐
│ REASONING LAYER (LLM) │
│ Retrieval-grounded · Policy + product docs │
│ Tool calls: CRM read/write, calendar, payment │
│ Guardrails: topic scope, escalation triggers │
└────────────────────┬────────────────────────────┘
│ response text (streaming)
┌────────────────────▼────────────────────────────┐
│ TEXT-TO-SPEECH (TTS) │
│ Streaming synthesis · MOS 4.2/5 │
│ 30+ voices · Barge-in detection │
└────────────────────┬────────────────────────────┘
│ 8kHz audio back to caller
┌────────────────────▼────────────────────────────┐
│ POST-CALL PIPELINE │
│ CRM sync · Transcript storage · QA scoring │
│ WhatsApp follow-up trigger · Analytics ingest │
└─────────────────────────────────────────────────┘
Each layer is independently scalable. The CRM sync and post-call pipeline run async—they don't sit in the critical latency path of the actual conversation.
The vendors running this stuff well publish real numbers from production. The ones running it less well publish benchmark numbers from curated test sets.
Ask for P95 latency on telephony audio. Ask for WER on a sample of your own call recordings. Ask for first-call resolution rate, not deflection rate. These three questions filter the market faster than any demo.
