Voice AI in Production: Why Agents That Pass Testing Still Break in the Wild
Key takeaways
- Voice AI in production typically handles 60-70% of real calls successfully when pre-launch testing suggested 90-95%. The gap comes from test coverage rather than the underlying model.
- The eight categories where test sets miss production conditions: audio realism, accents, frustrated callers, integration surprises, edge-case business logic, model drift, scale, and unknown unknowns.
- Closing the gap requires three layers: pre-launch simulation, production observability, and a feedback loop that turns production failures into new test cases.
- The biggest single lever for most teams is audio realism. Clean test audio is the most common reason staging numbers don’t predict voice AI production performance.
- Coval is built around this three-layer methodology: simulation, observability, and continuous evaluation infrastructure derived from the Waymo playbook for self-driving cars.
Table of contents
- How wide the voice AI production gap is
- What production has that staging usually doesn’t
- Why test suites miss the conditions voice AI production exposes
- Closing the gap: the three-layer approach
- What production-realistic testing looks like
- What teams say after they’ve closed the gap
- Common mistakes when trying to close the gap
- Where to go from here
- Frequently asked questions
The most common story we hear about voice AI in production goes something like this. The team builds an agent. The agent handles 95 percent of test calls in development. The team ships. Two weeks later, support starts logging complaints: calls being dropped, callers getting routed to dead ends, the agent saying something it shouldn’t have said. The team pulls the production logs and discovers the agent is handling 60 to 70 percent of real calls successfully. The other 30 to 40 percent are failing in ways the test suite never caught.
That gap usually comes from coverage rather than the model itself. The test set didn’t include the audio conditions, the caller behaviors, the integration edge cases, or the adversarial situations that real production exposes. The evaluation missed what the model would face in production, and the team can’t tell that until customers are already complaining. Coval’s view, drawn from our work with voice AI teams across healthcare, insurance, fintech, restaurants, and government, is that this is the dominant pattern in voice AI today.
This guide is for voice AI teams that have made it to production and want to understand why the demo-to-production gap exists, what conditions it shows up in, and how to close it before the next launch.
How wide the voice AI production gap is
In conversations with voice AI teams across healthcare, insurance, fintech, restaurants, and government, the pattern is consistent. Pre-launch evaluation suggests the agent will handle 90 to 95 percent of calls successfully. Production performance lands 20 to 30 percentage points below that. The agent that looked great in staging is meaningfully worse in the wild.
The number varies by use case. The gap is bigger for use cases with more variable inputs (drive-through ordering, emotional customer service calls) and smaller for use cases with more structured inputs (appointment confirmation, status checks). It’s bigger for teams with less mature evaluation infrastructure and smaller for teams that test against realistic conditions. The gap is almost never zero, and almost always larger than the team expected.
A rough sketch of how the gap scales across deployment types:
| Use case profile | Pre-launch test pass rate | Typical production rate | Gap |
|---|---|---|---|
| Structured (appointment confirm, status check) | 95% | 85-90% | 5-10 pts |
| Mixed (insurance intake, support triage) | 92% | 70-80% | 12-22 pts |
| Variable (drive-thru, emotional service) | 90% | 55-65% | 25-35 pts |
These ranges reflect what Coval sees across customer engagements. Platform choice affects the gap too. See voice AI platform comparison 2026 for benchmark data on how the major platforms differ on the metrics that translate to production behavior.
One founder of a voice AI startup framed the underlying problem directly: “We’re not going to have visibility to all the different variations.” The variations are infinite. The test set is finite. The art is in choosing the right variations to test.
What production has that staging usually doesn’t
The conditions that show up in production and that test sets typically miss are predictable. The same eight or ten categories appear in nearly every post-launch incident.
1. Audio quality the test set never covered
Test audio is usually clean. Production audio is anything but. Road noise, restaurant background, low-bandwidth phone connections, callers in moving subway cars, iPhone noise cancellation that drops critical syllables, echo cancellation failures that produce double-talk. None of this shows up in development testing unless the team intentionally builds for it.
The pattern, as one voice AI engineering lead at a major insurance carrier put it: “Most voices that are being sold are modeled against people that speak very well. But when you look at people that are actually speaking on the phone, they’re going to have stuttering.” The test set is rarely calibrated for that reality.
2. Accents and dialects outside the development team’s experience
Southern American accents cause turn-detection issues with some STT models. Regional slang for menu items trips up ordering agents in ways one customer at a drive-through called “very different things.” Non-native English speakers break transcription. Multilingual deployments launching in Portuguese, Spanish, Japanese, or Hindi expose quality issues the development team often can’t validate by ear.
The development team typically has demographic blind spots. A team in San Francisco testing primarily on their own voices is not catching the conditions that surface when the agent launches in Atlanta, Mumbai, or São Paulo. Audio-environment testing with realistic accent variation is essential and underdone.
3. Frustrated, adversarial, or unusual callers
Test scenarios usually feature cooperative callers who follow the agent’s lead. Production scenarios feature callers who arrive frustrated after waiting through an 8-minute IVR menu, callers gaming the system for refunds they don’t qualify for, callers who interrupt the agent three times in a row before stating their actual question.
A fintech voice AI lead framed the recovery problem directly: “As soon as we make a mistake, the conversation completely flips.” Recovery from an adversarial caller requires a different set of behaviors than the base case, and most test suites don’t exercise those behaviors enough.
4. Integration layer surprises
The CRM returns a slightly different schema than the test environment used. The EHR system rate-limits at a level the test suite didn’t anticipate. The payment processor takes 9 seconds to respond instead of the typical 800 milliseconds. The telephony provider drops every twentieth packet on a specific carrier route.
Most production voice AI failures live at the seams between the agent and the systems it depends on, rather than inside the model itself. If the test suite uses mocked integrations, none of these conditions get exercised, and the integration layer becomes the dominant source of post-launch bugs.
5. Edge-case business logic
The customer has multiple accounts. The patient has a name change in the EHR that doesn’t match the insurance record. The order was placed at one location but the customer is asking about it at another. The caller’s voice matches one account but their phone number is registered to a different one.
Business logic edge cases multiply with the complexity of the use case. Test suites rarely cover more than a small fraction of them, and the long tail is where production hits the team unexpectedly.
6. Model drift
The model vendor releases an update. The version string doesn’t change. The model is now subtly different: more verbose, slightly different intent recognition, different tool-calling behavior. The agent that worked yesterday quietly works less well today, and the team won’t notice until enough production traffic exposes the regression.
One QA lead at a voice AI company described the pattern: “Maybe it was working 100 percent of the time and suddenly it’s working only say 96 percent.” A 4-point drop is the kind of regression that’s invisible unless you’re monitoring for it explicitly.
7. Scale-induced failures
The agent works fine at 50 concurrent calls. At 500 concurrent calls, latency degrades, tool calls start timing out, and the conversation experience degrades in ways the lower-volume tests didn’t predict. Load testing voice agents is meaningfully harder than load testing web services, and most teams underinvest here until production exposes the problem.
A QA lead at a government technology company captured the discomfort plainly: “The scariest piece right now is just we can’t load test this and we’re rolling out to a very large agency.” The launch event becomes the load test, and the agency is the customer.
8. The “unknown unknowns”
The failures the team didn’t anticipate. The caller who pretended to be an existing customer to get information about their account. The conversation that triggered an unexpected interaction between two tool calls. The corner of the prompt that nobody noticed had ambiguous instructions until a real caller exposed it.
These are the failures that simulation alone cannot find. They surface in production, and the question is how quickly the team detects them and how soon they can be reproduced in the test suite so they don’t recur.
Why test suites miss the conditions voice AI production exposes
The teams that hit the demo-to-production gap haven’t done anything wrong. They’ve done the same things every voice AI team does, and discovered the systemic blind spots.
Test data is collected by the development team, not by users. Engineers test on their own voices, in their own environments, with their own use patterns. The distribution of real callers differs sharply from the distribution of developers, and the test set inherits the developer’s blind spots.
Base cases are easier to script than edge cases. Writing test scenarios for the most common user goals is straightforward. Writing test scenarios for the long tail of adversarial behavior, edge-case business logic, and integration failures is harder and less satisfying. Teams disproportionately invest in the easy scenarios.
Mocks are cleaner than real integrations. Mocked CRMs, EHRs, and payment processors are deterministic and fast. Real integrations are slow, flaky, and occasionally returning data the team didn’t expect. Test suites that rely on mocks miss the integration-layer failures that produce most production bugs.
Pre-launch testing has a deadline. The team needs to ship. The bar for “good enough” becomes whether the test suite passes, regardless of whether the suite covers everything that might happen in production. Real-world coverage stays an ongoing project beyond any launch gate.
The cost of testing edge cases isn’t always visible. Adding 200 new test scenarios that cover audio edge cases, adversarial behavior, and integration failures is a real investment. Without a clear forcing function, teams default to the smaller test suite that’s faster to maintain.
Closing the gap: the three-layer approach
Teams that close the demo-to-production gap have built a three-layer evaluation infrastructure. Each layer addresses a category of failure the others can’t catch.
Layer 1: Pre-production simulation against realistic conditions
Before any agent change ships, it runs against a library of test scenarios. The scenarios cover:
- Base cases across the agent’s primary use cases.
- Audio variation: realistic phone audio, background noise, accent diversity, multilingual coverage.
- Adversarial behavior: frustrated callers, gaming attempts, ambiguous requests, conversation hijacking.
- Integration edge cases: API timeouts, unexpected response schemas, partial data, rate limiting.
- Business logic edge cases: duplicate accounts, mismatched records, unusual user states.
The scenario library starts small and grows. Every production incident becomes a new scenario. Every customer complaint becomes a new scenario. Every clever attack a tester finds becomes a new scenario. The library is the team’s institutional memory of every failure mode they’ve ever seen, codified into automated tests. (Coval’s simulations are designed around this pattern; the docs walk through how teams structure scenario libraries.)
We covered the methodology in the three-layer testing framework for voice AI. For the cost picture (what it looks like when teams skip this work), see voice AI production failures: the $500K cost of skipping evaluation infrastructure.
Layer 2: Production observability
Once the agent is live, every conversation gets graded against the same criteria the simulations use. Resolution rate, escalation accuracy, tone, compliance adherence, tool-call success, latency. The output is a dashboard that surfaces drift before it becomes a customer-affecting problem.
Drift is constant. Model vendors update. Backend APIs change. Caller demographics shift with marketing campaigns. The agent that worked last month doesn’t necessarily work this month, and production observability is how teams find out without waiting for the complaints. See what is voice AI observability for the deeper coverage.
Layer 3: The feedback loop
The part most teams underinvest in. Failures caught in production get reproduced in simulation so they don’t recur. Patterns from real call data get used to generate new test scenarios. The agent improves over time because the evaluation infrastructure is collecting evidence and feeding it back into the development cycle.
One methodology that’s proven to work at scale comes from self-driving cars. According to Waymo’s research on simulation infrastructure, the company doesn’t ship software updates because they worked once in a test drive; they ship because the updates pass millions of simulated miles, the regression suite hasn’t degraded, and the production fleet’s metrics confirm the improvement. The same pattern works for voice AI. That difference separates agents that improve over time from agents that quietly degrade.
Our voice AI agent evaluation reference covers the full methodology.
What “production-realistic” testing looks like
Five characteristics separate test suites that predict production from test suites that don’t.
The test audio is real or convincingly synthetic. Studio-quality recordings of the engineering team will not predict production. Either real phone audio sampled from production (with appropriate legal and privacy infrastructure) or synthetic audio that includes realistic noise and accent variation.
The integrations are real or carefully shadowed. The test suite hits real APIs in a sandboxed environment, or shadow-runs against production APIs in a way that doesn’t affect production state. Mocks alone are inadequate.
The conversations are multi-turn. Single-turn tests catch the easiest failure modes. Multi-turn tests, including interruptions, reformulations, and recovery from errors, catch the failure modes that surface in real conversations.
The grading is rigorous. Pass/fail grading misses too much. Behavioral grading with rubrics covering tone, accuracy, compliance, and policy adherence gives the team the resolution they need to detect subtle regressions. The pattern matches what Stanford’s HELM project established for LLM benchmarking, where granular, multi-dimensional scoring beats single-number pass/fail.
The suite runs automatically on every change. Manual test runs miss regressions that automated regression testing catches. The discipline that matters most is consistency.
We covered the methodology in voice AI evaluation in 2026: the 5 metrics that predict production success.
What teams say after they’ve closed the gap
Voice AI teams that successfully close the gap describe the shift consistently.
The most universal feedback: deployment cycles get faster. When the team trusts the test suite, they ship without the weeks of manual validation that used to precede each release. One engineering leader described it as moving from “deploy on Friday and watch it all weekend” to “deploy any day and check the dashboard Monday.”
Production incidents drop, though they don’t disappear. When they do happen, they happen for novel reasons rather than for reasons the team already knew about and didn’t catch. The retrospectives feel different: “this is a category we hadn’t tested for” instead of “we knew about this and didn’t run the test.”
Compliance reviews get easier. Auditors ask for evidence of regression testing, scenario coverage, behavioral evaluation. Teams that have built the three-layer infrastructure produce the evidence without scrambling.
The team’s confidence in the agent grows. Agents still fail in some scenarios; that part doesn’t change. What changes is that the team knows where the failures will be, monitors for them, and improves the agent on a regular cadence instead of reacting to surprises.
Common mistakes when trying to close the gap
Five recurring failure modes show up in teams trying to do this work without getting traction.
Adding more tests without changing the test distribution. Doubling the number of base-case test scenarios doesn’t close the gap. The gap closes when the team adds the kinds of scenarios the suite is missing, beyond the kinds it already has.
Investing only in pre-launch testing. Pre-launch testing catches the failures the team knows to look for. Production observability catches the failures they don’t. Both are necessary, and teams that invest only in the first half discover the production half the hard way.
Treating evaluation as a launch gate. Evaluation is continuous. The team that closes the gap once and stops investing finds it has reopened by the next major model update or platform migration.
Underestimating audio realism. This is the single most common gap we see. Teams build sophisticated test suites with clean audio and discover at launch that audio realism was the bottleneck they hadn’t recognized.
Mocking integrations to keep tests fast. Speed matters, but mocks that don’t behave like the real systems are worse than no mocks at all because they give the team false confidence.
Where to go from here
The demo-to-production gap is the biggest hidden cost in voice AI deployments today. Teams that close it ship faster, have fewer incidents, and operate with more confidence in what they’ve built. Teams that don’t close it find out the hard way, from customers, that the evaluation they trusted wasn’t catching what mattered.
If you’re at the point where production performance is below what testing predicted, our voice AI agent evaluation reference covers the methodology that closes the gap. Book a call with the Coval team if you want to talk through what evaluation looks like for your specific deployment.
Frequently asked questions
How big is the typical demo-to-production gap?
It varies meaningfully by use case shape. Structured, narrow agents (fixed scheduling flows, balance-check IVR replacement) tend to lose 10-15 percentage points between staging and production. Variable, customer-facing agents (sales outbound, healthcare intake, contact-center triage) routinely lose 20-30. The biggest predictor isn’t model quality — it’s how far production traffic strays from what the test set covered.
What’s the single highest-ROI investment in closing the gap?
Audio realism. Most teams’ test sets use clean audio that doesn’t predict production. Investing in realistic audio conditions (accent variation, background noise, low-bandwidth phone audio) is the single highest-leverage change for most voice AI teams.
How long does it take to close the gap?
Closing it meaningfully usually takes a quarter. Closing it completely is a multi-quarter program; the scenario library grows over time as the team discovers new failure modes in production and adds them to the suite.
What’s the role of production monitoring versus pre-launch testing?
Pre-launch testing is the proactive layer; production monitoring is the discovery layer. The proactive layer catches known failure shapes against a curated set of scenarios. The discovery layer surfaces failures the team did not anticipate — caller behaviors, regional variations, integration drift — and feeds them back into the test set. Without the discovery loop, the test set ossifies and the gap re-opens within a quarter of every major model or platform update.
How do we know if our evaluation is closing the gap?
The most direct measure: compare your pre-launch test results to your production performance. If they match closely, your evaluation is predictive. If the results diverge, your evaluation is missing the conditions that matter in production.