/insights · VoiceLab
AI voice infrastructure in 2026 — what actually works
The honest map of voice infrastructure in 2026 — TTS, ASR, voice cloning, dubbing, real-time conversion, QA tooling. What ships in production, what is overpromised, and where the unbuilt opportunities sit.
If you build anything that processes human voice in 2026 — podcasts, calls, video, accessibility, language learning, automation — you sit on top of a layered stack of voice infrastructure. The pieces are increasingly capable, increasingly available as APIs, and increasingly subject to marketing claims that don’t survive contact with production.
This is the working engineer’s map of where voice infrastructure actually stands. It defines the layers, names what works, names what doesn’t, and points at where the next real work needs to happen. It’s the second of AudioLab.tools’s pillar pieces, following What is audio AI? — and it’s the cornerstone document for VoiceLab.
The seven layers
The voice stack split into the parts that actually matter:
- Capture — microphones, room acoustics, A/D conversion, format choice.
- Cleanup — noise reduction, dereverb, AGC, normalisation.
- Recognition (ASR) — speech → text, with timing, confidence, optional diarisation.
- Understanding (NLU) — intent, sentiment, entity extraction, summarisation.
- Synthesis (TTS) — text → speech with controllable expression.
- Conversion — voice-to-voice transformation, including dubbing and identity preservation.
- Delivery — distribution, routing, real-time vs offline, accessibility surfacing.
This piece covers layers 2–7. (Layer 1, capture, deserves its own dedicated piece — coming.) Each layer below ships in production today; the gaps are mostly in how they compose, not in their individual capability.
Layer 2: Cleanup
The boring layer that quietly makes everything else work.
What ships in 2026
- RNNoise-class neural noise reduction — every major video-call app (Zoom, Meet, Teams, Slack, FaceTime) defaults to it. Krisp and competitors expose it as SDKs for third-party apps.
- Studio-grade dereverb — iZotope RX, Acon Digital DeVerberate, several Adobe tools. Production-ready for podcast and broadcast post.
- AGC and loudness normalisation — ITU-R BS.1770-4 and EBU R128 compliance, ASR-optimised pre-processing, podcast platforms (Spotify, Apple) handling it automatically on ingest.
- Echo cancellation — solved at the conferencing layer. WebRTC’s AEC is good enough for most use cases.
What still falls short
- Real-time studio-grade dereverb. RX-quality dereverb runs slower than real time on most hardware. Real-time variants exist (RX Voice De-noise, Waves RVerb) but tradeoffs are real.
- Cross-language noise reduction. Most RNNoise variants are tuned on English speech. Performance on tonal languages or whisper-quiet voices drops measurably.
- Long-form dereverb consistency. A 90-minute podcast where the room tone shifts halfway is still hard. Most tools assume the impulse response is stationary.
What we built
VoiceLab QA doesn’t do cleanup. It measures the evidence of needing cleanup — SNR, noise floor, room echo decay, sibilance risk — and tells you what to fix and where to fix it. The cleanup tools are commodities; the QA layer wasn’t, until recently.
Layer 3: Recognition (ASR)
The layer that got rewritten between 2022 and 2025 and is still settling.
What ships in 2026
- Cloud ASR: OpenAI Whisper API, Deepgram, AssemblyAI, AWS Transcribe, Google STT. Accuracy at high-90s % on clean English; mid-80s to high-80s % on most major languages; degrades on niche dialects.
- On-device ASR: Whisper.cpp, MLX-based Whisper on Apple Silicon, Android’s on-device Live Caption (a Whisper-derived model). Production quality, no network round-trip.
- Real-time streaming ASR: Deepgram Nova-3, Google STT streaming, Speechmatics. Sub-300ms first-word latency, 95%+ accuracy on broadcast-quality audio.
- Diarisation: pyannote.audio in research; bundled in Deepgram, AssemblyAI. Reliable on 2–3 speakers with separated microphones; degrades fast with overlap.
- Word-level timing — every modern ASR provider exposes per-word timestamps suitable for caption-syncing.
What still falls short
- Long-tail languages. Mid-tier languages (Catalan, Vietnamese, Tagalog) lag English by 8–15 % accuracy on typical material.
- Cross-channel diarisation. A single mono file with multiple distant mics is still a hard problem. pyannote-class models do it but accuracy drops to the 70s.
- Code-switching — when speakers mix languages mid-sentence, accuracy collapses. Multi-lingual ASR exists (Whisper-large) but routes through detection that can flip mid-utterance and lose track.
- Whisper / quiet speech — pre-processing required. Production tools handle this; raw ASR APIs less so.
- Vocabulary adaptation — domain-specific terms (medical, legal, technical) need custom language models or post-correction. Some providers expose hotwords; most don’t expose vocabulary fine-tuning.
What we built
VoiceLab QA flags pacing, filler density, clipping, room echo — signal-level metrics that complement ASR. ASR for the words; VoiceLab for the quality of those words being captured.
Layer 4: Understanding (NLU)
Adjacent to voice but worth a brief mention because the boundary blurs.
What ships in 2026
- Intent classification — covered by general-purpose LLMs (GPT-4-class, Claude, Gemini). Specialised IVR-focused intent providers (Dialogflow, Lex) still exist but the general LLM route covers most ground.
- Sentiment + emotion — Voiseed, Hume, Suno-class systems do “vocal emotion” detection. Accuracy is moderate; the category is over-marketed.
- Speaker identification (recognition vs diarisation) — biometric speaker ID exists (Pindrop, Phonexia). Distinct from diarisation; less of an everyday tool.
- Summarisation + topic extraction — every transcription provider now ships a summary endpoint. Quality is good for 30-min material, degrades for hour-plus.
What still falls short
- Multi-turn understanding — long conversations where context matters across hours. RAG over the transcript helps but is bolted on, not native.
- Cultural / contextual nuance — sarcasm, regional idioms, code-switched humour. Not solved; not on a near-term roadmap.
Layer 5: Synthesis (TTS)
The layer most visible to users and most affected by the “AI hype” wave.
What ships in 2026
- High-fidelity neural TTS: ElevenLabs, Play.ht, Suno-derived, Cartesia, Resemble, Google Studio Voice, Azure Neural TTS. Quality at the top tier is indistinguishable from professional voiceover for most listeners.
- Voice cloning with explicit consent — ElevenLabs Pro, Resemble, Cartesia. Five seconds of source audio is enough for a usable clone; quality scales with sample length.
- Multilingual cross-tongue synthesis — synthesise speech in a language the speaker doesn’t themselves speak. Quality is good for major language pairs.
- Style + emotion control — sliders for excitement, calm, narrative arc. Quality varies; ElevenLabs is currently the leader.
- Real-time TTS — sub-200ms latency for short utterances. Used in voice assistants, real-time dubbing, interactive media.
What still falls short
- Performance. Synthetic voices land at “good announcer”. They don’t land at “great actor”. The last 15 % of expressive nuance is mostly missing.
- Long-form consistency — voices drift over a 30-minute synthesis. Each 60-second chunk sounds great; stitched together, prosody breaks.
- Whisper, breathiness, intimacy — high-detail vocal textures are hit or miss.
- Singing — Suno and similar exist but quality on a-cappella singing is plateaued; instrument-accompanied is much better.
Ethics and legal
The single biggest production blocker. Voice cloning without explicit consent has been ruled actionable in multiple jurisdictions. Production providers (ElevenLabs Enterprise, Cartesia) now require:
- Explicit consent capture from the source voice.
- Audit logs of who synthesised what voice and when.
- Watermarking or content authenticity markers.
- Region-specific compliance for biometric data laws.
Building voice synthesis into a product without these is no longer a pure engineering choice — it’s a legal one.
Layer 6: Conversion
Voice-to-voice transformation. The newest and fastest-moving layer.
What ships in 2026
- Real-time voice conversion — Coqui xtts and successors, ElevenLabs voice changer, Resemble’s real-time API. Convert your voice to a target voice with sub-200ms latency.
- Dubbing pipelines — synchronise voice translation while preserving the source speaker’s identity. ElevenLabs Dubbing, HeyGen, Synthesia. Quality is good for talking-head video; lip-sync remains the weakest link.
- Singing voice conversion (SVC) — RVC and successors. Used widely but mostly outside production legal contexts.
What still falls short
- Mouth movement preservation — dubbing-grade lip-sync requires video-side processing alongside voice; integrated systems (Wav2Lip, Sync.ai) are improving fast but still produce uncanny output on close-ups.
- Prosody preservation across languages — translating English to Mandarin and keeping the speaker’s rhythm is hard. The category is improving but not yet at “indistinguishable from native”.
- Studio-grade voice replacement — for film and broadcast post-production, hand-tuned voice replacement is still the norm. AI-assisted, not AI-driven.
Layer 7: Delivery
How the voice reaches the listener.
What ships in 2026
- Adaptive bitrate streaming for voice — covered by HLS/DASH; voice-specific optimisations exist but rarely matter.
- LE Audio for direct-to-hearing-device streaming — see Hearing accessibility on Android for the deeper dive.
- Captions everywhere — every major platform now expects captions to be available, often auto-generated.
- Real-time routing — WebRTC, Twilio Voice, LiveKit. Production-ready, sub-200ms typical latency.
- Accessibility surfacing — Live Caption on Android and iOS, screen-reader integrations.
What still falls short
- Routing visibility in apps. Users on hearing aids still rarely see where their audio is being routed. The platform exposes the info; almost no apps surface it.
- Multi-language routing — automatic per-listener language selection from a single source. Possible in principle; not productised cleanly.
The unbuilt opportunities
The places where voice infrastructure is good enough but the products aren’t yet:
1. Voice QA at scale
The ASR APIs are commodities. The quality control layer around voice production — pacing drift detection across episodes, room-tone consistency, sibilance regression — is mostly absent. This is the VoiceLab thesis: signal-level QA tooling that makes voice production reliable at volume.
2. Long-form synthesis with consistency
A 30-minute audiobook chapter synthesised end-to-end with consistent prosody, mood, and pacing. No-one ships this well today. The components exist; the orchestration doesn’t.
3. Dubbing-prep pipelines
Most dubbing today is “translate the script and re-record”. Tooling for prepping the original — extracting timing landmarks, detecting unmatched gestures, flagging speaker emphasis points — is mostly in-house at studios. The product layer is sparse.
4. Consent and audit infrastructure
Production voice cloning requires consent capture, audit trails, region-specific compliance, watermarking. Build-vs-buy is currently “build” for most teams. The opportunity for a horizontal compliance product is large.
5. Accessibility-first voice products
Captions everywhere is now baseline. Real first-class accessibility — assistive listening integration, routing surfacing, multi-modal redundancy — is mostly missing from consumer products. See the accessibility audio market is bigger than people think.
What “build on AudioLab” means for voice
VoiceLab is the QA + workflow layer. The browser-side demo measures clarity, pacing, filler density, room echo, sibilance — the things that distinguish “captured speech” from “production-ready speech”. The roadmap covers:
- API endpoints for VoiceLab QA (in private design preview)
- Per-vertical thresholds for podcasting / e-learning / dubbing prep
- Localisation-aware pronunciation risk markers
- Multi-speaker turn analysis at archive scale
We don’t synthesise voices. We don’t do voice cloning. We don’t do ASR. We do the layer in between — the one that makes the voices you have, or the voices you’re generating, actually production-quality.
Where the field is going
Three predictions for the 2027 voice stack:
- ASR becomes a commodity. Quality differentiation moves to consent + audit. Whisper-class models are good enough that the new value is in trust infrastructure, not accuracy.
- Real-time voice conversion ships in mainstream consumer apps. Already in beta for some games and creator tools; will be in conferencing platforms within 12 months.
- The “studio voice production” stack consolidates. The dozens of point tools for cleanup / dereverb / pacing / loudness will collapse into 2–3 dominant suites, with API access for downstream platforms.
Related
- What is audio AI?
- The accessibility audio market is bigger than people think
- Why filler density matters more than filler count
- Estimating room echo without RT60
- A pragmatic loudness target for podcasts
Get involved
VoiceLab’s design-partner program is open across all four verticals (podcast, e-learning, dubbing prep, accessibility). If you’re building voice infrastructure, drop a note via contact. We update this map when the field shifts.