/insights · Engineering

Real-time audio in the browser: what 2026 actually makes possible

The honest 2026 floor and ceiling for browser audio: AudioWorklet latency, WebGPU inference, MediaDevices capture, the gaps that still exist vs native, and the gaps that have finally closed.

June 4, 2026 20 min read webaudioaudioworkletwebgpubrowserpillarengineering

The phrase “you can’t do real audio in the browser” was true for most of the WebAudio era. It stopped being true around 2020 when AudioWorklet shipped. It became uncomfortably untrue when WebGPU shipped in 2023 and let serious ML inference move off-CPU. By 2026, browser audio is doing things that would have required a native app five years ago. A few things genuinely still need native.

This is the working map of the floor and the ceiling. What you can ship in the browser today with confidence, what’s right at the edge, and what still needs a separate codebase.

The shape of the platform in 2026

Five APIs do the work:

AudioContext + AudioNode graph. The classical WebAudio primitives. Built-in nodes: gain, biquad, delay, compressor, convolver, analyser, panner. Fine for “I need to play a sound” and for the analysis-side stuff (FFT via AnalyserNode). Limited for anything custom because the graph is fixed at instantiation time.
AudioWorklet. The escape hatch. JavaScript or WebAssembly running on a dedicated audio thread, processing buffers of 128 samples at a time. This is where real-time DSP, synthesis, and analysis live. AudioContext for the routing, AudioWorklet for the work.
MediaDevices.getUserMedia. Microphone and other input access. Constraints API for sample rate, channel count, echo cancellation, noise suppression. Standardised and consistent across browsers in 2026 in a way it wasn’t in 2020.
MediaStreamTrack with AudioWorklet via MediaStreamTrackProcessor. Treat a live track as a stream of audio frames you can process. The clean bridge between capture and DSP.
WebGPU. General-purpose GPU compute. The reason in-browser audio ML is now genuinely competitive (see the dedicated section below).

Around those, helpers: WebCodecs (encode/decode), Web Audio Stream Sink (low-latency rendering paths in modern browsers), and the AudioSession API for mobile-platform integration.

Latency: the headline number

Real-time audio is judged by round-trip latency: the time from sound entering the mic to processed sound coming out of the speakers. The 2026 floors:

Platform	Round-trip latency	Practical use
Desktop Chrome / Edge	18–35 ms	Monitoring with effects, tracking with light DSP
Desktop Safari	25–45 ms	Same as Chrome but slightly worse on macOS
Desktop Firefox	25–40 ms	Same range as Safari
Android Chrome	35–80 ms	Casual production; not professional tracking
iOS Safari	30–60 ms	Tighter than Android by a margin
With buffer-size 128 + dedicated audio interface	8–15 ms	Genuinely usable for monitoring

That last row matters. With WebAudio’s audioContext.baseLatency and audioContext.outputLatency honest, an audio interface configured for a 128-sample buffer at 48 kHz can deliver round-trip latencies that are within the JUCE-on-native ballpark. The browser tax is now small. It’s the OS-mixer tax that’s most of what remains.

For comparison, the equivalent native-app floor in 2026 with the same interface is 4–8 ms. The browser is 2–3× the latency, not 10×. For most use cases, that gap doesn’t matter. For live monitoring while tracking vocals, it still does.

Threading and jitter

AudioWorklet runs at audio-thread priority. The 128-sample buffer at 48 kHz means processing has to complete in 2.67 ms or you get an underrun (an audible glitch). The audio thread does not GC, does not allocate, does not block on IO. Code in an AudioWorkletProcessor is held to the same standards as code in a native audio callback.

In practice:

Pure JavaScript audio code. Works fine for moderate-complexity DSP: a few filters, an envelope follower, basic feature extraction. The V8/JavaScriptCore JITs are excellent on this kind of tight numerical loop. Allocations are the enemy; the discipline is the same as native.
WebAssembly audio code. The default for anything serious: full plugin emulations, FFT-heavy analysis, DSP libraries ported from C++. Build pipelines (Emscripten, Rust + wasm-bindgen) are mature; performance is within 1.2–1.5× of equivalent native code.
SharedArrayBuffer + Atomics. The transport for getting data between main-thread UI and audio-thread DSP without copying. Required for any non-trivial app. The COOP/COEP header requirements are an operational hassle but not a blocker.

The jitter story is honest: audio-thread runtime variance is well-controlled on desktop, less so on mobile. Apps that need bulletproof glitch-free behaviour add a small jitter buffer (5–10 ms) at the output side, the same trick native apps use on contended systems.

What’s actually possible

A non-exhaustive list of things that work in production browsers in 2026:

Solid

Multi-band processing chains. EQ, compression, gating, limiting. The standard mastering chain runs comfortably in a single AudioWorklet with a few hundred-microsecond budget per buffer.
Real-time analysis. Spectrum, LUFS metering (the AudioLab MixLab Analyzer does this in production), key detection, beat tracking, onset detection. All low-CPU.
Voice processing. Noise suppression, voice activity detection, mic processing. All well within the budget. Browser-built-in NS/AEC (echoCancellation: true) is the lazy good-enough; custom DSP for products that need it.
Synthesis. Wavetable, FM, additive, subtractive: every classical synthesis technique. Polyphony in the dozens is comfortable.
Convolution reverb. The built-in ConvolverNode handles arbitrary impulse responses. Long IRs (5+ seconds) work via partitioned convolution, which the browser implementations now do correctly.
Looping, slicing, time-stretching, pitch shifting. Phase-vocoder and granular implementations in WebAssembly are competitive with native equivalents.
MIDI input and output. Web MIDI is mature on Chromium browsers; Safari got it in 2024.

Edge of the envelope

Sample-accurate timing of long arrangements. Possible but requires careful scheduling against audioContext.currentTime. The web platform’s timing primitives are good enough; the gotchas are mostly in browser-tab throttling on inactive tabs.
Low-latency network audio. WebRTC for peer-to-peer is mature; the harder problem is jitter and concealment on bad networks. Custom transports via WebTransport are an emerging path.
Plug-in hosting. AU/VST hosting is not happening in the browser. WAM (Web Audio Modules) is an emerging plugin standard, though the ecosystem is still small in 2026.
Multi-channel surround / Dolby Atmos rendering. WebAudio supports up to 32 channels in theory; in practice, the OS-level routing for anything past stereo is platform-dependent and operating-system-dependent.

Still genuinely native-only

Sub-10 ms round-trip latency on consumer machines. The OS audio stack overhead doesn’t go away. Pro audio interfaces help; they don’t close the gap entirely.
Driver-level integration with audio interfaces. Custom mixer routing, hardware control surfaces, sample-accurate sync to external devices. Browsers don’t expose this and won’t.
Background audio processing on locked mobile devices. The browser tab goes to sleep. Native apps don’t.
OS-level audio capture from other apps. System audio routing is sandboxed away from the browser, deliberately.

WebGPU and audio ML

The biggest 2024–2026 shift. WebGPU put GPU compute in the browser without WebGL’s graphics-only constraints. For audio:

Real-time inference of small-to-medium models. Source separation, denoising, dereverberation, super-resolution: all available in the browser via ONNX Runtime Web or Transformers.js running on WebGPU. Performance is competitive with desktop class CPU inference.
Larger models (Whisper-large, MusicGen, Demucs at the larger sizes) still run, but with frame-by-frame batching that introduces real latency. Streaming inference is the bottleneck, not raw throughput.
Model sizes that fit comfortably are in the 50–500 MB range. Bigger than that and you fight cache eviction.

Two practical patterns are emerging in 2026:

Edge-only inference. The model lives on the user’s machine, runs entirely in the browser, never sends audio anywhere. The AudioLab demos lean this way for privacy reasons.
Hybrid. Cheap pre-processing in WebGPU; heavy inference in a cloud endpoint via WebTransport with sub-100 ms round-trip. This is what most production voice products look like today, and the cloud share is shrinking each year.

The build-side reality

Shipping a serious browser audio app in 2026 looks like:

Audio code in Rust or C++, compiled to WebAssembly via Emscripten or wasm-bindgen. Build pipeline matches what a native app would have.
AudioWorklet shim in TypeScript, importing the WASM module, doing the audio-thread plumbing.
Main-thread UI in React/Vue/Svelte, talking to the audio thread through SharedArrayBuffer.
COOP/COEP isolation for SharedArrayBuffer and WebGPU. Operational consequence: third-party iframes need cooperation headers. Annoying but solvable.
PWA + Service Worker for offline-capable installs. Real audio products in 2026 ship installable.
Worker pool for non-audio-thread heavy lifting: file decoding, analysis, ML inference.

The build chain has matured to the point where a small team can ship a real audio product without owning a native build pipeline per platform. That’s the genuine 2026 shift, not any one API.

What this means for product decisions

A working rule:

Choose browser-first unless you specifically need (a) sub-10 ms round-trip latency, (b) plugin hosting, (c) background audio when the tab is closed, or (d) deep OS-level audio routing.

Five years ago, “browser-first” was a constrained choice. You accepted significant compromises for distribution convenience. In 2026, browser-first is a competitive choice for a wide spectrum of audio products. Distribution is easier, the platform is good enough, and the install funnel does not compete with native app marketplaces.

For AudioLab, all seven labs ship browser-first by design. The trade-offs we accept (slightly higher floor latency, no plugin hosting) are not relevant to what the labs are for. For your product, do the honest accounting of what you actually need before committing to one or the other.

Try the AudioLab demos → · WebAudio vs native deep dive → · Methodology →