/docs · HearLab · Practical

Live Caption on Android: what it does and doesn't do

Google’s Live Caption is the most-used accessibility feature in the world. Here is how it actually works and what it doesn’t cover.

Android’s Live Caption is a system-wide feature that transcribes audio playing on the device. It works offline using on-device speech recognition (a quantised version of Google’s ASR models). It’s used by tens of millions of people daily and it’s probably the single most impactful accessibility ship of the last decade.

It’s also frequently misunderstood. Here is the working engineer’s view of what it does, what it doesn’t, and what that means if you’re building hearing-support tooling.

What Live Caption captures

Audio routed through Android’s media or VoiceCall streams.
Audio from system sounds, browser tabs, video apps, and most music apps.
Real-time, on-device, no network round-trip.

That’s a wide net. Watching YouTube, taking a call, listening to a podcast, opening a TikTok with audio: all get captioned automatically.

What it does not capture

The microphone. Live Caption ignores the mic. It’s about what your phone is playing, not what’s happening around you.
Spatial audio cues. It transcribes words. Music, room tone, environmental sound: all dropped or labelled simply (“[Music]”, “[Applause]”).
Locale-specific languages outside the supported set. The supported list has expanded but is still a fraction of the world’s spoken languages.
Caller identity in group calls. It transcribes, but doesn’t diarise. You won’t see “Alex:” or “Mia:”. Just text.

The microphone gap is the big one. If you want to caption the world around you (a restaurant, a meeting), you need a separate flow, usually a third-party app with explicit mic permission and sometimes cloud ASR for accuracy.

What that means for HearLab

HearLab’s captions mode is explicitly a microphone captions mode. We don’t try to replace Live Caption. Google has done that better than we ever will for media playback. We try to do the thing Live Caption doesn’t: caption the room around the user.

That choice has implications:

We need mic permission, which means we need explicit user consent and clear UX around it.
Cloud ASR is on the table. On-device models can do a lot but cloud ASR (Whisper, Deepgram, Google’s STT API) is still meaningfully better for noisy real-world audio.
Privacy framing matters. A mic-active app that ships audio to a cloud needs to be very clear about what it’s doing, when, and for how long. Off-by-default and explicit recording UI are non-negotiable.

What Live Caption does well that we should learn from

Latency: the on-device model gets first-pass text in under a second. Anything slower than that breaks the illusion of “captions”.
Falsy gracefully: when confidence is low, Live Caption shows lighter, smaller text. The user can tell it’s a guess. We do the same in HearLab: interim results render in italic, finalised in regular weight.
Inline correction: as more context arrives, earlier words can update. This requires UI that doesn’t penalise users for reading “live” text, so the cursor stays roughly where the eye is, not where the text gets re-rendered.

Building a captions mode in the browser

For the HearLab Companion demo, we use the Web Speech API (SpeechRecognition). This is essentially Google’s cloud ASR exposed through Chrome/Edge, with some on-device fallback in Safari. It’s the right starting point for prototyping because:

It’s a one-line setup.
It handles interim and final results out of the box.
It supports continuous mode.

It’s the wrong endpoint for production because:

It’s a free, browser-mediated path with no SLA.
The lifecycle is fragile: the recogniser will silently stop after a few minutes on some platforms.
You don’t control which model is used.

For a production HearLab companion, Whisper (running on a backend) or Deepgram is the right choice. The Web Speech API gets us through the prototype phase without making infrastructure decisions before they’re needed.