/docs · VoiceLab · Intro

Why filler density matters more than filler count

Counting "ums" is the wrong measurement. Density tells you whether they are a problem.

Every podcast QA tool wants to count your fillers. Most of them stop there. Counting fillers as a raw number is mostly noise: a 5-minute interview and a 90-minute discussion produce very different totals, but the experience of listening to them depends on something else entirely.

That something is density: fillers per minute of actual speech.

The measurement

filler_density = filler_count / speech_seconds * 60

Note the divisor: not total file duration, but speech seconds. A podcast with five-second filler bursts followed by silence is different from one where every second is dense with hesitation.

The thresholds we use in VoiceLab:

Density	Listener experience
< 3 /min	Smooth, professional
3–6 /min	Conversational, fine for talk-shows
6–10 /min	Noticeable, may need editing
> 10 /min	Distracting: heavy edit pass needed

These are guidelines, not laws. A high-energy improv podcast at 12 fillers/min can be brilliant. A scripted explainer at 5 fillers/min is broken.

Why count alone misleads

Take two podcasts:

Podcast A: 60 minutes, 40 fillers. Total: 40 fillers.
Podcast B: 10 minutes, 30 fillers. Total: 30 fillers.

By count, Podcast A is “worse.” By density, Podcast A is 0.67/min and Podcast B is 3/min. The latter is a more honest summary of what a listener will feel.

Pause length matters too

A 200ms pause is usually invisible. A 600ms pause is dramatic. A 1500ms pause is a hesitation, often replacing a filler. VoiceLab tracks pause distribution alongside filler density, because replacing a filler with a long hesitation isn’t actually an improvement. It’s the same problem in a different shape.

How VoiceLab detects fillers without ASR

Without speech-to-text, exact filler counting isn’t possible. But useful proxies are:

Sub-syllabic energy bursts: short envelope peaks bracketed by silence shorter than a typical word but longer than a click. These are heavily correlated with “um”, “uh”, “like”, “you know”.
Pause-cluster density: regions where 80–400ms pauses cluster more than 4× per 10 seconds. This catches the speech rhythm typical of filler-heavy delivery even when individual fillers aren’t detected.

The signal-level layer is good enough for QA. For exact counts and word-level work, you need ASR (Whisper, Deepgram, AWS Transcribe) downstream.

What to do with the number

If your density is above 6/min and you’re editing the show, three tactics from working editors:

Cut, don’t replace. Empty space between sentences sounds better than a filler in 80% of cases.
Tighten before you cut fillers. Most “filler” perception is actually pace. Tightening overall delivery makes fillers fade into the background.
Leave intentional ones. A “you know what I mean?” with character beats a sterile script. Don’t edit the personality out.

A pragmatic loudness target for podcasts (once you’ve cleaned the delivery)
Estimating room echo without RT60 (the other thing that makes voices sound amateur)