Evaluating audio AI models: a working guide to measurements that matter

How to measure whether an audio AI model is actually good: subjective listening tests (MUSHRA, ABX, MOS), objective metrics (PESQ, STOI, SI-SDR), and the gap between the two. The practical playbook for product teams.

June 4, 2026 19 min read evaluationmushraabxmetricspillaraudio ai

You shipped an audio AI feature. A model that denoises podcasts, separates stems, super-resolves voice memos, generates music, masters mixes. Pick the one that matches your week. Now: is it good? Was the new version better than the old version? Better than the competitor? Better than nothing?

Answering those questions well is harder than it looks. Audio is perceptual: two algorithms can produce signals that are mathematically very different and sound identical, or signals that are mathematically nearly identical and sound utterly different. The wrong evaluation strategy can let you ship a regression that 80% of listeners notice while every metric in your dashboard glows green.

This piece is the working playbook we use across the AudioLab labs when we’re judging an audio-AI model. Not exhaustive (there’s a 400-page literature behind any single line below), but enough to make a defensible go/no-go call without three months of academic detour.

The two-axis framework

Every audio-AI evaluation lives in a 2×2:

	Cheap to run	Expensive to run
Tracks perception	(mostly empty)	Subjective listening tests
Doesn’t track perception	Objective metrics	Specialised analysis

The bottom-left quadrant (cheap objective metrics) is where most product teams spend 95% of their time. The top-right (subjective listening tests) is what audio research labs build their reputation on. The trick is knowing which to use, when, and which combinations triangulate honestly.

Objective metrics: fast, scalable, lying often

The classics

PESQ (ITU-T P.862). Predicted MOS for narrowband and wideband speech. Validated against telecom-grade impairments: codec artefacts, packet loss, additive noise. Saturates badly above ~4 MOS, which means it can’t tell “really good” from “excellent.” Largely superseded by POLQA for new work, but PESQ is still the workhorse because it’s free and everyone has it integrated.
POLQA (ITU-T P.863). PESQ’s successor. Wider validation range, handles modern codecs better, less subject to gain mismatch. Licensed, which is the real reason most open-source projects still cite PESQ.
STOI (Short-Time Objective Intelligibility). A correlation-based intelligibility predictor. Cheap, well-correlated with human intelligibility scores for narrowband speech in noise. Less reliable for music or for “polish” judgements where intelligibility is already 100%.
SI-SDR (Scale-Invariant Signal-to-Distortion Ratio). The standard for source separation. Compares reference and estimate after optimal scaling. Scale-invariant, which matters because gain mismatch shouldn’t tank a perfectly good separation.

The newer cohort

DNSMOS / NISQA. Neural networks trained to predict subjective MOS scores. Reference-free: you don’t need a clean target signal. The right choice when you’re evaluating noise-suppression or speech-enhancement on real-world recordings where no ground truth exists.
VISQOL. Google’s perceptual model for both speech and audio. Open source, well-maintained, handles a wide range of impairments.
CDPAM (Contrastive Deep Perceptual Audio Metric). Learned perceptual distance for general audio. Useful for “did the model change anything a person would notice?” sanity checks.

When objective metrics break down

Three failure modes recur:

Out-of-distribution material. PESQ was validated on phone-line speech. Run it on opera and you get numbers, but those numbers don’t mean what they mean on phone-line speech. The classical metrics are domain-specific even when their authors didn’t intend them to be.
Saturating in the “good” range. Most speech metrics distinguish bad from okay well, and okay from good poorly. Once your model is producing genuinely high-quality output, the objective metric will plateau and stop telling you which version is better.
Gaming. Any single metric is gameable. A model can learn to produce outputs that maximise PESQ without producing outputs that humans prefer. The literature on this is uncomfortable; see the perceptual-vs-PESQ debates of the 2020s for examples.

The defensible position: use 2–3 complementary objective metrics, treat them as a triage filter, and never declare victory based on objective metrics alone.

Subjective listening tests: slow, expensive, the ground truth

MUSHRA (ITU-R BS.1534)

The standard test for high-quality audio. Listeners compare 5–8 versions of the same content (hidden reference, hidden anchor, and the systems under test) and score each from 0–100. The hidden reference catches inattentive listeners: anyone who scores it below 90 is excluded. The anchor calibrates the scale.

MUSHRA’s strength is high sensitivity in the high-quality range, exactly where PESQ falls over. It’s the right tool when you’re trying to discriminate between two models that are both pretty good.

Practical numbers: 15–20 listeners, 8–12 items per session, ~45 minutes per listener. Inter-listener variability is real; you want a test design that lets you do mixed-effects analysis.

ABX testing

The cleanest possible “is there an audible difference?” test. Listener hears A, B, then X; they have to identify whether X = A or X = B. Statistical significance at p < 0.05 needs about 30 correct identifications out of 40 trials. ABX is the gold standard for proving an audible difference exists. It tells you nothing about which is better; that’s a separate question.

When someone tells you “I can’t tell the difference between FLAC and 320 kbps MP3,” they’re making an ABX claim. Most people who think they can tell can’t, once they have to do it blind. The same is true of audio AI: many of the differences product teams agonise over disappear under ABX.

MOS testing

The simplest subjective test: present a sample, ask listeners to rate it 1–5. Cheaper than MUSHRA, lower sensitivity. Reasonable for “is the new noise-suppressor better than the old one in production” questions, where you can run hundreds of items.

The catch: MOS scores are not interval-scaled. A 3 is not “halfway between 2 and 4.” Statistical analysis has to use ordinal methods or you’ll overstate confidence.

Crowdsourced listening tests

Platforms like Prolific and Amazon MTurk let you run MOS-style tests at scale: hundreds of listeners, results in hours. The trade-offs:

No control over listening conditions. Headphones, laptop speakers, phone speakers: all produce different results. Mitigate with attention checks (a hidden tone they have to identify) and listener qualification.
Self-selected expertise. Crowdworkers aren’t trained critical listeners. For coarse quality judgements that’s fine; for nuanced spectral balance work, no.
Stimulus length limits. Crowdworkers won’t sit through 20-minute test sessions. Design for 5-minute attention spans.

For most product teams, a well-designed crowdsourced MOS test beats no listening test at all by a wide margin.

Triangulation: the playbook that actually works

Never bet a release on one number. Triangulate.

Stage 1: Continuous integration metrics

Every commit runs against a fixed evaluation set with 2–3 cheap objective metrics. The job is to catch regressions, not declare improvements. A 5% drop in PESQ blocks the merge; a 2% drop is flagged for human review.

Stage 2: Pre-release listening

Before promoting a model to production, run a small expert listening pass: 4–6 trained listeners, MUSHRA design, the actual content you ship on (not the benchmark suite). If experts can’t reliably rank the new model above the old one, the release stops.

Stage 3: Crowd evaluation

For consumer-facing releases, a crowdsourced MOS test on 100+ listeners. Confirms the expert verdict at scale, catches issues with material the experts didn’t include.

Stage 4: Production monitoring

After release, automated reference-free metrics (DNSMOS, NISQA) on production traffic. Noisy at the per-sample level, but solid at the population level. A week-over-week drop in mean DNSMOS is a real signal.

The traps

A non-exhaustive list of evaluation mistakes that cost product teams dearly:

Training on the test set. Sounds obvious; happens constantly. Especially when the team that builds the model and the team that picks the evaluation set are the same people.
Evaluating on too little material. Five clips is not an evaluation set. Forty is the working minimum for any quantitative claim.
Cherry-picking failure cases. “Look how bad it sounds on this clip!” is a debugging tool, not an evaluation. The honest version reports the full distribution.
Confusing perceptual loss with perceptual quality. A model trained with a perceptual loss function is not the same as a model that produces perceptually-better output. The loss is a means; the test is the end.
Apples-to-oranges comparisons. Your model at 48 kHz vs the baseline at 16 kHz tells you the model resamples. It tells you nothing about whether the model is better.
Reading too much into 0.1 MOS. PESQ and POLQA have measurement variance. A 0.1 MOS difference is noise. Run more samples or use a sensitive test.

What AudioLab evaluates with

For the working labs:

MixLab Analyzer. No model evaluations; it’s a measurement tool. The numbers are validated against EBU R128 reference signals and we publish the agreement bands in the methodology page.
VoiceLab QA. Composite score evaluated against a 200-clip internal set with expert annotations. Reported precision/recall in the methodology.
SignalLab Indexer. Region-labelling F1 against a 1000-clip ground-truth set with three annotators per clip (inter-annotator agreement reported alongside the F1).
HearLab Companion. No clinical claims; no clinical evaluation. The captioning component reports WER against a public benchmark; the rest is UX work and gets UX evaluation.

The honest summary

Audio-AI evaluation is harder than image-classification evaluation, harder than NLP benchmark evaluation, and much harder than the single-number leaderboards in many model cards suggest. The teams that get it right run multiple methods, instrument production, and stop trying to compress sound quality into one float.

The teams that get it wrong tend to share three habits: one metric, one evaluation set, and no listening tests. Avoiding those three habits is 80% of the work.

Try the AudioLab demos → · Read the methodology page → · Browse the glossary →