AI Transcription Accuracy: Why Vendor Benchmarks Lie, and How We Actually Test Models
If you are trying to figure out which speech-to-text model is the most accurate, the published benchmark number is the wrong place to look. A vendor's headline accuracy figure is an optimized result on a clean, read-aloud dataset — it tells you almost nothing about how the model handles the audio you actually have: a product review peppered with brand names and jargon, a meeting where two people talk over each other, a heavy accent, a creator who keeps flipping between languages.
I run Subanana, an AI speech-to-text tool. We route every transcription through a stack of evaluated models, and we re-test that stack constantly. This post is about how we test — the methodology, the criteria, and the real results from one of our own evaluation runs — and why we have stopped trusting vendor-published accuracy scores to make those decisions.
TL;DR
- Vendor accuracy benchmarks (a single WER score, "98% accurate") are mostly benchmark-maxing — measured on clean, scripted, single-speaker audio that looks nothing like real mixed-language, accented, or multi-speaker recordings.
- So we don't pick models by their published numbers. We test them on our own messy, real-world audio and judge the output the way a human editor would: did it fix misheard words, read as clean written language, and — crucially — not change any facts.
- In one real evaluation run, a small, fast model beat our heavier production default (about a 92% judge preference, roughly 13× faster). Bigger and slower was not more accurate.
- The failure that proves the point: a model silently rewrote a camera sensor "LYT-828" as "LYT-808" — clean-reading, factually wrong, and invisible to a WER score.
- Evaluating a tool yourself? Test your worst real audio — the accents, the cross-talk, the jargon, the language-switching — watch the timestamps on screen, and hunt for factual corruption, not the leaderboard number.
Why are vendor transcription benchmarks misleading?
A published Word Error Rate (WER) or accuracy percentage is a single number produced under conditions the vendor chose. Three things make it close to useless for picking a model for production:
- The test set is clean. Benchmark audio is usually scripted, single-speaker, recorded in a quiet room, in a high-resource language. Real audio is none of those things.
- The metric is coarse. WER counts substitutions, insertions, and deletions equally. But getting a model number wrong (a "Vivo X30" turning into a "Vivo X90") is a catastrophic error, while a dropped comma is harmless. WER scores them the same.
- It is the vendor's own scoreboard. Every lab reports the configuration where its model looks best. You are reading the high-water mark, not the expected result.
None of that is dishonest, exactly. It is just benchmark-maxing — optimizing for the leaderboard rather than for your use case. So when we evaluate models, we do not quote anyone's published figure. We run the model on the messy, code-mixed, real-world audio our users actually upload, and we judge the output on the things that matter for a finished subtitle or transcript.
That is the whole philosophy: accuracy is not a number a vendor hands you. It is something you measure on your own use case, or you do not really know it.
What "accuracy" actually means for a subtitle
When most people say "transcription accuracy," they are blending two completely different jobs:
- Speech-to-text (STT / ASR) — turning audio into raw text with timestamps. This is where WER lives.
- Text cleanup — turning that raw, messy ASR text into a publishable subtitle: fixing misheard words, converting spoken-language phrasing into clean written form, restoring spacing and punctuation, removing fillers, and crucially not changing any facts.
Both stages can fail, and they fail in different ways. A model can produce great raw text and still ship unusable subtitles because the timestamps drift. Another model can have perfectly aligned timestamps and still mangle a brand name. A single accuracy percentage cannot capture any of this, which is exactly why we test each stage separately and qualitatively.
The rest of this post walks through both: first the text-cleanup stage, where we have a structured evaluation with real numbers we can share, and then the raw-STT stage, where our findings are deliberately qualitative.
How we test the text-cleanup stage: LLM-as-judge on real audio
Here is the methodology for one real evaluation run we did in April 2026. The target was the model that does the cleanup pass on top of raw ASR output — the step that turns a rough machine transcript into a publishable subtitle. That pass does two distinct jobs, and we test each separately:
- Fixing mistakes — correcting the words and numbers the speech-to-text got wrong: a misheard brand name, a wrong model number, a dropped negation.
- Cleaning up the wording — turning spoken, colloquial phrasing into clean written language, restoring punctuation and spacing, and trimming filler words — without changing the meaning. (In some languages this gap is wide: Cantonese, for instance, has a distinct spoken-to-written conversion,
口語 → 書面語.)
(Scope, stated plainly: this run evaluates that text-cleanup pass, not raw speech-to-text. The two are tested differently.)
- The dataset was a small, deliberately curated set of real samples — in our case, code-switched Hong Kong Cantonese and English — chosen not for size but for the cases that break models: mixed script, technical terms and model numbers, punctuation-heavy passages, short fragile fragments, and long spans. The specific language matters less than the principle: a handful of genuinely nasty samples from your own use case surfaces more real failures than a thousand clean ones.
- The comparison was pairwise. For every sample, each candidate model's output went head-to-head against our current production baseline, and a separate judge model picked the better one — or called it a tie.
- The criteria were six things that actually define a good subtitle, scored independently per sample:
- Fixing misheard words — did it correct what the speech-to-text got wrong?
- Spoken-to-written cleanup — did it turn colloquial speech into clean written language? (In our Cantonese samples this is the
口語 → 書面語conversion; every language has its own version of tidying speech into prose.) - Filler-word removal — did it drop the ums and false starts?
- Factual preservation — did it leave names, numbers, and facts untouched?
- Annotation prohibition — did it avoid inventing bracketed notes the speaker never said?
- Thoroughness — did it actually clean the text up, or leave obvious errors in?
We put 31 model configurations into that run. Only 17 were even runnable — the rest fell over at preflight with invalid model IDs, request timeouts, or unsupported settings, which is itself a useful result: a model you cannot reliably call is not a candidate, regardless of its benchmark score.
This is documentation-grade methodology, not a vibe check. Every number below comes from that run's own output, and we are sharing it because it is ours to share — not because a vendor told us their model is good.
What we found: the numbers from our own run
A few results stood out. All figures are from our own evaluation; they are judge-preference win rates and speed on the subtitle-cleanup task — not anyone's STT accuracy percentage.
| Model configuration | Judge preference: fixing mistakes | Judge preference: cleanup | Speed |
|---|---|---|---|
| Production baseline (a Gemini 3 Flash model, default settings) | reference | reference | ~4 minutes |
| Same Gemini 3 Flash model, thinking turned off | 60% | 80% | ~18 seconds |
| A lighter Gemini 3.1 Flash Lite model, leanest run | 100% | ~67% | ~19 seconds |
| Same Gemini 3.1 Flash Lite model, best-scoring run | 100% | ~83% | ~19 seconds |
| A small GPT-5.4 nano model | up to 80% | up to ~67% | ~20–55 seconds |
| A Qwen3.6-Plus model | up to 80% | up to ~67% | ~11 minutes |
Three takeaways from our data:
- The best average judge preference in the whole run was about 92% — a lightweight Gemini 3.1 Flash Lite configuration that the judge preferred over our production baseline on the large majority of samples. A small, fast model beat the heavier default.
- The leanest runnable configuration was roughly 13x faster than the baseline — about 19 seconds versus roughly 4 minutes — at a fraction of the cost, and still won the head-to-head on fixing mistakes outright. Bigger and slower was not better.
- Capping the model's "thinking" budget was the single biggest efficiency win. The baseline spent the overwhelming majority of its budget on reasoning tokens it largely did not need. Turning that reasoning budget off on the same model family produced output the judge rated as good or better, roughly an order of magnitude faster and far leaner. For a constrained, well-specified task like subtitle cleanup, extended reasoning was mostly wasted effort.
You will notice none of those are "accuracy percentages." They are relative preference scores from a judge model, on our audio, against our own baseline. That is a deliberately humbler claim than "98% accurate," and it is far more useful for actually choosing a model.
The failure that proves why human-judged, use-case testing matters
Here is the example that captures the entire argument. One candidate model, cleaning up a phone review (one of our code-switched Cantonese-English clips), did this:
Source: T-828 的 sensor 啦。那這顆 LYT-828 呢,我們,我們又來……
Baseline: ……呢粒 LYT-828 呢……
Candidate: ……嗰呢粒 LYT-808 呢……
The model silently rewrote the camera sensor "LYT-828" as "LYT-808." We saw the same class of error elsewhere in the run, where another candidate turned a "Vivo X30 Pro" into a "Vivo X90 Pro."
The text reads perfectly. The grammar is clean, the punctuation is restored, the spoken phrasing is tidied into proper written form. A WER score would barely register the change — one digit out of a long passage. But it is a factual corruption: a different product, a different sensor. For a tech reviewer, that is the kind of mistake that gets a correction demanded in the comments.
The lesson is not about any one language. It is that the most dangerous transcription errors are the fluent ones — a clean-reading sentence with a silently swapped technical term, model number, or proper noun. Those hide in exactly the mixed-script, jargon-dense audio that real users record, in any language. No published accuracy benchmark would have caught this; it only surfaced because we judged the output the way a human editor would — against the specific question "did the model change a fact?" — on the kind of audio where this failure actually happens. That is the difference between benchmark-maxing and use-case-grounded testing.
It also shows why "factual preservation" is one of our six criteria, and why we read it as a comparative signal rather than a literal error count. In the same run, a model rendered "ninety percent" two equally-correct ways (百分之九十 versus 九成) — semantically identical, no error at all. A naive metric would have flagged the rephrase and missed the sensor swap. Judgment, on the right material, gets that ordering right.
What about the raw speech-to-text stage?
For the STT stage itself — audio in, timestamped text out — our findings are intentionally qualitative. We do not publish a WER table, ours or anyone's, because the failures that matter here are not well captured by a single error rate. What breaks an STT model in production is usually one of: hallucinated content the speaker never said, missed valid speech, unstable performance on lower-resource or code-switched languages, or timestamps that drift out of sync with the audio.
A few things we have learned by testing models on our own audio, rather than reading their spec sheets:
- Good text does not mean good timestamps. We evaluated a frontier multimodal model as a transcription engine: its raw text quality was genuinely good, but its cue timestamps drifted — fine for a reading transcript, unusable for subtitles that have to land on the right frame.
- Some models produce unusable timing outright. A different model we tested for the same job had, in our notes, "trash timestamp" — strong on paper, a non-starter for time-aligned captions.
- Low-resource and code-switched languages are where general models wobble. The cleaner, high-resource languages a benchmark leans on are the easy case; the wobble shows up on accents, dialects, and audio that switches between languages within a recording. Subanana started out using a single well-known STT model, and we were forced off the single-provider approach by exactly the kind of failure benchmarks hide: hallucinations and missed speech in real conditions, with the harder languages — Cantonese among them — least stable. That is why we now route across multiple evaluated engines and fall back automatically when one produces a bad segment.
- Real engineering lives in the gaps. When we brought a new STT provider in, the work was not "is the WER lower." It was: a stretch of background music getting mis-mapped onto the wrong caption's timing, stray
[upbeat music]tags needing removal, segments getting glued together without spaces. None of that shows up in an accuracy score; all of it shows up for a user.
The honest summary is that we pick the best-performing STT model per source language and per use case, and we keep re-checking — because a model that benchmarks well can still drift on timestamps or hallucinate on the harder languages, and the only way to know is to run it on the real thing. You can read more about how that routing and quality stack works on our AI subtitle tool and AI meeting transcription pages.
How should you evaluate transcription accuracy yourself?
You do not need an eval harness to avoid the benchmark trap. The principle is simple: test on your own audio, judge on what matters to you.
- Use your worst real audio, not a clean clip. Pick the file with the accents, the cross-talk, the jargon, the language-switching. That is where models separate.
- Check the timestamps, not just the words. Play the video with the subtitles on. Drifting cues are invisible in a text diff and obvious on screen.
- Hunt for factual corruption specifically. Scan names, numbers, and product/brand terms. A clean-reading subtitle with a wrong number is worse than an obviously rough one.
- Judge the finished output, not the raw transcript. What you ship is the corrected, formatted subtitle — so evaluate that, including how much manual cleanup it still needs.
- Re-test over time. Models change. The best one for your language this quarter may not be next quarter. We re-run our evaluation precisely because the answer keeps moving.
If you would rather not run that gauntlet yourself, that is the job we do continuously: benchmark the models, route to the best performer per language and use case, and layer hallucination detection and proofreading on top so the output you review is already the strongest the system can produce. You can try it on your own hardest audio — start with a free file and check the things above.
FAQ
Is a higher published accuracy percentage a reliable way to pick a transcription tool?
No. Published figures are optimized results on clean, often single-speaker, high-resource-language audio. They rarely predict performance on real audio with accents, cross-talk, technical terms, or language-switching. Test on your own files instead.
What is the difference between transcription accuracy and subtitle quality?
Transcription accuracy usually refers to raw speech-to-text — words and timestamps. Subtitle quality is the finished result after cleanup: misheard words corrected, spoken phrasing turned into clean written form, punctuation and spacing restored, fillers removed, and facts left intact. A tool can do one well and the other badly.
Why do you evaluate models with another model as the judge?
For the text-cleanup stage, an LLM judge lets us compare two outputs pairwise on consistent criteria, far faster than manual review, and re-run it cheaply whenever a new model ships. We treat its verdicts as a relative preference signal against our own baseline — not as an absolute accuracy score — on a deliberately hard, curated sample, and we keep humans in the loop on the failure cases that matter, like factual corruption.
Does a model with good transcription text always produce good subtitles?
No, and this is a common trap. We have seen models with genuinely good raw text produce drifting or unusable timestamps. For subtitles, which have to align to the frame, timing reliability matters as much as word accuracy — and the two are not correlated.
Why does Subanana use multiple speech-to-text models instead of one?
Because no single model is best across every language and use case, and any model can hallucinate or miss speech on real audio. Subanana started on a single provider and moved to a multi-model approach after production data showed the limits of one engine — especially on lower-resource and code-switched languages. We route to the best-evaluated model per source language and fall back automatically when output quality drops.