What Is ASR (Automatic Speech Recognition)? How It Works, and What Drives Accuracy

ASR stands for automatic speech recognition — the technology that converts spoken audio into written text. You'll also see it called speech-to-text (STT); they mean the same thing. Every time YouTube generates captions on a video, a voice assistant obeys a spoken command, or a meeting tool hands you a transcript afterwards, ASR is the engine doing the work of mapping sound to words.

This is a plain-English explainer, not a sales pitch. I'll cover what ASR is, how the pipeline works, why the same model can transcribe one recording near-perfectly and butcher the next, and how automated transcription stacks up against a human typist. Disclosure: I run Subanana, an AI transcription tool, so I have skin in this game — but the technical claims below are sourced to public references (Wikipedia's speech recognition and word error rate entries) and vendor documentation, fetched June 2026. I deliberately quote no accuracy percentages, and the reason why is part of the story.

The four-stage ASR pipeline: sound is captured and cleaned, mapped to phonemes, weighed against plausible word sequences, then decoded into a timestamped transcript.

What is ASR, in one sentence?

ASR is a field of computational linguistics concerned with methods that translate spoken language into text. Give it an audio signal — a voice memo, a podcast, a conference talk — and it returns a sequence of words. That sounds simple, but speech is one of the harder things to digitise: humans run words together, drop sounds, talk over each other, use accents and slang, and do it all against background noise. ASR is the decades-long engineering effort to make a computer reliably hear all of that and write it down.

A useful distinction up front: ASR gives you the words; it doesn't, by itself, give you a finished document. Labelling who spoke, cleaning up filler, adding punctuation, splitting the text into readable chunks — those are separate steps layered on top of the raw recognition. Keeping that line clear explains a lot about why transcription tools differ so much even when they share similar underlying models.

How does ASR actually work?

Modern ASR is a pipeline. Here's the path a clip of audio takes, from sound wave to text you can read.

Capture and pre-processing. The system takes the audio signal and cleans it up — normalising volume, sometimes filtering noise — then slices it into very short frames (think fractions of a second) and converts each frame into a compact set of numbers that describe its sound.
Acoustic modelling. This stage maps those sound features to phonemes — the basic units of speech, the distinct sounds that make up words. It's answering "what sounds are in this audio?"
Language modelling. Sounds alone are ambiguous ("recognise speech" vs "wreck a nice beach" sound nearly identical). A language model weighs which word sequences are actually plausible given how the language works, and resolves the ambiguity. It's answering "given those sounds, what was most likely said?"
Decoding to text. The system combines the acoustic and language evidence to output its best-guess transcript, usually with timestamps marking when each chunk was spoken.

That acoustic-model-plus-language-model split is the classic architecture, and for years the statistical glue holding it together was the Hidden Markov Model (HMM), which let researchers fold acoustics, language and syntax into one probabilistic system. It worked, but it was brittle.

The big shift was deep learning. Starting around 2009, neural networks drove a large step-change in accuracy, and the field moved from hand-engineered pipelines toward end-to-end neural models — first recurrent networks and LSTMs, then the attention-based transformer architecture that underpins most current systems. The practical upshot: instead of separately tuning an acoustic model and a language model, a modern system can learn the whole audio-to-text mapping from large amounts of example data. That's why ASR quality jumped so visibly over the last decade, and why it keeps improving.

Why does ASR get some recordings wrong?

Because not all audio is equal. The same model that nails a clean studio recording can stumble badly on a noisy café interview — and understanding the factors tells you how to get better results. Drawing on the well-documented challenges in the field:

Accent, dialect and pronunciation. A model trained mostly on one accent does worse on others. Dialects and non-standard pronunciation push error rates up.
Background noise and audio quality. Traffic, music, room echo, a bad mic — anything that muddies the signal makes the acoustic stage guess more.
Overlapping speakers. When two people talk at once, the model has to disentangle voices it was largely trained to handle one at a time.
Spontaneous, messy speech. Read-aloud speech is comparatively easy. Real conversation — false starts, "um," trailing off, mid-sentence corrections — is much harder.
Vocabulary and domain. A tiny, fixed vocabulary (say, the ten digits) is nearly trivial. A wide-open vocabulary across specialist jargon, brand names and proper nouns is where errors cluster, because rare and confusable words are easy to mishear.
Code-switching. Two languages in one sentence — common in Hong Kong, Singapore and many bilingual workplaces — is one of the toughest cases, since the model has to switch its expectations on the fly.

You don't have to take a vendor's word for how real this is. YouTube's own help page says its automatic captions are produced with "speech recognition technology" and "machine learning algorithms," and then warns plainly that they "might misrepresent the spoken content due to mispronunciations, accents, dialects, or background noise," advising creators to review and edit them. That's a multi-billion-user product telling you, in writing, that ASR output needs checking on hard audio. It's not a knock on YouTube — it's the honest nature of the technology.

How is ASR accuracy measured — and why don't I quote a percentage?

The standard yardstick is Word Error Rate (WER). The formula is straightforward:

WER = (Substitutions + Deletions + Insertions) ÷ number of words in the reference transcript

You line the machine transcript up against a correct "reference," count the words it swapped, dropped or added, and divide by the total. Lower is better; 0 means a perfect match.

WER is genuinely useful, but it's a blunt summary, and that's exactly why I won't put a single accuracy number on Subanana or anyone else in a blog post:

It treats every error as equal. Mishearing "the" as "a" counts the same as turning "we will not ship" into "we will now ship" — even though one is cosmetic and the other reverses the meaning.
It can exceed 100%. Per its definition, if a model hallucinates lots of extra words, insertions alone can push WER above 1.0 — which makes "X% accurate" marketing slippery.
It's entirely dependent on the test audio. A model's WER on clean English read-aloud and its WER on a noisy three-way Cantonese call are different universes. A headline percentage hides which one you're being shown.

So when a tool advertises "99% accurate," the fair question is always: on what audio, in what language, measured how? A number with no test conditions attached tells you almost nothing about your recordings. I wrote a separate, detailed piece on this — why vendor accuracy benchmarks tend to mislead, and how we actually evaluate models — because it's the single most misunderstood thing about this category. The honest move is to test a tool on your own audio, not to trust a context-free percentage.

ASR vs human transcription: which should you use?

For decades, accurate transcription meant a person with headphones and a keyboard. ASR has changed the default for most use cases, but not all. Here's the honest comparison.

	ASR (automated)	Human transcription
Speed	Minutes, often faster than real time	Hours; typically several times the audio length
Cost	Low, usually per-minute or a flat subscription	High, billed per audio minute
Accuracy on clean audio	Very good in well-supported languages	Very good
Accuracy on hard audio (heavy accents, crosstalk, noise, niche jargon)	Drops — needs review	Most reliable; a skilled human handles context machines miss
Consistency	Identical input gives identical output	Varies by transcriber
Scales to bulk / many languages	Easily	Limited by available people
Best for	The overwhelming majority of meetings, videos, interviews and podcasts	High-stakes, low-margin-for-error work (legal, medical, broadcast) where a human pass is mandated

The practical reality for most people: ASR gets you a usable transcript in minutes, and a quick human review of the few hard passages closes the gap — far cheaper and faster than transcribing from scratch. Pure human transcription still wins where the cost of a single error is high enough to justify it. Many professional services in fact run ASR first and have a human clean it up, which is the best of both worlds.

Where you already use ASR every day

Even if you've never heard the acronym, you rely on ASR constantly:

Captions and subtitles. YouTube auto-captions support 100+ languages — including Cantonese/Hong Kong — all generated by ASR.
Meeting transcripts. Tools that record a call and produce a transcript afterwards are running ASR on the recording. (Worth knowing: in Google Meet, recording, captions and transcripts are separate features, and recording is limited to certain Workspace editions and saved to the organiser's Drive — so "I recorded it" doesn't automatically mean "I have a transcript.")
Voice assistants and voice search. Every spoken query to a phone or speaker is ASR turning your voice into a command.
Dictation and accessibility. Speak-to-type, and live captioning that makes spoken content accessible to deaf and hard-of-hearing audiences.

How Subanana approaches the hard parts of ASR

Since I run one, here's concretely how Subanana is built around the failure modes above — useful as a worked example of what "a finished transcription tool" adds on top of raw recognition.

No single model, ever. A recurring theme above is that a model's quality depends on the language and the audio. So instead of locking to one speech engine, Subanana continuously benchmarks speech-to-text models and routes each job to the best performer for that source language — you're not betting everything on one vendor's strengths. It does well on hard cases like Cantonese and code-switched audio, alongside the major Western languages, across 80+ languages.
Catching the model's bad moments. Every transcription is quality-checked, and when a segment shows signs of trouble — the kind of hallucinated text that plagues raw ASR on difficult audio — the system automatically re-routes that segment to another evaluated model rather than shipping the garbage.
Cleaning the words, with you in control. An LLM pass flags likely misheard words and same-sounding wrong characters and proposes corrections in the editor — you approve or reject each one; nothing is silently rewritten.
The parts ASR doesn't do. In transcript mode you get automatic speaker labels and filler cleaned into readable, punctuated prose; in subtitle mode you get time-aligned caption cues (and can translate into multiple target languages at once). A scoped glossary lets you pin brand names and jargon so the open-vocabulary problem above doesn't keep mangling your product names.

If you want to go deeper on the underlying model itself, OpenAI's Whisper is the best-known open ASR model — I wrote a hands-on guide to running Whisper and where a managed tool earns its place over the raw model.

Frequently asked questions

What does ASR stand for?

ASR stands for automatic speech recognition: technology that converts spoken audio into written text. It's also called speech-to-text, or STT — the terms are interchangeable.

Is ASR the same as speech-to-text (STT)?

Yes. "ASR" and "speech-to-text" describe the same thing — software that turns speech into text. ASR is the more academic/technical term; STT is the more common product term. Some people reserve STT for the live, real-time flavour, but functionally they're the same technology.

How accurate is ASR?

It depends entirely on the audio and the language. On a clean recording of a single speaker in a well-supported language, modern ASR is very good. On noisy audio, heavy accents, overlapping speakers, or niche vocabulary, accuracy drops and you'll want to review the output. That variability is why a single "X% accurate" figure is misleading — the only number that matters is how a tool does on your audio, which is why testing on your own files beats trusting a marketing percentage.

What's the difference between ASR and a language model like ChatGPT?

ASR turns audio into text. A large language model works on text that already exists — generating, summarising or answering questions about it. They're complementary: a transcription tool uses ASR to produce the transcript, then can use an LLM on top to summarise the meeting or clean up wording. They solve different halves of the problem.

Can ASR tell who is speaking?

Not on its own — identifying speakers is a separate step called speaker diarization, layered on top of the raw recognition. Plenty of tools include it so you get a "Speaker 1 / Speaker 2" transcript automatically; here's how speaker diarization works.

Will ASR replace human transcribers?

For most everyday transcription — meetings, videos, interviews, podcasts — ASR is already the default because it's far faster and cheaper. Human transcription remains the standard for high-stakes work where a single error is costly (legal, medical, broadcast), and a common professional workflow is ASR first, human review second.

The bottom line

ASR — automatic speech recognition, a.k.a. speech-to-text — is the technology that maps spoken audio to written words, built from an acoustic stage that hears sounds and a language stage that decides which words were said, and powered today by end-to-end neural models. Its accuracy isn't a fixed number; it rises and falls with your audio quality, language, speaker overlap and vocabulary. The smartest way to judge any ASR tool is to run your own recordings through it rather than trust a headline percentage.

If you want a transcript without assembling the pipeline yourself, that's what Subanana is for — upload a file or paste a public link and get back a speaker-labelled, cleaned-up transcript in any of 80+ languages, with a free tier to test on your own audio first. You can open the app and run a recording through it right now.

Try Subanana on your own audio — free to start