How to Transcribe and Translate Any Language to English with AI

2026-06-09
KKevin Wong

To get English text from audio or video in another language, you run two steps: transcribe the speech into text in its own language, then translate that text into English. AI handles both automatically, and for clear, single-speaker recordings the result is usually good enough to edit lightly and ship — whether the source is Spanish, Japanese, Cantonese, Arabic, or any of 80+ languages.

The honest part most tool pages skip: not all languages are equally easy. High-resource languages like Spanish or German are close to solved; tonal, dialectal, and code-switched languages are where quality is actually won or lost, and where the choice of model matters more than any headline accuracy number. I run Subanana, an AI speech-to-text app, so this is the practical version — the workflow, the realistic quality breakdown, and what to do about the hard cases. The same two-step pattern is exactly how you'd translate a video into any other language.

How do you transcribe a recording and translate it to English?

Here is the end-to-end workflow. It's identical for any source language — you just set the source language to match your audio.

  1. Add your audio or video. Upload a file, or paste a public YouTube, Instagram, or Facebook link and let Subanana fetch and transcribe it — no separate download step.
  2. Set the source language. Pick the language actually spoken in the recording (Korean, Portuguese, Cantonese, whatever it is). This tells the system to route to a model tuned for that language rather than a generic one.
  3. Add English as a translation language. Subanana transcribes the source first, then translates the text to English. In subtitle mode you can add more than one target at once — say, English and another language — and get a separate subtitle file for each.
  4. Let the AI transcribe and translate. Quality checks run automatically in the background (more on those below).
  5. Review and correct in the editor. How much you do here depends heavily on the language — trivial for easy ones, real work for hard ones.
  6. Export. Download SRT, VTT, TXT, DOCX, XLSX, or Markdown, grab a bilingual (source-over-English) subtitle file, or render a video with the captions burned in.

The whole flow funnels through one screen — you can start a transcription here.

Why are some languages harder for AI than others?

Most speech-to-text models are trained overwhelmingly on English and a handful of other high-resource languages. The further your language sits from that training data, the more the model struggles. Three properties make a language genuinely hard:

  • Tone. In tonal languages, pitch changes word meaning, so a model that handles tone poorly will confuse words that differ only in pitch (academic background). Mandarin, Cantonese, Vietnamese, and Thai all carry this challenge.
  • Code-mixing. Many bilingual communities drop English (or another language) into a sentence mid-stream, often pronounced with a local accent. Hindi-English, Tagalog-English, and Hong Kong's Cantonese-English are everyday examples. A typical Cantonese-English line sounds like:
我哋下個 sprint 先 follow up 啦
("Let's follow up next sprint")

Recognising speech that switches language inside one utterance is "much more than a simple integration of two monolingual systems," because the model has to find the language boundary, handle the accented words, and cope with little training data for the mixed case (code-mixing dataset, arXiv).

  • A spoken/written gap. In some languages the spoken form and the formal written form differ in vocabulary and grammar, not just style — Arabic dialects vs Modern Standard Arabic, or spoken Cantonese vs Standard Written Chinese, are closer to two related languages than to one language written down (low-resource study). That matters the moment you want clean written output.

None of this makes hard languages impossible — it makes the choice of model the thing that decides your result.

What does the AI do well, and what does it still get wrong?

This is the part most tool pages skip. Being clear-eyed about it saves you time — and it holds across languages.

TaskHow AI handles it todayYour job
Clear, single-speaker audioStrong — light edits at mostSpot-check names and numbers
Code-mixed speech (e.g. Hindi-English, Cantonese-English)Decent, but the embedded language is the weak pointFix mis-heard foreign terms
Proper nouns, brands, jargonOften wrong on first passPin them in a glossary up front
Spoken-to-formal-written (e.g. dialect to standard)A real option, but a genuine rewritePick the right output language
Translation to EnglishGood for plain content, literal on slangReword idioms and culture-specific lines
Fast, overlapping speakersDiarization helps, isn't perfectRe-check speaker turns

Where it does well: for a clean recording — one person, a decent mic — modern models produce text that is genuinely usable with a quick read-through, even in a hard language. Translation into English for straightforward, informational content (a tutorial, a product walkthrough, a lecture) is reliable across languages.

Where it still struggles: in code-mixed speech, the embedded foreign-language fragments are the most common error source, exactly as the research predicts. Slang and culture-loaded lines translate literally and lose the point in any language; AI renders the words, not the meaning. And anything name-heavy needs help.

A note on numbers: I deliberately won't quote you an "X% accuracy" figure for any language. Those numbers are cherry-picked from clean audio, age badly, and tell you nothing about your recording. The honest answer is "good on clean audio, weaker on code-mixing and slang, and the model choice matters more than any single percentage."

How does Subanana handle the hard languages?

Three things in the product map directly onto the problems above.

It picks the model per language instead of locking to one. Subanana continuously benchmarks speech-to-text models and routes each job to the best performer for the source language — so a hard language isn't served by a model that happens to be great at English. This is the direct answer to "different languages need different models," which the code-mixing research makes plain.

It catches model failures automatically. Every transcription is quality-checked. When a segment looks like a hallucination — text that doesn't match the audio — the system re-runs that part on a different model and keeps the cleaner result. That re-run is free; you pay for the file once, not per retry.

It handles spoken-vs-written forms as separate output languages where they diverge. For languages with a big spoken/written gap, the colloquial and the formal-written forms are offered as distinct languages — so you choose which one you want as your output or translation target rather than getting whichever the model defaults to (for example, spoken Cantonese versus Standard Written Chinese). It's a language choice at setup, not a one-click "convert" button.

For cleanup, the editor flags likely mis-heard words and same-sounding wrong words and proposes fixes you approve one by one. To stop names and jargon from being mangled in the first place, pin them in a glossary before you run the job.

Which Subanana mode should you use?

The right mode depends on what you're producing — and a couple of capabilities are mode-specific, so this matters regardless of source language.

  • Subtitles for a video (YouTube, a course, social): use subtitle mode. It's the only mode that lets you output multiple translation targets from one job, export a bilingual two-language caption file, or burn captions into the video.
  • A readable transcript (interview, podcast, meeting minutes): use transcript mode. It adds punctuation and paragraph breaks, removes filler words, and labels speakers. Translation here is to a single target language.
  • A live event (a talk, a bilingual seminar): use live captioning. It runs in real time from a microphone or routed system audio and, for live events specifically, can auto-detect when a speaker switches between languages mid-sentence — the live-event answer to code-mixing. Live captioning is single-target (one source, one translation) and exports SRT.

A quick scope note so you don't hit a surprise: Subanana transcribes and translates speech — it isn't a video editor, it doesn't pull existing burned-in subtitles off a finished video, and it doesn't do voice dubbing. It turns the spoken words into text and then into English.

Frequently asked questions

Can AI transcribe any language and translate it to English?

For the 80+ languages Subanana supports, yes — the two-step flow (transcribe in the source language, then translate to English) is the same for each. Quality is highest on clear audio in high-resource languages and takes more review on tonal, dialectal, or code-mixed ones.

Which languages are hardest to get right?

Tonal languages (Mandarin, Cantonese, Vietnamese, Thai), heavily code-mixed speech (Hindi-English, Tagalog-English), and languages with a wide spoken/written gap (Arabic dialects, for instance). They're all workable — they just need a language-tuned model and more review.

Can AI transcribe speech that mixes two languages?

Yes. Code-mixed audio is supported and common in many bilingual regions. Expect the dominant language to come through well and the embedded second-language terms to be the most likely thing you'll touch up — that's the known hard part of code-switched recognition, not a tool-specific limit.

Can I get subtitles in two languages from one video?

Yes, in subtitle mode. You can add multiple translation targets and export a bilingual subtitle file with the source language and the English translation stacked per line, or separate files per language.

How accurate is AI on a hard language?

Good on clean, single-speaker audio; weaker on code-mixing, slang, and proper nouns. Accuracy depends far more on your recording quality and the model used than on any headline percentage, which is why a tool that routes to a language-tuned model and lets you review the output beats one that quotes a big number.

Do I have to clean up the result myself?

Plan to review, especially for names, numbers, and any second-language terms in code-mixed speech. The editor speeds this up by proposing corrections for likely mis-heard words, and a glossary you set beforehand prevents the most common mistakes.

Is there a free way to try it?

You can run a transcription and preview the result on the free tier before subscribing; exporting subtitle and transcript files is a paid feature. See pricing for what each plan includes.

The two-step pattern is the same in every language; what changes is how much the model can do on its own versus how much you review. Pick a tool that routes to a model tuned for your source language, and the hard languages stop being a problem. Try it on your own audio, or create a free account to start.

Boost Your Efficiency with Subanana

No payment method required
Free Trial
Cancel Anytime