How to Get a Transcript of a YouTube Video (2026): Built-in Captions, AI Tools, Accuracy & Export
There are two reliable ways to get a transcript of a YouTube video. The fastest is YouTube's own Show transcript panel, which turns the video's captions into scrollable text you can read and copy. The other is to paste the video link into an AI transcription tool, which generates a fresh, cleaned-up transcript you can edit and export as SRT, Word, or plain text. Which one you want depends on whether you just need to skim the words, or you need an accurate, formatted document.
Quick disclosure: I run Subanana, an AI speech-to-text tool, so I have a horse in this race. I have tried to keep the comparison below honest — the built-in transcript is genuinely the better choice for some jobs, and I will say where.

How do you open a YouTube video's built-in transcript?
Every video that has captions — uploaded by the creator or generated automatically by YouTube — exposes a transcript you can read without any extra tool. On desktop:
- Open the video and scroll down to the video description.
- Click Show transcript. A panel opens beside or below the video.
- As the video plays, the transcript scrolls to the current line. Click any line of caption text to jump the video to that moment.
Per YouTube's own help docs, the transcript "will scroll to show you the current caption text," and you can "click any line of caption text to jump to that part of the video." You can select the text in the panel and copy it out manually.
That is the whole feature. It is fast, free, and built in. For a quick read-through — checking whether a 40-minute video is worth your time, grabbing a quote, finding the timestamp where a topic comes up — it is usually all you need.
Why is the built-in transcript often not enough?
The Show transcript panel is a reader, not an export tool. Three gaps show up the moment you need the transcript as a working document:
- No clean export. There is no "download as Word" or "download as SRT" button in the panel. You copy-paste, and what you get is raw lines — often with timestamps interleaved and no paragraph structure.
- No speaker labels. For an interview, panel, or podcast, the built-in transcript is one undifferentiated wall of text. It does not tell you who said what.
- It depends on captions existing — and being right. If the creator never added captions and YouTube's automatic captions did not generate (more on when that happens below), there is no transcript to show. And when automatic captions are there, their accuracy varies.
If your goal is to read, the panel is fine. If your goal is to use the words — quote them in an article, repurpose them into a blog post, feed them to a summary, or hand them to a client — you usually want a real transcript file instead.
How accurate are YouTube's automatic captions?
YouTube is refreshingly direct about this in its automatic captioning documentation: automatic captions "are generated by machine learning algorithms, so the quality of the captions may vary," and they "might misrepresent the spoken content due to mispronunciations, accents, dialects, or background noise." Its standing advice to creators is to "always review automatic captions and edit any parts that haven't been properly transcribed."
In practice, that means automatic captions are strongest on clean, single-speaker, clearly-enunciated audio in a well-supported language, and weaker the further you get from that — accented speech, crosstalk, music beds, technical jargon, and brand or proper-noun heavy content all degrade the result.
There are also videos where automatic captions never appear at all. YouTube lists the reasons: the audio is still being processed, the language is not supported, the video is too long, the sound quality is poor or the speech is not recognised, there is a long silence at the start, or there are "multiple speakers whose speech overlaps or multiple languages at the same time." Live streams are a special case — automatic captions for live streams are English only, and they "won't remain on the video" after the stream ends; YouTube regenerates captions afterward from the recorded video.
So "get the YouTube transcript" is not always a solved problem out of the box. When it is not, or when the built-in accuracy is not good enough, an AI transcription tool is the fallback.
Built-in captions vs an AI transcription tool
Here is the honest split between the two routes:
| What you need | YouTube Show transcript | AI transcription tool (e.g. Subanana) |
|---|---|---|
| Read or skim the words | Yes — instant, free | Yes, after a short processing step |
| Cost | Free | Usually credit- or subscription-based |
| Download as SRT / VTT / Word / TXT | No (manual copy-paste only) | Yes — multiple export formats |
| Speaker labels (who said what) | No | Yes (diarization) |
| Clean punctuation + paragraphs | No (raw caption lines) | Yes (auto-punctuation in transcript mode) |
| Fix mistranscribed words in an editor | No | Yes — in-app editor with proofreading suggestions |
| Works when no captions exist | No | Yes — it transcribes the audio directly |
| Re-transcribe in a different language target | No | Yes — translate the transcript to another language |
| Best for | A quick read or a single quote | An accurate, editable, exportable document |
The pattern is simple: the built-in panel wins on speed and price; an AI tool wins whenever you need a file rather than a glance.
How do you get a clean, editable transcript with an AI tool?
You do not need to download the video first. Subanana imports the audio straight from a public YouTube link:
- Open the AI transcription tool and paste the public YouTube video URL. Subanana fetches the audio and transcribes it — no local download needed. (Public videos and Shorts work; age-restricted, members-only, or private videos generally will not import.)
- Choose Transcript mode and set the spoken (source) language. Transcript mode adds automatic punctuation and paragraph breaks, removes filler words, and can label speakers — so the output reads like a document, not a caption dump.
- Review and tidy in the editor. Subanana runs several quality layers automatically — it benchmarks speech-to-text models per language and routes to the best performer, detects likely hallucinations and re-runs those segments through another model, and flags misheard or same-sounding wrong words for you to accept or reject. You stay in control; nothing is silently changed.
- Export. Download the transcript as SRT, VTT, TXT, Word (DOCX), Excel (XLSX), or Markdown, or grab a ZIP with all of them.
That is the difference between "I can read the words on screen" and "I have a formatted, speaker-labelled transcript file I can edit, quote, and send."
Languages: getting a transcript in Cantonese, Mandarin, Japanese, and more
YouTube's automatic captions already cover a long list of languages — its docs name Cantonese/Hong Kong, Chinese, Japanese, Korean, and dozens of others. But "supported" and "accurate enough to publish" are different bars, especially for tonal languages, mixed-language speech (Cantonese with English terms is a classic), and heavy proper-noun content.
This is where a dedicated tool earns its place. Subanana transcribes across 80+ languages and picks the best-benched speech-to-text model per source language rather than running one model for everything — and the same language list applies whether you are transcribing, translating, or captioning. If the video mixes languages, or you need the transcript translated into a different language than was spoken, you set a translation target and get a second version. (Translating a video into multiple target languages at once is a subtitle-generation job rather than a transcript job — useful to know if you are publishing the video, not just reading it. See how to translate a video into another language.)
Export formats: from transcript to SRT, Word, or plain text
The format you want depends on what you are doing with the transcript:
- TXT or Word (DOCX) — for reading, quoting, editing, or sending to a client.
- SRT or VTT — for re-uploading captions to a video (yours or a re-edit), since these carry timecodes.
- Excel (XLSX) or Markdown — for structured workflows, spreadsheets, or pasting into a CMS.
The built-in panel gives you none of these as a file. An AI tool gives you all of them. One caveat worth setting straight: Subanana's transcript exports are SRT, VTT, TXT, DOCX, XLSX, and Markdown — a broad set that covers reading, captioning, and CMS workflows. If your video editor requires a niche caption format outside that list, check that it is supported before you commit.
If you are coming the other way — you already have an SRT and want to read or convert it — see how to open an SRT file.
When the built-in transcript is the right call
I am not going to pretend you need a paid tool for everything. Use YouTube's Show transcript when:
- You just want to read or skim the content, or grab a single quote.
- The video is already well-captioned by the creator and the accuracy looks good.
- You need a timestamp to link to a specific moment — the click-to-jump feature is genuinely handy.
Reach for an AI transcription tool when you need an accurate, speaker-labelled, exportable document — for interviews and podcasts, for repurposing a video into written content, for non-English or mixed-language audio where automatic captions struggle, or for any video that never got captions in the first place.
For the deeper question of how transcription accuracy should even be judged — and why headline accuracy percentages are mostly marketing — see how we actually test transcription models. And if speaker labels matter to you, here is how AI adds them.
Either way, you are never stuck: if the built-in transcript is missing or rough, paste the link into Subanana and get a clean one.
Frequently asked questions
Can I get a transcript of any YouTube video?
You can read the built-in transcript of any video that has captions — creator-added or auto-generated. If a video has none (and automatic captions did not generate), use an AI tool that transcribes the audio directly from the link. Note that age-restricted, members-only, and private videos generally will not import into third-party tools.
Is the YouTube Show transcript feature free?
Yes. It is built into YouTube at no cost. The limitation is that it is a reader, not an exporter — there is no download-to-file button, no speaker labels, and no editor.
How do I download a YouTube transcript as a text or Word file?
The built-in panel does not offer a file download; you would copy-paste the text manually. To get a proper TXT, Word, or SRT file, paste the video link into an AI transcription tool like Subanana, which exports SRT, VTT, TXT, DOCX, XLSX, and Markdown.
How accurate are YouTube's automatic captions?
Per YouTube's documentation, quality "may vary" and captions "might misrepresent the spoken content due to mispronunciations, accents, dialects, or background noise" — YouTube advises always reviewing and editing them. Expect strong results on clean, single-speaker audio and weaker results on accented, noisy, jargon-heavy, or multi-speaker content.
Can I get a transcript in Cantonese, Mandarin, or Japanese?
Yes. YouTube's automatic captions support those languages, and dedicated tools transcribe across 80+ languages. For tonal or mixed-language audio, a tool that routes to the best speech-to-text model per language tends to produce a cleaner transcript than generic automatic captions.
Can I get speaker labels (who said what)?
Not from the built-in panel — it produces one undivided block of text. An AI transcription tool with speaker diarization labels each speaker, which matters for interviews, panels, and podcasts.
Do I have to download the video first?
No. Tools like Subanana import the audio directly from a public YouTube link — you paste the URL and it fetches and transcribes the audio without a local download.