YouTube Subtitles 2026: the Full YouTube Studio Workflow, Plus Where Auto-Captions Break for Cantonese and Cross-Region Viewers

A Cantonese-YouTuber friend messaged me last week. He'd just cut an 18-minute tech-unboxing video, clicked straight through to YouTube Studio's auto-captions, and came back with an entire track of 佢哋而家唔使咁做 — clean 口語 Cantonese, faithfully transcribed, unreadable for his TW and SG viewers. His question: "Is YouTube's auto-caption good enough, or do I need a different tool?"

That's the split this post is about. Below: the three native subtitle paths inside YouTube Studio, the genuine strengths of auto-captions, where they break for Cantonese and pan-zh-Hant distribution, and the Subanana round-trip I use to fix the gap without leaving YouTube as the publish platform.

Adding subtitles in YouTube Studio: the three native paths

YouTube Studio's Subtitles page gives you three ways to put captions on a video. You can use one in isolation or mix them. The flow:

Open the subtitles panel. Sign in to YouTube Studio, click Subtitles in the left menu, pick the video, open the Subtitles tab. Click Add language and pick the source language (Chinese (Traditional), Cantonese, English, whatever matches the dialogue).
Path A: let YouTube auto-generate. Within minutes to a few hours of upload, YouTube runs speech recognition on supported languages and produces an Automatic caption track. Open it and edit inline — typo fixes, timing nudges, line breaks. Lowest-friction option for short-form where raw accuracy is acceptable.
Path B: type from scratch. If auto-captions haven't generated or the output is unusable, hit Add subtitles > Type manually. YouTube puts a player next to a text field and records timecodes as you type. Works for short clips or anywhere you want total control.
Path C: upload an .srt. If you've already generated the SRT in another tool (Subanana, Premiere, a transcription service), click Add subtitles > Upload file > With timing. YouTube reads the timecodes and slots the cues onto the caption track. This is the path professional workflows default to — subtitle generation outsourced to whatever tool handles the language best, the final file lands back on YouTube for publish.
Save or publish. Click Save draft or Publish. Viewers see the new language under the CC menu.

The three paths aren't mutually exclusive. A common pattern: let YouTube take a first pass, keep it if the material is English and clean, fall back to path C with a Subanana-generated SRT when the auto track doesn't hold up.

What YouTube auto-captions are genuinely good at

Before the weaknesses, what YouTube auto-captions beat standalone AI subtitle tools on:

Free and unmetered. Upload the video and the caption job runs automatically. No separate subscription, no monthly minute quota, no file-size cap on subtitle generation.
Zero workflow friction. You don't download the video, run it through another tool, and re-upload. Footage, caption track, and publish all live on YouTube.
English accuracy is usually acceptable. Clean-audio English podcasts, vlogs, tutorials — YouTube auto-captions routinely come out in the "minor edit and ship" zone. For solo English creators that's often enough.
Timecode alignment is baked in. YouTube aligns the caption track internally, so you don't hit the out-of-sync drift you sometimes see after an external SRT upload.

If your content is English-dominant, short-form, and recorded in a controlled environment, YouTube auto-captions are the right answer.

Where YouTube auto-captions struggle

YouTube auto-captions are bound to Google's internal speech-recognition backend. On a few specific material types, the seams show. For Cantonese creators the first point is decisive.

1. Cantonese output stays in 口語 — no 書面語 conversion

This isn't an accuracy problem. It's a missing feature, and it's the one that matters most.

YouTube does list Cantonese on its auto-caption supported-languages page — the label is Cantonese/Hong Kong. One creator-side nuance: which language YouTube attempts depends on the setting inside Studio → Video details → Language and caption certification. Set it to Cantonese (Hong Kong) and YouTube tries Cantonese recognition. Leave it unset or set to the wrong variant and auto-captions may fail to generate or come back in a different Chinese variant entirely.

Even when the job succeeds, the output is a literal Cantonese transcription — an entire track of 佢哋而家唔使咁做喇-style 口語. YouTube doesn't run a post-processing step that rewrites 口語 into 書面語繁體 (他們現在不需要這樣做了). For HK-only short-form where 口語 captions read natively, you can ship it. For YouTube long-form, courses, enterprise channels, or any pan-zh-Hant audience spanning HK, TW, SG, and MY, 口語 subtitles are unusable — you'll rewrite every cue by hand.

For Cantonese creators distributing across Traditional Chinese markets, this single gap decides whether YouTube auto-captions can be the subtitle main chain. Usually the answer is no.

2. Accuracy varies with recording environment, accent, and audio quality

YouTube's own documentation acknowledges that "difficult accents, dialects, or background noise" reduce accuracy, without publishing per-language or per-scenario error-rate numbers. In practice, the same language recorded in a studio vs on the street produces meaningfully different results. If you're running auto-captions at channel scale, test a clip per material type before committing the workflow to a series.

3. Language coverage is broad, but per-language quality isn't uniform

YouTube supports auto-captions in 60+ languages — Cantonese/Hong Kong, Chinese, Japanese, Korean, Vietnamese, Thai, Indonesian, and more. Breadth is ahead of most of the standalone field.

Google doesn't publish per-language accuracy numbers, but the community-observed pattern has been consistent for years: English strongest, Mandarin solid, Cantonese noticeably weaker (nine-tone system plus high homophone density are well-known recognition challenges), long-tail languages patchiest. Multilingual footage — Cantonese with English and Japanese in the same cut — is where the backend tends to break first.

The round-trip: generate the SRT in Subanana, upload it back to YouTube

If you've run YouTube auto-captions on Cantonese material and the output isn't usable — especially the 口語 issue above — you don't need to change publishing platform. The pragmatic fix is to outsource just the subtitle-generation step to a tool that handles Cantonese, Cantonese-English code-switching, and 口語 → 書面語 conversion, then upload the SRT back to YouTube through path C.

That's what Subanana's AI subtitle tool is built for. The single feature that matters most here: one-click 口語 → 書面語 conversion for Cantonese. After transcription, a toggle rewrites the whole SRT from 口語 Cantonese (佢哋而家唔使咁做) to 書面語繁體 (他們現在不需要這樣做). No equivalent in YouTube auto-captions, Premiere Speech to Text, ArcTime, or pyTranscriber. For any creator distributing Cantonese content beyond HK short-form, it's the highest-leverage step on the tool.

The workflow:

Paste the YouTube URL (no download). The video is already on YouTube, so there's no local export step. Subanana accepts public YouTube URLs directly, and also public Instagram and Facebook links, so you can transcribe without pulling the file.
Pick the source language. Cantonese, Mandarin, English, or mixed. Under the hood, Subanana routes to whichever STT model is currently benchmarking highest for that language — we continuously re-evaluate across providers rather than locking to a single backend.
One-click 口語 → 書面語 (Cantonese-only). One toggle after transcription rewrites the SRT from 口語 to 書面語繁體. Keep the 口語 version, the 書面語 version, or export both.
Proof in the Subanana editor. Three QA layers sit on top of the routed STT output:
- Hallucination detection with auto-reroute. If a segment's output doesn't match the source audio, the system re-runs that segment through a different evaluated model. You don't see the retry, just the cleaner result.
- LLM-assisted proofing for text-level errors. An LLM pass flags likely misheard words and same-sounding wrong characters — 在見 should be 再見, that kind of swap. Each suggestion waits for you to approve. Scope matters: this layer handles text-substitution errors only. It doesn't detect missed characters and doesn't touch timecodes.
- CPS (characters-per-second) check. Cues that cram too many characters into too little time, or linger past the point where viewers have finished reading, get flagged so you can catch them before publish.
Export SRT (or bilingual SRT). Plain SRT, or a bilingual SRT with source and translation stacked in the same cue — the 中英對照字幕 use case, one file instead of two. Six standalone formats total: SRT, VTT, TXT, DOCX, XLSX, Markdown.
Upload back to YouTube Studio. Open the Subtitles page for the video, click Add subtitles > Upload file > With timing, and pick the SRT Subanana exported. YouTube slots the cues onto the caption track on their original timecodes. Confirm and publish — the track shows up under the CC menu, and you can set it as the default.

The trade-off is two extra steps a pure auto-caption flow doesn't have (transcribe in Subanana, upload SRT). What you get back is materially higher accuracy on Cantonese and code-switched content, 口語 → 書面語 handled in a single click, and a routing layer that isn't locked to one backend. If SRT is unfamiliar territory, this one-minute SRT tutorial covers the format.

Subanana paid plans start at US$9/month (about HK$68/month, annual billing) — the free tier is enough to try the Cantonese flow on a single video. See the pricing page for the full breakdown.

When YouTube auto-captions are enough

Not every upload is worth the extra hop. If you recognise yourself in these, stay native:

English-dominant content. English is the language with the deepest STT training data. YouTube auto-captions on clean English audio are already in the "minor edit and ship" zone.
Short-form, clean dialogue, controlled recording. Under-15-minute solo explainers and studio-recorded podcasts — error rates are generally acceptable, and the zero-friction flow wins.
Your channel already does a line-by-line proofing pass. If a human editor is reading every cue anyway, raw STT accuracy matters less.
Target audience is HK-local and 口語 captions are on-register. Shorts, vlogs, local meme content — 口語 on screen can read more native than 書面語 would.

Simple rule: the closer your video is to Cantonese / pan-zh-Hant audience / multi-speaker / outdoor-noisy, the better the ROI on the Subanana round-trip. The closer it is to English / solo / studio-clean, the more the YouTube-native flow is the right call.

If you cut in Premiere Pro, the Premiere Pro subtitles full workflow covers the same round-trip from inside the NLE. For subtitle readability settings once the transcription is done, four subtitle settings to get right for HK bilingual video covers CPS, line length, font, and duration.

YouTube subtitle FAQ

Q1: Are YouTube auto-captions free? Any quota? Free and unmetered. There's no monthly minute cap and no video-length ceiling — as long as the video is public or unlisted, YouTube runs the caption job. Coverage is limited to Google's published language list, and per-language accuracy varies.

Q2: How good are YouTube auto-captions for Cantonese? Workable but limited by two hard constraints. One: output stays in 口語 (佢哋 / 唔使 / 而家) with no automatic 書面語 conversion. Two: Cantonese accuracy generally trails Mandarin. For HK-only audiences you can ship the raw track for short-form. For cross-region Traditional Chinese distribution, generate the SRT in Subanana and upload it back.

Q3: If I upload my own SRT, does YouTube overwrite the auto-caption track? No. YouTube treats them as separate tracks and both show up under the CC menu. The common pattern is to set your uploaded track as the default and delete the auto-caption track once the manual one is live.

Q4: Will a Subanana-exported SRT go out of sync when uploaded to YouTube? Not as long as the SRT's timecodes match the same source video. Subanana exports absolute timecodes from 0:00 of the uploaded file, so if the YouTube upload is the same file you transcribed, timing aligns. If you re-cut the video after transcribing, regenerate the SRT against the new cut.

Q5: Can YouTube auto-captions output bilingual subtitles in one pass? No. YouTube auto-captions generate one source-language track at a time — translation relies on viewer-side Auto-translate, which is inconsistent. For proper bilingual subtitles you generate a bilingual SRT externally and upload it. Subanana's bilingual SRT export stacks source and translation in the same cue, so after upload the caption track displays both rows directly.

YouTube's native caption flow is still the right answer for English-dominant, short-form, controlled-recording content — free, zero-friction, deeply integrated with the platform. Where it hits the wall is Cantonese aimed at cross-region Traditional Chinese audiences: 口語 output with no 書面語 conversion is a structural gap, not a tuning issue. For that subset, outsource just one step — generate the SRT in Subanana, upload it back through YouTube Studio's Upload file > With timing path. You don't change publishing platform. You don't rebuild the channel workflow. You swap out one model for a better-suited one on one step.