Premiere Pro Subtitles: The Full Speech-to-Text Workflow, Plus Where It Breaks for Cantonese and Code-Switched HK Content

I was cutting a 20-minute Cantonese product review in Premiere last week. Timeline locked, B-roll in place, audio ducking done. One step left: subtitles. I opened Speech to Text and the first line came back as 佢哋而家唔使咁做喇 — clean 口語 Cantonese, faithfully transcribed, unusable as a YouTube subtitle for any audience outside HK short-form.

That's the gap this post is about. Below: the full Premiere Speech-to-Text workflow, where it breaks for Cantonese and code-switched HK content, and the round-trip I use to fix it without leaving Premiere.

Premiere Pro Speech to Text: the complete native workflow

Adobe merged captioning into the Text panel a few versions back. No plugins, no third-party service, no leaving the timeline. The path:

Import footage and drop it on the timeline. Check which audio track carries dialogue — Speech to Text transcribes the selected audio track, and picking the wrong one (music stem, SFX stem) poisons the output immediately.
Open the Text panel. Window > Text, then switch to the Transcript tab. Click Transcribe sequence (or Create transcription on older builds).
Pick language and audio track. The popup asks for source language — Mandarin Chinese and Cantonese are both listed. Point it at your dialogue track, tick Mix only if the dialogue is spread across multiple tracks. Hit Transcribe.
Proof the transcript. Premiere drops the text into the Transcript tab with word-level timecodes mapped to the sequence. This is still raw text, not captions — you can fix typos, merge lines, delete filler before the caption step. If you lean on text-based editing, this is where you also trim the timeline by deleting words in the transcript.
Generate captions. Click Create captions > Create from sequence transcript. Set max characters per line, max duration per cue, line spacing. Premiere slices the transcript into cues and drops them on a caption track above your video.
Tune on the timeline. Every cue is now a draggable clip. Duration, position, text — edit in place. Font, size, stroke, background colour live in Essential Graphics. This is the part standalone tools genuinely can't replicate: text and picture in one view, one keystroke away from the next edit.
Export. In the Export panel, tick Captions. You get three options — sidecar .srt or .vtt (upload-ready for YouTube, LinkedIn), embedded caption track inside the video container, or Burn Captions Into Video for platforms that strip caption data (Instagram Reels, TikTok).

If your source is clean English, clear Mandarin, or a studio-recorded podcast, the native workflow is complete end-to-end. Transcribe → edit → burn, one app, zero hand-offs. That's a real advantage and the reason I don't tell most Premiere users to switch.

Where Premiere Speech to Text genuinely wins

Before the weaknesses, what Premiere STT beats standalone AI subtitle tools on:

Timeline integration. Transcript cues align to cut points. Editing a caption doesn't mean exporting anything — you're already in the right app. On a project already mid-cut in Premiere, this friction gap is real and it adds up across a long edit.
Text-based rough-cutting. Recent Premiere builds let you delete transcript segments and the timeline clip disappears with them. For interview, podcast, or long-form lecture footage, this saves serious time.
Bundled with the Creative Cloud sub. If you're already paying for Premiere, Speech to Text costs nothing extra. Premiere single-app runs US$22.99/mo on the annual commit and US$34.49/mo month-to-month. No standalone STT line item.
One-app Adobe stack. Auto Reframe, Warp Stabilizer, Lumetri Color all sit a tab away. No file shuttling between a subtitle tool and an NLE.

If your content is English-dominant or clean-Mandarin and you live on the timeline, Premiere native is the right answer and I'm not going to talk you out of it.

Five scenarios where Premiere Speech to Text struggles

Premiere's STT is bound to a single backend model. On a few specific material types, the seams show. For Cantonese creators, the first one is decisive.

1. No 口語 → 書面語 conversion for Cantonese

This isn't an accuracy problem. It's a missing feature.

Premiere STT transcribes audio literally. Spoken Cantonese comes out as written Cantonese — 佢哋而家唔使咁做 stays exactly that, character for character. There's no post-processing step that converts 口語 to 書面語繁體 (他們現在不需要這樣做). If your audience is HK-only and you're cutting IG Reels or Shorts where 口語 captions read natively, fine. But the moment you publish to YouTube long-form, a course, an enterprise client, or any pan-zh-Hant audience, 口語 subtitles are unusable. You'll rewrite every cue by hand.

This single gap is the biggest reason Cantonese creators can't run Premiere STT as their subtitle main chain. 口語 → 書面語 is a Cantonese-specific linguistic conversion that needs a dedicated layer on top of STT output.

2. Code-switching (Cantonese + English mix)

Cantonese speakers mix English constantly — brand names, tech terms, loanwords. It's the default register for HK creators, tech channels, and anything aimed at SG/MY audiences. Premiere STT routinely phoneticizes English tokens into near-sound Chinese: Sharp 個 AQUOS XLED 嘅對比度 can come back with Sharp as Shop, OLED as 歐利. Every misfire is a manual fix.

3. Cantonese accuracy sits around 89% on same-batch testing

Even if you accept 口語-only output, the recognition layer itself isn't where the strongest Cantonese models are. On a same-batch test across five subtitle tools (full comparison here), Premiere landed at about 89% on Cantonese. For a 15-minute short that's workable. For a one-hour interview, 11% character error rate pushes QA time close to manually typing the whole thing — and that's before you account for the 口語 rewrite in point 1.

4. Noise, multi-speaker, and version-to-version drift

Outdoor interviews, event recordings, multi-speaker rooms — anywhere audio overlap and background noise dominate — Premiere STT drops measurably. Adobe announced a 26.2 accuracy upgrade on March 27, 2026, citing up to 36% error reduction on "some languages." Cantonese wasn't called out, so I wouldn't assume it got the boost until you retest your own footage.

The other version-drift issue is regressions. Earlier in 2026, Premiere 26.0.1 shipped an STT break — in 2016 building a mobile app came back as 16 billion moves to cost a minimum of — and it took until 26.2 to land a fix. If STT is in your production path, keep a known-result clip and re-run it against every Premiere point release before committing to it.

5. Only 18 supported languages

Adobe's 26.2 announcement states Premiere STT "supports 18 languages." The list covers Mandarin (Simplified and Traditional), Cantonese, English, most major European languages (French, German, Spanish, Italian, Portuguese, Russian, Dutch, Danish, Norwegian, Swedish), plus Korean, Japanese, and Hindi. If your content only touches that set, coverage is fine.

The moment you work across Southeast Asian languages, smaller-market European languages, or multilingual footage that swaps between three or four languages in one cut, Premiere hits a wall. Subanana routes across 80+ source languages. Language coverage is one of the cleanest advantages a cloud standalone tool has over an NLE-bundled STT feature.

The round-trip: generate the SRT in Subanana, import it back into Premiere

If you've run Premiere STT on Cantonese material and the output wasn't usable — especially on the 口語 issue in point 1 — you don't need to switch editors. The pragmatic fix is to outsource just the subtitle-generation step to a tool that handles Cantonese, Cantonese-English mix, and 口語 → 書面語 conversion, then import the SRT back into Premiere and keep cutting.

That's the role Subanana's AI subtitle tool is built for. The one that matters most: built-in one-click 口語 → 書面語 conversion for Cantonese. A Cantonese-to-書面語繁體 conversion layer on top of the STT output. No equivalent in Premiere, ArcTime, Taption, or pyTranscriber. For pan-zh-Hant distribution it's the single highest-leverage feature on the tool.

The workflow:

Upload the file or paste a link. Direct upload works for anything up to 15 GB / 3 hours on paid plans. If the footage is already on YouTube, Instagram, or Facebook, paste the public URL — Subanana pulls and transcribes without a local download.
Pick the source language. Cantonese, Mandarin, English, or mixed. Under the hood, Subanana routes to whichever STT model is currently benchmarking highest for that language — we continuously re-evaluate across providers rather than locking to one vendor.
One-click 口語 → 書面語 (Cantonese-only). After transcription, a single toggle converts the whole SRT from 口語 to 書面語繁體. Keep the original 口語 version, export the 書面語 version, or export both.
Proof in the editor. The Subanana editor layers a few QA checks on top of the routed STT output. Hallucination detection auto-reroutes segments that don't match the audio to a different evaluated model — you don't see the retry, just the cleaner result. An LLM proofing pass flags likely misheard words and same-sounding wrong characters (e.g. 在見 → 再見) and waits for you to approve each suggestion. Scope matters: the LLM layer handles text-level substitution errors only — it doesn't detect missed characters and doesn't touch timecodes. A CPS (characters-per-second) rule flags cues that cram too many characters into too little time, so you can spot "viewer can't finish reading this line" before the video goes up.
Export the SRT. Plain SRT, or a bilingual SRT with source and translation stacked in one cue (the 中英對照字幕 use case, one file instead of two). Six standalone formats total: SRT, VTT, TXT, DOCX, XLSX, Markdown.
Back to Premiere. In the Text panel, Import captions from file, pick the SRT Subanana exported. Cues drop onto the caption track on the right timecodes. From here it's the Premiere workflow you already know — tune positions, style in Essential Graphics, export or burn in.

The trade-off is one export-import hop the pure-Premiere flow doesn't have. What you get back is materially higher accuracy on Cantonese and Cantonese-English mix, 口語 → 書面語 handled in one click, and a routing layer that isn't tied to a single backend model.

Pricing: Subanana paid plans start at US$9/month (about HK$68/month, annual billing) — added to the Creative Cloud sub, not replacing it. See the pricing page for the full tier breakdown.

When to stay pure Premiere

Not every project is worth the extra hop. If you recognise yourself in these, skip Subanana and stay in Premiere STT:

English-dominant content. English has the deepest STT training data of any language. Premiere's accuracy on clean English dialogue is already where it needs to be.
Text-based editing is core to your cut. Deleting words in the transcript to trim the timeline is a Premiere-native workflow. Standalone tools don't replicate it.
Short-form, clean dialogue, controlled recording. Studio-recorded podcasts, single-person camera-facing tutorials, anything where the audio environment is tight. Premiere's error rate is acceptable here.
Your pipeline already budgets a full proofing pass. If a human editor is reading every cue anyway, raw STT accuracy matters less.

Simple rule: the closer your footage is to Cantonese / code-switched / outdoor-noisy, the better the ROI on a Subanana round-trip. The closer it is to English / studio / clean-dialogue, the more Premiere-native is the right call.

For a full same-batch comparison across five subtitle tools — Subanana, Premiere Pro, ArcTime, Taption, and pyTranscriber — with per-language accuracy, code-switching handling, export formats, and scenario fit, see AI subtitle app comparison 2026.

Premiere Pro subtitle FAQ

Q1: Is Premiere's Speech to Text a paid add-on? No. It's bundled inside the Creative Cloud Premiere Pro subscription. There's no standalone STT purchase — to access it you're paying for full Premiere (US$22.99/mo annual-commit, US$34.49/mo month-to-month).

Q2: How well does Premiere STT handle Cantonese? It's the weakest part of the tool for HK creators. About 89% accuracy on a same-batch Cantonese test, plus phoneticized English in code-switched sentences, plus — the bigger issue — no 口語 → 書面語 conversion. If Cantonese is your main content language, run the transcription through Subanana and import the SRT back into Premiere.

Q3: Will an external SRT import out of sync with my timeline? Not if the SRT timecodes align with the sequence's in-point. Subanana exports absolute timecodes from 0:00 of the uploaded file, so importing into Premiere the cues slot onto the caption track on their right starts. If your sequence stitches multiple clips, you may need to nudge the start.

Q4: Can Premiere output bilingual subtitles in one pass? Not natively — its STT doesn't generate a translated second row. You'd either import two caption tracks, or generate a bilingual SRT externally and import that. Subanana's bilingual SRT export puts source and translation in the same cue, so after Import captions from file the caption track is already bilingual.

Q5: Should I retest Premiere STT after every major update? Yes. Adobe's community threads have multiple cases of STT regressions across point releases — the 26.0.1 → 26.2 fix cycle is the most recent I've seen. If Premiere STT is in your production pipeline, keep a known-result test clip and re-run it against each update before committing to it.

Premiere Pro's native subtitle workflow is still the right choice for a lot of editing scenarios — timeline integration, text-based editing, the one-app Adobe stack are things standalone tools can't match yet. The pressure point shows up on Cantonese, Cantonese-English code-switching, and outdoor-noise footage. For that specific subset, outsource just the subtitle-generation step: generate the SRT in Subanana, import it back into Premiere, keep cutting. You don't change editor. You don't rebuild the pipeline. You just swap out one model for a better-suited one on one step.