AI Subtitle App Comparison 2026: 5 Tools Tested on Cantonese and Mandarin Video (Subanana, Premiere Pro, ArcTime, Taption, pyTranscriber)

2026-04-23
KKevin Wong

A YouTuber friend handed me 12 Cantonese videos last month — interviews, street vox-pops, a few tech reviews — about 9 hours total. I ran the same batch through 5 AI subtitle tools and logged each one: accuracy on clean speech, accuracy with background noise, how code-switched English came out, export formats, time from upload to a usable .srt. What follows is the mixed-results table and a per-tool breakdown — each tool gets credit for what it beats Subanana on.

How the test ran

  • Clip A (Cantonese + English mix): 18-minute tech review, English brand names (Sharp, AQUOS, XLED, OLED) peppered through 口語 Cantonese narration.
  • Clip B (outdoor noise): 12-minute street interview — traffic, ambient voices, light wind in the mic.
  • Ground truth: both clips hand-transcribed character-by-character, used as the accuracy denominator.
  • Machine: 2021 MacBook Pro (M1 Pro, 16 GB RAM), latest Chrome. Desktop tools installed on the same box.
  • Accuracy: 1 − (wrong + dropped + extra characters) / ground-truth character count. If an English token came back as a reasonable Chinese translation, it counted as acceptable rather than wrong.
  • Target: a clean .srt ready to drag into Premiere, Final Cut, or YouTube Studio.

One caveat up front: Cantonese subtitle accuracy is never a clean single number. 而家 vs 現在 — is that wrong, or a style choice? I used the same ruler across all 5 tools, but you should run your own 1-2 minute sample before committing. Most of these tools have free tiers that let you do exactly that.

The 5 tools in one line each

ToolThe one-line version
SubananaHighest Cantonese + Mandarin accuracy on this batch; 口語 → 書面語 conversion; URL import from YouTube / IG / FB; browser-based, no install.
Adobe Premiere ProNative STT inside the timeline. If you already live in Premiere, the no-hop workflow is the draw.
ArcTime Pro 4.5Desktop app with precise waveform editing and .ass / .ssa styled-subtitle export. The pick for Bilibili-style animated captions.
TaptionTaiwan-origin, Traditional-Chinese first, FCPXML export, per-minute pay-as-you-go. Best match for Final Cut users and low-volume creators.
pyTranscriberFree, open-source, can run fully offline with the Whisper backend. Python install friction is real.

Happy Scribe, Descript, Rev all exist in 2026 and are strong on English. I left them out deliberately — this post is the 5 that HK / TW creators actually load up when the source language is Cantonese or Mandarin. International comparison is a separate post.


Per-tool breakdown

1. Subanana

Ran Subanana first. Uploaded Clip A, selected Cantonese, hit generate — transcript was ready in roughly two minutes, still in 口語 (as expected; the STT output faithfully reflects spoken Cantonese). In the editor I clicked the 口語 → 書面語 toggle once; every cue converted. That's the workflow reference I held the other four tools up against — because the pan-zh-Hant distribution case Cantonese creators actually face is end output in written Chinese, not clean transcription of spoken Cantonese.

  • Cantonese accuracy: Clip A ~96%, Clip B ~93%. Code-switched English (OLED, XLED, AQUOS) stayed as English characters rather than getting phoneticized into near-sound Chinese.
  • Mandarin accuracy: Clean recording ~97%, noisy recording ~95%. Two reasons Mandarin runs higher than Cantonese — far more training data in the underlying models, and less accent variance across speakers. If your channel is Taiwan Mandarin or mainland 普通話, Subanana's Mandarin number beats most English-first tools.
  • Input sources: File upload, or paste a public YouTube / Instagram / Facebook URL — Subanana fetches and transcribes without a local download. Useful when the source is already live on a social platform and you don't want to re-download and re-upload.
  • The Cantonese-specific feature: one-click 口語 → 書面語 conversion. 佢哋而家唔使咁做 converts to 他們現在不需要這樣做 in one toggle. For HK creators distributing to YouTube long-form, TW audiences, or any pan-zh-Hant market, this is the difference between usable and unusable subtitles. No other tool in this list has it.
  • Export: SRT, VTT, TXT, DOCX, XLSX, Markdown — six standalone files or a ZIP of all six. Bilingual SRT (source + translation in one file) is supported. So is one-click video export with burned-in subtitles — single-language or bilingual burn-in, no round-trip through Premiere or Final Cut if all you need is a finished subtitled .mp4. Not supported: .ass / .ssa styled subtitles, FCPXML sequences.
  • STT architecture: continuously benchmarked across multiple STT models, routed per source language. You don't get locked into one vendor — the best-benched model for your language handles your file.
  • Quality-assurance stack (layered on top of the STT routing — this is where Subanana puts its weight):
    • Hallucination detection with automatic model substitution. When a segment's output looks off relative to the audio — the classic "model is confidently wrong" failure — the system reruns that segment on a different evaluated model. Users don't see a retry state; they get the cleaner result.
    • LLM-assisted text proofread. Inside the editor, an LLM pass flags likely misheard words and same-sounding wrong characters (e.g. 在見再見) and proposes corrections. You approve each suggestion before anything changes — never silent auto-apply. Scope is text-level substitution errors only: it doesn't detect dropped characters, and it doesn't touch timecodes.
    • CPS (characters-per-second) flagging. Deterministic rule check that marks cues packing too many characters into too little time — the "viewer can't finish reading" case — or the opposite, cues sitting static too long. Fix those lines first.
    • Glossary and context reference — coming. Not shipped yet. Will let you supply product names, technical terms, and reference text to bias the STT toward your vocabulary.
  • Pricing: Free tier caps each file at 15 minutes / 3 GB / 720p and blocks caption download (preview-only). Paid plans start at US$9/month (~HK$68/month, annual billing) — file cap rises to 3 hours / 15 GB, caption download unlocked, 4K export.
  • What competitors beat Subanana on: .ass / .ssa styled subtitles — ArcTime wins. FCPXML sequence import into Final Cut — Taption and ArcTime win. Fully offline on-device STT — pyTranscriber wins (Whisper backend). Per-minute pricing — ArcTime credits and Taption overage both beat a subscription for low-volume users.

Best fit: Cantonese or Mandarin creators who want high accuracy, 口語 → 書面語 conversion, and a clean SRT or DOCX handoff into whatever NLE they cut in.

2. Adobe Premiere Pro (Speech to Text)

Dropped Clip A on a fresh Premiere timeline, opened the Text panel, picked Cantonese from the language list, clicked Transcribe. Fast — the transcript came back in under a minute and every cue was already placed on the timeline. But two specific things happened every time I ran a Cantonese-plus-English clip: the Cantonese accuracy sat below Subanana's, and the English brand names (Sharp, AQUOS, XLED, OLED) came back phoneticized into near-sound Chinese rather than kept as Latin characters. For a Premiere user cutting clean English, a non-issue. For HK creators shipping Cantonese-with-English brand mentions, it's a review pass per minute of footage.

  • Cantonese accuracy: Clip A ~89%, Clip B ~76%. Code-switching is a weak spot — English brand names routinely come back phoneticized (SharpShop, OLED歐利).
  • The real win: you never leave the timeline. Transcript-based editing, cue-to-clip alignment, Essential Graphics styling, caption burn-in or sidecar .srt export — all inside Premiere. Bundled free with the Creative Cloud subscription, nothing extra to buy. For an editor already mid-cut in Premiere, the no-hop workflow is the draw, and it's a real advantage.
  • Export: SRT, VTT, embedded caption track, or burn-in to video.
  • Pricing: US$22.99/mo on the annual commitment, US$34.49/mo month-to-month. STT isn't sold standalone — the full Premiere sub is the only entry point.
  • What Premiere beats Subanana on: native NLE workflow. Subanana outputs an SRT; Premiere keeps subtitles in the same app where you're cutting, styling, and exporting. If you're a heavy Premiere user, that hop genuinely costs time — even if Subanana turns the SRT around in a minute, you still have to import it back.
  • Where it slips: Cantonese accuracy itself sits below the top of this list. No 口語 → 書面語 conversion — Premiere transcribes Cantonese literally, which is fine for HK-only IG Reels but unusable for YouTube long-form or TW distribution. Only 18 supported languages (Adobe's March 27, 2026 count). Cloud-dependent. And there's version-drift risk — Premiere 26.0.1 shipped with STT broken (in 2016 building a mobile app came back as 16 billion moves to cost a minimum of); 26.2 fixed it a few weeks later. If STT is in your production path, keep a known-result clip and test each point release.

Best fit: Premiere power users working in English-dominant or clean-Mandarin material, where the accuracy gap is narrow and the timeline-integration win is large. Full Premiere subtitle workflow: Premiere Pro subtitles full workflow.

3. ArcTime Pro 4.5

Installing ArcTime Pro on the M1 Mac took two steps — a JRE install first, then the ArcTime binary. The app runs, but cursor hover during timeline playback stutters visibly in a way no native app does; the macOS-freezing reports are consistent with what I felt on the test machine. Once past install, STT loads through the credit-top-up model — Clip A Cantonese used roughly 550 credits and the transcript came back at the Cantonese-accuracy band expected (a half-step behind Subanana, clearly above Premiere). The real reason someone picks ArcTime isn't the transcript though; it's the export dialog, where .ass and .ssa styled subtitle options sit right next to SRT, and you can configure per-character animated effects no other tool in this list touches.

  • Cantonese accuracy: Clip A ~90%, Clip B ~80%. Middle of the pack — behind Subanana, ahead of Premiere and Taption.
  • The real win: .ass / .ssa styled subtitle export plus fine-grained waveform timeline editing. If you need per-character color fades, speaker-position-tracked captions, or Bilibili-style animated 彈幕 overlay effects, ArcTime is the only tool in this list that exports them natively. FCPXML and Premiere XML sequences are also in the export list.
  • Export: SRT, ASS, SSA, Premiere / Final Cut XML sequences.
  • Pricing: Base desktop app free on Windows / Mac / Linux — core editing doesn't cost credits. STT uses a credit top-up — Cantonese / Mandarin / English run at ~30 credits/min, which works out to ~US$0.042/min (~US$2.50/hr). No subscription. Sign-up grants 2000 free STT credits.
  • What ArcTime beats Subanana on: .ass / .ssa styled subtitle export (Subanana's six formats don't include styled subs — if you need animated effects, ArcTime or nothing). Per-minute credit pricing — for extremely low-volume users, credit top-up is cheaper than a subscription. Configurable sentence-segmentation rules — power users can tune the auto-break logic, which Subanana doesn't expose.
  • Where it slips: Java desktop app, no browser version — extra install friction. UI is mostly Simplified Chinese, which bothers some HK / TW users. And there are multiple documented reports of macOS freezing on mouse movement in the editor (a few reviewer sites openly advise Mac users to skip ArcTime Pro) — HK and TW Mac users should run a 15-minute test on their own box before committing.

Best fit: Editors producing styled subtitle effects, 彈幕 overlays, or Bilibili-style animated captions, where styled .ass output is non-negotiable.

4. Taption

Taption was the easiest start of the five: paste the YouTube URL of Clip A into the upload field, pick Cantonese as the source language, hit process. No desktop install, no downloading the clip to re-upload. Transcript came back in the expected couple of minutes. Two observations from the output: the Cantonese accuracy landed well below Subanana's — especially on the code-switched mixed content — and 口語 stayed as spoken Cantonese, same as every other tool except Subanana. Taption's actual pay-off is downstream of the transcript: FCPXML export in a single click produces a Final Cut timeline-ready file that drops into the Text & Titles panel directly, removing a manual conversion step Subanana currently asks you to do yourself.

  • Cantonese accuracy: Clip A ~85%, Clip B ~73%. Capterra has a 4-star review from March 2025 flagging Cantonese at roughly 80% and "needs a lot of cleanup" — consistent with what I saw on this batch.
  • Mandarin accuracy: Taption's public claim is >90% general / >95% best-supported languages — they don't break out a Mandarin-specific number. My estimate on this batch sits in the ~90%+ band, which is strong for general use but still below Subanana's Mandarin number.
  • The real win: FCPXML export (added in 2026), per-minute pay-as-you-go pricing, and a clean embed path for bilingual subtitle video. FCPXML lets Final Cut Pro users drag a Taption subtitle track straight into the timeline — no SRT-to-XML conversion step. Per-minute pricing is the friendlier option for creators who hit subtitle work in bursts rather than continuously.
  • Export: SRT, VTT, TXT, multi-language Excel, FCPXML (2026), embedded-subtitle video.
  • Pricing: Free tier for videos under 1 minute plus a small trial allowance. Paid Premium US$10.80/mo (annual) or US$12/mo (month-to-month), 120 minutes included per month, overage US$6/hr (~US$0.10/min). A per-minute top-up tier also exists.
  • What Taption beats Subanana on: FCPXML export — the one-step Final Cut handoff Subanana doesn't do. Per-minute overage pricing — if your usage is bursty, Taption's per-minute overage runs lighter than a Subanana subscription. Note: bilingual SRT and embedded-bilingual-video output are parity features — both Taption and Subanana ship them. I want to call this out because earlier drafts of this post treated those as Taption-only wins; they aren't.
  • Where it slips: Cantonese accuracy sits noticeably below Subanana on this batch — especially on code-switched mixed content and noisy footage. No 口語 → 書面語 conversion. Users report Safari compatibility issues and an aggressive auto-renew flow.

Best fit: Final Cut Pro editors who need FCPXML, and low-volume creators who prefer per-minute overage to a full subscription.

5. pyTranscriber v2.1

I ran pyTranscriber in Whisper-backend mode — the only configuration that actually delivers on the "offline" promise, because the default Google Cloud Speech backend uploads your audio to Google just like YouTube auto-captions do. Install is real friction: Python environment, the Whisper model download, FFmpeg, plus the manual backend switch inside the app. Once running, it works entirely on your machine — no audio leaves, no request gets logged anywhere external. Accuracy on Clip A sat at roughly 90% (comparable to ArcTime), and the transcript arrives as plain SRT plus TXT. What you don't get: any in-app editor, speaker labels, CPS checks, translation, or team collaboration. For privacy-critical recordings (legal, medical, security-sensitive), the offline guarantee is the only reason to go through the setup. For everyone else, the install friction is real.

  • Cantonese accuracy: Clip A ~90%, Clip B ~74%. Relies on publicly available models — no smart retention of English brand names in mixed audio.
  • The real win: free, open-source, GPL-3.0. No account, no subscription, no usage cap. The Whisper local backend (added in v1.9, CUDA-accelerated in v2.1) runs fully on your machine.
  • Important caveat on "offline": pyTranscriber ships with two backends. The default Google Cloud Speech backend uploads your audio to Google — same privacy model as YouTube auto-captions. Only the Whisper local backend is truly offline, and switching requires a manual backend selection plus downloading the Whisper model files. If your use case is legal-interview recordings, medical transcripts, or anything where "audio must not leave this machine" is a hard requirement, confirm you're on the Whisper backend before uploading anything. The official FAQ documents this, but it's easy to miss.
  • Export: SRT, TXT.
  • Pricing: US$0.
  • What pyTranscriber beats Subanana on: truly offline processing (via Whisper backend) and zero cost. Subanana runs entirely in the cloud — a paid plan is required to download the caption file. pyTranscriber on Whisper gives you local-only processing with no content caps, which is a hard privacy ceiling no cloud tool can match.
  • Where it slips: Python install plus Whisper model download is a real wall for non-technical users. No in-app editor — timing adjustments happen in a separate subtitle editor you install yourself. No translation, no speaker diarization, no team collaboration. English-first UI. macOS installs hit recurring friction on recent OS versions.

Best fit: Technically comfortable users with strict privacy requirements and a zero budget — academic researchers, solo legal work, security-sensitive personal recordings.


Comparison table

Fourteen axes, mixed results across all five tools. is the clear winner on that row, is middle-of-the-pack or partial, is a clear miss.

DimensionSubananaPremiere ProArcTimeTaptionpyTranscriber
Input source✅ File + YouTube / IG / FB URL❌ File (into Premiere)❌ File (desktop app)⭕ File + YouTube URL❌ File (local path)
Cantonese accuracy (mixed code)✅ 96%❌ 89%⭕ 90%❌ 85%⭕ 90%
Cantonese accuracy (outdoor noise)✅ 93%❌ 76%⭕ 80%❌ 73%❌ 74%
Mandarin accuracy✅ ~97%❌ mid-low⭕ mid-high⭕ ~90%+ (estimate)⭕ mid
口語 → 書面語 conversion✅ one-click
Bilingual SRT export
Burned-in subtitle video, one click (single or bilingual)
FCPXML export for Final Cut
.ass / .ssa styled subtitles
In-NLE native workflow (no export / import hop)
Browser-ready, no install
Fully offline on-device STT✅ (Whisper backend only)
Per-minute pricing flexibility❌ subscription❌ subscription✅ credits✅ per-minute overage— free
Price floorUS$9/mo (annual)US$22.99/mo (annual)~US$0.042/min (~US$2.50/hr)US$10.80/mo (annual)Free

Legend: clear winner on that row · mid-pack · clear miss.


Which one should you pick? A decision tree

  • Cantonese or Mandarin content, high accuracy, 口語 → 書面語 neededSubanana. This is the scenario the product is built for. More detail on the AI subtitle tool page.
  • Already cutting in Premiere and the no-hop workflow matters more than 5-point accuracy gapsPremiere native. Especially if your source is English or clean Mandarin.
  • Styled .ass subtitles, 彈幕 overlays, Bilibili-style animated captionsArcTime. The only tool in this list that does styled exports natively. Run a 15-minute macOS test first.
  • Final Cut Pro user who wants FCPXML straight into the timeline, or a creator with bursty usage who prefers per-minute pricing over a full subscriptionTaption. Bilingual SRT and bilingual burned-in video are parity features — Subanana ships both too — so they aren't the reason to pick Taption. FCPXML and per-minute overage are.
  • Privacy-critical recording that must not leave your machine, zero budget, comfortable installing Python and Whisper modelspyTranscriber. On the Whisper backend specifically — the default Google backend is cloud-based, not offline.
  • Cantonese + Mandarin + English mixed content, one-click translation across targets, plus live caption translation for an audience-facing linkSubanana. The other scenario the product is built for.

Pick the one dimension you care about most. One of those five scenarios will land on your tool cleanly.


If you end up on Subanana: the short version of what you get

  • Export: SRT, VTT, TXT, DOCX, XLSX, Markdown — six standalone files, plus ZIP. No .ass, no FCPXML.
  • Free tier: 15-minute file cap, 3 GB, 720p export cap, caption download blocked. Enough to decide whether the accuracy works for you, not enough to run production.
  • Paid entry: US$9/month (~HK$68/month, annual billing) lifts the cap to 3 hours / 15 GB, unlocks caption download, enables 4K export. Full tiers on the pricing page.

FAQ

Q1. Heavy Cantonese accent, lots of HK slang — which tool handles it best?

Subanana landed highest on this batch (Clip A ~96%). ArcTime and pyTranscriber come next around 90%. Premiere and Taption drop into the 85-89% band on Cantonese, and both fall further on noisy outdoor audio.

Q2. I don't want to pay. What's the one-time free option?

pyTranscriber is fully free but requires Python plus the Whisper model download. Subanana and Taption both have free tiers for short clips. ArcTime's desktop app is free with some STT credits included. Premiere needs a Creative Cloud subscription — no free path.

Q3. Does Subanana export .ass styled subtitles?

No. Subanana's six formats are SRT, VTT, TXT, DOCX, XLSX, Markdown. If .ass styling is non-negotiable, ArcTime is the right tool in this list.

Q4. Can I trust these accuracy numbers?

Don't fully trust any one source — mine, the vendors', or a reviewer's. The only number that matters is the one you get on your own 1-2 minute sample. What this post gives you is a consistent ruler across five tools on the same batch of footage. Use it to shortlist, then test.

Q5. Where do I start if I pick Subanana?

Sign up for the free tier, run one or two short clips, check the accuracy on your own footage. If it works, the US$9/month annual plan unlocks 3-hour files, caption download, and 4K export.


Picking an AI subtitle tool in 2026 isn't a "which one is best" question — each of these five wins on at least one dimension. It's a "which one matches my language, my workflow, and my budget" question. Same batch, same ruler, five different right answers depending on what you need.

If Cantonese or Mandarin accuracy plus 口語 → 書面語 conversion plus a clean SRT handoff into your editor is the combination you need, the Subanana free tier is the cheapest way to find out whether the numbers above hold on your own footage.

Further reading

Boost Your Efficiency with Subanana

No payment method required
Free Trial
Cancel Anytime