How to Transcribe an Interview: Speaker-Labelled, Quotable Transcripts
The core of interview transcription is really just three things: the text has to be accurate, it has to show who said which line, and it has to be quotable as-is. If you're doing qualitative research, UX interviews, journalism, or just a student handing in coursework, you don't want a rough approximation — you want text you can code line by line and quote straight into a paper or a story.
There are three main approaches: typing it out by hand, free auto-caption tools, and AI speech-to-text. This guide explains the trade-offs, then shows how I'd use Subanana's transcript mode to turn an interview recording into a transcript with speaker labels, punctuation, and paragraph breaks — so the manual cleanup afterwards is as small as possible. The short version up front: AI transcription does about nine-tenths of the grind, and you do one final proofreading pass.

What's the difference between an interview transcript and subtitles?
A lot of people reach for a tool the first time and treat "subtitles" and "a transcript" as the same thing — then end up with a file they can't use. They're two different deliverables:
- Subtitles are made to be read on screen over a video, cut into short timed lines, conventionally without punctuation, and exported as SRT or VTT.
- A transcript is made to be read by a person: it needs punctuation, paragraphs, and speaker labels so you can read it top to bottom, annotate it, and pull quotes.
An interview transcript is the second kind. So if you pick the wrong mode in a tool — running an interview through a subtitle workflow — you get a wall of short, timestamped, unpunctuated fragments that are actually harder to work with. That's why this guide keeps stressing: choose transcript mode.
The trade-offs between the three approaches
Approach 1: Manual transcription
The most traditional method, and the one with the highest accuracy ceiling — you listen and type, line by line, yourself.
- Upside: you control every word. Tone, pauses, overlapping speech — you can annotate all of it exactly the way your research needs.
- Limit: it's extremely slow. A common industry rule of thumb is that one hour of audio takes four to six hours to type up, and it's slower again with multiple speakers, strong accents, or poor recording quality. For a reporter on a deadline or a researcher running several interviews at once, that time cost is often more than the budget allows.
Approach 2: Free auto-caption tools
Plenty of free tools — the auto-captions on video platforms, online transcribe-this sites — will generate text quickly.
- Upside: fast, free, low barrier to entry.
- Limit: on accented speech and less-common languages the error rate is noticeably higher; most don't separate speakers, so the whole interview runs together and you can't tell who said which line; and they usually don't add punctuation or paragraphs, so it reads like a wall of text. Fine for a short English clip — but for an interview you intend to quote, you'll often spend a lot of time restructuring it afterwards.
Approach 3: AI speech-to-text tools
If what you want is "the transcript is readable and quotable the moment I get it," AI transcription is the most practical middle ground right now. The tool re-transcribes the audio with a speech-recognition model, adds punctuation, paragraphs, and speaker identification, and then lets you proofread in an editor.
- Upside: much faster than typing by hand; more accurate than free tools, and it separates speakers and adds punctuation and paragraphs automatically.
- Trade-off (worth being clear about): AI transcription does not replace the final proofread. Before you quote anyone verbatim, you should still do one human pass — checking names, proper nouns, and key numbers. High accuracy isn't zero errors, and the more weight a quote carries, the more it's worth checking.
The next section shows how I'd take the third path with Subanana.
How do you turn interview audio into a transcript with Subanana?
I run Subanana, so I'll use it to walk through the whole flow. Where it earns its place for interview transcription is multilingual accuracy, speaker identification (diarization), automatic filler-word removal, and automatic punctuation and paragraphing.
The critical first step is picking the right mode. Subanana has a subtitle mode, a transcript mode, and a meeting mode — for an interview transcript you want transcript mode, because that's what adds punctuation, breaks the text into paragraphs by meaning, and produces something readable. Subtitle mode only gives you short timed caption lines. The flow has four steps:
- Import the recording. Upload the interview's audio or video file (.mp4 / .mov / .webm / .ogg), or paste a public YouTube / Instagram / Facebook link to import it directly. If the interview is behind a private or access-restricted link, use file upload instead.
- Choose transcript mode and set the source language. Go into transcript mode and pick the language of the recording. Subanana covers 80+ languages, so most interview audio is in range. Set the number of speakers to auto-detect (or type the count in manually), and turn on automatic punctuation and paragraphing.
- Proofread and label speakers. When transcription finishes you land in the editor. The system splits the different voices into Speaker 1, Speaker 2, and so on, removes filler words ("um," "you know"), and tidies the text. From here you can:
- Rename speakers: change Speaker 1 to "Interviewer" and Speaker 2 to "Participant A," and the whole transcript updates in sync — handy for quoting and annotating line by line later.
- Fix misheard words: click any word and edit it directly. For the words most likely to be wrong — people's names, organisation names, technical terms — set up a Glossary first, and the system will prefer your spellings while transcribing.
- Chat with the transcript: inside the editor you can ask the AI directly — "where does Participant A mention X?" or "pull out the three key arguments" — which saves a lot of time on a long interview.
- Export. Pick the format you need. For transcripts the most common choices are DOCX (Word, ready to edit) or TXT (to drop into Obsidian, Notion, or another notes tool); for citation, coding, or annotation, XLSX lays out timecodes, speaker, and text as a table. VTT, SRT, and Markdown are supported too.
Once you've proofread and exported, the interview transcript drops straight into your paper, your article, or your analysis. To understand how the modes are designed, see AI subtitling and transcription and AI meeting transcription.
What if the interview is multilingual or accented?
This is exactly where general-purpose speech tools tend to be weakest — accented speech and languages outside the usual English-and-a-handful set. Two things are worth checking when you pick a tool:
- Accuracy across languages: Subanana continuously benchmarks the available speech-recognition models and picks the best-performing one for each source language, rather than locking into a single provider. And if a transcription goes wrong, it automatically re-runs on a different model — a re-run that doesn't cost you any extra minutes.
- Translating the transcript: an interview might be recorded in one language while you need the transcript in another. Transcript mode supports a single translation target, so you can transcribe in the source language and translate into one other language in the same pass.
One boundary worth flagging: when it comes to mid-sentence code-switching — a speaker flipping between two languages within a single sentence and the tool auto-detecting the switch in real time — that's a strength of Subanana's live caption feature, not transcript mode. For interview transcription, what you're leaning on is multilingual accuracy and speaker identification, not real-time in-sentence language switching. If you need live captions at an actual event, see AI real-time transcription.
Interview transcription FAQ
Can the free tier produce a complete interview transcript? You can run a recording and preview the result, but exporting is a paid step. The free tier doesn't support subtitle or transcript file downloads, and you can't select-and-copy the text in the editor either — the only output is a watermarked video, first 5 minutes only, at 720p, with a 3 GB per-file limit. To export usable transcript files (DOCX / TXT / XLSX), you need a paid plan (which also raises the per-file limit to 15 GB / 3 hours). See pricing for details.
Can it tell apart who said what in a multi-speaker interview? Yes. Transcript mode supports speaker identification — it automatically separates Speaker 1, Speaker 2, and so on, and you can rename them to the actual roles (Interviewer, Participant A) in the editor, with the whole transcript updating in sync.
Can I quote an AI transcript directly? I'd do one human proofreading pass first. AI transcription handles the vast majority of the text and the paragraphing, but the places where a wrong word actually matters — names, proper nouns, key numbers — are worth checking line by line, especially in passages where you'll quote a participant verbatim. 3 tips for AI transcription covers how to proofread more efficiently.
Does a long interview recording (one or two hours) work? Yes. Paid plans take up to 15 GB / 3 hours per file, which covers most interview recordings. For a long interview I'd use the editor's AI chat to find the key passages first, then proofread closely the parts you intend to quote.