👁 Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT avatar

Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT

Pricing

Pay per event

👁 Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT

Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT

Download captions from any Archive.org film, TV, or audio item: clean transcript text, timestamped cues, normalized SRT & VTT, one row per language. Search 3M+ captioned items, monitor for new ones. No login or API key. $2 per 1,000 transcripts.

Pricing

Pay per event

Rating

0.0

(0)

Developer

👁 Scrapers Delight

Scrapers Delight

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

13 days ago

Last modified

🎞️ Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT

Pull the subtitles/captions from any Internet Archive film, TV recording, or audio item — no login, no API key, no AI transcription. Archive.org hosts 3M+ captioned items (classic films, newsreels, lectures, TV news) and exposes them through public APIs; this actor downloads the caption files and parses them into clean transcript text, timestamped cues, and normalized SRT/VTT — one row per language. Point it at item URLs or an archive.org search query.

Because the captions already exist (uploaded subtitles or archive's own ASR), there's no speech-to-text compute — it's fast and cheap.

What does it do?

For each archive.org item you give it (by URL/identifier or discovered via search), it returns:

📝 Full transcript (clean plain text) — always included
⏲️ Timestamped cues — {index, start_ms, end_ms, start, end, text}
🎬 Normalized SRT / VTT — re-emitted with proper 3-digit millisecond stamps (archive's raw ASR files use non-standard 2-digit millis that break many players)
🌍 One row per caption file/language — grab English ASR plus every uploaded translation
🏷️ Item metadata — title, language, mediatype, collections, item URL
🔎 Search discovery — any advanced-search (Lucene) query, auto-scoped to captioned movies/audio, sorted by downloads
🚩 Honest flags — items with no captions, access-restricted items, and private/empty caption files are reported as such, never as silent zero-cue "successes"

No ASR, no API key — it reads the caption files archive.org already publishes.

What data does it extract?

One dataset record per caption file (per language):

🆔 identifier, 🏷️ item_title, 🔗 item_url, 🌍 language, 📦 mediatype, 🗂️ collection[], ⬇️ downloads
📄 caption_file_name, caption_format (SubRip / Web Video Text Tracks), 🌍 caption_lang_code, 🤖 is_autogenerated (.asr = archive's English ASR)
🔗 caption_url, 📏 caption_size_bytes
📝 transcript, ⏲️ segments[], 🎬 srt, vtt, 🔢 cue_count
🚩 restricted, note, ✨ is_new (monitor), 🕒 scraped_at

Example output

{
"identifier":"Doctorin1946",
"item_title":"Doctor in Industry (Part I)",
"item_url":"https://archive.org/details/Doctorin1946",
"mediatype":"movies",
"caption_file_name":"Doctorin1946.asr.srt",
"caption_format":"SubRip",
"caption_lang_code":"en",
"is_autogenerated":true,
"caption_url":"https://archive.org/download/Doctorin1946/Doctorin1946.asr.srt",
"caption_size_bytes":14725,
"cue_count":217,
"transcript":"When the thing with the name names …",
"restricted":false,
"scraped_at":"2026-06-12T00:00:00.000Z"
}

Who is it for?

🤖 AI / RAG dataset builders — millions of hours of public-domain era film and TV speech, already transcribed.
✍️ Documentary makers & editors — search inside classic films and newsreels, get ready-to-cut SRT/VTT.
🔎 Researchers & historians — full-text search across mid-century educational films, TV news, and lectures.
🌍 Localization & subtitle teams — pull every language track an item carries in one run.

How to use it (step by step)

Click Try for free.
Paste one or more item URLs (https://archive.org/details/{identifier}) or bare identifiers — or set a search query (e.g. collection:prelinger).
(Optional) filter languages, toggle autogenerated (.asr) captions, add extra formats (srt, vtt, segments).
Click Start, then open the Dataset tab to view/export.
(Optional) set monitorMode + a searchQuery + a Schedule to capture newly captioned items automatically.

Quick start

{
"itemUrls":["https://archive.org/details/his_girl_friday"],
"transcriptFormats":["txt","srt"]
}

Search a whole collection

{
"searchQuery":"collection:prelinger",
"maxItems":50,
"transcriptFormats":["txt","segments"]
}

Input

Field	What it does
`itemUrls`	archive.org item URLs / identifiers
`searchQuery`	advanced-search (Lucene) query — auto-scoped to captioned movies/audio, restricted items excluded, sorted by downloads
`languages`	keep only these caption language codes (empty = all)
`includeAutoGenerated`	include archive's `.asr` English ASR captions (default on)
`transcriptFormats`	`txt` · `segments` · `srt` · `vtt`
`maxItems`	hard cap on items per run (default 5; 0 = unlimited)
`maxCaptionFilesPerItem`	cap caption files per item (default 5; 0 = all)
`monitorMode`, `alertOnNewItem`	recurring new-item watcher + alerts
`webhookUrl`, `slackWebhookUrl`, `emailRecipients`	alert channels
`proxyConfiguration`, `requestConcurrency`	proxy + parallelism

Output

Each caption file is one dataset record (fields above). Items with no captions, access-restricted items, and private/empty caption files are emitted as flagged rows (restricted, note) so you always know why a transcript is missing. Export to JSON, CSV, Excel, HTML, or RSS, or fetch via the Apify API.

How much does it cost?

Pay-per-event — and with no transcription compute, it's cheap:

Event	What it covers	Price
`lot-scraped`	each record returned	$0.004 / record
`lot-detail-enriched`	each caption file downloaded + parsed	$0.004 / file
`monitor-run-completed`	each scheduled watch run	$0.05 / run
`new-lot-detected`	each new item found by the monitor	$0.02 / item
`alert-delivered`	each Slack/email/webhook push	$0.005 / alert

That's about $8 per 1,000 transcripts (fetch + parse). No charge for actor starts or empty runs.

Monitor & alert setup

Set a searchQuery (e.g. collection:prelinger or subject:"television news").
Turn on monitorMode (and keep alertOnNewItem on).
Add a webhookUrl, slackWebhookUrl, and/or emailRecipients.
Create an Apify Schedule (e.g. daily). The first run baselines the seen items; every later run outputs and alerts only new items. State persists in a named key-value store (archive-transcript-monitor-state), so it survives between runs.

How does it work without AI transcription?

Archive.org items carry caption files: uploader-provided .srt/.vtt subtitles and archive's own autogenerated English ASR (.asr.srt). This actor reads the item's public metadata, picks the caption files, downloads them, and runs a hardened parser that handles every variant found in the wild — BOM + CRLF files, 2-digit millisecond ASR stamps, <i> formatting tags, VTT headers with trailing junk, and cues without indices. It does not run speech-to-text, so there's no GPU cost and results are instant.

Is it legal to scrape archive.org captions?

The Internet Archive is a non-profit library that publishes these items and APIs for public access, and much of the captioned material is public-domain era film. The output is published media content and item stats, not personal data. Scraping public data is generally legal, but you are responsible for your use — review archive.org's Terms of Use and each item's rights/license statement before redistributing content.

FAQ

Which items have captions? 3M+ movies/audio items carry .srt/.vtt files — classic films, Prelinger educational shorts, TV news, lectures. The search mode finds them for you (it filters to format:"SubRip" OR "Web Video Text Tracks" automatically).

Is there a Whisper/ASR step? No — it downloads the caption files archive.org already publishes (including archive's own ASR track), so it's fast and cheap.

Can I get subtitles for video editing? Yes — add srt and/or vtt to transcriptFormats. The actor normalizes archive's non-standard 2-digit-millisecond stamps to proper hh:mm:ss,mmm, so the files work in any editor/player.

What about multiple languages? Each caption file becomes its own row with caption_lang_code parsed from the filename. Use languages to keep only the ones you want.

Why did an item return no transcript? Three honest cases, all flagged in the row: the item has no caption files (note), the item is access-restricted (its files are private and download as empty bodies — restricted: true), or a specific file is private/zero-byte. The actor never reports those as empty "successes".

Can I crawl a whole collection? Yes — searchQuery: "collection:{name}" + maxItems: 0. Archive's search window caps at 10,000 rows per query; slice bigger collections by date (publicdate:[2020-01-01 TO 2021-01-01]).

How fresh is monitor mode? Every scheduled run re-queries your search and diffs against the named state store — you get only items it hasn't seen before, plus optional Slack/webhook/email alerts.

Does it need a proxy or login? No login or API key. Archive.org's endpoints are public; the default datacenter proxy rotation is plenty.

How do I export? JSON, CSV, Excel, HTML, or RSS from the Dataset tab, or via the Apify API.

What does a 1,000-film crawl cost? With one caption file each: 1,000 × ($0.004 + $0.004) = ~$8.

Feedback

Want full-text search inside transcripts, TV-news-specific fields, or bulk export to a single file? Open an issue on the actor.

👁 Dailymotion Transcript Scraper — Subtitles to TXT, SRT, VTT avatar

Dailymotion Transcript Scraper — Subtitles to TXT, SRT, VTT

scrapersdelight/dailymotion-transcript-scraper

Extract any public Dailymotion video's subtitle transcript — no login, no ASR. By video URL/ID or a search query: full text, timestamped segments & SRT/VTT, plus title, owner and duration, from Dailymotion's own subtitle tracks. $2 per 1,000 videos.

👁 User avatar

Scrapers Delight

TikTok Transcript Scraper - JSON, SRT, VTT

jamhimself/tiktok-transcript-scraper

Extract TikTok video transcripts and subtitles as clean JSON, text, SRT, VTT, or RAG chunks with timestamps. Native captions, bulk, no API key, pay per video.

👁 User avatar

Jaime Martinez

👁 Vimeo Transcript Scraper — Captions to TXT, SRT & VTT avatar

Vimeo Transcript Scraper — Captions to TXT, SRT & VTT

scrapersdelight/vimeo-transcript-scraper

Extract any public Vimeo video's captions and transcript — no login, no ASR. By video URL/ID or a page that links Vimeo videos: transcript text, timestamped segments & SRT/VTT, plus title, owner and duration, from Vimeo's own caption tracks. $2 per 1,000 videos.

👁 User avatar

Scrapers Delight

YouTube Transcript Scraper - JSON, SRT, VTT, RAG

jamhimself/youtube-transcript-extractor

Extract YouTube transcripts & subtitles as JSON, text, SRT, VTT, or RAG chunks - bulk, 100+ languages, timestamps & deep links. Pay per video, no subscription.

👁 User avatar

Jaime Martinez

👁 Wistia Transcript Scraper — Captions to TXT, SRT & VTT avatar

Wistia Transcript Scraper — Captions to TXT, SRT & VTT

scrapersdelight/wistia-transcript-scraper

Extract any public Wistia video's transcript and captions — no login, no ASR. By hashedId or any page that embeds Wistia: full text, timestamped segments & SRT/VTT, plus title and duration, straight from Wistia's CDN. $2 per 1,000 videos.

👁 User avatar

Scrapers Delight

👁 YouTube Subtitle Extractor avatar

YouTube Subtitle Extractor

entertained_rattlesnake/youtube-subtitle-extractor

Extract subtitles and transcripts from YouTube videos and export them as JSON, TXT, SRT and VTT.

👁 User avatar

Entertained Rattlesnake

Archive.org Scraper

lulzasaur/archive-org-scraper

Scrape the Internet Archive (archive.org). Search 50M+ texts, 13M+ audio, 16M+ movies, and 1.3M+ software items. Get metadata, download counts, file lists, and more via public APIs.

👁 User avatar

lulz bot

👁 YouTube Transcript Scraper – Download Subtitles & Captions avatar

YouTube Transcript Scraper – Download Subtitles & Captions

harshmaur/youtube-transcript-scraper

Extract transcripts, captions & subtitles from YouTube videos, channels or playlists — no API key. Timestamped or plain text, SRT/VTT export, 156-language translation, plus full video & channel metadata. Built for AI summaries, ChatGPT & research. Pay only for transcripts returned.

👁 User avatar

Harsh Maur

5.0

👁 YouTube Transcript Scraper avatar

YouTube Transcript Scraper

taroyamada/youtube-transcript-bulk-api

Extract YouTube captions, timestamps, SRT, VTT, and plain text from public videos in bulk without browser automation.

👁 User avatar

naoki anzai

👁 Podcast Transcript Scraper — Any RSS Feed to Text & SRT avatar

Podcast Transcript Scraper — Any RSS Feed to Text & SRT

scrapersdelight/podcast-transcript-scraper

Extract per-episode transcripts from any podcast RSS feed via the Podcasting 2.0 <podcast:transcript> tag — no login, no ASR. Clean text, timestamped segments & SRT/VTT per episode, plus metadata. Works with Buzzsprout, Captivate, Transistor, RSS.com & more. $2 per 1,000 episodes.

👁 User avatar

Scrapers Delight

URL: https://apify.com/scrapersdelight/archive-transcript-scraper