VOOZH about

URL: https://apify.com/scrapersdelight/archive-transcript-scraper

โ‡ฑ Archive.org Subtitle & Transcript Scraper โ€” TXT, SRT & VTT ยท Apify


๐Ÿ‘ Archive.org Subtitle & Transcript Scraper โ€” TXT, SRT & VTT avatar

Archive.org Subtitle & Transcript Scraper โ€” TXT, SRT & VTT

Pricing

Pay per event

Go to Apify Store

Archive.org Subtitle & Transcript Scraper โ€” TXT, SRT & VTT

Download captions from any Archive.org film, TV, or audio item: clean transcript text, timestamped cues, normalized SRT & VTT, one row per language. Search 3M+ captioned items, monitor for new ones. No login or API key. $2 per 1,000 transcripts.

Pricing

Pay per event

Rating

0.0

(0)

Developer

๐Ÿ‘ Scrapers Delight

Scrapers Delight

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

13 days ago

Last modified

Share

๐ŸŽž๏ธ Archive.org Subtitle & Transcript Scraper โ€” TXT, SRT & VTT

Pull the subtitles/captions from any Internet Archive film, TV recording, or audio item โ€” no login, no API key, no AI transcription. Archive.org hosts 3M+ captioned items (classic films, newsreels, lectures, TV news) and exposes them through public APIs; this actor downloads the caption files and parses them into clean transcript text, timestamped cues, and normalized SRT/VTT โ€” one row per language. Point it at item URLs or an archive.org search query.

Because the captions already exist (uploaded subtitles or archive's own ASR), there's no speech-to-text compute โ€” it's fast and cheap.


What does it do?

For each archive.org item you give it (by URL/identifier or discovered via search), it returns:

  • ๐Ÿ“ Full transcript (clean plain text) โ€” always included
  • โฒ๏ธ Timestamped cues โ€” {index, start_ms, end_ms, start, end, text}
  • ๐ŸŽฌ Normalized SRT / VTT โ€” re-emitted with proper 3-digit millisecond stamps (archive's raw ASR files use non-standard 2-digit millis that break many players)
  • ๐ŸŒ One row per caption file/language โ€” grab English ASR plus every uploaded translation
  • ๐Ÿท๏ธ Item metadata โ€” title, language, mediatype, collections, item URL
  • ๐Ÿ”Ž Search discovery โ€” any advanced-search (Lucene) query, auto-scoped to captioned movies/audio, sorted by downloads
  • ๐Ÿšฉ Honest flags โ€” items with no captions, access-restricted items, and private/empty caption files are reported as such, never as silent zero-cue "successes"

No ASR, no API key โ€” it reads the caption files archive.org already publishes.


What data does it extract?

One dataset record per caption file (per language):

  • ๐Ÿ†” identifier, ๐Ÿท๏ธ item_title, ๐Ÿ”— item_url, ๐ŸŒ language, ๐Ÿ“ฆ mediatype, ๐Ÿ—‚๏ธ collection[], โฌ‡๏ธ downloads
  • ๐Ÿ“„ caption_file_name, caption_format (SubRip / Web Video Text Tracks), ๐ŸŒ caption_lang_code, ๐Ÿค– is_autogenerated (.asr = archive's English ASR)
  • ๐Ÿ”— caption_url, ๐Ÿ“ caption_size_bytes
  • ๐Ÿ“ transcript, โฒ๏ธ segments[], ๐ŸŽฌ srt, vtt, ๐Ÿ”ข cue_count
  • ๐Ÿšฉ restricted, note, โœจ is_new (monitor), ๐Ÿ•’ scraped_at

Example output

{
"identifier":"Doctorin1946",
"item_title":"Doctor in Industry (Part I)",
"item_url":"https://archive.org/details/Doctorin1946",
"mediatype":"movies",
"caption_file_name":"Doctorin1946.asr.srt",
"caption_format":"SubRip",
"caption_lang_code":"en",
"is_autogenerated":true,
"caption_url":"https://archive.org/download/Doctorin1946/Doctorin1946.asr.srt",
"caption_size_bytes":14725,
"cue_count":217,
"transcript":"When the thing with the name names โ€ฆ",
"restricted":false,
"scraped_at":"2026-06-12T00:00:00.000Z"
}

Who is it for?

  • ๐Ÿค– AI / RAG dataset builders โ€” millions of hours of public-domain era film and TV speech, already transcribed.
  • โœ๏ธ Documentary makers & editors โ€” search inside classic films and newsreels, get ready-to-cut SRT/VTT.
  • ๐Ÿ”Ž Researchers & historians โ€” full-text search across mid-century educational films, TV news, and lectures.
  • ๐ŸŒ Localization & subtitle teams โ€” pull every language track an item carries in one run.

How to use it (step by step)

  1. Click Try for free.
  2. Paste one or more item URLs (https://archive.org/details/{identifier}) or bare identifiers โ€” or set a search query (e.g. collection:prelinger).
  3. (Optional) filter languages, toggle autogenerated (.asr) captions, add extra formats (srt, vtt, segments).
  4. Click Start, then open the Dataset tab to view/export.
  5. (Optional) set monitorMode + a searchQuery + a Schedule to capture newly captioned items automatically.

Quick start

{
"itemUrls":["https://archive.org/details/his_girl_friday"],
"transcriptFormats":["txt","srt"]
}

Search a whole collection

{
"searchQuery":"collection:prelinger",
"maxItems":50,
"transcriptFormats":["txt","segments"]
}

Input

FieldWhat it does
itemUrlsarchive.org item URLs / identifiers
searchQueryadvanced-search (Lucene) query โ€” auto-scoped to captioned movies/audio, restricted items excluded, sorted by downloads
languageskeep only these caption language codes (empty = all)
includeAutoGeneratedinclude archive's .asr English ASR captions (default on)
transcriptFormatstxt ยท segments ยท srt ยท vtt
maxItemshard cap on items per run (default 5; 0 = unlimited)
maxCaptionFilesPerItemcap caption files per item (default 5; 0 = all)
monitorMode, alertOnNewItemrecurring new-item watcher + alerts
webhookUrl, slackWebhookUrl, emailRecipientsalert channels
proxyConfiguration, requestConcurrencyproxy + parallelism

Output

Each caption file is one dataset record (fields above). Items with no captions, access-restricted items, and private/empty caption files are emitted as flagged rows (restricted, note) so you always know why a transcript is missing. Export to JSON, CSV, Excel, HTML, or RSS, or fetch via the Apify API.


How much does it cost?

Pay-per-event โ€” and with no transcription compute, it's cheap:

EventWhat it coversPrice
lot-scrapedeach record returned$0.004 / record
lot-detail-enrichedeach caption file downloaded + parsed$0.004 / file
monitor-run-completedeach scheduled watch run$0.05 / run
new-lot-detectedeach new item found by the monitor$0.02 / item
alert-deliveredeach Slack/email/webhook push$0.005 / alert

That's about $8 per 1,000 transcripts (fetch + parse). No charge for actor starts or empty runs.


Monitor & alert setup

  1. Set a searchQuery (e.g. collection:prelinger or subject:"television news").
  2. Turn on monitorMode (and keep alertOnNewItem on).
  3. Add a webhookUrl, slackWebhookUrl, and/or emailRecipients.
  4. Create an Apify Schedule (e.g. daily). The first run baselines the seen items; every later run outputs and alerts only new items. State persists in a named key-value store (archive-transcript-monitor-state), so it survives between runs.

How does it work without AI transcription?

Archive.org items carry caption files: uploader-provided .srt/.vtt subtitles and archive's own autogenerated English ASR (.asr.srt). This actor reads the item's public metadata, picks the caption files, downloads them, and runs a hardened parser that handles every variant found in the wild โ€” BOM + CRLF files, 2-digit millisecond ASR stamps, <i> formatting tags, VTT headers with trailing junk, and cues without indices. It does not run speech-to-text, so there's no GPU cost and results are instant.


Is it legal to scrape archive.org captions?

The Internet Archive is a non-profit library that publishes these items and APIs for public access, and much of the captioned material is public-domain era film. The output is published media content and item stats, not personal data. Scraping public data is generally legal, but you are responsible for your use โ€” review archive.org's Terms of Use and each item's rights/license statement before redistributing content.


FAQ

Which items have captions? 3M+ movies/audio items carry .srt/.vtt files โ€” classic films, Prelinger educational shorts, TV news, lectures. The search mode finds them for you (it filters to format:"SubRip" OR "Web Video Text Tracks" automatically).

Is there a Whisper/ASR step? No โ€” it downloads the caption files archive.org already publishes (including archive's own ASR track), so it's fast and cheap.

Can I get subtitles for video editing? Yes โ€” add srt and/or vtt to transcriptFormats. The actor normalizes archive's non-standard 2-digit-millisecond stamps to proper hh:mm:ss,mmm, so the files work in any editor/player.

What about multiple languages? Each caption file becomes its own row with caption_lang_code parsed from the filename. Use languages to keep only the ones you want.

Why did an item return no transcript? Three honest cases, all flagged in the row: the item has no caption files (note), the item is access-restricted (its files are private and download as empty bodies โ€” restricted: true), or a specific file is private/zero-byte. The actor never reports those as empty "successes".

Can I crawl a whole collection? Yes โ€” searchQuery: "collection:{name}" + maxItems: 0. Archive's search window caps at 10,000 rows per query; slice bigger collections by date (publicdate:[2020-01-01 TO 2021-01-01]).

How fresh is monitor mode? Every scheduled run re-queries your search and diffs against the named state store โ€” you get only items it hasn't seen before, plus optional Slack/webhook/email alerts.

Does it need a proxy or login? No login or API key. Archive.org's endpoints are public; the default datacenter proxy rotation is plenty.

How do I export? JSON, CSV, Excel, HTML, or RSS from the Dataset tab, or via the Apify API.

What does a 1,000-film crawl cost? With one caption file each: 1,000 ร— ($0.004 + $0.004) = ~$8.


Feedback

Want full-text search inside transcripts, TV-news-specific fields, or bulk export to a single file? Open an issue on the actor.

You might also like

Dailymotion Transcript Scraper โ€” Subtitles to TXT, SRT, VTT

scrapersdelight/dailymotion-transcript-scraper

Extract any public Dailymotion video's subtitle transcript โ€” no login, no ASR. By video URL/ID or a search query: full text, timestamped segments & SRT/VTT, plus title, owner and duration, from Dailymotion's own subtitle tracks. $2 per 1,000 videos.

๐Ÿ‘ User avatar

Scrapers Delight

4

Vimeo Transcript Scraper โ€” Captions to TXT, SRT & VTT

scrapersdelight/vimeo-transcript-scraper

Extract any public Vimeo video's captions and transcript โ€” no login, no ASR. By video URL/ID or a page that links Vimeo videos: transcript text, timestamped segments & SRT/VTT, plus title, owner and duration, from Vimeo's own caption tracks. $2 per 1,000 videos.

๐Ÿ‘ User avatar

Scrapers Delight

7

Wistia Transcript Scraper โ€” Captions to TXT, SRT & VTT

scrapersdelight/wistia-transcript-scraper

Extract any public Wistia video's transcript and captions โ€” no login, no ASR. By hashedId or any page that embeds Wistia: full text, timestamped segments & SRT/VTT, plus title and duration, straight from Wistia's CDN. $2 per 1,000 videos.

๐Ÿ‘ User avatar

Scrapers Delight

5

YouTube Subtitle Extractor

entertained_rattlesnake/youtube-subtitle-extractor

Extract subtitles and transcripts from YouTube videos and export them as JSON, TXT, SRT and VTT.

๐Ÿ‘ User avatar

Entertained Rattlesnake

2

YouTube Transcript Scraper โ€“ Download Subtitles & Captions

harshmaur/youtube-transcript-scraper

Extract transcripts, captions & subtitles from YouTube videos, channels or playlists โ€” no API key. Timestamped or plain text, SRT/VTT export, 156-language translation, plus full video & channel metadata. Built for AI summaries, ChatGPT & research. Pay only for transcripts returned.

2

5.0

YouTube Transcript Scraper

taroyamada/youtube-transcript-bulk-api

Extract YouTube captions, timestamps, SRT, VTT, and plain text from public videos in bulk without browser automation.

Podcast Transcript Scraper โ€” Any RSS Feed to Text & SRT

scrapersdelight/podcast-transcript-scraper

Extract per-episode transcripts from any podcast RSS feed via the Podcasting 2.0 <podcast:transcript> tag โ€” no login, no ASR. Clean text, timestamped segments & SRT/VTT per episode, plus metadata. Works with Buzzsprout, Captivate, Transistor, RSS.com & more. $2 per 1,000 episodes.

๐Ÿ‘ User avatar

Scrapers Delight

7