VOOZH about

URL: https://apify.com/andok/wayback-machine-scraper

โ‡ฑ Wayback Machine Archive Scraper ยท Apify


Pricing

$1.00 / 1,000 snapshot retrieveds

Go to Apify Store

Wayback Machine Archive Scraper

Fetch historical snapshots of any webpage from the Internet Archive. Perfect for digital forensics and tracking deleted content.

Pricing

$1.00 / 1,000 snapshot retrieveds

Rating

0.0

(0)

Developer

๐Ÿ‘ Andok

Andok

Maintained by Community

Actor stats

1

Bookmarked

68

Total users

16

Monthly active users

3 months ago

Last modified

Categories

Share

Wayback Machine Scraper for Historical Snapshots

Retrieve historical web page snapshots from the Internet Archive for compliance checks, competitive due diligence, and content recovery. Feed it a list of URLs and get back every archived snapshot with timestamps, status codes, and archive links โ€” or optionally fetch the full HTML of the latest snapshot. Built on the official Wayback CDX API for accurate, structured results.

Features

  • Bulk URL processing โ€” check snapshot history for dozens of URLs in a single run
  • Date range filtering โ€” narrow results to a specific time window with from and to parameters
  • Deduplication โ€” collapse identical snapshots by digest to reduce noise
  • Status code filtering โ€” only return snapshots with specific HTTP status codes (default: 200)
  • HTML retrieval โ€” optionally fetch the archived HTML content for the most recent snapshot
  • Concurrent processing โ€” configurable parallelism for faster batch runs
  • Structured metadata โ€” every snapshot includes timestamp, original URL, MIME type, and archive URL

Input

FieldTypeRequiredDefaultDescription
urlsarrayYes["https://example.com"]List of URLs to look up in the Wayback Machine
urlstringNoโ€”Single URL (backwards compatible, merged with urls)
fromstringNoโ€”Start date for snapshot range (format: YYYY or YYYYMMDDhhmmss)
tostringNoโ€”End date for snapshot range (format: YYYY or YYYYMMDDhhmmss)
limitintegerNo50Maximum snapshots to return per URL (1-5000)
collapsestringNodigestCollapse parameter to deduplicate snapshots (e.g. digest, timestamp:8)
filterStatusstringNostatuscode:200HTTP status filter for snapshots (e.g. statuscode:200)
includeHtmlbooleanNofalseFetch the archived HTML content for the latest snapshot (experimental)
timeoutSecondsintegerNo20Per-request timeout in seconds (1-120)
concurrencyintegerNo5Number of URLs to process in parallel (1-25)

Input Example

{
"urls":["https://example.com","https://news.ycombinator.com"],
"from":"2023",
"to":"2025",
"limit":10,
"includeHtml":false
}

Output

Each dataset item represents one input URL with its snapshot history. Key fields:

  • inputUrl (string) โ€” the URL that was looked up
  • snapshotCount (number) โ€” total number of matching snapshots found
  • snapshots (array) โ€” list of snapshot objects with timestamp, original, statuscode, mimetype, length, and archiveUrl
  • latestSnapshot (object) โ€” the most recent snapshot, or null if none found
  • latestHtml (string) โ€” archived HTML content (only when includeHtml is enabled)
  • checkedAt (string) โ€” ISO timestamp of when the check was performed
  • error (string) โ€” error message if the lookup failed, otherwise null

Output Example

{
"inputUrl":"https://example.com",
"snapshotCount":3,
"snapshots":[
{
"timestamp":"20250110153022",
"original":"https://example.com",
"statuscode":200,
"mimetype":"text/html",
"length":1256,
"archiveUrl":"https://web.archive.org/web/20250110153022/https://example.com"
}
],
"latestSnapshot":{
"timestamp":"20250110153022",
"original":"https://example.com",
"statuscode":200,
"mimetype":"text/html",
"length":1256,
"archiveUrl":"https://web.archive.org/web/20250110153022/https://example.com"
},
"latestHtml":null,
"checkedAt":"2025-01-20T12:00:00.000Z",
"error":null
}

Pricing

EventCost
Snapshot RetrievedPay-per-event (see actor pricing page)

Use Cases

  • Compliance & legal โ€” retrieve historical versions of terms of service, privacy policies, or product pages
  • Competitive due diligence โ€” review how a competitor's website evolved over time before a deal or partnership
  • Content recovery โ€” recover lost or deleted web pages from the Internet Archive
  • SEO auditing โ€” check when a page was last crawled and compare historical content changes
  • Brand monitoring โ€” verify historical claims or track how a brand's messaging changed
  • Research & journalism โ€” access archived versions of news articles or government pages

Related Actors

ActorWhat it adds
Google News ScraperMonitor current news coverage alongside historical archive lookups
Broken Links CheckerFind dead links on your site, then recover them via Wayback Machine
Sitemap ExtractorExtract all URLs from a sitemap to feed into bulk Wayback lookups

Notes

  • The Wayback Machine CDX API is free but may throttle under heavy load. Use the concurrency setting conservatively for large batches.
  • The includeHtml option is experimental and may fail for very large pages or pages with complex JavaScript rendering.

You might also like

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

71

Wayback Machine Search

crawlerbros/wayback-machine-search

Query Internet Archive's Wayback Machine for historical snapshots of any URL or domain. Filter by date, HTTP status, MIME type, and deduplicate. Optionally fetch the archived page text. Free public CDX API, no authentication.

Internet Archive Search โ€” Wayback Machine Advanced Query Tool

maged120/archive-org-advanced-search

Search the Internet Archive (archive.org) with full advanced filter support โ€” date range, media type, language, subject, and more. Returns metadata from archived web pages, books, audio, and video.

Wayback Machine Snapshots Scraper โ€” Internet Archive History

seemuapps/wayback-machine-snapshots-scraper

List every Internet Archive snapshot of a URL, page, or whole domain. Timestamp, snapshot URL, status code, mime type, content length. No login.

Wayback Machine Scraper

gio21/wayback-machine-scraper

List Internet Archive Wayback Machine snapshots for one or more URLs. Returns timestamp, snapshot URL, HTTP status, MIME type, digest. Useful for tracking website changes over time, OSINT research, content recovery, and brand monitoring.

Internet Archive Items Scraper - archive.org Search by Query

gio21/archive-org-items-scraper

Search Internet Archive (archive.org) items: books, movies, audio, software, images, web archives, data. Returns title, creator, date, description, downloads, identifier, URLs. Free, no key. For research, content discovery, digital preservation.

Wayback Machine Historical Content Scraper

happyfhantum/wayback-machine-historical-content-scraper

Compare archived website snapshots through the Wayback Machine and extract page-history change signals.

89

4.0