Wayback Machine Scraper

Pricing

Pay per usage

Wayback Machine Scraper

Scrape Wayback Machine archive snapshots for any URL or domain. Get archived URLs, timestamps, status codes, MIME types. Export to JSON, CSV, Excel.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

👁 Glass Ventures

Glass Ventures

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

What does Wayback Machine Scraper do?

Wayback Machine Scraper uses the official Wayback Machine CDX API to retrieve historical snapshots of any website. It lets you discover every archived version of a page, filter by date range, content type, and HTTP status code.

Whether you need to track how a website changed over time, recover lost content, monitor competitor website changes, or build a historical dataset of web pages, this actor makes it easy. It handles pagination, rate limiting, and exports data in JSON, CSV, or Excel format.

The Wayback Machine (Archive.org) has archived over 800 billion web pages since 1996. This actor gives you structured access to that massive archive without writing any code.

Use Cases

SEO specialists -- Track historical changes to competitor pages, find old URLs for redirect mapping, discover deleted content
Researchers -- Build datasets of how websites evolved over time, study web history trends
Content recovery -- Find and recover deleted or changed web pages from the archive
Compliance teams -- Document historical versions of terms of service, privacy policies, or regulatory pages
Developers -- Programmatically access Wayback Machine data via API for integration into tools and pipelines

Features

Search by exact URL or entire domain (wildcard matching)
Filter snapshots by date range (from/to)
Filter by MIME type (HTML, JSON, CSS, JavaScript, images)
Filter by HTTP status code (200, 301, 404, etc.)
Bulk processing of multiple URLs and domains
Proxy support with automatic rotation
Handles rate limiting and large datasets automatically
Exports to JSON, CSV, Excel, or connect via API

How much will it cost?

The Wayback Machine CDX API is free and public. The only cost is Apify platform compute time.

Results	Estimated Cost
1,000	~$0.01
10,000	~$0.05
100,000	~$0.25

Cost Component	Per 10,000 Results
Platform compute	~$0.05
Proxy (optional)	~$0.00
Total	~$0.05

How to use

Go to the Wayback Machine Scraper page on Apify Store
Click "Start" or "Try for free"
Enter URLs to look up in the archive, or domain names for full-domain search
Optionally set date range filters, MIME type, and status code filters
Set the maximum number of items
Click "Start" and wait for the results

Input parameters

Parameter	Type	Description	Default
startUrls	array	Website URLs to look up in the Wayback Machine	-
domains	array	Domain names for full-domain archive search	-
dateFrom	string	Only include snapshots after this date	-
dateTo	string	Only include snapshots before this date	-
mimeTypeFilter	string	Filter by content type (text/html, application/json, all)	all
statusCodeFilter	string	Filter by HTTP status code (e.g., "200")	-
maxItems	number	Maximum snapshot records to return	1000
proxyConfig	object	Proxy settings (optional)	-

Output

The actor produces a dataset with the following fields:

{
"originalUrl":"https://www.example.com",
"archiveUrl":"https://web.archive.org/web/20230115120000/https://www.example.com",
"timestamp":"20230115120000",
"statusCode":"200",
"mimeType":"text/html",
"length":"1256",
"archivedDate":"2023-01-15T12:00:00.000Z",
"scrapedAt":"2026-04-23T10:30:00.000Z"
}

Field	Type	Description
originalUrl	string	The original URL that was archived
archiveUrl	string	Full Wayback Machine URL to view the snapshot
timestamp	string	Raw Wayback Machine timestamp (YYYYMMDDHHmmss)
statusCode	string	HTTP status code of the archived response
mimeType	string	Content type of the archived resource
length	string	Size of the archived resource in bytes
archivedDate	string	ISO 8601 date when the snapshot was taken
scrapedAt	string	ISO 8601 timestamp when data was extracted

Integrations

Connect Wayback Machine Scraper with other tools:

Apify API -- REST API for programmatic access
Webhooks -- get notified when a run finishes
Zapier / Make -- connect to 5,000+ apps
Google Sheets -- export directly to spreadsheets

API Example (Node.js)

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token:'YOUR_TOKEN'});
const run =await client.actor('YOUR_USERNAME/wayback-machine-scraper').call({
startUrls:[{url:'https://www.example.com'}],
maxItems:100,
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();

API Example (Python)

from apify_client import ApifyClient
client = ApifyClient('YOUR_TOKEN')
run = client.actor('YOUR_USERNAME/wayback-machine-scraper').call(run_input={
'startUrls':[{'url':'https://www.example.com'}],
'maxItems':100,
})
items = client.dataset(run['defaultDatasetId']).list_items().items

API Example (cURL)

curl"https://api.apify.com/v2/acts/YOUR_USERNAME~wayback-machine-scraper/runs"\
-X POST \
-H"Content-Type: application/json"\
-H"Authorization: Bearer YOUR_TOKEN"\
-d'{"startUrls": [{"url": "https://www.example.com"}], "maxItems": 100}'

Tips and tricks

Start with a small maxItems (10-50) to test before running large scrapes
Use date filters (dateFrom/dateTo) to narrow results for popular sites with thousands of snapshots
Domain-wide searches can return very large datasets -- always set a maxItems limit
Filter by statusCode: "200" to only get successful snapshots (skip redirects and errors)
The Wayback Machine API can be slow for domains with millions of snapshots -- be patient

FAQ

Q: Does this actor require login credentials? A: No. The Wayback Machine CDX API is completely free and public.

Q: How fast is the scraping? A: Typically 1,000-10,000 results per minute depending on the API response time. Large domain searches may take longer.

Q: What should I do if I get rate limited? A: Enable proxy configuration to rotate IPs automatically. Also reduce maxConcurrency.

Q: Can I get the actual page content from the archive? A: This actor returns snapshot metadata (URLs, dates, status codes). Use the archiveUrl field to access the actual archived page content.

Q: Why are some snapshots missing? A: The Wayback Machine does not archive every page on every visit. Some pages may have been excluded by robots.txt or simply not crawled.

Is it legal to scrape the Wayback Machine?

The Wayback Machine (Archive.org) provides a public API specifically designed for programmatic access to archive data. This actor uses only the official CDX API. Always review and respect Archive.org's Terms of Service. For more information, see Apify's blog on web scraping legality.

Related Actors

Website Content Crawler -- Crawl and extract content from live websites
Google Cache Scraper -- Access Google's cached versions of web pages

Limitations

The CDX API may rate-limit requests for very high-volume queries
Domain-wide searches for popular domains (e.g., google.com) can return millions of records -- use date filters and maxItems
The actor returns snapshot metadata, not the actual archived page content
Some timestamps may have reduced precision (date only, no time)

Changelog

v0.1 (2026-04-23) -- Initial release

👁 Wayback Machine Scraper - Track Website Changes Over Time avatar

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

👁 User avatar

Ryan Clinton

👁 Wayback Machine Search avatar

Wayback Machine Search

crawlerbros/wayback-machine-search

Query Internet Archive's Wayback Machine for historical snapshots of any URL or domain. Filter by date, HTTP status, MIME type, and deduplicate. Optionally fetch the archived page text. Free public CDX API, no authentication.

👁 User avatar

Crawler Bros

Wayback Cdx Scraper

fortuitous_pirate/wayback-cdx-scraper

Scrape the Internet Archive Wayback Machine CDX index: find all archived snapshots of any URL with timestamps, HTTP status codes, and MIME types.

👁 User avatar

Fortuitous Pirate

👁 Wayback Machine Historical Content Scraper avatar

Wayback Machine Historical Content Scraper

happyfhantum/wayback-machine-historical-content-scraper

Compare archived website snapshots through the Wayback Machine and extract page-history change signals.

👁 User avatar

Kelsey Todd

4.0

👁 Wayback Machine URL Extractor - Archived URLs avatar

Wayback Machine URL Extractor - Archived URLs

logiover/wayback-machine-url-extractor

Extract every archived URL of any domain from the Internet Archive's Wayback Machine (CDX API). Recover lost or old pages, build redirect maps and run OSINT, with date and status filters. No API key, export to CSV or JSON.

👁 User avatar

Logiover

👁 Wayback Machine Scraper avatar

Wayback Machine Scraper

gio21/wayback-machine-scraper

List Internet Archive Wayback Machine snapshots for one or more URLs. Returns timestamp, snapshot URL, HTTP status, MIME type, digest. Useful for tracking website changes over time, OSINT research, content recovery, and brand monitoring.

👁 User avatar

Gio

👁 Wayback Machine CDX Bulk Extractor avatar

Wayback Machine CDX Bulk Extractor

automation-lab/wayback-machine-cdx-extractor

Bulk extract archived snapshot metadata from the Wayback Machine CDX API. Get every crawled URL, timestamp, HTTP status code, MIME type, and content digest for any domain or URL pattern. Export to JSON, CSV, or Excel.

👁 User avatar

Stas Persiianenko

👁 Wayback Machine CDX URL List Scraper avatar

Wayback Machine CDX URL List Scraper

parseforge/wayback-cdx-scraper

Pull every archived URL the Internet Archive has captured for any domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and direct snapshot links. Filter by date range, status, MIME, and uniqueness. Export to JSON, CSV, or Excel for SEO recovery and competitive research.

👁 User avatar

ParseForge

Internet Archive & Wayback Machine Scraper

cloud9_ai/internet-archive-scraper

Search Internet Archive and check Wayback Machine snapshots. Access 800B+ archived pages, books, movies, audio. Search items, get metadata, or check URL archive history. No API key needed. For SEO, OSINT, legal, and research.

👁 User avatar

cloud9

👁 Wayback Machine Checker avatar

Wayback Machine Checker

automation-lab/wayback-machine-checker

This actor checks if URLs are archived in the Internet Archive Wayback Machine. It retrieves snapshot counts, oldest and newest archive dates, and direct links to archived versions. Uses both the Availability API and CDX API for comprehensive results.

👁 User avatar

Stas Persiianenko

URL: https://apify.com/glassventures/wayback-machine-scraper

⇱ Wayback Machine Scraper · Apify

Wayback Machine Scraper

What does Wayback Machine Scraper do?

Use Cases

Features

How much will it cost?

How to use

Input parameters

Output

Integrations

API Example (Node.js)

API Example (Python)

API Example (cURL)

Tips and tricks

FAQ

Is it legal to scrape the Wayback Machine?

Related Actors

Limitations

Changelog

You might also like

Wayback Machine Scraper - Track Website Changes Over Time

Wayback Machine Search

Wayback Cdx Scraper

Wayback Machine Historical Content Scraper

Wayback Machine URL Extractor - Archived URLs

Wayback Machine Scraper

Wayback Machine CDX Bulk Extractor

Wayback Machine CDX URL List Scraper

Internet Archive & Wayback Machine Scraper

Wayback Machine Checker