Wayback Machine Bulk Lookup

Pricing

Pay per event

Wayback Machine Bulk Lookup

Look up Wayback Machine snapshots for any URL or list of URLs. Returns capture timeline, optional snapshot markdown, and live-vs-snapshot diff. Date range filtering, capture limit, bulk input. Built for OSINT, journalism, SEO link-rot recovery, and legal evidence.

Pricing

Pay per event

Rating

0.0

(0)

Developer

👁 BowTiedRaccoon

BowTiedRaccoon

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

21 days ago

Last modified

What this actor does

For each input URL, the actor:

Queries the Wayback CDX API to retrieve the snapshot index in your specified date range and capture limit
Optionally fetches snapshot HTML for each capture and converts it to markdown (for reading or archiving)
Optionally fetches the current live URL and computes a line-level text diff against the most recent snapshot (to detect page changes)

Each output record contains the full snapshot timeline plus optional diff and content fields.

Input

Field	Type	Default	Description
`urls`	array of strings	—	Required. URLs to look up in the Wayback Machine
`maxItems`	integer	—	Maximum total output records across all URLs
`dateFrom`	string	—	Earliest snapshot date to include (ISO date, e.g. `2020-01-01`)
`dateTo`	string	—	Latest snapshot date to include (ISO date, e.g. `2024-12-31`)
`captureLimit`	integer	`100`	Max snapshots per URL
`fetchSnapshotContent`	boolean	`false`	Download snapshot HTML and convert to markdown
`diffWithLive`	boolean	`false`	Compute text diff between latest snapshot and current live URL
`proxyConfiguration`	object	none	Optional proxy config (usually not needed for Wayback)

Example input:

{
"urls":[
"https://example.com/news/2024-article",
"https://example.com/about"
],
"dateFrom":"2023-01-01",
"dateTo":"2024-12-31",
"captureLimit":50,
"diffWithLive":true
}

Output

One record per input URL.

Field	Type	Description
`url`	string	The input URL
`snapshotCount`	number	Number of snapshots found in the date range
`firstCaptured`	string	Earliest snapshot timestamp (ISO 8601)
`lastCaptured`	string	Latest snapshot timestamp (ISO 8601)
`captures`	array	Snapshot entries — each a JSON-encoded string with `timestamp`, `archiveUrl`, `status`, `mimetype`, and optionally `contentMarkdown`
`diff`	object	`{ addedLines, removedLines, changedRatio }` — only present when `diffWithLive=true`
`liveStatus`	number	Current HTTP status of the live URL — only present when `diffWithLive=true`
`finalLiveUrl`	string	Final URL after redirects
`status`	string	`success`, `timeout`, or `error`
`errorMsg`	string	Error details on failure, `null` on success

Example output record:

{
"url":"https://example.com/news/2024-article",
"snapshotCount":14,
"firstCaptured":"2024-03-12T08:42:00Z",
"lastCaptured":"2026-04-29T22:11:00Z",
"captures":[
"{\"timestamp\":\"2026-04-29T22:11:00Z\",\"archiveUrl\":\"https://web.archive.org/web/20260429221100/https://example.com/news/2024-article\",\"status\":200,\"mimetype\":\"text/html\"}"
],
"diff":{"addedLines":12,"removedLines":3,"changedRatio":0.04},
"liveStatus":200,
"finalLiveUrl":"https://example.com/news/2024-article",
"status":"success",
"errorMsg":null
}

Dataset views

The actor produces two dataset views in the Apify console:

Capture Timeline — url, snapshotCount, firstCaptured, lastCaptured, captures
Live vs Snapshot Diff — url, liveStatus, diff, lastCaptured

Rate limits and performance

The actor respects Wayback Machine's rate limits:

CDX API queries: ~10 requests/second (110ms minimum delay)
Snapshot content fetches: ~1-2 requests/second (700ms minimum delay)

For large batches with fetchSnapshotContent=true, expect longer runtimes. The default timeout is 2 hours. Start with a small captureLimit (e.g. 10) to estimate runtime before running at full scale.

Use cases

OSINT / research: Check whether a source URL existed, when it was captured, and how its content has changed
Journalism: Verify archived versions of articles or government pages for fact-checking
SEO / link-rot recovery: Find archived versions of dead inbound links and plan redirects or outreach
Legal evidence: Retrieve timestamped snapshots of web pages for documentation
Web archiving: Bulk-check coverage for a list of URLs before deeper archiving work

👁 Wayback Machine Scraper avatar

Wayback Machine Scraper

gio21/wayback-machine-scraper

List Internet Archive Wayback Machine snapshots for one or more URLs. Returns timestamp, snapshot URL, HTTP status, MIME type, digest. Useful for tracking website changes over time, OSINT research, content recovery, and brand monitoring.

👁 User avatar

Gio

Wayback Machine Scraper

glassventures/wayback-machine-scraper

Scrape Wayback Machine archive snapshots for any URL or domain. Get archived URLs, timestamps, status codes, MIME types. Export to JSON, CSV, Excel.

👁 User avatar

Glass Ventures

👁 Wayback Machine Historical Content Scraper avatar

Wayback Machine Historical Content Scraper

happyfhantum/wayback-machine-historical-content-scraper

Compare archived website snapshots through the Wayback Machine and extract page-history change signals.

👁 User avatar

Kelsey Todd

4.0

Wayback Machine Toolkit

logical_vivacity/wayback-machine-toolkit

A practical toolkit on top of the public web archive. Goes well beyond raw snapshot listings: extract clean article markdown from any past capture, diff two points in time, audit a list of URLs for link rot, and detect content changes across pages between

👁 User avatar

Logical Vivacity

👁 Wayback Machine CDX Bulk Extractor avatar

Wayback Machine CDX Bulk Extractor

automation-lab/wayback-machine-cdx-extractor

Bulk extract archived snapshot metadata from the Wayback Machine CDX API. Get every crawled URL, timestamp, HTTP status code, MIME type, and content digest for any domain or URL pattern. Export to JSON, CSV, or Excel.

👁 User avatar

Stas Persiianenko

👁 Wayback Machine Scraper - Track Website Changes Over Time avatar

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

👁 User avatar

Ryan Clinton

Wayback Snapshots — CSV, Date-Filter, Bulk JSON

knotless_cadence/wayback-machine-scraper

Wayback Machine snapshots CSV/JSON — per snapshot: timestamp, status, MIME, size, archive URL — date-filterable. CDX API, no key. 21+ runs. For competitor history-tracking + SEO recovery + brand archaeology. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

👁 User avatar

Alex

👁 Wayback Machine Search avatar

Wayback Machine Search

maximedupre/wayback-machine-search

Search Wayback Machine snapshots for URLs, hosts, and domains. Export archive dates, status codes, MIME types, digests, content text, version timelines, reports, and monitoring alerts.

👁 User avatar

Maxime Dupré

👁 Wayback Machine Search avatar

Wayback Machine Search

crawlerbros/wayback-machine-search

Query Internet Archive's Wayback Machine for historical snapshots of any URL or domain. Filter by date, HTTP status, MIME type, and deduplicate. Optionally fetch the archived page text. Free public CDX API, no authentication.

👁 User avatar

Crawler Bros

👁 Wayback Machine Snapshots Scraper — Internet Archive History avatar

Wayback Machine Snapshots Scraper — Internet Archive History

seemuapps/wayback-machine-snapshots-scraper

List every Internet Archive snapshot of a URL, page, or whole domain. Timestamp, snapshot URL, status code, mime type, content length. No login.

👁 User avatar

Andrew

URL: https://apify.com/jungle_synthesizer/wayback-machine-bulk-lookup