👁 Wayback Machine URL Extractor - Archived URLs avatar

Wayback Machine URL Extractor - Archived URLs

Pricing

from $3.50 / 1,000 results

👁 Wayback Machine URL Extractor - Archived URLs

Wayback Machine URL Extractor - Archived URLs

Extract every archived URL of any domain from the Internet Archive's Wayback Machine (CDX API). Recover lost or old pages, build redirect maps and run OSINT, with date and status filters. No API key, export to CSV or JSON.

Pricing

from $3.50 / 1,000 results

Rating

0.0

(0)

Developer

👁 Logiover

Logiover

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

Wayback Machine URL Extractor 🕰️ — Archived URLs from the Internet Archive

Recover every historical URL a website has ever published — straight from the Internet Archive's Wayback Machine. This Wayback Machine scraper queries the public CDX API to extract archived URLs and historical URLs for any domain — including pages that were deleted, renamed, or lost in a migration. Feed in one domain and get back up to tens of thousands of unique URLs, each with its capture date, archived HTTP status, MIME type, and a direct Wayback snapshot link.

Point it at one domain and it pulls the full historical URL inventory automatically. No API key, no login, no rate-limit headaches — one row per archived URL.

Looking to recover old URLs after a site migration, build a redirect map, find old/deleted pages, do OSINT on a domain's history, or pull a list of Internet Archive URLs without writing CDX queries by hand? This is the Internet Archive URL extractor that does it at scale.

✨ Key features

🕰️ Full historical URL inventory — pulls every unique URL the Wayback Machine has on record for a domain, going back to 1996.
🔑 No API key required — uses the open Internet Archive CDX API; no auth, no token, no login.
🌐 Subdomain & path matching — capture the host plus all subdomains and paths, or narrow down to a single host or path prefix.
📅 Date-range filtering — restrict to snapshots captured between two dates (fromDate / toDate).
✅ Status-code filtering — keep only 200 OK captures and drop dead/redirected ones.
🔗 Direct snapshot links — every row includes a ready-to-open web.archive.org/web/... URL.
🌊 Streamed pagination — pages through massive result sets with the CDX resumeKey mechanism, so memory stays flat even on 100k+ URL domains.
🔢 Result caps — set maxResults per domain, or 0 for unlimited.
📋 Multiple domains per run — process a whole list in one go.
📤 Export-ready — JSON, CSV, and Excel output via the Apify Dataset or REST API.

💡 Use cases

SEO migration & redirect maps — recover lost/old URLs after a site move and rebuild a complete 301 redirect map so you don't lose link equity.
Content recovery — find and restore blog posts, product pages, or docs that were deleted but still live in the archive.
OSINT & research — enumerate a target domain's historical footprint, old endpoints, removed pages, and forgotten subdomains.
Link reclamation — find old URLs that still earn backlinks so you can redirect them and reclaim the link value.
Finding old endpoints — surface admin paths, legacy APIs, and orphaned pages that no longer appear on the live site.
Competitive & web-archaeology research — reconstruct how a competitor's URL structure and content changed across years of snapshots.
Datasets — build a domain's URL/MIME/capture-history dataset for analysis.

📦 What you get

One row per unique archived URL, including:

Field	Description
`domain`	The normalized domain this URL belongs to
`url`	The original archived URL
`timestamp`	Raw 14-digit Wayback capture timestamp (`YYYYMMDDhhmmss`)
`capturedAt`	ISO 8601 form of the capture timestamp
`statusCode`	HTTP status the archive recorded for that capture (e.g. `200`, `301`, `404`, or `-`)
`mimeType`	Content type recorded at capture time (e.g. `text/html`)
`digest`	Wayback content digest (used internally for de-duplication)
`snapshotUrl`	Direct link to the archived snapshot on `web.archive.org`

Example output

{
"domain":"nasa.gov",
"url":"http://www.nasa.gov/mission_pages/station/main/index.html",
"timestamp":"20120114043915",
"capturedAt":"2012-01-14T04:39:15.000Z",
"statusCode":"200",
"mimeType":"text/html",
"digest":"AB23CD45EF67GH89IJ01KL23MN45OP67",
"snapshotUrl":"https://web.archive.org/web/20120114043915/http://www.nasa.gov/mission_pages/station/main/index.html"
}

🚀 How to use it

Click Try for free / Start.
Add one or more domains to Domains (e.g. nasa.gov, bbc.com). URLs and www. are normalized automatically.
(Optional) Pick a matchType, set a date range, filter by status code, or raise maxResults (0 = unlimited).
Click Save & Start.
Export the archived URL list as JSON, CSV, Excel or via API, and open any row's snapshotUrl to view the archived page.

⚙️ Input

Field	Type	Description	Default
`domains`	array	Required. One or more domains or URLs (e.g. `nasa.gov`, `bbc.com`). Wildcards added automatically	–
`matchType`	enum	`subdomains` (host + all subdomains + paths), `host` (exact host only), `domain` (host + subdomains), `prefix` (path prefix)	`subdomains`
`fromDate`	string	Optional `YYYYMMDD` lower bound on capture date	–
`toDate`	string	Optional `YYYYMMDD` upper bound on capture date	–
`filterStatus`	string	Optional — only return captures with this HTTP status (e.g. `200`)	–
`maxResults`	integer	Max unique URLs per domain. `0` = unlimited	`5000`
`proxyConfiguration`	object	Proxy settings. Defaults to Apify Proxy	Apify Proxy

Example input

{
"domains":["nasa.gov"],
"matchType":"subdomains",
"fromDate":"20100101",
"toDate":"20201231",
"filterStatus":"200",
"maxResults":5000,
"proxyConfiguration":{"useApifyProxy":true}
}

🔍 How it works

Each domain you provide is normalized — scheme, www., paths and wildcards are stripped down to a bare host.
A CDX API query is built from your matchType, date range, and status filter, requesting the original, timestamp, statuscode, mimetype and digest fields with collapse=urlkey so each URL appears only once instead of returning every capture of it.
Results are paged using the CDX showResumeKey / resumeKey mechanism, and each page is pushed to the dataset in a batch — so even domains with hundreds of thousands of archived URLs stream out without exhausting memory.
For every row, a direct snapshotUrl is constructed in the https://web.archive.org/web/<timestamp>/<original-url> form, so you can open the exact archived page.
Slow responses, 5xx, and 429 errors are retried with exponential backoff on a fresh proxy IP — the CDX index can be slow, so retries keep large runs reliable.

🧰 Tips & best practices

Big domains (news sites, government sites) can have hundreds of thousands of archived URLs. Start with the default maxResults of 5000 to gauge volume, then raise it or set 0 for everything.
Use filterStatus: "200" to skip dead and redirected captures and keep only pages that actually resolved — ideal for building redirect maps.
Narrow with fromDate / toDate (both YYYYMMDD) when you only care about a specific era of the site.
Use matchType: "subdomains" to sweep every subdomain at once, or host for a single host without its subdomains.
Sort or filter the dataset by mimeType to isolate just HTML pages, images, PDFs, etc.

❓ FAQ

How do I get all URLs of a website from the Wayback Machine?

Add the domain to Domains, leave matchType on subdomains, set maxResults to 0 for everything, and run it. The actor queries the Internet Archive CDX API and returns one row per unique archived URL.

Can I find old or deleted pages of a domain?

Yes — that's the core use case. The Wayback Machine keeps URLs even after they're removed from the live site, so deleted blog posts, retired product pages, and old endpoints all show up in the results with a snapshotUrl to view them.

How do I export archived URLs to CSV or JSON?

Run the actor, then download the dataset as CSV, JSON or Excel (or pull it via the REST API). Every archived URL is one row, so it drops straight into a spreadsheet or pipeline.

Is this free and without an API key?

The Internet Archive CDX API is public and requires no API key and no login. You only pay for the Apify platform usage of the run itself.

Can I filter by date or status code?

Yes — set fromDate / toDate (YYYYMMDD) to restrict to a capture window, and filterStatus (e.g. 200) to keep only captures with a specific HTTP status.

How many URLs can it return?

Up to tens of thousands per domain — set maxResults to 0 for unlimited. Results stream to the dataset in pages via the CDX resumeKey, so even 100k+ URL domains run without memory issues.

Why are some `statusCode` values `-`?

The Wayback index sometimes records captures without a stored status code (e.g. revisit records). Those rows are still valid archived URLs.

🔗 Related actors by the same author

Sitemap to URL Crawler — extract all URLs from any sitemap.xml.
Website SEO Audit Crawler — run a full on-page SEO audit across a whole site.
Bulk URL Status Checker — check HTTP status codes for a list of URLs in bulk.
Broken Link Checker — crawl a site and find dead links with HTTP status codes.

📝 Changelog

2026-06-15

Initial release — extract archived URLs from the Wayback Machine CDX API with date/status filters, CSV/JSON export, no API key.

👁 Wayback Machine Search avatar

Wayback Machine Search

crawlerbros/wayback-machine-search

Query Internet Archive's Wayback Machine for historical snapshots of any URL or domain. Filter by date, HTTP status, MIME type, and deduplicate. Optionally fetch the archived page text. Free public CDX API, no authentication.

👁 User avatar

Crawler Bros

Wayback Machine Scraper

glassventures/wayback-machine-scraper

Scrape Wayback Machine archive snapshots for any URL or domain. Get archived URLs, timestamps, status codes, MIME types. Export to JSON, CSV, Excel.

👁 User avatar

Glass Ventures

👁 Wayback Machine Checker avatar

Wayback Machine Checker

automation-lab/wayback-machine-checker

This actor checks if URLs are archived in the Internet Archive Wayback Machine. It retrieves snapshot counts, oldest and newest archive dates, and direct links to archived versions. Uses both the Availability API and CDX API for comprehensive results.

👁 User avatar

Stas Persiianenko

👁 Wayback Machine CDX Bulk Extractor avatar

Wayback Machine CDX Bulk Extractor

automation-lab/wayback-machine-cdx-extractor

Bulk extract archived snapshot metadata from the Wayback Machine CDX API. Get every crawled URL, timestamp, HTTP status code, MIME type, and content digest for any domain or URL pattern. Export to JSON, CSV, or Excel.

👁 User avatar

Stas Persiianenko

Wayback Cdx Scraper

fortuitous_pirate/wayback-cdx-scraper

Scrape the Internet Archive Wayback Machine CDX index: find all archived snapshots of any URL with timestamps, HTTP status codes, and MIME types.

👁 User avatar

Fortuitous Pirate

Internet Archive & Wayback Machine Scraper

cloud9_ai/internet-archive-scraper

Search Internet Archive and check Wayback Machine snapshots. Access 800B+ archived pages, books, movies, audio. Search items, get metadata, or check URL archive history. No API key needed. For SEO, OSINT, legal, and research.

👁 User avatar

cloud9

👁 Wayback Machine Historical Content Scraper avatar

Wayback Machine Historical Content Scraper

happyfhantum/wayback-machine-historical-content-scraper

Compare archived website snapshots through the Wayback Machine and extract page-history change signals.

👁 User avatar

Kelsey Todd

4.0

👁 Wayback Machine Scraper - Track Website Changes Over Time avatar

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

👁 User avatar

Ryan Clinton

👁 Wayback Machine CDX URL List Scraper avatar

Wayback Machine CDX URL List Scraper

parseforge/wayback-cdx-scraper

Pull every archived URL the Internet Archive has captured for any domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and direct snapshot links. Filter by date range, status, MIME, and uniqueness. Export to JSON, CSV, or Excel for SEO recovery and competitive research.

👁 User avatar

ParseForge

👁 Wayback Machine Scraper avatar

Wayback Machine Scraper

gio21/wayback-machine-scraper

List Internet Archive Wayback Machine snapshots for one or more URLs. Returns timestamp, snapshot URL, HTTP status, MIME type, digest. Useful for tracking website changes over time, OSINT research, content recovery, and brand monitoring.

👁 User avatar

Gio

URL: https://apify.com/logiover/wayback-machine-url-extractor