VOOZH about

URL: https://apify.com/logiover/wayback-machine-url-extractor

โ‡ฑ Wayback Machine URL Extractor โ€“ Archived URLs ยท Apify


๐Ÿ‘ Wayback Machine URL Extractor - Archived URLs avatar

Wayback Machine URL Extractor - Archived URLs

Pricing

from $3.50 / 1,000 results

Go to Apify Store

Wayback Machine URL Extractor - Archived URLs

Extract every archived URL of any domain from the Internet Archive's Wayback Machine (CDX API). Recover lost or old pages, build redirect maps and run OSINT, with date and status filters. No API key, export to CSV or JSON.

Pricing

from $3.50 / 1,000 results

Rating

0.0

(0)

Developer

๐Ÿ‘ Logiover

Logiover

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Wayback Machine URL Extractor ๐Ÿ•ฐ๏ธ โ€” Archived URLs from the Internet Archive

Recover every historical URL a website has ever published โ€” straight from the Internet Archive's Wayback Machine. This Wayback Machine scraper queries the public CDX API to extract archived URLs and historical URLs for any domain โ€” including pages that were deleted, renamed, or lost in a migration. Feed in one domain and get back up to tens of thousands of unique URLs, each with its capture date, archived HTTP status, MIME type, and a direct Wayback snapshot link.

Point it at one domain and it pulls the full historical URL inventory automatically. No API key, no login, no rate-limit headaches โ€” one row per archived URL.

Looking to recover old URLs after a site migration, build a redirect map, find old/deleted pages, do OSINT on a domain's history, or pull a list of Internet Archive URLs without writing CDX queries by hand? This is the Internet Archive URL extractor that does it at scale.


โœจ Key features

  • ๐Ÿ•ฐ๏ธ Full historical URL inventory โ€” pulls every unique URL the Wayback Machine has on record for a domain, going back to 1996.
  • ๐Ÿ”‘ No API key required โ€” uses the open Internet Archive CDX API; no auth, no token, no login.
  • ๐ŸŒ Subdomain & path matching โ€” capture the host plus all subdomains and paths, or narrow down to a single host or path prefix.
  • ๐Ÿ“… Date-range filtering โ€” restrict to snapshots captured between two dates (fromDate / toDate).
  • โœ… Status-code filtering โ€” keep only 200 OK captures and drop dead/redirected ones.
  • ๐Ÿ”— Direct snapshot links โ€” every row includes a ready-to-open web.archive.org/web/... URL.
  • ๐ŸŒŠ Streamed pagination โ€” pages through massive result sets with the CDX resumeKey mechanism, so memory stays flat even on 100k+ URL domains.
  • ๐Ÿ”ข Result caps โ€” set maxResults per domain, or 0 for unlimited.
  • ๐Ÿ“‹ Multiple domains per run โ€” process a whole list in one go.
  • ๐Ÿ“ค Export-ready โ€” JSON, CSV, and Excel output via the Apify Dataset or REST API.

๐Ÿ’ก Use cases

  • SEO migration & redirect maps โ€” recover lost/old URLs after a site move and rebuild a complete 301 redirect map so you don't lose link equity.
  • Content recovery โ€” find and restore blog posts, product pages, or docs that were deleted but still live in the archive.
  • OSINT & research โ€” enumerate a target domain's historical footprint, old endpoints, removed pages, and forgotten subdomains.
  • Link reclamation โ€” find old URLs that still earn backlinks so you can redirect them and reclaim the link value.
  • Finding old endpoints โ€” surface admin paths, legacy APIs, and orphaned pages that no longer appear on the live site.
  • Competitive & web-archaeology research โ€” reconstruct how a competitor's URL structure and content changed across years of snapshots.
  • Datasets โ€” build a domain's URL/MIME/capture-history dataset for analysis.

๐Ÿ“ฆ What you get

One row per unique archived URL, including:

FieldDescription
domainThe normalized domain this URL belongs to
urlThe original archived URL
timestampRaw 14-digit Wayback capture timestamp (YYYYMMDDhhmmss)
capturedAtISO 8601 form of the capture timestamp
statusCodeHTTP status the archive recorded for that capture (e.g. 200, 301, 404, or -)
mimeTypeContent type recorded at capture time (e.g. text/html)
digestWayback content digest (used internally for de-duplication)
snapshotUrlDirect link to the archived snapshot on web.archive.org

Example output

{
"domain":"nasa.gov",
"url":"http://www.nasa.gov/mission_pages/station/main/index.html",
"timestamp":"20120114043915",
"capturedAt":"2012-01-14T04:39:15.000Z",
"statusCode":"200",
"mimeType":"text/html",
"digest":"AB23CD45EF67GH89IJ01KL23MN45OP67",
"snapshotUrl":"https://web.archive.org/web/20120114043915/http://www.nasa.gov/mission_pages/station/main/index.html"
}

๐Ÿš€ How to use it

  1. Click Try for free / Start.
  2. Add one or more domains to Domains (e.g. nasa.gov, bbc.com). URLs and www. are normalized automatically.
  3. (Optional) Pick a matchType, set a date range, filter by status code, or raise maxResults (0 = unlimited).
  4. Click Save & Start.
  5. Export the archived URL list as JSON, CSV, Excel or via API, and open any row's snapshotUrl to view the archived page.

โš™๏ธ Input

FieldTypeDescriptionDefault
domainsarrayRequired. One or more domains or URLs (e.g. nasa.gov, bbc.com). Wildcards added automaticallyโ€“
matchTypeenumsubdomains (host + all subdomains + paths), host (exact host only), domain (host + subdomains), prefix (path prefix)subdomains
fromDatestringOptional YYYYMMDD lower bound on capture dateโ€“
toDatestringOptional YYYYMMDD upper bound on capture dateโ€“
filterStatusstringOptional โ€” only return captures with this HTTP status (e.g. 200)โ€“
maxResultsintegerMax unique URLs per domain. 0 = unlimited5000
proxyConfigurationobjectProxy settings. Defaults to Apify ProxyApify Proxy

Example input

{
"domains":["nasa.gov"],
"matchType":"subdomains",
"fromDate":"20100101",
"toDate":"20201231",
"filterStatus":"200",
"maxResults":5000,
"proxyConfiguration":{"useApifyProxy":true}
}

๐Ÿ” How it works

  1. Each domain you provide is normalized โ€” scheme, www., paths and wildcards are stripped down to a bare host.
  2. A CDX API query is built from your matchType, date range, and status filter, requesting the original, timestamp, statuscode, mimetype and digest fields with collapse=urlkey so each URL appears only once instead of returning every capture of it.
  3. Results are paged using the CDX showResumeKey / resumeKey mechanism, and each page is pushed to the dataset in a batch โ€” so even domains with hundreds of thousands of archived URLs stream out without exhausting memory.
  4. For every row, a direct snapshotUrl is constructed in the https://web.archive.org/web/<timestamp>/<original-url> form, so you can open the exact archived page.
  5. Slow responses, 5xx, and 429 errors are retried with exponential backoff on a fresh proxy IP โ€” the CDX index can be slow, so retries keep large runs reliable.

๐Ÿงฐ Tips & best practices

  • Big domains (news sites, government sites) can have hundreds of thousands of archived URLs. Start with the default maxResults of 5000 to gauge volume, then raise it or set 0 for everything.
  • Use filterStatus: "200" to skip dead and redirected captures and keep only pages that actually resolved โ€” ideal for building redirect maps.
  • Narrow with fromDate / toDate (both YYYYMMDD) when you only care about a specific era of the site.
  • Use matchType: "subdomains" to sweep every subdomain at once, or host for a single host without its subdomains.
  • Sort or filter the dataset by mimeType to isolate just HTML pages, images, PDFs, etc.

โ“ FAQ

How do I get all URLs of a website from the Wayback Machine?

Add the domain to Domains, leave matchType on subdomains, set maxResults to 0 for everything, and run it. The actor queries the Internet Archive CDX API and returns one row per unique archived URL.

Can I find old or deleted pages of a domain?

Yes โ€” that's the core use case. The Wayback Machine keeps URLs even after they're removed from the live site, so deleted blog posts, retired product pages, and old endpoints all show up in the results with a snapshotUrl to view them.

How do I export archived URLs to CSV or JSON?

Run the actor, then download the dataset as CSV, JSON or Excel (or pull it via the REST API). Every archived URL is one row, so it drops straight into a spreadsheet or pipeline.

Is this free and without an API key?

The Internet Archive CDX API is public and requires no API key and no login. You only pay for the Apify platform usage of the run itself.

Can I filter by date or status code?

Yes โ€” set fromDate / toDate (YYYYMMDD) to restrict to a capture window, and filterStatus (e.g. 200) to keep only captures with a specific HTTP status.

How many URLs can it return?

Up to tens of thousands per domain โ€” set maxResults to 0 for unlimited. Results stream to the dataset in pages via the CDX resumeKey, so even 100k+ URL domains run without memory issues.

Why are some statusCode values -?

The Wayback index sometimes records captures without a stored status code (e.g. revisit records). Those rows are still valid archived URLs.

๐Ÿ”— Related actors by the same author

๐Ÿ“ Changelog

2026-06-15

  • Initial release โ€” extract archived URLs from the Wayback Machine CDX API with date/status filters, CSV/JSON export, no API key.

You might also like

Wayback Machine Search

crawlerbros/wayback-machine-search

Query Internet Archive's Wayback Machine for historical snapshots of any URL or domain. Filter by date, HTTP status, MIME type, and deduplicate. Optionally fetch the archived page text. Free public CDX API, no authentication.

Wayback Machine Checker

automation-lab/wayback-machine-checker

This actor checks if URLs are archived in the Internet Archive Wayback Machine. It retrieves snapshot counts, oldest and newest archive dates, and direct links to archived versions. Uses both the Availability API and CDX API for comprehensive results.

๐Ÿ‘ User avatar

Stas Persiianenko

41

Wayback Machine CDX Bulk Extractor

automation-lab/wayback-machine-cdx-extractor

Bulk extract archived snapshot metadata from the Wayback Machine CDX API. Get every crawled URL, timestamp, HTTP status code, MIME type, and content digest for any domain or URL pattern. Export to JSON, CSV, or Excel.

๐Ÿ‘ User avatar

Stas Persiianenko

7

Wayback Machine Historical Content Scraper

happyfhantum/wayback-machine-historical-content-scraper

Compare archived website snapshots through the Wayback Machine and extract page-history change signals.

89

4.0

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

74

Wayback Machine CDX URL List Scraper

parseforge/wayback-cdx-scraper

Pull every archived URL the Internet Archive has captured for any domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and direct snapshot links. Filter by date range, status, MIME, and uniqueness. Export to JSON, CSV, or Excel for SEO recovery and competitive research.

Wayback Machine Scraper

gio21/wayback-machine-scraper

List Internet Archive Wayback Machine snapshots for one or more URLs. Returns timestamp, snapshot URL, HTTP status, MIME type, digest. Useful for tracking website changes over time, OSINT research, content recovery, and brand monitoring.