VOOZH about

URL: https://apify.com/v0iddo/sitemap-url-extractor

⇱ Sitemap URL Extractor β€” All URLs by Domain Β· Apify


πŸ‘ Sitemap URL Extractor β€” robots.txt + sitemap.xml Crawl avatar

Sitemap URL Extractor β€” robots.txt + sitemap.xml Crawl

Pricing

Pay per usage

Go to Apify Store

Sitemap URL Extractor β€” robots.txt + sitemap.xml Crawl

Discover every URL a site exposes via its public sitemap chain. Reads robots.txt, follows Sitemap declarations, recursively descends sitemap-index files, extracts URLs with lastmod, changefreq, priority.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

πŸ‘ vΓΈiddo

vΓΈiddo

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

10 days ago

Last modified

Categories

Share

Extract every URL a site exposes via its public sitemap chain. Reads robots.txt for Sitemap: declarations, falls back to /sitemap.xml, recursively descends sitemap-index files, and returns one row per discovered URL with lastmod, changefreq, and priority.

Example output row

{
"domain":"vercel.com",
"url":"https://vercel.com/blog/nextjs-14",
"lastmod":"2024-03-15",
"changefreq":"weekly",
"priority":0.8,
"source":"https://vercel.com/sitemap-blog.xml"
}

How to use

Input

FieldTypeDefaultDescription
domainsstring[]["stripe.com","shopify.com","vercel.com"]Domains to crawl β€” no scheme, no trailing slash
maxUrlsPerDomaininteger2000Hard cap on URLs returned per domain
followSitemapIndexbooleantrueRecursively follow <sitemapindex> child links (up to depth 5)

Minimal run

{
"domains":["example.com"],
"maxUrlsPerDomain":500,
"followSitemapIndex":true
}

Output fields

FieldTypeNotes
domainstringInput domain
urlstringDiscovered URL from <loc>
lastmodstringISO date, null if absent
changefreqstringe.g. weekly, null if absent
priorityfloat0.0–1.0, null if absent
sourcestringSitemap file the URL was found in

Pricing

EventCostWhen charged
url_extracted$0.0001 per URLOnce per run, total = URLs pushed

A 2 000-URL run costs $0.20. Unused budget is not charged β€” if a domain has only 300 URLs you pay for 300.

Buyer

  • SEO teams auditing crawl coverage β€” verify every page is in the sitemap.
  • Content operations checking lastmod staleness across thousands of URLs.
  • Competitive intelligence β€” map a competitor's full URL structure.
  • QA pipelines validating sitemap health after deploys.
  • Link-building researchers finding indexable pages at scale.

Source

Crawl order per domain:

  1. GET https://{domain}/robots.txt β€” parse all Sitemap: lines.
  2. If none found, fall back to GET https://{domain}/sitemap.xml.
  3. For each sitemap URL: fetch + parse XML.
  4. If <sitemapindex>, enqueue each <sitemap><loc> (up to depth 5).
  5. If <urlset>, emit one row per <url> until maxUrlsPerDomain is reached.

All requests use a polite User-Agent and are paced at 250–600 ms between calls. 404 and empty responses are skipped gracefully.

You might also like

Sitemap Extractor: Every URL, Recursive, Reliable

thoob/sitemap-extractor

Reads sitemap.xml, sitemap index files, .gz compressed sitemaps, and robots.txt Sitemap directives, and returns one clean row per URL with lastmod, changefreq, and priority. Billed only per delivered URL.

Pono Data

2

Sitemap URL Extractor - List All URLs in a Sitemap

dltik/sitemap-url-extractor

Extract every URL from any XML sitemap, with lastmod, changefreq and priority. Resolves sitemap indexes recursively. Pass a sitemap.xml or just a site root to auto-discover its sitemaps. Pure HTTP, no browser β€” fast and cheap.

Sitemap Sniffer

crawlerbros/sitemap-sniffer

Discover every sitemap file for a website. Reads robots.txt for Sitemap directives, probes common sitemap paths, and recursively unpacks sitemap-index files. HTTP-only, no proxy or cookies needed.

Sitemap Sniffer

maximedupre/sitemap-sniffer

Find sitemap files from website roots, domains, robots.txt, and direct sitemap URLs. Export sitemap metadata, URL counts, nested index depth, and optional URL inventory rows.

πŸ‘ User avatar

Maxime DuprΓ©

2

Sitemap URL Extractor

crawlerbros/sitemap-url-extractor

Extract every URL from any site's sitemap.xml with handles sitemap index files (nested sitemaps), gzipped sitemaps, and robots.txt discovery. Returns URL, lastmod, changefreq, priority, and optional image/video/alternate-language fields. No proxy, no cookies, no login.

Sitemap URL Extractor

seemuapps/sitemap-extractor

Extract every URL from a website's sitemap.xml. Recursively walks nested sitemap indexes and returns loc, lastmod, changefreq, and priority for each page.