Sitemap URL Extractor β robots.txt + sitemap.xml Crawl
Pricing
Pay per usage
Sitemap URL Extractor β robots.txt + sitemap.xml Crawl
Discover every URL a site exposes via its public sitemap chain. Reads robots.txt, follows Sitemap declarations, recursively descends sitemap-index files, extracts URLs with lastmod, changefreq, priority.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
10 days ago
Last modified
Categories
Share
Extract every URL a site exposes via its public sitemap chain. Reads robots.txt for Sitemap: declarations, falls back to /sitemap.xml, recursively descends sitemap-index files, and returns one row per discovered URL with lastmod, changefreq, and priority.
Example output row
{"domain":"vercel.com","url":"https://vercel.com/blog/nextjs-14","lastmod":"2024-03-15","changefreq":"weekly","priority":0.8,"source":"https://vercel.com/sitemap-blog.xml"}
How to use
Input
| Field | Type | Default | Description |
|---|---|---|---|
domains | string[] | ["stripe.com","shopify.com","vercel.com"] | Domains to crawl β no scheme, no trailing slash |
maxUrlsPerDomain | integer | 2000 | Hard cap on URLs returned per domain |
followSitemapIndex | boolean | true | Recursively follow <sitemapindex> child links (up to depth 5) |
Minimal run
{"domains":["example.com"],"maxUrlsPerDomain":500,"followSitemapIndex":true}
Output fields
| Field | Type | Notes |
|---|---|---|
domain | string | Input domain |
url | string | Discovered URL from <loc> |
lastmod | string | ISO date, null if absent |
changefreq | string | e.g. weekly, null if absent |
priority | float | 0.0β1.0, null if absent |
source | string | Sitemap file the URL was found in |
Pricing
| Event | Cost | When charged |
|---|---|---|
url_extracted | $0.0001 per URL | Once per run, total = URLs pushed |
A 2 000-URL run costs $0.20. Unused budget is not charged β if a domain has only 300 URLs you pay for 300.
Buyer
- SEO teams auditing crawl coverage β verify every page is in the sitemap.
- Content operations checking
lastmodstaleness across thousands of URLs. - Competitive intelligence β map a competitor's full URL structure.
- QA pipelines validating sitemap health after deploys.
- Link-building researchers finding indexable pages at scale.
Source
Crawl order per domain:
GET https://{domain}/robots.txtβ parse allSitemap:lines.- If none found, fall back to
GET https://{domain}/sitemap.xml. - For each sitemap URL: fetch + parse XML.
- If
<sitemapindex>, enqueue each<sitemap><loc>(up to depth 5). - If
<urlset>, emit one row per<url>untilmaxUrlsPerDomainis reached.
All requests use a polite User-Agent and are paced at 250β600 ms between calls. 404 and empty responses are skipped gracefully.
