Sitemap URL Finder

Pricing

from $0.05 / 1,000 results

Sitemap URL Finder

Find and export URLs from any website’s robots.txt and sitemaps. Enter a domain or website URL, optionally filter matching URLs by text, and get clean dataset rows with the URL, domain, path, source sitemap, and match details.

Pricing

from $0.05 / 1,000 results

Rating

0.0

(0)

Developer

👁 Inus Grobler

Inus Grobler

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

12 days ago

Last modified

Use Cases

Build a URL inventory for SEO audits, migrations, QA checks, or crawl planning.
Find product, category, blog, documentation, listing, or support URLs before scraping page details.
Export sitemap URLs with the source sitemap attached for downstream workflows.
Filter sitemap results to one section, such as /products/, /blog/, /docs/, or /store/.
Prepare URL lists for content crawlers, monitoring, enrichment, RAG ingestion, or lead workflows.

What Data You Get

Each dataset row contains:

url: URL found in a sitemap.
domain: hostname of the found URL.
path: path part of the URL.
sourceUrl: sitemap or robots-discovered file where the URL was found.
sourceDomain: hostname of the source sitemap.
filterType: all, contains, or regex.
filterValue: text or regular expression used for matching.
matchedRegex: regular expression used, when provided through the API.
lastmod: optional last modified value from the sitemap entry.
changefreq: optional change frequency value from the sitemap entry.
priority: optional priority value from the sitemap entry.
matchedAt: UTC timestamp when the URL was saved.

The run output also links to a summary record with processed sitemap counts, discovered URL counts, saved result count, filter details, and failed request count.

Input

Use websites for normal runs. You can enter domains, homepages, or site sections; the Actor normalizes each value to the website origin and discovers common sitemap locations automatically.

{
"websites":[
{
"url":"https://docs.apify.com"
}
],
"includeUrlText":"/platform/",
"maxResults":25
}

Main Settings

websites: Website homepages or domains to scan. The Actor checks robots.txt and /sitemap.xml for each website.
includeUrlText: Optional text that found URLs must contain. Leave it empty to save every sitemap URL.
maxResults: Maximum number of URL rows to save.

Optional API Settings

maxRequestsPerCrawl: Safety cap for robots.txt and sitemap files fetched in one run.
targetUrlRegex: API-only regular expression filter. It takes precedence over includeUrlText.
websiteUrls and startUrls: Legacy/API aliases for existing integrations.

Example Output

{
"url":"https://docs.apify.com/platform/actors",
"domain":"docs.apify.com",
"path":"/platform/actors",
"sourceUrl":"https://docs.apify.com/sitemap_base.xml",
"sourceDomain":"docs.apify.com",
"lastmod":"2026-06-10",
"changefreq":"weekly",
"priority":"0.8",
"filterType":"contains",
"filterValue":"/platform/",
"matchedAt":"2026-06-11T13:55:51.865Z"
}

How To Run

Open the Actor in Apify Console.
Add one or more websites in the Input tab.
Optionally set URL contains if you only want one site section.
Set Max results to control dataset size and cost.
Start the run and open the Dataset tab when it finishes.

Results are pushed to the dataset while the Actor runs, so partial results can still be useful if a long run is stopped or times out.

Exporting Results

After a run, open the Dataset tab and export results as JSON, CSV, Excel, XML, RSS, or HTML. API users can read the default dataset from the run response.

Python API Example

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("thescrapelab/sitemap-target-url-extractor").call(run_input={
"websites":[{"url":"https://docs.apify.com"}],
"includeUrlText":"/platform/",
"maxResults":25,
})
if run isNone:
raise RuntimeError("Actor run failed")
items = client.dataset(run["defaultDatasetId"]).list_items().items
for item in items:
print(item["url"], item["sourceUrl"])

Limits And Caveats

The Actor reads sitemap files; it does not crawl every HTML page to discover links.
Some websites do not publish complete or valid sitemaps.
Password-protected, blocked, or private sitemaps may return no results.
Very large sites can contain many sitemap indexes. Use maxResults and maxRequestsPerCrawl to keep runs predictable.
Sitemap metadata is included only when the website provides it. Image, video, and alternate-language sitemap extensions are not currently included.

Troubleshooting

No results were found. The website may not publish sitemap URLs, or your URL contains filter may be too narrow. Try leaving the filter empty.

The run finished quickly with failed request counts. Some sitemap URLs returned permanent errors such as 404. The Actor skips those instead of wasting retries.

The output has fewer rows than expected. Check maxResults, maxRequestsPerCrawl, and any filter value. Also confirm the website sitemap actually lists the URLs you expect.

The run is slow on a large website. Keep the default 256 MB memory for most runs, raise maxResults gradually, and use maxRequestsPerCrawl to keep very large sitemap indexes predictable.

Pricing

The recommended pricing model is pay per result with a very small Actor start event. This keeps small tests inexpensive and makes larger runs scale with the number of useful URLs returned. Platform usage is low because the Actor uses lightweight HTTP requests instead of a browser and defaults to the 256 MB memory tier.

FAQ

Can this extract all URLs from a sitemap?

Yes. Leave URL contains empty and set Max results high enough for the website.

Can it find sitemap URLs from robots.txt?

Yes. The Actor checks robots.txt, follows sitemap directives, and also tries /sitemap.xml.

Can it parse sitemap indexes?

Yes. It follows nested sitemap indexes until the request or result limits are reached.

Does it support gzip sitemaps?

Yes. Gzip-compressed sitemap responses are decompressed before parsing.

Can I filter only product or blog URLs?

Yes. Use URL contains with a path fragment such as /products/, /blog/, /category/, or /docs/.

Is this a full website crawler?

No. It extracts URLs listed in sitemaps. Use a full web crawler if you need to discover links from page HTML.

Website Sitemap Extractor

glassventures/website-sitemap-extractor

Extract all URLs from any website's sitemap. Auto-discovers sitemaps from robots.txt, supports sitemap index files and .gz compression. Filter by URL pattern, date range.

👁 User avatar

Glass Ventures

👁 Sitemap Sniffer avatar

Sitemap Sniffer

maximedupre/sitemap-sniffer

Find sitemap files from website roots, domains, robots.txt, and direct sitemap URLs. Export sitemap metadata, URL counts, nested index depth, and optional URL inventory rows.

👁 User avatar

Maxime Dupré

Sitemap Extractor

automationagents/web-sitemap

Extract all URLs from a website's sitemap (XML, robots.txt, or crawl discovery).

👁 User avatar

Alex Jordan

👁 Find Sitemap from url avatar

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

👁 User avatar

ando

210

1.0

Sitemap & URL Extractor — Get Every URL of a Website

dataquarry/sitemap-url-extractor

Get every URL of a website: parses sitemap.xml and sitemap-indexes (discovered via robots.txt or the default location), with a same-site crawl fallback when there's no sitemap. Returns each URL + lastmod. No API key.

👁 User avatar

Daniel Brenner

👁 Sitemap Finder & URL Extractor · Crawl Any XML Sitemap avatar

Sitemap Finder & URL Extractor · Crawl Any XML Sitemap

corent1robert/sitemap-detector

Find and crawl XML sitemaps from any website. Follows sitemap indexes, handles gzip, and exports every page URL with source file and lastmod into a clean dataset. No config needed.

👁 User avatar

Corentin Robert

Sitemap Crawler - XML Sitemap URL Extractor

miccho27/sitemap-crawler

Extract all URLs from XML sitemaps (including sitemap index) and optionally audit each page

👁 User avatar

Tatsuya Mizuno

👁 Sitemap URL Extractor avatar

Sitemap URL Extractor

mikolabs/sitemap-url-extractor

Extract every URL and its metadata from any sitemap.xml in seconds. Paste one or more sitemap URLs, run the Actor, and get a clean, structured dataset with url, lastmod, changefreq, priority, and more — ready to export as CSV, JSON, or Excel.

👁 User avatar

mikolabs

Sitemap URL Extractor — robots.txt + sitemap.xml Crawl

v0iddo/sitemap-url-extractor

Discover every URL a site exposes via its public sitemap chain. Reads robots.txt, follows Sitemap declarations, recursively descends sitemap-index files, extracts URLs with lastmod, changefreq, priority.

👁 User avatar

vøiddo

👁 Sitemap URL Extractor avatar

Sitemap URL Extractor

crawlerbros/sitemap-url-extractor

Extract every URL from any site's sitemap.xml with handles sitemap index files (nested sitemaps), gzipped sitemaps, and robots.txt discovery. Returns URL, lastmod, changefreq, priority, and optional image/video/alternate-language fields. No proxy, no cookies, no login.

👁 User avatar

Crawler Bros

URL: https://apify.com/thescrapelab/sitemap-target-url-extractor