VOOZH about

URL: https://apify.com/thescrapelab/sitemap-target-url-extractor

⇱ Sitemap URL Finder | Extract URLs from Website Sitemaps Β· Apify


Pricing

from $0.05 / 1,000 results

Go to Apify Store

Sitemap URL Finder

Find and export URLs from any website’s robots.txt and sitemaps. Enter a domain or website URL, optionally filter matching URLs by text, and get clean dataset rows with the URL, domain, path, source sitemap, and match details.

Pricing

from $0.05 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ Inus Grobler

Inus Grobler

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

2

Monthly active users

12 days ago

Last modified

Share

Sitemap URL Finder extracts URLs from website sitemaps and robots.txt for SEO teams, data teams, QA teams, and crawler builders who need a clean URL inventory before a larger crawl.

At a glance: input examples, output examples, use cases, limitations, troubleshooting, and pricing/cost guidance are included below for small URL inventory checks and recurring sitemap monitoring.

Enter one or more domains or website URLs. The Actor checks robots.txt, discovers sitemap files, follows sitemap indexes, reads XML, plain-text, and gzip sitemaps, removes duplicate URLs, and saves ready-to-export rows in the dataset.

Use Cases

  • Build a URL inventory for SEO audits, migrations, QA checks, or crawl planning.
  • Find product, category, blog, documentation, listing, or support URLs before scraping page details.
  • Export sitemap URLs with the source sitemap attached for downstream workflows.
  • Filter sitemap results to one section, such as /products/, /blog/, /docs/, or /store/.
  • Prepare URL lists for content crawlers, monitoring, enrichment, RAG ingestion, or lead workflows.

What Data You Get

Each dataset row contains:

  • url: URL found in a sitemap.
  • domain: hostname of the found URL.
  • path: path part of the URL.
  • sourceUrl: sitemap or robots-discovered file where the URL was found.
  • sourceDomain: hostname of the source sitemap.
  • filterType: all, contains, or regex.
  • filterValue: text or regular expression used for matching.
  • matchedRegex: regular expression used, when provided through the API.
  • lastmod: optional last modified value from the sitemap entry.
  • changefreq: optional change frequency value from the sitemap entry.
  • priority: optional priority value from the sitemap entry.
  • matchedAt: UTC timestamp when the URL was saved.

The run output also links to a summary record with processed sitemap counts, discovered URL counts, saved result count, filter details, and failed request count.

Input

Use websites for normal runs. You can enter domains, homepages, or site sections; the Actor normalizes each value to the website origin and discovers common sitemap locations automatically.

{
"websites":[
{
"url":"https://docs.apify.com"
}
],
"includeUrlText":"/platform/",
"maxResults":25
}

Main Settings

  • websites: Website homepages or domains to scan. The Actor checks robots.txt and /sitemap.xml for each website.
  • includeUrlText: Optional text that found URLs must contain. Leave it empty to save every sitemap URL.
  • maxResults: Maximum number of URL rows to save.

Optional API Settings

  • maxRequestsPerCrawl: Safety cap for robots.txt and sitemap files fetched in one run.
  • targetUrlRegex: API-only regular expression filter. It takes precedence over includeUrlText.
  • websiteUrls and startUrls: Legacy/API aliases for existing integrations.

Example Output

{
"url":"https://docs.apify.com/platform/actors",
"domain":"docs.apify.com",
"path":"/platform/actors",
"sourceUrl":"https://docs.apify.com/sitemap_base.xml",
"sourceDomain":"docs.apify.com",
"lastmod":"2026-06-10",
"changefreq":"weekly",
"priority":"0.8",
"filterType":"contains",
"filterValue":"/platform/",
"matchedAt":"2026-06-11T13:55:51.865Z"
}

How To Run

  1. Open the Actor in Apify Console.
  2. Add one or more websites in the Input tab.
  3. Optionally set URL contains if you only want one site section.
  4. Set Max results to control dataset size and cost.
  5. Start the run and open the Dataset tab when it finishes.

Results are pushed to the dataset while the Actor runs, so partial results can still be useful if a long run is stopped or times out.

Exporting Results

After a run, open the Dataset tab and export results as JSON, CSV, Excel, XML, RSS, or HTML. API users can read the default dataset from the run response.

Python API Example

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("thescrapelab/sitemap-target-url-extractor").call(run_input={
"websites":[{"url":"https://docs.apify.com"}],
"includeUrlText":"/platform/",
"maxResults":25,
})
if run isNone:
raise RuntimeError("Actor run failed")
items = client.dataset(run["defaultDatasetId"]).list_items().items
for item in items:
print(item["url"], item["sourceUrl"])

Limits And Caveats

  • The Actor reads sitemap files; it does not crawl every HTML page to discover links.
  • Some websites do not publish complete or valid sitemaps.
  • Password-protected, blocked, or private sitemaps may return no results.
  • Very large sites can contain many sitemap indexes. Use maxResults and maxRequestsPerCrawl to keep runs predictable.
  • Sitemap metadata is included only when the website provides it. Image, video, and alternate-language sitemap extensions are not currently included.

Troubleshooting

No results were found. The website may not publish sitemap URLs, or your URL contains filter may be too narrow. Try leaving the filter empty.

The run finished quickly with failed request counts. Some sitemap URLs returned permanent errors such as 404. The Actor skips those instead of wasting retries.

The output has fewer rows than expected. Check maxResults, maxRequestsPerCrawl, and any filter value. Also confirm the website sitemap actually lists the URLs you expect.

The run is slow on a large website. Keep the default 256 MB memory for most runs, raise maxResults gradually, and use maxRequestsPerCrawl to keep very large sitemap indexes predictable.

Pricing

The recommended pricing model is pay per result with a very small Actor start event. This keeps small tests inexpensive and makes larger runs scale with the number of useful URLs returned. Platform usage is low because the Actor uses lightweight HTTP requests instead of a browser and defaults to the 256 MB memory tier.

FAQ

Can this extract all URLs from a sitemap?

Yes. Leave URL contains empty and set Max results high enough for the website.

Can it find sitemap URLs from robots.txt?

Yes. The Actor checks robots.txt, follows sitemap directives, and also tries /sitemap.xml.

Can it parse sitemap indexes?

Yes. It follows nested sitemap indexes until the request or result limits are reached.

Does it support gzip sitemaps?

Yes. Gzip-compressed sitemap responses are decompressed before parsing.

Can I filter only product or blog URLs?

Yes. Use URL contains with a path fragment such as /products/, /blog/, /category/, or /docs/.

Is this a full website crawler?

No. It extracts URLs listed in sitemaps. Use a full web crawler if you need to discover links from page HTML.

You might also like

Sitemap Sniffer

maximedupre/sitemap-sniffer

Find sitemap files from website roots, domains, robots.txt, and direct sitemap URLs. Export sitemap metadata, URL counts, nested index depth, and optional URL inventory rows.

πŸ‘ User avatar

Maxime DuprΓ©

4

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

Sitemap Finder & URL Extractor Β· Crawl Any XML Sitemap

corent1robert/sitemap-detector

Find and crawl XML sitemaps from any website. Follows sitemap indexes, handles gzip, and exports every page URL with source file and lastmod into a clean dataset. No config needed.

πŸ‘ User avatar

Corentin Robert

3

Sitemap URL Extractor

mikolabs/sitemap-url-extractor

Extract every URL and its metadata from any sitemap.xml in seconds. Paste one or more sitemap URLs, run the Actor, and get a clean, structured dataset with url, lastmod, changefreq, priority, and more β€” ready to export as CSV, JSON, or Excel.

Sitemap URL Extractor

crawlerbros/sitemap-url-extractor

Extract every URL from any site's sitemap.xml with handles sitemap index files (nested sitemaps), gzipped sitemaps, and robots.txt discovery. Returns URL, lastmod, changefreq, priority, and optional image/video/alternate-language fields. No proxy, no cookies, no login.

18