VOOZH about

URL: https://apify.com/crawlerbros/sitemap-sniffer

⇱ Sitemap Sniffer Β· Apify


Pricing

from $1.00 / 1,000 results

Go to Apify Store

Discover every sitemap file for a website. Reads robots.txt for Sitemap directives, probes common sitemap paths, and recursively unpacks sitemap-index files. HTTP-only, no proxy or cookies needed.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

1

Monthly active users

a month ago

Last modified

Share

Discover every sitemap file for a website β€” automatically. Reads robots.txt for Sitemap: directives, probes 16 common sitemap paths (Yoast, WordPress, sitemap-index, gzipped variants), and recursively unpacks sitemap-index files. HTTP-only, no proxy, no cookies, no API key.

What it does

You point this actor at any website and get back a structured list of every sitemap file it could find:

  • /robots.txt directives β€” the canonical place sites declare their sitemaps.
  • Common sitemap paths β€” sitemap.xml, sitemap_index.xml, wp-sitemap.xml, post-sitemap.xml, sitemap.xml.gz, and 11 more.
  • Sitemap-index expansion β€” when an index points to child sitemaps, the actor follows it (one level deep) and emits each child too.

For each discovered sitemap, the actor reports the URL, type (sitemap / sitemap_index / txt), HTTP status, content type, byte size, URL count (parsed from XML), gzip flag, last-modified date if present, and how it was discovered.

Input

FieldTypeDefaultDescription
urlstring (required)https://apify.comRoot URL or bare host (e.g. example.com). The actor extracts the origin and probes that.
followIndexesbooleantrueWhen a sitemap-index is found, also fetch and emit the child sitemap URLs it points to.
maxSitemapsinteger50 (1–1000)Hard cap on the number of records emitted. Probing stops once this many are discovered.
fetchUrlCountsbooleantrueParse each sitemap and report the number of URLs it contains. Disable to skip the full-body download.
emitUrlsbooleanfalseWhen true, the actor also emits one record per URL found inside each discovered sitemap (with lastmod, changefreq, priority, hreflang when present).
maxUrlsinteger10000 (1–100000)Hard cap on per-URL records when emitUrls: true. Has no effect when emitUrls: false.
userAgentstring (optional)Chrome 131Override only if a target server filters by UA.

Example input

{
"url":"https://www.bbc.com",
"followIndexes":true,
"maxSitemaps":50,
"fetchUrlCounts":true
}

Output

By default, one record per discovered sitemap. When emitUrls: true, the dataset also contains one record per URL found inside each sitemap. The two shapes can be disambiguated by recordType. Empty fields are omitted (no nulls).

Sitemap record (recordType: "sitemap")

{
"recordType":"sitemap",
"url":"https://www.bbc.com/sitemap.xml",
"domainHost":"www.bbc.com",
"type":"sitemap_index",
"httpStatus":200,
"contentType":"application/xml",
"byteCount":13450,
"urlCount":78,
"isCompressed":false,
"lastmod":"2024-12-15",
"discoveredVia":"robots.txt",
"scrapedAt":"2024-12-16T14:23:11+00:00"
}

URL record (recordType: "url", only when emitUrls: true)

{
"recordType":"url",
"url":"https://www.bbc.com/news/articles/c-12345",
"domainHost":"www.bbc.com",
"sitemapUrl":"https://www.bbc.com/sitemaps/news/sitemap.xml",
"lastmod":"2024-12-15",
"changefreq":"hourly",
"priority":0.8,
"hreflang":[{"lang":"en-GB","href":"https://www.bbc.com/news/articles/c-12345"}],
"scrapedAt":"2024-12-16T14:23:11+00:00"
}

Output fields

  • recordType β€” "sitemap" for sitemap-file records (always emitted), or "url" for per-URL records (only when emitUrls: true).
  • url β€” absolute URL of the sitemap (or the URL referenced inside a sitemap, when recordType: "url").
  • domainHost β€” parsed hostname of url (handy for grouping records by site when ingesting from multiple runs).
  • type β€” "sitemap" (a <urlset>), "sitemap_index" (a <sitemapindex> of child sitemaps), or "txt" (plain-text sitemap with one URL per line). Sitemap records only.
  • httpStatus β€” HTTP status code returned (200 = success). Sitemap records only.
  • contentType β€” Content-Type header value (without charset). Sitemap records only.
  • byteCount β€” response body size in bytes. Sitemap records only.
  • urlCount β€” number of URL entries found inside the sitemap (or child sitemap links inside an index). Sitemap records only.
  • isCompressed β€” true when the body is gzipped (e.g. .xml.gz paths). Sitemap records only.
  • lastmod β€” first <lastmod> value found in the sitemap, or the per-URL <lastmod> when recordType: "url".
  • discoveredVia β€” "robots.txt", "common-path", or "sitemap-index" (parent index pointed here). Sitemap records only.
  • sitemapUrl β€” (URL records only) URL of the sitemap that contained this URL.
  • changefreq / priority / hreflang β€” (URL records only) standard sitemap fields when present.
  • scrapedAt β€” ISO-8601 timestamp of the discovery.

When to use this

  • Before crawling β€” feed the discovered sitemap URLs into a downstream crawler so you scrape only what's listed instead of guessing internal links.
  • SEO audits β€” confirm a site has a sitemap, that it points to the right pages, and that index files aren't broken.
  • Competitive research β€” measure how many URLs a site exposes, broken down by sitemap type (news / video / image / page).
  • Content migration β€” get a complete inventory of URLs declared by the source site.

FAQ

Does it need cookies, login, or a proxy? No. Sitemaps are public assets, designed to be machine-readable. The actor uses curl_cffi with a Chrome User-Agent and connects directly.

What if the site has no sitemap at all? The actor emits a single record {"type": "sitemap_sniffer_error", "reason": "no_sitemaps_found"} with a hint to check robots.txt manually. The run still completes successfully β€” empty datasets are not treated as failures.

Does it handle gzipped sitemaps? Yes. .xml.gz files are transparently decompressed in-memory before parsing.

How does it handle giant sites with thousands of sitemap files? maxSitemaps (default 50, max 1000) caps the run. The actor probes in priority order: robots.txt directives first, then the most common paths, then sitemap-index children. You'll get the most useful sitemaps first even if the cap stops the run early.

Can I get the URLs inside each sitemap? Yes β€” set emitUrls: true and the actor will also push one record per URL inside each discovered sitemap, with lastmod, changefreq, priority, and hreflang when present. maxUrls caps the total (default 10,000). Use recordType: "sitemap" vs recordType: "url" to disambiguate the two record shapes.

Is it safe to run on any website? Yes β€” the actor only fetches robots.txt and 16 well-known public paths. It makes at most ~17 requests on initial probe, plus one per sitemap-index child if followIndexes is enabled. No login pages, no admin paths, no API endpoints.

You might also like

Sitemap Sniffer

maximedupre/sitemap-sniffer

Find sitemap files from website roots, domains, robots.txt, and direct sitemap URLs. Export sitemap metadata, URL counts, nested index depth, and optional URL inventory rows.

πŸ‘ User avatar

Maxime DuprΓ©

2

Sitemap Scraper

pvillalva/sitemap-scraper

The Sitemap Scraper extracts and outputs all URLs from a given sitemap.

πŸ‘ User avatar

Percival Villalva

268

Sitemap Extractor: Every URL, Recursive, Reliable

thoob/sitemap-extractor

Reads sitemap.xml, sitemap index files, .gz compressed sitemaps, and robots.txt Sitemap directives, and returns one clean row per URL with lastmod, changefreq, and priority. Billed only per delivered URL.

Pono Data

2

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

Sitemap URL Extractor - List All URLs in a Sitemap

dltik/sitemap-url-extractor

Extract every URL from any XML sitemap, with lastmod, changefreq and priority. Resolves sitemap indexes recursively. Pass a sitemap.xml or just a site root to auto-discover its sitemaps. Pure HTTP, no browser β€” fast and cheap.

Robots.txt & Sitemap Analyzer

automation-lab/robots-sitemap-analyzer

This actor fetches and parses robots.txt and sitemap.xml files for any list of websites. It extracts crawl directives (user-agent rules, allowed/disallowed paths, crawl-delay), discovers sitemap URLs, and counts the number of pages listed in each sitemap. Use it for SEO audits, competitive...

πŸ‘ User avatar

Stas Persiianenko

16

Sitemap Generator

himalyancoder/Sitemap-generator