Sitemap Sniffer

Pricing

from $1.00 / 1,000 results

Sitemap Sniffer

Discover every sitemap file for a website. Reads robots.txt for Sitemap directives, probes common sitemap paths, and recursively unpacks sitemap-index files. HTTP-only, no proxy or cookies needed.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

👁 Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

What it does

You point this actor at any website and get back a structured list of every sitemap file it could find:

/robots.txt directives — the canonical place sites declare their sitemaps.
Common sitemap paths — sitemap.xml, sitemap_index.xml, wp-sitemap.xml, post-sitemap.xml, sitemap.xml.gz, and 11 more.
Sitemap-index expansion — when an index points to child sitemaps, the actor follows it (one level deep) and emits each child too.

For each discovered sitemap, the actor reports the URL, type (sitemap / sitemap_index / txt), HTTP status, content type, byte size, URL count (parsed from XML), gzip flag, last-modified date if present, and how it was discovered.

Input

Field	Type	Default	Description
`url`	string (required)	`https://apify.com`	Root URL or bare host (e.g. `example.com`). The actor extracts the origin and probes that.
`followIndexes`	boolean	`true`	When a sitemap-index is found, also fetch and emit the child sitemap URLs it points to.
`maxSitemaps`	integer	`50` (1–1000)	Hard cap on the number of records emitted. Probing stops once this many are discovered.
`fetchUrlCounts`	boolean	`true`	Parse each sitemap and report the number of URLs it contains. Disable to skip the full-body download.
`emitUrls`	boolean	`false`	When `true`, the actor also emits one record per URL found inside each discovered sitemap (with `lastmod`, `changefreq`, `priority`, `hreflang` when present).
`maxUrls`	integer	`10000` (1–100000)	Hard cap on per-URL records when `emitUrls: true`. Has no effect when `emitUrls: false`.
`userAgent`	string (optional)	Chrome 131	Override only if a target server filters by UA.

Example input

{
"url":"https://www.bbc.com",
"followIndexes":true,
"maxSitemaps":50,
"fetchUrlCounts":true
}

Output

By default, one record per discovered sitemap. When emitUrls: true, the dataset also contains one record per URL found inside each sitemap. The two shapes can be disambiguated by recordType. Empty fields are omitted (no nulls).

Sitemap record (`recordType: "sitemap"`)

{
"recordType":"sitemap",
"url":"https://www.bbc.com/sitemap.xml",
"domainHost":"www.bbc.com",
"type":"sitemap_index",
"httpStatus":200,
"contentType":"application/xml",
"byteCount":13450,
"urlCount":78,
"isCompressed":false,
"lastmod":"2024-12-15",
"discoveredVia":"robots.txt",
"scrapedAt":"2024-12-16T14:23:11+00:00"
}

URL record (`recordType: "url"`, only when `emitUrls: true`)

{
"recordType":"url",
"url":"https://www.bbc.com/news/articles/c-12345",
"domainHost":"www.bbc.com",
"sitemapUrl":"https://www.bbc.com/sitemaps/news/sitemap.xml",
"lastmod":"2024-12-15",
"changefreq":"hourly",
"priority":0.8,
"hreflang":[{"lang":"en-GB","href":"https://www.bbc.com/news/articles/c-12345"}],
"scrapedAt":"2024-12-16T14:23:11+00:00"
}

Output fields

recordType — "sitemap" for sitemap-file records (always emitted), or "url" for per-URL records (only when emitUrls: true).
url — absolute URL of the sitemap (or the URL referenced inside a sitemap, when recordType: "url").
domainHost — parsed hostname of url (handy for grouping records by site when ingesting from multiple runs).
type — "sitemap" (a <urlset>), "sitemap_index" (a <sitemapindex> of child sitemaps), or "txt" (plain-text sitemap with one URL per line). Sitemap records only.
httpStatus — HTTP status code returned (200 = success). Sitemap records only.
contentType — Content-Type header value (without charset). Sitemap records only.
byteCount — response body size in bytes. Sitemap records only.
urlCount — number of URL entries found inside the sitemap (or child sitemap links inside an index). Sitemap records only.
isCompressed — true when the body is gzipped (e.g. .xml.gz paths). Sitemap records only.
lastmod — first <lastmod> value found in the sitemap, or the per-URL <lastmod> when recordType: "url".
discoveredVia — "robots.txt", "common-path", or "sitemap-index" (parent index pointed here). Sitemap records only.
sitemapUrl — (URL records only) URL of the sitemap that contained this URL.
changefreq / priority / hreflang — (URL records only) standard sitemap fields when present.
scrapedAt — ISO-8601 timestamp of the discovery.

When to use this

Before crawling — feed the discovered sitemap URLs into a downstream crawler so you scrape only what's listed instead of guessing internal links.
SEO audits — confirm a site has a sitemap, that it points to the right pages, and that index files aren't broken.
Competitive research — measure how many URLs a site exposes, broken down by sitemap type (news / video / image / page).
Content migration — get a complete inventory of URLs declared by the source site.

FAQ

Does it need cookies, login, or a proxy? No. Sitemaps are public assets, designed to be machine-readable. The actor uses curl_cffi with a Chrome User-Agent and connects directly.

What if the site has no sitemap at all? The actor emits a single record {"type": "sitemap_sniffer_error", "reason": "no_sitemaps_found"} with a hint to check robots.txt manually. The run still completes successfully — empty datasets are not treated as failures.

Does it handle gzipped sitemaps? Yes. .xml.gz files are transparently decompressed in-memory before parsing.

How does it handle giant sites with thousands of sitemap files? maxSitemaps (default 50, max 1000) caps the run. The actor probes in priority order: robots.txt directives first, then the most common paths, then sitemap-index children. You'll get the most useful sitemaps first even if the cap stops the run early.

Can I get the URLs inside each sitemap? Yes — set emitUrls: true and the actor will also push one record per URL inside each discovered sitemap, with lastmod, changefreq, priority, and hreflang when present. maxUrls caps the total (default 10,000). Use recordType: "sitemap" vs recordType: "url" to disambiguate the two record shapes.

Is it safe to run on any website? Yes — the actor only fetches robots.txt and 16 well-known public paths. It makes at most ~17 requests on initial probe, plus one per sitemap-index child if followIndexes is enabled. No login pages, no admin paths, no API endpoints.

Sitemap URL Extractor — robots.txt + sitemap.xml Crawl

v0iddo/sitemap-url-extractor

Discover every URL a site exposes via its public sitemap chain. Reads robots.txt, follows Sitemap declarations, recursively descends sitemap-index files, extracts URLs with lastmod, changefreq, priority.

👁 User avatar

vøiddo

👁 Sitemap Sniffer avatar

Sitemap Sniffer

maximedupre/sitemap-sniffer

Find sitemap files from website roots, domains, robots.txt, and direct sitemap URLs. Export sitemap metadata, URL counts, nested index depth, and optional URL inventory rows.

👁 User avatar

Maxime Dupré

Sitemap API

vivid_astronaut/sitemap

👁 User avatar

Fabio Suizu

👁 Sitemap Scraper avatar

Sitemap Scraper

pvillalva/sitemap-scraper

The Sitemap Scraper extracts and outputs all URLs from a given sitemap.

👁 User avatar

Percival Villalva

268

👁 Sitemap Extractor: Every URL, Recursive, Reliable avatar

Sitemap Extractor: Every URL, Recursive, Reliable

thoob/sitemap-extractor

Reads sitemap.xml, sitemap index files, .gz compressed sitemaps, and robots.txt Sitemap directives, and returns one clean row per URL with lastmod, changefreq, and priority. Billed only per delivered URL.

Pono Data

Sitemap Extractor

automationagents/web-sitemap

Extract all URLs from a website's sitemap (XML, robots.txt, or crawl discovery).

👁 User avatar

Alex Jordan

👁 Find Sitemap from url avatar

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

👁 User avatar

ando

210

1.0

👁 Sitemap URL Extractor - List All URLs in a Sitemap avatar

Sitemap URL Extractor - List All URLs in a Sitemap

dltik/sitemap-url-extractor

Extract every URL from any XML sitemap, with lastmod, changefreq and priority. Resolves sitemap indexes recursively. Pass a sitemap.xml or just a site root to auto-discover its sitemaps. Pure HTTP, no browser — fast and cheap.

👁 User avatar

Walid

👁 Robots.txt & Sitemap Analyzer avatar

Robots.txt & Sitemap Analyzer

automation-lab/robots-sitemap-analyzer

This actor fetches and parses robots.txt and sitemap.xml files for any list of websites. It extracts crawl directives (user-agent rules, allowed/disallowed paths, crawl-delay), discovers sitemap URLs, and counts the number of pages listed in each sitemap. Use it for SEO audits, competitive...

👁 User avatar

Stas Persiianenko

👁 Sitemap Generator avatar

Sitemap Generator

himalyancoder/Sitemap-generator

👁 User avatar

Sameer Pun

URL: https://apify.com/crawlerbros/sitemap-sniffer

⇱ Sitemap Sniffer · Apify

Sitemap Sniffer

What it does

Input

Example input

Output

Sitemap record (`recordType: "sitemap"`)

URL record (`recordType: "url"`, only when `emitUrls: true`)

Output fields

When to use this

FAQ

You might also like

Sitemap URL Extractor — robots.txt + sitemap.xml Crawl

Sitemap Sniffer

Sitemap API

Sitemap Scraper

Sitemap Extractor: Every URL, Recursive, Reliable

Sitemap Extractor

Find Sitemap from url

Sitemap URL Extractor - List All URLs in a Sitemap

Robots.txt & Sitemap Analyzer

Sitemap Generator

URL: https://apify.com/crawlerbros/sitemap-sniffer

⇱ Sitemap Sniffer · Apify

Sitemap Sniffer

What it does

Input

Example input

Output

Sitemap record (recordType: "sitemap")

URL record (recordType: "url", only when emitUrls: true)

Output fields

When to use this

FAQ

You might also like

Sitemap URL Extractor — robots.txt + sitemap.xml Crawl

Sitemap Sniffer

Sitemap API

Sitemap Scraper

Sitemap Extractor: Every URL, Recursive, Reliable

Sitemap Extractor

Find Sitemap from url

Sitemap URL Extractor - List All URLs in a Sitemap

Robots.txt & Sitemap Analyzer

Sitemap Generator

Sitemap record (`recordType: "sitemap"`)

URL record (`recordType: "url"`, only when `emitUrls: true`)