Pricing
from $7.00 / 1,000 results
Go to Apify Store
Sitemap Generator
DeprecatedPricing
from $7.00 / 1,000 results
Rating
0.0
(0)
Developer
Actor stats
1
Bookmarked
1
Total users
0
Monthly active users
4 months ago
Last modified
Share
Sitemap Generator Actor
Python Apify Actor that crawls a single hostname and generates sitemap files:
sitemap.xml(always an XML sitemap index)sitemap-00001.xml,sitemap-00002.xml, ... (50,000 URLs max per chunk)sitemap.html(optional)sitemap.txt(optional)sitemap-summary.json(run summary and output key references)
The Actor respects robots.txt, includes only canonical URLs, deduplicates by normalized URL, and supports regex include/exclude filters.
Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrl | string | required | Start page (http/https) |
maxDepth | integer | 3 | Max crawl depth (startUrl is depth 0) |
maxPages | integer | 1000 | Max fetched pages |
concurrency | integer | 10 | Concurrent HTTP workers (1-50) |
allowNoindex | boolean | false | If true, includes pages with noindex directives |
sitemapSeedUrls | string[] | [] | Optional sitemap XML URLs to seed discovery (in addition to robots.txt Sitemap: entries) |
includePatterns | string[] | [] | Regex allow-list for URLs |
excludePatterns | string[] | [] | Regex deny-list for URLs |
outputFormats | string[] | ["html","txt"] | Optional extra outputs (html, txt) |
lastmodStrategy | string | headers | headers or crawl_time |
changefreq | string | weekly | Default sitemap changefreq |
priorityRules.defaultPriority | number | 0.5 | Default sitemap priority |
priorityRules.rules | object[] | [] | Ordered regex overrides (pattern, optional priority, optional changefreq) |
Run Locally (Apify)
- Put your JSON input into
storage/key_value_stores/default/INPUT.json. - Run:
$apify run
Example INPUT.json:
{"startUrl":"https://example.com/","maxDepth":2,"maxPages":500,"concurrency":10,"allowNoindex":false,"includePatterns":[],"excludePatterns":["/private","/preview"],"outputFormats":["html","txt"],"lastmodStrategy":"headers","changefreq":"weekly","priorityRules":{"defaultPriority":0.5,"rules":[{"pattern":"/docs/","priority":0.8,"changefreq":"daily"}]}}
Run Locally (CLI)
The module also supports direct CLI flags:
python -m src \--start-url https://example.com/ \--max-depth 2\--max-pages 500\--concurrency10\--allow-noindex \--sitemap-seed-url https://example.com/sitemap.xml \--exclude-pattern /private \--output-format html \--output-format txt \--lastmod-strategy headers \--changefreq weekly
Priority rules from CLI can be provided as JSON string or file path:
$python -m src --start-url https://example.com/ --priority-rules-json priority-rules.json
CLI precedence is higher than Actor input: CLI > INPUT JSON > defaults.
Output Locations
- Dataset: one item per included canonical URL (
url,lastmod,changefreq,priority,depth,sourceUrl,statusCode,discoveredAt) - Key-value store records:
sitemap.xmlsitemap-00001.xml,sitemap-00002.xml, ...sitemap.html(if enabled)sitemap.txt(if enabled)sitemap-summary.json
Run Tests
$python -m unittest discover -s tests -p"test_*.py"
Integration fixture site is under fixtures/site/.
