VOOZH about

URL: https://apify.com/apify/sitemap-extractor

⇱ Sitemap Extractor Β· Apify


Pricing

from $0.10 / 1,000 results

Go to Apify Store

This Apify Actor extracts all URLs from a website's sitemaps and checks their status codes via lightweight HTTP requests. It provides a clean list of valid links, acting as an ideal pre-processor to ensure your larger crawling projects target only active URLs.

Pricing

from $0.10 / 1,000 results

Rating

3.1

(5)

Developer

πŸ‘ Apify

Apify

Maintained by Apify

Actor stats

5

Bookmarked

171

Total users

41

Monthly active users

3 months ago

Last modified

Share

This Actor is designed to bridge the gap between discovery and crawling. By traversing a website's sitemap.xml structure, it compiles a comprehensive list of all published pages and verifies their status before you commit resources to a full-scale scrape.

Features

  • Recursive Sitemap Discovery: Automatically detects and traverses nested sitemaps (sitemap indexes).
  • Efficiency: Uses HTTP HEAD requests for URL validation, which are significantly faster and consume less bandwidth than full GET requests.
  • Proxy Support: Integrated with Apify Proxy to prevent rate limiting or blocking during the discovery phase.
  • Detailed Output: Provides the final URL and the corresponding HTTP status code.

How it Works

  1. Input: You provide one or more "Start URLs" pointing to the domain name root, sitemaps or sitemap indexes.
  2. Extraction: The Actor parses the XML, extracting both page URLs and links to further sitemaps.
  3. Validation: For every page URL found, the Actor performs a status check.
  4. Deduplication: The crawler uses unique keys to ensure that even if a URL appears in multiple sitemaps, it is only checked once.

Usage

This Actor is ideal for:

  • Pre-crawling filter: Generating a "clean" list of URLs for actors like Website Content Crawler or Web Scraper.
  • SEO Audits: Quickly identifying 404 Not Found or 500 Server Error pages listed in your sitemap.
  • Site Mapping: Getting a high-level overview of a site's architecture.

Configuration

FieldDescription
Start URLsJust a domain name or a list of sitemap XML URLs to start from.
Proxy configurationSettings for Apify Proxies.

You might also like

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

URL to markdown

apify/url-to-markdown

An Apify Actor that takes a URL as input and returns the content of the page in Markdown format.

Website Image Scraper

gomorrhadev/website-image-scraper

Website Image Scraper is a fast, lightweight tool that crawls websites to extract image URLs (jpg, png, svg) without downloading files or using browsers. It supports recursive crawling, respects robots.txt, and efficiently collects image links for analysis or monitoring or a later download.

πŸ‘ User avatar

Gomorrha UG (haftungsbeschrΓ€nkt)

308

5.0

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

πŸš€ Transform web content into clean, LLM-ready Markdown! πŸ“˜ Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! πŸŒπŸ“πŸ§ 

TrustMRR Startup scraper

advantageous_subcontra/trustmrr

Get all startups listed in any category on TrustMRR startup database. Get all information about each startup, like revenue, founding year, and location.

66

Website Image Downloader Pro

powerful_bachelor/website-image-downloader-pro

πŸ“Έ Website Image Downloader Pro: Extract and download images from any URL! πŸš€ Features include image URL extraction, SVG to PNG conversion, downloading, and zipping images. Perfect for market research, AI training, and creating visual archives. 🌐✨ Try it now on Apify! πŸ’Ύ

πŸ‘ User avatar

Powerful Bachelor

509

2.5

AI Web Scraper

apify/ai-web-scraper

AI-first web scraper that extracts structured data from any website using natural-language prompts. No programming knowledge required. No hard-coded logic that breaks when a website changes.

Image Scraper

rapidtech1898/image-scraper

Extract image links from any website quickly and easily. Enter a URL and the scraper collects all available image URLs in seconds. Perfect for designers, marketers, and developers who need fast access to image sources without manual searching.

103

1.0

Web Images Scraper

jupri/web-images-scraper

Scrape Images from a Webpage