AI-Ready Website Crawler

Pricing

Pay per usage

Try for free

Go to Apify Store

👁 AI-Ready Website Crawler

AI-Ready Website Crawler

Try for free

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

👁 Fulcria Labs

Fulcria Labs

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

What it does

This actor takes a starting URL, crawls the website following same-domain links, and outputs each page as clean markdown with metadata. It strips out navigation, ads, scripts, and other non-content elements to produce AI-ready text.

Input

Field	Type	Default	Description
`startUrl`	string	required	Primary URL to start crawling
`additionalUrls`	string[]	`[]`	Extra URLs to include in the crawl
`maxPages`	integer	`50`	Maximum pages to crawl (1-10000)
`maxDepth`	integer	`3`	Maximum link depth from start URL
`requestsPerSecond`	number	`2`	Rate limit for politeness
`respectRobotsTxt`	boolean	`true`	Honor robots.txt rules
`includeUrlPatterns`	string[]	`[]`	Regex patterns - only crawl matching URLs
`excludeUrlPatterns`	string[]	see below	Regex patterns - skip matching URLs
`removeSelectors`	string[]	see below	CSS selectors for elements to remove
`contentSelectors`	string[]	`[]`	CSS selectors to isolate main content
`requestTimeoutSecs`	integer	`30`	Per-request timeout
`userAgent`	string	`AIReadyWebsiteCrawler/1.0`	User-Agent header

Default exclude patterns

\.(pdf|zip|tar|gz|mp4|mp3|...)$
/api/
/login,/logout,/signin,/signup,/auth/

Default remove selectors

nav, footer, header, aside, .sidebar, .advertisement, .cookie-banner, script, style, noscript, iframe, svg, and more.

Output

Each crawled page produces a dataset item with:

{
"url":"https://docs.example.com/getting-started",
"title":"Getting Started - Example Docs",
"markdown":"---\ntitle: \"Getting Started\"\nurl: https://...\ncrawl_date: 2026-02-23T12:00:00Z\n---\n\n# Getting Started\n\nWelcome to...",
"crawl_date":"2026-02-23T12:00:00+00:00",
"depth":1,
"word_count":342
}

The markdown field includes YAML frontmatter with title, URL, and crawl date, followed by the cleaned content.

Example input

Crawl documentation site

{
"startUrl":"https://docs.example.com",
"maxPages":100,
"maxDepth":5,
"requestsPerSecond":2
}

Crawl specific section only

{
"startUrl":"https://example.com/docs/api",
"maxPages":50,
"maxDepth":3,
"includeUrlPatterns":["/docs/api/"],
"contentSelectors":[".docs-content","article"]
}

Crawl multiple sites

{
"startUrl":"https://docs.example.com",
"additionalUrls":[
"https://blog.example.com",
"https://wiki.example.com"
],
"maxPages":200
}

How the content cleaning works

HTML fetching - Uses httpx with HTTP/2 support and configurable timeouts
Element removal - Strips nav, footer, ads, scripts, styles via CSS selectors
Content isolation - Auto-detects <main>, <article>, or content divs (or uses your custom selectors)
Markdown conversion - Converts to markdown preserving headings, lists, tables, code blocks, and links
Whitespace cleanup - Removes excessive blank lines and trailing whitespace
Quality filter - Skips pages with fewer than 10 words of content

Use cases

Build RAG knowledge bases from documentation sites
Create training datasets for LLM fine-tuning
Index product documentation for AI assistants
Archive website content in a portable format
Feed content into vector databases (Pinecone, Weaviate, etc.)

Technical details

Async crawling with httpx for fast performance
BFS traversal with configurable depth limits
URL deduplication with fragment removal and normalization
robots.txt compliance with per-domain caching
Token bucket rate limiting for polite crawling
Same-domain restriction prevents crawling external sites
lxml parser for fast, robust HTML parsing

👁 Website Content Crawler avatar

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

👁 User avatar

ParseForge

👁 Fast Google Search Results Scraper avatar

Fast Google Search Results Scraper

6sigmag/fast-google-search-results-scraper

Paste keywords in bulk → get clean, clickable URLs. This ultra-lightweight Google SERP scraper is built for non-technical teams who need links fast for lead prospecting and market research. No giant payloads, no complex setup

👁 User avatar

David

276

5.0

👁 Google Search Results Scraper (Pay Per Result) avatar

Google Search Results Scraper (Pay Per Result)

vtrdev/google-search-results-serp-scraper

Google SERP scraper with dual parsing, smart title recovery, and proxy support. Scrape multiple pages with localized results. Ideal for SEO tracking, content research, and brand monitoring — billed only per result.

👁 User avatar

VTRDEV

👁 Website Company Enricher avatar

Website Company Enricher

great_pistachio/website-company-enricher

Enrich company data from any website domain. Extracts company name, emails, phones, social links, tech stack, addresses, and more. A free alternative to Clearbit and Clay for lead enrichment and sales prospecting.

👁 User avatar

Saturnin Pugnet

👁 Google Search Results Scraper (SERP) avatar

Google Search Results Scraper (SERP)

apidojo/google-search-scraper

SERP - Google Search Scraper with unbeatable pricing! $0.002/query gets you 10 results FREE + $0.0002/extra item. Event-based billing = pay only for what you need. Ideal for SEO monitoring, keyword research & market analysis. No proxy required!

👁 User avatar

API Dojo

564

3.2

👁 Expired Domains Scraper avatar

Expired Domains Scraper

martin1080p/expired-domains-scraper

The Expired Domains Scraper automates finding valuable expired domains from expireddomains.com, offering filters and sorting by SEO metrics and auction details for efficient domain acquisition.

👁 User avatar

Martin Fanta

267

1.0

👁 Company Website Research avatar

Company Website Research

mstech/company-website-research

Extracting comprehensive data from the corporate website

👁 User avatar

Jian Lee

4.2

👁 Camoufox Scraper avatar

Camoufox Scraper

apify/camoufox-scraper

Crawls websites with stealthy Camoufox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

👁 User avatar

Apify

342

5.0

👁 Google Search Scraper avatar

Google Search Scraper

mina_safwat/Google-Search-Scraper

The fastest Google Search scraper available—perfect if you need to scrape Google Search results quickly and efficiently.

👁 User avatar

Mina

3.0

👁 AI Training Data Curator avatar

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

👁 User avatar

Ryan Clinton

URL: https://apify.com/optimus-fulcria/ai-ready-website-crawler

⇱ Website to Markdown Crawler for AI & RAG · Apify

AI-Ready Website Crawler

What it does

Input

Default exclude patterns

Default remove selectors

Output

Example input

Crawl documentation site

Crawl specific section only

Crawl multiple sites

How the content cleaning works

Use cases

Technical details

You might also like

Website Content Crawler

Fast Google Search Results Scraper

Google Search Results Scraper (Pay Per Result)

Website Company Enricher

Google Search Results Scraper (SERP)

Expired Domains Scraper

Company Website Research

Camoufox Scraper

Google Search Scraper

AI Training Data Curator