👁 Website Content Crawler — AI & RAG Ready avatar

Website Content Crawler — AI & RAG Ready

Pricing

Pay per event

👁 Website Content Crawler — AI & RAG Ready

Website Content Crawler — AI & RAG Ready

Crawl any website and extract clean Markdown and plain text optimized for AI ingestion, RAG pipelines, and LLM context. Readability-style main content extraction removes ads, navs, and footers. Configurable depth, concurrency, and page limits. Pay-per-page.

Pricing

Pay per event

Rating

0.0

(0)

Developer

👁 Ale

Ale

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

18 days ago

Last modified

Why This Actor?

AI-optimized output — Markdown + plain text per page, with content type detection
Main content extraction — Readability-style selectors remove noise (nav, footer, ads, sidebars)
Flexible crawl modes — Fetch a list of URLs directly (depth=0) or crawl entire sites (depth=1-5)
Concurrent processing — Up to 20 parallel workers for high-throughput extraction
Pay-per-page pricing — Only pay for pages successfully crawled

Use Cases

Build RAG knowledge bases from company documentation sites
Feed LLMs with up-to-date content from blog posts and news articles
Extract article text for AI summarization pipelines
Crawl competitor sites for content analysis
Bulk-convert web pages to Markdown for offline use

Input

Parameter	Type	Default	Description
`startUrls`	array	required	URLs to crawl. Use `maxDepth=0` for flat fetch, `maxDepth>0` to follow links
`maxDepth`	integer	`0`	Crawl depth. 0 = start pages only, 1 = start pages + their links, 2 = two levels, etc.
`maxPagesPerCrawl`	integer	`100`	Maximum total pages to process across all start URLs
`maxPagesPerDomain`	integer	`50`	Maximum pages per unique domain
`maxConcurrency`	integer	`5`	Number of parallel workers (1–20)
`extractMainContent`	boolean	`true`	Strip nav/footer/ads using readability-style selectors
`proxyConfiguration`	object	Apify proxy	Proxy settings

Output

One record per crawled page:

Field	Type	Description
`url`	string	URL of the crawled page
`title`	string	Page title (og:title or HTML title tag)
`description`	string	Meta description (description or og:description)
`markdown`	string	Clean Markdown output, up to 50,000 characters
`text`	string	Plain text with all HTML removed, up to 10,000 characters
`word_count`	integer	Number of words in the extracted plain text
`content_type`	string	Detected type: `article`, `blog`, `documentation`, or `generic`
`depth`	integer	Crawl depth (0 = start URL)
`start_url`	string	Start URL that led to this page
`links_found`	integer	New internal links discovered and added to crawl queue
`status_code`	integer	HTTP status code
`scraped_at`	string	ISO 8601 UTC timestamp

Example Input

Fetch a list of documentation pages (no crawling):

{
"startUrls":[
"https://docs.example.com/api/overview",
"https://docs.example.com/api/authentication"
],
"maxDepth":0,
"extractMainContent":true
}

Crawl an entire blog up to 2 levels deep:

{
"startUrls":["https://blog.example.com"],
"maxDepth":2,
"maxPagesPerCrawl":200,
"maxConcurrency":10,
"extractMainContent":true
}

Pricing

Event	Price
Actor start	$0.25 (flat)
Per 1,000 pages crawled	$1.00

MCP Integration

Use this actor directly from Claude or any MCP-compatible AI tool:

Use apify/santamaria-automations/website-content-crawler to crawl https://docs.example.com with maxDepth=1 and extractMainContent=true, then summarize the documentation

Actor URL: apify/santamaria-automations/website-content-crawler

Notes

Challenge pages (Cloudflare, DataDome, PerimeterX) are detected and skipped automatically
Deduplication prevents the same URL from being crawled twice in the same run
Content type detection identifies articles, blog posts, and documentation pages
Main content extraction uses CSS selector priority: article-specific classes → semantic tags → body fallback

RAG Website Crawler - Clean Markdown for LLMs & AI

themineworks/rag-crawler

Crawl any website and extract clean, chunked Markdown ready for RAG pipelines and LLM context. Returns page text, titles and URLs. No API key. Works in Claude, ChatGPT & any MCP-compatible AI agent.

👁 User avatar

The Mine Works

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.

👁 User avatar

Kai Agent

Web to Markdown — AI-Ready Text from Any URL

wsgcjj/web-to-markdown

Convert any web page URL to clean Markdown format. Perfect for LLM training data, RAG pipelines, and AI content processing. Extracts main content, strips ads/nav/footers.

👁 User avatar

陈俊杰

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

👁 User avatar

IntelScrape

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

👁 Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks avatar

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

👁 User avatar

Ken M

Website Markdown Crawler

moorish-dev/website-markdown-crawler

Crawls a website and converts every page to clean Markdown optimized for LLM ingestion.

👁 User avatar

Ziad Tarik

Smart Web Content Extractor for AI & LLM

project_bbb/smart-web-content-extractor

Crawl any website and extract clean, structured content optimized for LLM consumption. Outputs Markdown, plain text, or HTML with metadata. Removes nav, ads, and boilerplate automatically.

👁 User avatar

BBB & Company

👁 AI Website Content Extractor avatar

AI Website Content Extractor

scrapeai/ai-website-content-extractor

Crawl website pages, strip noise, and convert the main content to clean Markdown for RAG/LLM training.

👁 User avatar

ScrapeAI

5.0

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

👁 User avatar

Fulcria Labs

URL: https://apify.com/santamaria-automations/website-content-crawler