VOOZH about

URL: https://apify.com/santamaria-automations/website-content-crawler

โ‡ฑ Website Content Crawler โ€” AI & RAG Ready ยท Apify


๐Ÿ‘ Website Content Crawler โ€” AI & RAG Ready avatar

Website Content Crawler โ€” AI & RAG Ready

Pricing

Pay per event

Go to Apify Store

Website Content Crawler โ€” AI & RAG Ready

Crawl any website and extract clean Markdown and plain text optimized for AI ingestion, RAG pipelines, and LLM context. Readability-style main content extraction removes ads, navs, and footers. Configurable depth, concurrency, and page limits. Pay-per-page.

Pricing

Pay per event

Rating

0.0

(0)

Developer

๐Ÿ‘ Ale

Ale

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

18 days ago

Last modified

Share

Extract clean Markdown and plain text from any website, optimized for AI ingestion, RAG pipelines, and LLM context windows. Readability-style main content extraction strips navigation, footers, sidebars, and ads so your AI gets only the content that matters.

Why This Actor?

  • AI-optimized output โ€” Markdown + plain text per page, with content type detection
  • Main content extraction โ€” Readability-style selectors remove noise (nav, footer, ads, sidebars)
  • Flexible crawl modes โ€” Fetch a list of URLs directly (depth=0) or crawl entire sites (depth=1-5)
  • Concurrent processing โ€” Up to 20 parallel workers for high-throughput extraction
  • Pay-per-page pricing โ€” Only pay for pages successfully crawled

Use Cases

  • Build RAG knowledge bases from company documentation sites
  • Feed LLMs with up-to-date content from blog posts and news articles
  • Extract article text for AI summarization pipelines
  • Crawl competitor sites for content analysis
  • Bulk-convert web pages to Markdown for offline use

Input

ParameterTypeDefaultDescription
startUrlsarrayrequiredURLs to crawl. Use maxDepth=0 for flat fetch, maxDepth>0 to follow links
maxDepthinteger0Crawl depth. 0 = start pages only, 1 = start pages + their links, 2 = two levels, etc.
maxPagesPerCrawlinteger100Maximum total pages to process across all start URLs
maxPagesPerDomaininteger50Maximum pages per unique domain
maxConcurrencyinteger5Number of parallel workers (1โ€“20)
extractMainContentbooleantrueStrip nav/footer/ads using readability-style selectors
proxyConfigurationobjectApify proxyProxy settings

Output

One record per crawled page:

FieldTypeDescription
urlstringURL of the crawled page
titlestringPage title (og:title or HTML title tag)
descriptionstringMeta description (description or og:description)
markdownstringClean Markdown output, up to 50,000 characters
textstringPlain text with all HTML removed, up to 10,000 characters
word_countintegerNumber of words in the extracted plain text
content_typestringDetected type: article, blog, documentation, or generic
depthintegerCrawl depth (0 = start URL)
start_urlstringStart URL that led to this page
links_foundintegerNew internal links discovered and added to crawl queue
status_codeintegerHTTP status code
scraped_atstringISO 8601 UTC timestamp

Example Input

Fetch a list of documentation pages (no crawling):

{
"startUrls":[
"https://docs.example.com/api/overview",
"https://docs.example.com/api/authentication"
],
"maxDepth":0,
"extractMainContent":true
}

Crawl an entire blog up to 2 levels deep:

{
"startUrls":["https://blog.example.com"],
"maxDepth":2,
"maxPagesPerCrawl":200,
"maxConcurrency":10,
"extractMainContent":true
}

Pricing

EventPrice
Actor start$0.25 (flat)
Per 1,000 pages crawled$1.00

MCP Integration

Use this actor directly from Claude or any MCP-compatible AI tool:

Use apify/santamaria-automations/website-content-crawler to crawl https://docs.example.com with maxDepth=1 and extractMainContent=true, then summarize the documentation

Actor URL: apify/santamaria-automations/website-content-crawler

Notes

  • Challenge pages (Cloudflare, DataDome, PerimeterX) are detected and skipped automatically
  • Deduplication prevents the same URL from being crawled twice in the same run
  • Content type detection identifies articles, blog posts, and documentation pages
  • Main content extraction uses CSS selector priority: article-specific classes โ†’ semantic tags โ†’ body fallback

You might also like

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

AI Website Content Extractor

scrapeai/ai-website-content-extractor

Crawl website pages, strip noise, and convert the main content to clean Markdown for RAG/LLM training.