AI / RAG Web Crawler

Pricing

from $0.50 / 1,000 results

AI / RAG Web Crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

👁 Group Oject

Group Oject

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

What it does

Crawls from your start URLs, following links up to a depth/page limit you set (same-domain by default).
Extracts the main content — removes nav, header, footer, sidebars, scripts, ads.
Converts it to clean Markdown (headings, lists, links, code preserved).
Chunks it into overlapping, embeddings-sized pieces for RAG.

Output is one row per chunk, each tagged with its source URL, title, and chunk position — exactly the shape you want for an embeddings/vector pipeline.

Who it's for

AI/RAG builders — turn a docs site or knowledge base into a clean corpus for retrieval.
Chatbot makers — feed your support docs into a customer-facing assistant.
Agent developers — give an agent a fresh, structured snapshot of a site.
Data teams — bulk-convert web content to Markdown without writing a parser.

Popular use cases

Docs to RAG dataset - crawl product documentation into LLM-ready Markdown chunks for embeddings.
Help center chatbot data - turn support articles, FAQs, and knowledge bases into clean chatbot context.
Website to Markdown export - convert public pages into structured Markdown for analysis or archiving.
AI agent knowledge refresh - schedule repeat crawls so agents work from current website content.
Competitor docs monitoring - snapshot competitor documentation, pricing pages, or changelogs.
Blog corpus builder - collect editorial content into chunked rows for semantic search and content analysis.

Input

Field	Type	Default	Description
`startUrls`	array	—	URLs to crawl (plain strings or `{ "url": "..." }`)
`maxCrawlPages`	integer	`50`	Total page cap
`maxCrawlDepth`	integer	`1`	Link-hops from start URLs (0 = start URLs only)
`sameDomainOnly`	boolean	`true`	Only follow links on the start domain(s)
`includeUrlGlobs`	array	—	Only crawl URLs matching these globs (e.g. `https://site.com/docs/*`)
`excludeUrlGlobs`	array	—	Skip URLs matching these globs (e.g. `*.pdf`)
`chunkContent`	boolean	`true`	Split pages into RAG chunks (one row each)
`chunkSize`	integer	`1000`	Target characters per chunk
`chunkOverlap`	integer	`100`	Overlap chars between chunks
`minChunkChars`	integer	`50`	Drop chunks smaller than this
`saveHtml`	boolean	`false`	Also include cleaned HTML
`maxConcurrency`	integer	`10`	Pages crawled in parallel
`proxyConfiguration`	object	—	Optional Apify Proxy

Example input

{
"startUrls":[{"url":"https://docs.apify.com/"}],
"maxCrawlPages":30,
"maxCrawlDepth":2,
"includeUrlGlobs":["https://docs.apify.com/*"],
"chunkContent":true,
"chunkSize":1000,
"chunkOverlap":100
}

Output

One dataset row per chunk:

{
"url":"https://docs.apify.com/platform/actors",
"title":"Actors | Apify Docs",
"description":"Learn how Apify Actors work.",
"chunkIndex":0,
"chunkCount":4,
"content":"# Actors\n\nActors are serverless programs...",
"contentChars":980,
"depth":1,
"crawledAt":"2026-06-15T12:00:00.000Z"
}

To build a vector index: embed the content field, store url + title + chunkIndex as metadata. Done.

Key-value store outputs

SUMMARY — pages crawled/failed, total chunks, average chunk size, settings

Tips for clean RAG data

Use includeUrlGlobs to stay inside the section you care about (e.g. .../docs/*) and skip marketing pages.
chunkSize 800–1200 chars suits most embedding models; bump chunkOverlap to 150–200 for prose-heavy sites.
Turn off chunkContent if you want whole pages (one row each) and prefer to chunk in your own pipeline.
Exclude noise with excludeUrlGlobs (*.pdf, */tag/*, */author/*).

Limitations & compliance

HTTP crawler — it reads server-rendered HTML. Pages that render content purely client-side (heavy SPA) may yield little; those need a browser-based crawler.
Main-content extraction is heuristic (prefers <article>/<main>, strips common boilerplate). Unusual layouts may include or drop some content.
You choose the targets. Crawl only sites you're permitted to, respect each site's terms and robots policy, and don't collect private or paywalled data. This Actor accesses publicly reachable pages only.

Changelog

See CHANGELOG.md.

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

👁 Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks avatar

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

👁 User avatar

Ken M

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

👁 User avatar

Fulcria Labs

👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

👁 User avatar

Dev with Bobby

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

👁 Website to Text & Markdown — AI / RAG Content Crawler avatar

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

👁 User avatar

Hitman studio

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

👁 User avatar

IntelScrape

RAG Website Crawler - Clean Markdown for LLMs & AI

themineworks/rag-crawler

Affordable RAG website crawler: clean Markdown for LLMs & RAG. Free (compute-only), no per-result charge, no subscription. Works in Claude, ChatGPT & any MCP-compatible AI agent.

👁 User avatar

The Mine Works

👁 AI Web Crawler avatar

AI Web Crawler

hounderd/ai-web-crawler

Crawl websites and extract clean, LLM-ready markdown content with stealth browser rendering, anti-bot hardening, smart content filtering, and structured metadata extraction. Built for RAG pipelines, AI agents, and data workflows.

👁 User avatar

Hounderd

👁 Website To Markdown avatar

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

👁 User avatar

SmartApi

5.0

URL: https://apify.com/groupoject/ai-rag-web-crawler

⇱ AI Web Crawler for RAG — Website to LLM-Ready Markdown · Apify

AI / RAG Web Crawler

What it does

Who it's for

Popular use cases

Input

Example input

Output

Key-value store outputs

Tips for clean RAG data

Limitations & compliance

Changelog

You might also like

Web-to-Markdown Generator for AI & RAG Pipelines

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

AI-Ready Website Crawler

Docs Markdown Rag Ready Crawler

Website to Markdown Crawler for LLM & RAG

Website to Text & Markdown — AI / RAG Content Crawler

AI Web Content Crawler - Markdown for LLMs

RAG Website Crawler - Clean Markdown for LLMs & AI

AI Web Crawler

Website To Markdown