VOOZH about

URL: https://apify.com/groupoject/ai-rag-web-crawler

⇱ AI Web Crawler for RAG β€” Website to LLM-Ready Markdown Β· Apify


Pricing

from $0.50 / 1,000 results

Go to Apify Store

AI / RAG Web Crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ Group Oject

Group Oject

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Crawl any website and get clean, LLM-ready Markdown chunks β€” ready to feed AI agents, chatbots, and RAG pipelines.

Point it at a docs site, knowledge base, or blog. It crawls the pages, strips the navigation/ads/boilerplate, converts the main content to clean Markdown, and (optionally) splits it into overlapping chunks. One dataset row per chunk β€” pipe it straight into a vector database.

⚑ Fast HTTP crawler (no headless browser). No API key required.


What it does

  1. Crawls from your start URLs, following links up to a depth/page limit you set (same-domain by default).
  2. Extracts the main content β€” removes nav, header, footer, sidebars, scripts, ads.
  3. Converts it to clean Markdown (headings, lists, links, code preserved).
  4. Chunks it into overlapping, embeddings-sized pieces for RAG.

Output is one row per chunk, each tagged with its source URL, title, and chunk position β€” exactly the shape you want for an embeddings/vector pipeline.


Who it's for

  • AI/RAG builders β€” turn a docs site or knowledge base into a clean corpus for retrieval.
  • Chatbot makers β€” feed your support docs into a customer-facing assistant.
  • Agent developers β€” give an agent a fresh, structured snapshot of a site.
  • Data teams β€” bulk-convert web content to Markdown without writing a parser.

Popular use cases

  • Docs to RAG dataset - crawl product documentation into LLM-ready Markdown chunks for embeddings.
  • Help center chatbot data - turn support articles, FAQs, and knowledge bases into clean chatbot context.
  • Website to Markdown export - convert public pages into structured Markdown for analysis or archiving.
  • AI agent knowledge refresh - schedule repeat crawls so agents work from current website content.
  • Competitor docs monitoring - snapshot competitor documentation, pricing pages, or changelogs.
  • Blog corpus builder - collect editorial content into chunked rows for semantic search and content analysis.

Input

FieldTypeDefaultDescription
startUrlsarrayβ€”URLs to crawl (plain strings or { "url": "..." })
maxCrawlPagesinteger50Total page cap
maxCrawlDepthinteger1Link-hops from start URLs (0 = start URLs only)
sameDomainOnlybooleantrueOnly follow links on the start domain(s)
includeUrlGlobsarrayβ€”Only crawl URLs matching these globs (e.g. https://site.com/docs/*)
excludeUrlGlobsarrayβ€”Skip URLs matching these globs (e.g. *.pdf)
chunkContentbooleantrueSplit pages into RAG chunks (one row each)
chunkSizeinteger1000Target characters per chunk
chunkOverlapinteger100Overlap chars between chunks
minChunkCharsinteger50Drop chunks smaller than this
saveHtmlbooleanfalseAlso include cleaned HTML
maxConcurrencyinteger10Pages crawled in parallel
proxyConfigurationobjectβ€”Optional Apify Proxy

Example input

{
"startUrls":[{"url":"https://docs.apify.com/"}],
"maxCrawlPages":30,
"maxCrawlDepth":2,
"includeUrlGlobs":["https://docs.apify.com/*"],
"chunkContent":true,
"chunkSize":1000,
"chunkOverlap":100
}

More in examples/.


Output

One dataset row per chunk:

{
"url":"https://docs.apify.com/platform/actors",
"title":"Actors | Apify Docs",
"description":"Learn how Apify Actors work.",
"chunkIndex":0,
"chunkCount":4,
"content":"# Actors\n\nActors are serverless programs...",
"contentChars":980,
"depth":1,
"crawledAt":"2026-06-15T12:00:00.000Z"
}

To build a vector index: embed the content field, store url + title + chunkIndex as metadata. Done.

Key-value store outputs

  • SUMMARY β€” pages crawled/failed, total chunks, average chunk size, settings

Tips for clean RAG data

  • Use includeUrlGlobs to stay inside the section you care about (e.g. .../docs/*) and skip marketing pages.
  • chunkSize 800–1200 chars suits most embedding models; bump chunkOverlap to 150–200 for prose-heavy sites.
  • Turn off chunkContent if you want whole pages (one row each) and prefer to chunk in your own pipeline.
  • Exclude noise with excludeUrlGlobs (*.pdf, */tag/*, */author/*).

Limitations & compliance

  • HTTP crawler β€” it reads server-rendered HTML. Pages that render content purely client-side (heavy SPA) may yield little; those need a browser-based crawler.
  • Main-content extraction is heuristic (prefers <article>/<main>, strips common boilerplate). Unusual layouts may include or drop some content.
  • You choose the targets. Crawl only sites you're permitted to, respect each site's terms and robots policy, and don't collect private or paywalled data. This Actor accesses publicly reachable pages only.

Changelog

See CHANGELOG.md.

You might also like

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Website to Text & Markdown β€” AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

3

AI Web Crawler

hounderd/ai-web-crawler

Crawl websites and extract clean, LLM-ready markdown content with stealth browser rendering, anti-bot hardening, smart content filtering, and structured metadata extraction. Built for RAG pipelines, AI agents, and data workflows.

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds β€” perfect for AI training data, RAG pipelines, and content archiving.